Merge ~mdeslaur/ubuntu-cve-tracker:perf-part2 into ubuntu-cve-tracker:master
- Git
- lp:~mdeslaur/ubuntu-cve-tracker
- perf-part2
- Merge into master
Status: | Merged |
---|---|
Merge reported by: | Marc Deslauriers |
Merged at revision: | fc0b57572b30d8f996a0370758b8cc94163061ad |
Proposed branch: | ~mdeslaur/ubuntu-cve-tracker:perf-part2 |
Merge into: | ubuntu-cve-tracker:master |
Diff against target: |
290 lines (+77/-41) 2 files modified
scripts/check-syntax (+34/-31) scripts/cve_lib.py (+43/-10) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Alex Murray | Approve | ||
Review via email: mp+461633@code.launchpad.net |
Commit message
commit b1839ea0984addc
Author: Marc Deslauriers <email address hidden>
Date: Fri Mar 1 08:53:41 2024 -0500
check-syntax: improve performance by eliminating subprojects check
check_cve() was attempting to determine if a specified CVE path
was in a subproject by going through all the subprojects directories,
but just to end up using find_cve(). We should just unconditionally
use find_cve() here to find the base CVE file, unless the CVE
specified a path to the boilerplates directory.
commit fb769299890f579
Author: Marc Deslauriers <email address hidden>
Date: Fri Mar 1 08:50:47 2024 -0500
Improve find_external_
In find_external_
if the CVE wasn't found in the top-level. Also, only call realpath
if the file exists. Add option to disable realpath lookups.
Description of the change
BEFORE:
$ time ./scripts/
real 2m25.430s
user 15m53.184s
sys 2m57.945s
AFTER:
$ time ./scripts/
real 1m26.673s
user 9m49.054s
sys 1m16.596s
Seth Arnold (seth-arnold) wrote : | # |
- 6a7dbc9... by Marc Deslauriers
-
Remove find_external_
subproject_ cves() use to improve performance To check if the main CVE file incorrectly contains a subproject
entry, find_external_subproject_ cves() was used but that is expensive,
so instead, just check if the entry was found in the main CVE file
instead. - c6e5588... by Marc Deslauriers
-
Disable use of realpath() in find_external_
subproject_ cves() The only remaining use of find_external_
subproject_ cves() is when
load_cve() loads subproject files. I can't think of a good
reason why we would need to use realpath() here, so turn it
off by default. Also make code a little less redundant.
Marc Deslauriers (mdeslaur) wrote : | # |
I have added two more commits to this merge proposal to clean up find_external_
Marc Deslauriers (mdeslaur) wrote : | # |
The two new commits shaved off another few seconds
$ time ./scripts/
real 1m18.166s
user 8m54.048s
sys 1m4.496s
- 5a276aa... by Marc Deslauriers
-
check-syntax: remove disk access from is_* functions
is_active(), is_embargoed(), and is_retired() would all check if
the file existed on disk, which is expensive. Instead, just check
the path for these locations. Also, replace manual checks for the
boilerplate directory with a is_boilerplate() function.
Marc Deslauriers (mdeslaur) wrote : | # |
The removed file checks in the last commit reduced the runtime pretty good:
$ time ./scripts/
real 0m54.352s
user 6m27.479s
sys 0m12.230s
- f81d2da... by Marc Deslauriers
-
cve_lib.py: if we're loading boilerplates, don't try in subprojects
- e429156... by Marc Deslauriers
-
Build up a cache of subproject CVEs and directories to improve performance
- aca4f51... by Marc Deslauriers
-
cve_lib.py: use itertools cache on get_subproject_
details( )
Marc Deslauriers (mdeslaur) wrote : | # |
I've added some caching, and now get:
$ time ./scripts/
real 0m49.048s
user 5m40.278s
sys 0m6.770s
- fc0b575... by Marc Deslauriers
-
cve_lib.py: use lru_cache so it works with focal's python
Alex Murray (alexmurray) wrote : | # |
Thanks for working on this Marc - I'll give it a run locally.
@seth-arnold - regarding realpath - I added thus to support reading a CVE from stdin for check-syntax - ie:
./scripts/
(since this then allows to support integrating check-syntax with an external editor for on-the-fly syntax checking).
But performance improvements are more important than having this work at the moment so I am more than happy to forego it whilst we get performance improved, then I can look to add it back if it is still needed (and try and retain the performance gains in the process).
Alex Murray (alexmurray) wrote : | # |
LGTM (a small fixup is needed to get --stdin working again with this but that can land after):
diff --git a/scripts/
index ba2645516a4.
--- a/scripts/
+++ b/scripts/
@@ -538,14 +538,18 @@ def check_cve(cve):
cve_okay = True
srcmap = dict()
try:
- # Try to locate the base CVE and load that, but not if this is a
- # boilerplate path
- if is_boilerplate(
+ if opt.stdin:
+ # fixup cve below once we have loaded it
- cve = os.path.
else:
- cve = os.path.
- cvepath = cve_lib.
+ # Try to locate the base CVE and load that, but not if this is a
+ # boilerplate path
+ if is_boilerplate(
+ cvepath = cve
+ cve = os.path.
+ else:
+ cve = os.path.
+ cvepath = cve_lib.
data = cve_lib.
except ValueError as e:
print(e, file=sys.stderr)
Preview Diff
1 | diff --git a/scripts/check-syntax b/scripts/check-syntax |
2 | index b740331..ba26455 100755 |
3 | --- a/scripts/check-syntax |
4 | +++ b/scripts/check-syntax |
5 | @@ -107,18 +107,21 @@ def ever_existed(pkg): |
6 | return False |
7 | |
8 | |
9 | -def is_active(cve): |
10 | - return os.path.exists(os.path.join(cve_lib.active_dir, cve)) |
11 | +def is_active(cvepath): |
12 | + return cvepath.startswith(cve_lib.active_dir) |
13 | |
14 | |
15 | -def is_embargoed(cve): |
16 | - return os.path.exists(os.path.join(cve_lib.embargoed_dir, cve)) |
17 | +def is_embargoed(cvepath): |
18 | + return cvepath.startswith(cve_lib.embargoed_dir) |
19 | |
20 | |
21 | -def is_retired(cve): |
22 | - return os.path.exists(os.path.join(cve_lib.retired_dir, cve)) |
23 | +def is_retired(cvepath): |
24 | + return cvepath.startswith(cve_lib.retired_dir) |
25 | |
26 | |
27 | +def is_boilerplate(cvepath): |
28 | + return cvepath.startswith(cve_lib.boilerplates_dir) |
29 | + |
30 | def is_unpublished_kernel(kernel, release=None, debug=False): |
31 | for u in unpublished_kernels: |
32 | unpub = u.split("/") |
33 | @@ -352,7 +355,7 @@ elif len(args) == 0: |
34 | if filename.startswith("%s/CVE-" % dir): |
35 | args += [filename] |
36 | continue |
37 | - if cve_lib.boilerplates_dir in filename: |
38 | + if is_boilerplate(filename): |
39 | args += [filename] |
40 | |
41 | else: |
42 | @@ -380,6 +383,9 @@ ignored = cve_lib.parse_CVEs_from_uri("%s/not-for-us.txt" % cve_lib.ignored_dir) |
43 | |
44 | debug("ignored %s" % ignored) |
45 | |
46 | +# Build up the cache of subproject CVEs |
47 | +cve_lib.build_subproject_dir_cache() |
48 | + |
49 | cna_cves_set = CVEs_from_CNA() |
50 | # Just run this if we're not specifying specific CVEs |
51 | if len(cna_cves_set) > 0 and all_files: |
52 | @@ -529,21 +535,18 @@ def get_cve_path(cve, rel): |
53 | return cvepath |
54 | |
55 | def check_cve(cve): |
56 | - if re.match(r"EMB-", cve): |
57 | - cvepath = os.path.join(cve_lib.embargoed_dir, cve) |
58 | - elif re.match(r"CVE-", cve): |
59 | - cvepath = os.path.join(cve_lib.active_dir, cve) |
60 | - else: |
61 | - cvepath = cve |
62 | - cve = os.path.basename(cvepath) |
63 | - |
64 | cve_okay = True |
65 | srcmap = dict() |
66 | try: |
67 | - if cvepath in cve_lib.find_external_subproject_cves(cve): |
68 | - data = cve_lib.load_cve(cve_lib.find_cve(cve), opt.strict, srcmap=srcmap) |
69 | + # Try to locate the base CVE and load that, but not if this is a |
70 | + # boilerplate path |
71 | + if is_boilerplate(cve): |
72 | + cvepath = cve |
73 | + cve = os.path.basename(cve) |
74 | else: |
75 | - data = cve_lib.load_cve(cvepath, opt.strict, srcmap=srcmap) |
76 | + cve = os.path.basename(cve) |
77 | + cvepath = cve_lib.find_cve(cve) |
78 | + data = cve_lib.load_cve(cvepath, opt.strict, srcmap=srcmap) |
79 | except ValueError as e: |
80 | print(e, file=sys.stderr) |
81 | return False |
82 | @@ -580,7 +583,7 @@ def check_cve(cve): |
83 | cve_okay = False |
84 | |
85 | # verify candidate field matches the CVE file name but ignore if we don't have one |
86 | - if "stdin" not in cve and cve_lib.boilerplates_dir not in cvepath and not data["Candidate"] == cve: |
87 | + if "stdin" not in cve and not is_boilerplate(cvepath) and not data["Candidate"] == cve: |
88 | filename = srcmap["Candidate"][0] |
89 | linenum = srcmap["Candidate"][1] |
90 | print( |
91 | @@ -606,7 +609,7 @@ def check_cve(cve): |
92 | for rel in all_required_releases: |
93 | if rel in cve_lib.external_releases: |
94 | # for external releases skip boilerplates and embargoed CVEs |
95 | - if 'boilerplates' in cvepath or is_embargoed(cve): |
96 | + if is_boilerplate(cvepath) or is_embargoed(cvepath): |
97 | continue |
98 | |
99 | if rel in source: |
100 | @@ -615,7 +618,7 @@ def check_cve(cve): |
101 | _, series = cve_lib.product_series(rel) |
102 | # If the series is not listed in the original CVE, |
103 | # we won't create the CVE in the subproject. |
104 | - if not is_active(cve) and series not in listed_series: |
105 | + if not is_active(cvepath) and series not in listed_series: |
106 | skip_aliases = True |
107 | |
108 | # We are not adding CVEs in ignored. |
109 | @@ -655,7 +658,7 @@ def check_cve(cve): |
110 | missing_releases = all_required_releases - listed_releases |
111 | for rel in missing_releases: |
112 | # only warn on active CVEs |
113 | - if is_active(cve) and \ |
114 | + if is_active(cvepath) and \ |
115 | rel in source and pkg in source[rel]: |
116 | |
117 | # If the subproject doesn't require cve triage, |
118 | @@ -689,7 +692,7 @@ def check_cve(cve): |
119 | linenum = srcmap["pkgs"][pkg][release][1] |
120 | # warn if the cve file contains external subproject releases |
121 | # and it is not located in the subprojects folder |
122 | - if rel in cve_lib.external_releases and filename not in cve_lib.find_external_subproject_cves(cve): |
123 | + if rel in cve_lib.external_releases and filename == cvepath: |
124 | print( |
125 | "%s: %d: external release '%s' listed in internal CVE file" |
126 | % (filename, linenum, rel), |
127 | @@ -800,7 +803,7 @@ def check_cve(cve): |
128 | |
129 | # Check that package exists in a given release |
130 | if not ever_existed(pkg): |
131 | - if is_active(cve) and not cve_lib.is_active_esm_release(release): |
132 | + if is_active(cvepath) and not cve_lib.is_active_esm_release(release): |
133 | # forcibly skip linux-lts-backport packages and # |
134 | # other derived kernels since want to track them |
135 | # before they end up fully in the archive; |
136 | @@ -825,7 +828,7 @@ def check_cve(cve): |
137 | else: |
138 | if rel in source: |
139 | if pkg not in source[rel]: |
140 | - if is_active(cve): |
141 | + if is_active(cvepath): |
142 | if is_unpublished_kernel(pkg, rel): |
143 | if opt.debug: |
144 | print( |
145 | @@ -903,7 +906,7 @@ def check_cve(cve): |
146 | # looking for kernels that were in the |
147 | # unpublished_kernels list that have subsequently been |
148 | # published and should be checked for existence. |
149 | - if is_active(cve) and is_unpublished_kernel(pkg, rel): |
150 | + if is_active(cvepath) and is_unpublished_kernel(pkg, rel): |
151 | # we'll only issue a warning the first time we come |
152 | # across a kernel that's been published |
153 | if pkg not in warned_kernels: |
154 | @@ -937,7 +940,7 @@ def check_cve(cve): |
155 | # Verify priority for any CVE with a supported package |
156 | if ( |
157 | len(supported) |
158 | - and (is_active(cve) or is_embargoed(cve)) |
159 | + and (is_active(cvepath) or is_embargoed(cvepath)) |
160 | and ("Priority" not in data or data["Priority"][0] not in cve_lib.priorities) |
161 | ): |
162 | filename = srcmap["Priority"][0] if "Priority" in srcmap else cvepath |
163 | @@ -1005,7 +1008,7 @@ def check_cve(cve): |
164 | # check to see if the description has been changed to rejected by |
165 | # MITRE, we should move CVE to ignored state. |
166 | if ( |
167 | - is_active(cve) |
168 | + is_active(cvepath) |
169 | and "Description" in data |
170 | and data["Description"].lstrip().startswith("** REJECT **") |
171 | ): |
172 | @@ -1065,7 +1068,7 @@ def check_cve(cve): |
173 | cve_okay = False |
174 | |
175 | # Either PublicDate or CRD must be set to something |
176 | - if (cve_lib.boilerplates_dir not in cvepath and |
177 | + if (not is_boilerplate(cvepath) and |
178 | ("PublicDate" not in data or data["PublicDate"] == "") and |
179 | ("CRD" not in data or data["CRD"] == "") |
180 | ): |
181 | @@ -1082,11 +1085,11 @@ def check_cve(cve): |
182 | ) |
183 | |
184 | for d in ["PublicDate", "PublicDateAtUSN", "CRD"]: |
185 | - if cve_lib.boilerplates_dir in cvepath: |
186 | + if is_boilerplate(cvepath): |
187 | continue |
188 | passes = [] |
189 | # Embargoed CVEs can have an empty or unknown PublicDate |
190 | - if d == "PublicDate" and is_embargoed(cve): |
191 | + if d == "PublicDate" and is_embargoed(cvepath): |
192 | passes = ["", "unknown"] |
193 | if d == "PublicDateAtUSN": |
194 | # PublicDateAtUSN can be empty |
195 | diff --git a/scripts/cve_lib.py b/scripts/cve_lib.py |
196 | index 2bd1d31..ccd89e7 100755 |
197 | --- a/scripts/cve_lib.py |
198 | +++ b/scripts/cve_lib.py |
199 | @@ -26,6 +26,7 @@ import urllib.error |
200 | import urllib.request |
201 | |
202 | from functools import reduce |
203 | +from functools import lru_cache |
204 | |
205 | def set_cve_dir(path): |
206 | '''Return a path with CVEs in it. Specifically: |
207 | @@ -744,6 +745,7 @@ def product_series(rel): |
208 | return product, series |
209 | |
210 | # get the subproject details for rel along with it's canonical name, product and series |
211 | +@lru_cache |
212 | def get_subproject_details(rel): |
213 | """Return the canonical name,product,series,details tuple for rel.""" |
214 | canon, product, series, details, release = None, None, None, None, None |
215 | @@ -974,17 +976,34 @@ def find_files_recursive(path, name): |
216 | matches.append(filepath) |
217 | return matches |
218 | |
219 | -def find_external_subproject_cves(cve): |
220 | +def find_external_subproject_cves(cve, realpath=False): |
221 | """Return the list of external subproject CVE snippets for the given CVE.""" |
222 | cves = [] |
223 | - for rel in external_releases: |
224 | - # fallback to the series specific subdir rather than just the |
225 | - # top-level project directory even though this is preferred |
226 | - for d in [get_external_subproject_cve_dir(rel), |
227 | - get_external_subproject_dir(rel)]: |
228 | - path = os.path.realpath(os.path.join(d, cve)) |
229 | - if os.path.exists(path) and path not in cves: |
230 | - cves.append(path) |
231 | + # Use the cache if it's not empty |
232 | + if subproject_dir_cache_cves: |
233 | + if cve not in subproject_dir_cache_cves: |
234 | + return cves |
235 | + for entry in subproject_dir_cache_dirs: |
236 | + path = os.path.join(entry, cve) |
237 | + if os.path.exists(path): |
238 | + if realpath: |
239 | + path = os.path.realpath(path) |
240 | + if path not in cves: |
241 | + cves.append(path) |
242 | + else: |
243 | + for rel in external_releases: |
244 | + # fallback to the series specific subdir rather than just the |
245 | + # top-level project directory even though this is preferred |
246 | + for path in [get_external_subproject_dir(rel), |
247 | + get_external_subproject_cve_dir(rel)]: |
248 | + path = os.path.join(path, cve) |
249 | + if os.path.exists(path): |
250 | + if realpath: |
251 | + path = os.path.realpath(path) |
252 | + if path not in cves: |
253 | + cves.append(path) |
254 | + break |
255 | + |
256 | return cves |
257 | |
258 | # Keys in config.yml for a external subproject |
259 | @@ -1620,8 +1639,21 @@ if os.path.islink(embargoed_dir): |
260 | EXIT_FAIL = 1 |
261 | EXIT_OKAY = 0 |
262 | |
263 | +subproject_dir_cache_cves = set() |
264 | +subproject_dir_cache_dirs = set() |
265 | + |
266 | config = {} |
267 | |
268 | +def build_subproject_dir_cache(): |
269 | + """Build a pre-cached list of the subproject files""" |
270 | + for rel in external_releases: |
271 | + # fallback to the series specific subdir rather than just the |
272 | + # top-level project directory even though this is preferred |
273 | + for path in [get_external_subproject_dir(rel), |
274 | + get_external_subproject_cve_dir(rel)]: |
275 | + subproject_dir_cache_dirs.update([path]) |
276 | + cves = glob.glob("CVE-*", root_dir=path) |
277 | + subproject_dir_cache_cves.update(cves) |
278 | |
279 | def parse_CVEs_from_uri(url): |
280 | """Return a list of all CVE numbers mentioned in the given URL.""" |
281 | @@ -2529,7 +2561,8 @@ def load_cve(cvefile, strict=False, srcmap=None): |
282 | |
283 | data['pkgs'] = affected |
284 | |
285 | - code, msg = load_external_subproject_cve_data(cvefile, data, srcmap, code, msg) |
286 | + if not "boilerplate" in cvefile: |
287 | + code, msg = load_external_subproject_cve_data(cvefile, data, srcmap, code, msg) |
288 | |
289 | if code != EXIT_OKAY: |
290 | raise ValueError(msg.strip()) |
Nice!
I'm not sure why we have realpath() scattered all over the tools. I didn't see anything very compelling in git history to explain why we have them.
If none of us can recall why we've got them, at some point we ought to flip that switch and find out what breaks. It might take a while.
Thanks