Merge ~mdeslaur/ubuntu-cve-tracker:perf-part2 into ubuntu-cve-tracker:master

Proposed by Marc Deslauriers
Status: Merged
Merge reported by: Marc Deslauriers
Merged at revision: fc0b57572b30d8f996a0370758b8cc94163061ad
Proposed branch: ~mdeslaur/ubuntu-cve-tracker:perf-part2
Merge into: ubuntu-cve-tracker:master
Diff against target: 290 lines (+77/-41)
2 files modified
scripts/check-syntax (+34/-31)
scripts/cve_lib.py (+43/-10)
Reviewer Review Type Date Requested Status
Alex Murray Approve
Review via email: mp+461633@code.launchpad.net

Commit message

commit b1839ea0984addc4f1c18d64600bb36e62d5250e
Author: Marc Deslauriers <email address hidden>
Date: Fri Mar 1 08:53:41 2024 -0500

    check-syntax: improve performance by eliminating subprojects check

    check_cve() was attempting to determine if a specified CVE path
    was in a subproject by going through all the subprojects directories,
    but just to end up using find_cve(). We should just unconditionally
    use find_cve() here to find the base CVE file, unless the CVE
    specified a path to the boilerplates directory.

commit fb769299890f579b2ac5983b0a6ff1b14999820a
Author: Marc Deslauriers <email address hidden>
Date: Fri Mar 1 08:50:47 2024 -0500

    Improve find_external_subproject_cves() performance

    In find_external_subproject_cves(), only look inside subdirectories
    if the CVE wasn't found in the top-level. Also, only call realpath
    if the file exists. Add option to disable realpath lookups.

Description of the change

BEFORE:

$ time ./scripts/check-syntax -j8

real 2m25.430s
user 15m53.184s
sys 2m57.945s

AFTER:

$ time ./scripts/check-syntax -j8

real 1m26.673s
user 9m49.054s
sys 1m16.596s

To post a comment you must log in.
Revision history for this message
Seth Arnold (seth-arnold) wrote :

Nice!

I'm not sure why we have realpath() scattered all over the tools. I didn't see anything very compelling in git history to explain why we have them.

If none of us can recall why we've got them, at some point we ought to flip that switch and find out what breaks. It might take a while.

Thanks

6a7dbc9... by Marc Deslauriers

Remove find_external_subproject_cves() use to improve performance

To check if the main CVE file incorrectly contains a subproject
entry, find_external_subproject_cves() was used but that is expensive,
so instead, just check if the entry was found in the main CVE file
instead.

c6e5588... by Marc Deslauriers

Disable use of realpath() in find_external_subproject_cves()

The only remaining use of find_external_subproject_cves() is when
load_cve() loads subproject files. I can't think of a good
reason why we would need to use realpath() here, so turn it
off by default. Also make code a little less redundant.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

I have added two more commits to this merge proposal to clean up find_external_subproject_cves() further and to completely remove its use in check-syntax.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

The two new commits shaved off another few seconds

$ time ./scripts/check-syntax -j8

real 1m18.166s
user 8m54.048s
sys 1m4.496s

5a276aa... by Marc Deslauriers

check-syntax: remove disk access from is_* functions

is_active(), is_embargoed(), and is_retired() would all check if
the file existed on disk, which is expensive. Instead, just check
the path for these locations. Also, replace manual checks for the
boilerplate directory with a is_boilerplate() function.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

The removed file checks in the last commit reduced the runtime pretty good:

$ time ./scripts/check-syntax -j8

real 0m54.352s
user 6m27.479s
sys 0m12.230s

f81d2da... by Marc Deslauriers

cve_lib.py: if we're loading boilerplates, don't try in subprojects

e429156... by Marc Deslauriers

Build up a cache of subproject CVEs and directories to improve performance

aca4f51... by Marc Deslauriers

cve_lib.py: use itertools cache on get_subproject_details()

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

I've added some caching, and now get:

$ time ./scripts/check-syntax -j8

real 0m49.048s
user 5m40.278s
sys 0m6.770s

fc0b575... by Marc Deslauriers

cve_lib.py: use lru_cache so it works with focal's python

Revision history for this message
Alex Murray (alexmurray) wrote :

Thanks for working on this Marc - I'll give it a run locally.

@seth-arnold - regarding realpath - I added thus to support reading a CVE from stdin for check-syntax - ie:

./scripts/check-syntax --stdin < active/CVE-YYYY-NNNNN

(since this then allows to support integrating check-syntax with an external editor for on-the-fly syntax checking).

But performance improvements are more important than having this work at the moment so I am more than happy to forego it whilst we get performance improved, then I can look to add it back if it is still needed (and try and retain the performance gains in the process).

Revision history for this message
Alex Murray (alexmurray) wrote :

LGTM (a small fixup is needed to get --stdin working again with this but that can land after):

diff --git a/scripts/check-syntax b/scripts/check-syntax
index ba2645516a4..1dd37810acd 100755
--- a/scripts/check-syntax
+++ b/scripts/check-syntax
@@ -538,14 +538,18 @@ def check_cve(cve):
     cve_okay = True
     srcmap = dict()
     try:
- # Try to locate the base CVE and load that, but not if this is a
- # boilerplate path
- if is_boilerplate(cve):
+ if opt.stdin:
+ # fixup cve below once we have loaded it
             cvepath = cve
- cve = os.path.basename(cve)
         else:
- cve = os.path.basename(cve)
- cvepath = cve_lib.find_cve(cve)
+ # Try to locate the base CVE and load that, but not if this is a
+ # boilerplate path
+ if is_boilerplate(cve):
+ cvepath = cve
+ cve = os.path.basename(cve)
+ else:
+ cve = os.path.basename(cve)
+ cvepath = cve_lib.find_cve(cve)
         data = cve_lib.load_cve(cvepath, opt.strict, srcmap=srcmap)
     except ValueError as e:
         print(e, file=sys.stderr)

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/scripts/check-syntax b/scripts/check-syntax
2index b740331..ba26455 100755
3--- a/scripts/check-syntax
4+++ b/scripts/check-syntax
5@@ -107,18 +107,21 @@ def ever_existed(pkg):
6 return False
7
8
9-def is_active(cve):
10- return os.path.exists(os.path.join(cve_lib.active_dir, cve))
11+def is_active(cvepath):
12+ return cvepath.startswith(cve_lib.active_dir)
13
14
15-def is_embargoed(cve):
16- return os.path.exists(os.path.join(cve_lib.embargoed_dir, cve))
17+def is_embargoed(cvepath):
18+ return cvepath.startswith(cve_lib.embargoed_dir)
19
20
21-def is_retired(cve):
22- return os.path.exists(os.path.join(cve_lib.retired_dir, cve))
23+def is_retired(cvepath):
24+ return cvepath.startswith(cve_lib.retired_dir)
25
26
27+def is_boilerplate(cvepath):
28+ return cvepath.startswith(cve_lib.boilerplates_dir)
29+
30 def is_unpublished_kernel(kernel, release=None, debug=False):
31 for u in unpublished_kernels:
32 unpub = u.split("/")
33@@ -352,7 +355,7 @@ elif len(args) == 0:
34 if filename.startswith("%s/CVE-" % dir):
35 args += [filename]
36 continue
37- if cve_lib.boilerplates_dir in filename:
38+ if is_boilerplate(filename):
39 args += [filename]
40
41 else:
42@@ -380,6 +383,9 @@ ignored = cve_lib.parse_CVEs_from_uri("%s/not-for-us.txt" % cve_lib.ignored_dir)
43
44 debug("ignored %s" % ignored)
45
46+# Build up the cache of subproject CVEs
47+cve_lib.build_subproject_dir_cache()
48+
49 cna_cves_set = CVEs_from_CNA()
50 # Just run this if we're not specifying specific CVEs
51 if len(cna_cves_set) > 0 and all_files:
52@@ -529,21 +535,18 @@ def get_cve_path(cve, rel):
53 return cvepath
54
55 def check_cve(cve):
56- if re.match(r"EMB-", cve):
57- cvepath = os.path.join(cve_lib.embargoed_dir, cve)
58- elif re.match(r"CVE-", cve):
59- cvepath = os.path.join(cve_lib.active_dir, cve)
60- else:
61- cvepath = cve
62- cve = os.path.basename(cvepath)
63-
64 cve_okay = True
65 srcmap = dict()
66 try:
67- if cvepath in cve_lib.find_external_subproject_cves(cve):
68- data = cve_lib.load_cve(cve_lib.find_cve(cve), opt.strict, srcmap=srcmap)
69+ # Try to locate the base CVE and load that, but not if this is a
70+ # boilerplate path
71+ if is_boilerplate(cve):
72+ cvepath = cve
73+ cve = os.path.basename(cve)
74 else:
75- data = cve_lib.load_cve(cvepath, opt.strict, srcmap=srcmap)
76+ cve = os.path.basename(cve)
77+ cvepath = cve_lib.find_cve(cve)
78+ data = cve_lib.load_cve(cvepath, opt.strict, srcmap=srcmap)
79 except ValueError as e:
80 print(e, file=sys.stderr)
81 return False
82@@ -580,7 +583,7 @@ def check_cve(cve):
83 cve_okay = False
84
85 # verify candidate field matches the CVE file name but ignore if we don't have one
86- if "stdin" not in cve and cve_lib.boilerplates_dir not in cvepath and not data["Candidate"] == cve:
87+ if "stdin" not in cve and not is_boilerplate(cvepath) and not data["Candidate"] == cve:
88 filename = srcmap["Candidate"][0]
89 linenum = srcmap["Candidate"][1]
90 print(
91@@ -606,7 +609,7 @@ def check_cve(cve):
92 for rel in all_required_releases:
93 if rel in cve_lib.external_releases:
94 # for external releases skip boilerplates and embargoed CVEs
95- if 'boilerplates' in cvepath or is_embargoed(cve):
96+ if is_boilerplate(cvepath) or is_embargoed(cvepath):
97 continue
98
99 if rel in source:
100@@ -615,7 +618,7 @@ def check_cve(cve):
101 _, series = cve_lib.product_series(rel)
102 # If the series is not listed in the original CVE,
103 # we won't create the CVE in the subproject.
104- if not is_active(cve) and series not in listed_series:
105+ if not is_active(cvepath) and series not in listed_series:
106 skip_aliases = True
107
108 # We are not adding CVEs in ignored.
109@@ -655,7 +658,7 @@ def check_cve(cve):
110 missing_releases = all_required_releases - listed_releases
111 for rel in missing_releases:
112 # only warn on active CVEs
113- if is_active(cve) and \
114+ if is_active(cvepath) and \
115 rel in source and pkg in source[rel]:
116
117 # If the subproject doesn't require cve triage,
118@@ -689,7 +692,7 @@ def check_cve(cve):
119 linenum = srcmap["pkgs"][pkg][release][1]
120 # warn if the cve file contains external subproject releases
121 # and it is not located in the subprojects folder
122- if rel in cve_lib.external_releases and filename not in cve_lib.find_external_subproject_cves(cve):
123+ if rel in cve_lib.external_releases and filename == cvepath:
124 print(
125 "%s: %d: external release '%s' listed in internal CVE file"
126 % (filename, linenum, rel),
127@@ -800,7 +803,7 @@ def check_cve(cve):
128
129 # Check that package exists in a given release
130 if not ever_existed(pkg):
131- if is_active(cve) and not cve_lib.is_active_esm_release(release):
132+ if is_active(cvepath) and not cve_lib.is_active_esm_release(release):
133 # forcibly skip linux-lts-backport packages and #
134 # other derived kernels since want to track them
135 # before they end up fully in the archive;
136@@ -825,7 +828,7 @@ def check_cve(cve):
137 else:
138 if rel in source:
139 if pkg not in source[rel]:
140- if is_active(cve):
141+ if is_active(cvepath):
142 if is_unpublished_kernel(pkg, rel):
143 if opt.debug:
144 print(
145@@ -903,7 +906,7 @@ def check_cve(cve):
146 # looking for kernels that were in the
147 # unpublished_kernels list that have subsequently been
148 # published and should be checked for existence.
149- if is_active(cve) and is_unpublished_kernel(pkg, rel):
150+ if is_active(cvepath) and is_unpublished_kernel(pkg, rel):
151 # we'll only issue a warning the first time we come
152 # across a kernel that's been published
153 if pkg not in warned_kernels:
154@@ -937,7 +940,7 @@ def check_cve(cve):
155 # Verify priority for any CVE with a supported package
156 if (
157 len(supported)
158- and (is_active(cve) or is_embargoed(cve))
159+ and (is_active(cvepath) or is_embargoed(cvepath))
160 and ("Priority" not in data or data["Priority"][0] not in cve_lib.priorities)
161 ):
162 filename = srcmap["Priority"][0] if "Priority" in srcmap else cvepath
163@@ -1005,7 +1008,7 @@ def check_cve(cve):
164 # check to see if the description has been changed to rejected by
165 # MITRE, we should move CVE to ignored state.
166 if (
167- is_active(cve)
168+ is_active(cvepath)
169 and "Description" in data
170 and data["Description"].lstrip().startswith("** REJECT **")
171 ):
172@@ -1065,7 +1068,7 @@ def check_cve(cve):
173 cve_okay = False
174
175 # Either PublicDate or CRD must be set to something
176- if (cve_lib.boilerplates_dir not in cvepath and
177+ if (not is_boilerplate(cvepath) and
178 ("PublicDate" not in data or data["PublicDate"] == "") and
179 ("CRD" not in data or data["CRD"] == "")
180 ):
181@@ -1082,11 +1085,11 @@ def check_cve(cve):
182 )
183
184 for d in ["PublicDate", "PublicDateAtUSN", "CRD"]:
185- if cve_lib.boilerplates_dir in cvepath:
186+ if is_boilerplate(cvepath):
187 continue
188 passes = []
189 # Embargoed CVEs can have an empty or unknown PublicDate
190- if d == "PublicDate" and is_embargoed(cve):
191+ if d == "PublicDate" and is_embargoed(cvepath):
192 passes = ["", "unknown"]
193 if d == "PublicDateAtUSN":
194 # PublicDateAtUSN can be empty
195diff --git a/scripts/cve_lib.py b/scripts/cve_lib.py
196index 2bd1d31..ccd89e7 100755
197--- a/scripts/cve_lib.py
198+++ b/scripts/cve_lib.py
199@@ -26,6 +26,7 @@ import urllib.error
200 import urllib.request
201
202 from functools import reduce
203+from functools import lru_cache
204
205 def set_cve_dir(path):
206 '''Return a path with CVEs in it. Specifically:
207@@ -744,6 +745,7 @@ def product_series(rel):
208 return product, series
209
210 # get the subproject details for rel along with it's canonical name, product and series
211+@lru_cache
212 def get_subproject_details(rel):
213 """Return the canonical name,product,series,details tuple for rel."""
214 canon, product, series, details, release = None, None, None, None, None
215@@ -974,17 +976,34 @@ def find_files_recursive(path, name):
216 matches.append(filepath)
217 return matches
218
219-def find_external_subproject_cves(cve):
220+def find_external_subproject_cves(cve, realpath=False):
221 """Return the list of external subproject CVE snippets for the given CVE."""
222 cves = []
223- for rel in external_releases:
224- # fallback to the series specific subdir rather than just the
225- # top-level project directory even though this is preferred
226- for d in [get_external_subproject_cve_dir(rel),
227- get_external_subproject_dir(rel)]:
228- path = os.path.realpath(os.path.join(d, cve))
229- if os.path.exists(path) and path not in cves:
230- cves.append(path)
231+ # Use the cache if it's not empty
232+ if subproject_dir_cache_cves:
233+ if cve not in subproject_dir_cache_cves:
234+ return cves
235+ for entry in subproject_dir_cache_dirs:
236+ path = os.path.join(entry, cve)
237+ if os.path.exists(path):
238+ if realpath:
239+ path = os.path.realpath(path)
240+ if path not in cves:
241+ cves.append(path)
242+ else:
243+ for rel in external_releases:
244+ # fallback to the series specific subdir rather than just the
245+ # top-level project directory even though this is preferred
246+ for path in [get_external_subproject_dir(rel),
247+ get_external_subproject_cve_dir(rel)]:
248+ path = os.path.join(path, cve)
249+ if os.path.exists(path):
250+ if realpath:
251+ path = os.path.realpath(path)
252+ if path not in cves:
253+ cves.append(path)
254+ break
255+
256 return cves
257
258 # Keys in config.yml for a external subproject
259@@ -1620,8 +1639,21 @@ if os.path.islink(embargoed_dir):
260 EXIT_FAIL = 1
261 EXIT_OKAY = 0
262
263+subproject_dir_cache_cves = set()
264+subproject_dir_cache_dirs = set()
265+
266 config = {}
267
268+def build_subproject_dir_cache():
269+ """Build a pre-cached list of the subproject files"""
270+ for rel in external_releases:
271+ # fallback to the series specific subdir rather than just the
272+ # top-level project directory even though this is preferred
273+ for path in [get_external_subproject_dir(rel),
274+ get_external_subproject_cve_dir(rel)]:
275+ subproject_dir_cache_dirs.update([path])
276+ cves = glob.glob("CVE-*", root_dir=path)
277+ subproject_dir_cache_cves.update(cves)
278
279 def parse_CVEs_from_uri(url):
280 """Return a list of all CVE numbers mentioned in the given URL."""
281@@ -2529,7 +2561,8 @@ def load_cve(cvefile, strict=False, srcmap=None):
282
283 data['pkgs'] = affected
284
285- code, msg = load_external_subproject_cve_data(cvefile, data, srcmap, code, msg)
286+ if not "boilerplate" in cvefile:
287+ code, msg = load_external_subproject_cve_data(cvefile, data, srcmap, code, msg)
288
289 if code != EXIT_OKAY:
290 raise ValueError(msg.strip())

Subscribers

People subscribed via source and target branches