Merge ~mdeslaur/ubuntu-cve-tracker:perf-part4 into ubuntu-cve-tracker:master

Proposed by Marc Deslauriers
Status: Merged
Merge reported by: Marc Deslauriers
Merged at revision: 06df23418f8851d45c435207f0e5831826805293
Proposed branch: ~mdeslaur/ubuntu-cve-tracker:perf-part4
Merge into: ubuntu-cve-tracker:master
Diff against target: 129 lines (+80/-6)
4 files modified
scripts/active_edit (+1/-6)
scripts/gen-source-map-cache (+50/-0)
scripts/packages-mirror (+9/-0)
scripts/source_map.py (+20/-0)
Reviewer Review Type Date Requested Status
Alex Murray Approve
Review via email: mp+461826@code.launchpad.net

Commit message

commit 7e46124a0687ea0b4ae955cd23d4d3fbde6ec9ed
Author: Marc Deslauriers <email address hidden>
Date: Tue Mar 5 12:40:42 2024 -0500

    gen-source-map-cache: create pickles of cached source map data

    Generating a source map is an expensive operation. This script
    generates source map data and stores it into pickles in the
    packages-mirror directory for use by source_map.py later on.

commit 45f1855791946615b8b7fd261979e7168addadcb
Author: Marc Deslauriers <email address hidden>
Date: Tue Mar 5 12:42:59 2024 -0500

    packages-mirror: call gen-source-map-cache to update pickles

commit 52c25d36857a9284fd4066f4f1dff1c580dd8691
Author: Marc Deslauriers <email address hidden>
Date: Tue Mar 5 12:43:26 2024 -0500

    source_map.py: improve performance by using cached data

    If a pickle containing cached information is available, load that
    instead of parsing all the Sources files.

commit 5762b55746375254778688fe3d0d64f1811d45f7 (HEAD -> perf-part4)
Author: Marc Deslauriers <email address hidden>
Date: Tue Mar 5 12:45:45 2024 -0500

    active_edit: use the default source_map options to get the cache

Description of the change

This branch creates pickles of cached source map data whenever packages-mirror has been run. This massively improves runtime of small scripts such as active_edit:

active_edit before:
$ time ./scripts/active_edit -c CVE-2024-NNN1 -p python-django

real 0m3.406s
user 0m3.290s
sys 0m0.116s

active_edit after:
$ time ./scripts/active_edit -c CVE-2024-NNN1 -p python-django

real 0m0.443s
user 0m0.387s
sys 0m0.056s

check-syntax before:
$ time ./scripts/check-syntax -j8
real 0m50.289s
user 6m6.409s
sys 0m6.995s

check-syntax after:
$ time ./scripts/check-syntax -j8
real 0m48.658s
user 6m19.725s
sys 0m7.410s

To post a comment you must log in.
Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

This needs work still, 5762b55746375254778688fe3d0d64f1811d45f7 is incorrect.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

So 5762b55746375254778688fe3d0d64f1811d45f7 modifies the way active_edit will create a new CVE file. It will no longer add the end of standard support releases...for example:

Patches_anacron:
upstream_anacron: needs-triage
trusty/esm_anacron: needs-triage
esm-infra/xenial_anacron: needs-triage
esm-infra/bionic_anacron: needs-triage
focal_anacron: needs-triage
jammy_anacron: needs-triage
mantic_anacron: needs-triage
devel_anacron: needs-triage

instead of:

Patches_anacron:
upstream_anacron: needs-triage
trusty_anacron: ignored (end of standard support)
trusty/esm_anacron: needs-triage
xenial_anacron: ignored (end of standard support)
esm-infra/xenial_anacron: needs-triage
bionic_anacron: ignored (end of standard support)
esm-infra/bionic_anacron: needs-triage
focal_anacron: needs-triage
jammy_anacron: needs-triage
mantic_anacron: needs-triage
devel_anacron: needs-triage

Do we really need those releases to be listed in new CVE files? All they do is add noise, and if nothing uses them, perhaps we should simply leave them out?

Revision history for this message
Alex Murray (alexmurray) wrote :

Re: the old releaase - I always thought they seemed redundant so I am happy to seem them go.

One question - why not use `detailed=True` for the cache so that it is as complete as possible since we are generating it out-of-band then we may as well put as much in it as we can IMO.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

> Re: the old releaase - I always thought they seemed redundant so I am happy to
> seem them go.
>
> One question - why not use `detailed=True` for the cache so that it is as
> complete as possible since we are generating it out-of-band then we may as
> well put as much in it as we can IMO.

Good point, I'll measure if there's a runtime search performance difference with a bigger dataset, and if it's negligible, I'll switch it to detailed.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

So there's a good difference with the bigger dataset, almost half a second longer:

detailed=True:
$ time ./scripts/active_edit -c CVE-2024-NNN1 -p python-django

real 0m0.848s
user 0m0.752s
sys 0m0.096s

detailed=False:
$ time ./scripts/active_edit -c CVE-2024-NNN1 -p python-django

real 0m0.440s
user 0m0.383s
sys 0m0.056s

I think we're better off keeping a non-detailed dataset, and if we think other tools could benefit from this, we can certainly generate a detailed dataset too. There's no reason why we can't generate a few different pickles there, depending on usage.

Revision history for this message
Alex Murray (alexmurray) wrote :

Awesome - LGTM - thanks for doing that additional analysis Marc - I agree, the extra overhead is not worth it so let's go for performance and like you say we can add additional caches in the future if needed.

review: Approve
Revision history for this message
Eduardo Barretto (ebarretto) wrote :

> So 5762b55746375254778688fe3d0d64f1811d45f7 modifies the way active_edit will
> create a new CVE file. It will no longer add the end of standard support
> releases...for example:
>
> Patches_anacron:
> upstream_anacron: needs-triage
> trusty/esm_anacron: needs-triage
> esm-infra/xenial_anacron: needs-triage
> esm-infra/bionic_anacron: needs-triage
> focal_anacron: needs-triage
> jammy_anacron: needs-triage
> mantic_anacron: needs-triage
> devel_anacron: needs-triage
>
>
> instead of:
>
> Patches_anacron:
> upstream_anacron: needs-triage
> trusty_anacron: ignored (end of standard support)
> trusty/esm_anacron: needs-triage
> xenial_anacron: ignored (end of standard support)
> esm-infra/xenial_anacron: needs-triage
> bionic_anacron: ignored (end of standard support)
> esm-infra/bionic_anacron: needs-triage
> focal_anacron: needs-triage
> jammy_anacron: needs-triage
> mantic_anacron: needs-triage
> devel_anacron: needs-triage
>
> Do we really need those releases to be listed in new CVE files? All they do is
> add noise, and if nothing uses them, perhaps we should simply leave them out?

from an OVAL point of view, I don't think this will be a problem as we moved into having one OVAL (CVE- or PKG-based) per release, therefore it will consider trusty/esm in that case, instead of trusty, for example.

I believe that won't create an issue for the Web CVE Tracker, but maybe it might be worth testing with a single CVE file and seeing if anything happens in the website.

Revision history for this message
Paulo Flabiano Smorigo (pfsmorigo) wrote :

LGTM. Maybe just fix the lpci failures:
scripts/packages-mirror: line 358: ./scripts/gen-source-map-cache: No such file or directory

06df234... by Marc Deslauriers

packages-mirror: don't generate pickles if the tool is missing

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Nice, LGTM :)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
diff --git a/scripts/active_edit b/scripts/active_edit
index 321f39a..a0e0efa 100755
--- a/scripts/active_edit
+++ b/scripts/active_edit
@@ -37,12 +37,7 @@ parser.add_option("-C", "--cvss", help="CVSS3.1 rating", metavar="CVSS:3.1/AV:_/
37parser.add_option("-d", "--description", help="Description", default=None)37parser.add_option("-d", "--description", help="Description", default=None)
38(options, args) = parser.parse_args()38(options, args) = parser.parse_args()
3939
4040source = source_map.load(detailed=False)
41source_releases = []
42for release in cve_lib.all_releases:
43 if cve_lib.is_active_esm_release(release) or cve_lib.is_active_release(release):
44 source_releases.append(release)
45source = source_map.load(releases=source_releases, skip_eol_releases=False, detailed=False)
4641
47added_rel_pkg = []42added_rel_pkg = []
4843
diff --git a/scripts/gen-source-map-cache b/scripts/gen-source-map-cache
49new file mode 10075544new file mode 100755
index 0000000..afc6956
--- /dev/null
+++ b/scripts/gen-source-map-cache
@@ -0,0 +1,50 @@
1#!/usr/bin/python3
2# -*- coding: utf-8 -*-
3#
4# This script pre-calculates some sources maps and saves them as pickle
5# files to speed up being used by various tools.
6#
7# Author: Marc Deslauriers <marc.deslauriers@canonical.com>
8# Copyright (C) 2024 Canonical Ltd.
9#
10# This script is distributed under the terms and conditions of the GNU General
11# Public License, Version 3 or later. See http://www.gnu.org/copyleft/gpl.html
12# for details.
13#
14
15import sys
16import os
17import pickle
18import cve_lib
19import source_map
20
21def main():
22 config = source_map.read_config_file(os.path.expanduser("~/.ubuntu-cve-tracker.conf"))
23 if 'packages_mirror' not in config:
24 sys.stderr.write("\rERROR: Could not find packages_mirror in config file!\n")
25 sys.exit(1)
26
27 mirror = config['packages_mirror']
28 sources_pickle = os.path.join(mirror, 'sources_map.pickle')
29 sources_pickle_dev_proposed = os.path.join(mirror, 'sources_map_dev_proposed.pickle')
30
31 # Remove these first so that source_map.load doesn't load them!
32 if os.path.exists(sources_pickle):
33 os.unlink(sources_pickle)
34 if os.path.exists(sources_pickle_dev_proposed):
35 os.unlink(sources_pickle_dev_proposed)
36
37 sources = source_map.load(detailed=False)
38 sources_dev_proposed = source_map.load(pockets=["-proposed"],
39 releases=[cve_lib.devel_release],
40 detailed=False)
41
42 # Use protocol=3 to remain compatible with focal
43 with open(sources_pickle, 'wb') as handle:
44 pickle.dump(sources, handle, protocol=3)
45
46 with open(sources_pickle_dev_proposed, 'wb') as handle:
47 pickle.dump(sources_dev_proposed, handle, protocol=3)
48
49if __name__ == '__main__':
50 main()
diff --git a/scripts/packages-mirror b/scripts/packages-mirror
index f7a9886..067e128 100755
--- a/scripts/packages-mirror
+++ b/scripts/packages-mirror
@@ -350,3 +350,12 @@ fi
350for path in "$outPath" "$debianPath"; do350for path in "$outPath" "$debianPath"; do
351 find "$path" -name Sources.gz -exec zgrep '^Package: .*' {} \; | cut -c10- | sort -u > "$path/sources"351 find "$path" -name Sources.gz -exec zgrep '^Package: .*' {} \; | cut -c10- | sort -u > "$path/sources"
352done352done
353
354# generate source map cache pickles
355SOURCE_MAP_SCRIPT=$(dirname "$0")/gen-source-map-cache
356if [ -f "$SOURCE_MAP_SCRIPT" ]; then
357 if [ "$verbosity_args" != "-q" ]; then
358 echo "Generating Source Map Cache"
359 fi
360 $SOURCE_MAP_SCRIPT
361fi
diff --git a/scripts/source_map.py b/scripts/source_map.py
index 226b92e..c5ec9ad 100755
--- a/scripts/source_map.py
+++ b/scripts/source_map.py
@@ -18,6 +18,7 @@ import subprocess
18import sys18import sys
19import cve_lib19import cve_lib
20import yaml20import yaml
21import pickle
2122
22built_using_tags = ["Built-Using", "Static-Built-Using", "X-Cargo-Built-Using"]23built_using_tags = ["Built-Using", "Static-Built-Using", "X-Cargo-Built-Using"]
23apt_pkg.init_system()24apt_pkg.init_system()
@@ -216,6 +217,25 @@ def load(data_type='sources', pockets=None, releases=None, skip_eol_releases=Tru
216 if data_type not in ['sources', 'packages']:217 if data_type not in ['sources', 'packages']:
217 raise ValueError("'data_type' should be either 'sources' or 'packages'")218 raise ValueError("'data_type' should be either 'sources' or 'packages'")
218219
220 # Attempt to load from cached pickle files
221 config = read_config_file(os.path.expanduser("~/.ubuntu-cve-tracker.conf"))
222 if 'packages_mirror' in config:
223 mirror = config['packages_mirror']
224
225 if data_type == 'sources' and skip_eol_releases == True and arch == "amd64" and detailed == False:
226 if pockets == None and releases == None:
227 sources_pickle = os.path.join(mirror, 'sources_map.pickle')
228 if os.path.exists(sources_pickle):
229 with open(sources_pickle, 'rb') as handle:
230 map = pickle.load(handle)
231 return map
232 elif pockets == ["-proposed"] and releases == [cve_lib.devel_release]:
233 sources_pickle_dev_proposed = os.path.join(mirror, 'sources_map_dev_proposed.pickle')
234 if os.path.exists(sources_pickle_dev_proposed):
235 with open(sources_pickle_dev_proposed, 'rb') as handle:
236 map = pickle.load(handle)
237 return map
238
219 map = dict()239 map = dict()
220 if data_type == 'sources':240 if data_type == 'sources':
221 for item in _find_sources(pockets=pockets, releases=releases, skip_eol_releases=skip_eol_releases, arch=arch):241 for item in _find_sources(pockets=pockets, releases=releases, skip_eol_releases=skip_eol_releases, arch=arch):

Subscribers

People subscribed via source and target branches