Merge ~mdeslaur/ubuntu-cve-tracker:perf-part4 into ubuntu-cve-tracker:master

Proposed by Marc Deslauriers
Status: Merged
Merge reported by: Marc Deslauriers
Merged at revision: 06df23418f8851d45c435207f0e5831826805293
Proposed branch: ~mdeslaur/ubuntu-cve-tracker:perf-part4
Merge into: ubuntu-cve-tracker:master
Diff against target: 129 lines (+80/-6)
4 files modified
scripts/active_edit (+1/-6)
scripts/gen-source-map-cache (+50/-0)
scripts/packages-mirror (+9/-0)
scripts/source_map.py (+20/-0)
Reviewer Review Type Date Requested Status
Alex Murray Approve
Review via email: mp+461826@code.launchpad.net

Commit message

commit 7e46124a0687ea0b4ae955cd23d4d3fbde6ec9ed
Author: Marc Deslauriers <email address hidden>
Date: Tue Mar 5 12:40:42 2024 -0500

    gen-source-map-cache: create pickles of cached source map data

    Generating a source map is an expensive operation. This script
    generates source map data and stores it into pickles in the
    packages-mirror directory for use by source_map.py later on.

commit 45f1855791946615b8b7fd261979e7168addadcb
Author: Marc Deslauriers <email address hidden>
Date: Tue Mar 5 12:42:59 2024 -0500

    packages-mirror: call gen-source-map-cache to update pickles

commit 52c25d36857a9284fd4066f4f1dff1c580dd8691
Author: Marc Deslauriers <email address hidden>
Date: Tue Mar 5 12:43:26 2024 -0500

    source_map.py: improve performance by using cached data

    If a pickle containing cached information is available, load that
    instead of parsing all the Sources files.

commit 5762b55746375254778688fe3d0d64f1811d45f7 (HEAD -> perf-part4)
Author: Marc Deslauriers <email address hidden>
Date: Tue Mar 5 12:45:45 2024 -0500

    active_edit: use the default source_map options to get the cache

Description of the change

This branch creates pickles of cached source map data whenever packages-mirror has been run. This massively improves runtime of small scripts such as active_edit:

active_edit before:
$ time ./scripts/active_edit -c CVE-2024-NNN1 -p python-django

real 0m3.406s
user 0m3.290s
sys 0m0.116s

active_edit after:
$ time ./scripts/active_edit -c CVE-2024-NNN1 -p python-django

real 0m0.443s
user 0m0.387s
sys 0m0.056s

check-syntax before:
$ time ./scripts/check-syntax -j8
real 0m50.289s
user 6m6.409s
sys 0m6.995s

check-syntax after:
$ time ./scripts/check-syntax -j8
real 0m48.658s
user 6m19.725s
sys 0m7.410s

To post a comment you must log in.
Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

This needs work still, 5762b55746375254778688fe3d0d64f1811d45f7 is incorrect.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

So 5762b55746375254778688fe3d0d64f1811d45f7 modifies the way active_edit will create a new CVE file. It will no longer add the end of standard support releases...for example:

Patches_anacron:
upstream_anacron: needs-triage
trusty/esm_anacron: needs-triage
esm-infra/xenial_anacron: needs-triage
esm-infra/bionic_anacron: needs-triage
focal_anacron: needs-triage
jammy_anacron: needs-triage
mantic_anacron: needs-triage
devel_anacron: needs-triage

instead of:

Patches_anacron:
upstream_anacron: needs-triage
trusty_anacron: ignored (end of standard support)
trusty/esm_anacron: needs-triage
xenial_anacron: ignored (end of standard support)
esm-infra/xenial_anacron: needs-triage
bionic_anacron: ignored (end of standard support)
esm-infra/bionic_anacron: needs-triage
focal_anacron: needs-triage
jammy_anacron: needs-triage
mantic_anacron: needs-triage
devel_anacron: needs-triage

Do we really need those releases to be listed in new CVE files? All they do is add noise, and if nothing uses them, perhaps we should simply leave them out?

Revision history for this message
Alex Murray (alexmurray) wrote :

Re: the old releaase - I always thought they seemed redundant so I am happy to seem them go.

One question - why not use `detailed=True` for the cache so that it is as complete as possible since we are generating it out-of-band then we may as well put as much in it as we can IMO.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

> Re: the old releaase - I always thought they seemed redundant so I am happy to
> seem them go.
>
> One question - why not use `detailed=True` for the cache so that it is as
> complete as possible since we are generating it out-of-band then we may as
> well put as much in it as we can IMO.

Good point, I'll measure if there's a runtime search performance difference with a bigger dataset, and if it's negligible, I'll switch it to detailed.

Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

So there's a good difference with the bigger dataset, almost half a second longer:

detailed=True:
$ time ./scripts/active_edit -c CVE-2024-NNN1 -p python-django

real 0m0.848s
user 0m0.752s
sys 0m0.096s

detailed=False:
$ time ./scripts/active_edit -c CVE-2024-NNN1 -p python-django

real 0m0.440s
user 0m0.383s
sys 0m0.056s

I think we're better off keeping a non-detailed dataset, and if we think other tools could benefit from this, we can certainly generate a detailed dataset too. There's no reason why we can't generate a few different pickles there, depending on usage.

Revision history for this message
Alex Murray (alexmurray) wrote :

Awesome - LGTM - thanks for doing that additional analysis Marc - I agree, the extra overhead is not worth it so let's go for performance and like you say we can add additional caches in the future if needed.

review: Approve
Revision history for this message
Eduardo Barretto (ebarretto) wrote :

> So 5762b55746375254778688fe3d0d64f1811d45f7 modifies the way active_edit will
> create a new CVE file. It will no longer add the end of standard support
> releases...for example:
>
> Patches_anacron:
> upstream_anacron: needs-triage
> trusty/esm_anacron: needs-triage
> esm-infra/xenial_anacron: needs-triage
> esm-infra/bionic_anacron: needs-triage
> focal_anacron: needs-triage
> jammy_anacron: needs-triage
> mantic_anacron: needs-triage
> devel_anacron: needs-triage
>
>
> instead of:
>
> Patches_anacron:
> upstream_anacron: needs-triage
> trusty_anacron: ignored (end of standard support)
> trusty/esm_anacron: needs-triage
> xenial_anacron: ignored (end of standard support)
> esm-infra/xenial_anacron: needs-triage
> bionic_anacron: ignored (end of standard support)
> esm-infra/bionic_anacron: needs-triage
> focal_anacron: needs-triage
> jammy_anacron: needs-triage
> mantic_anacron: needs-triage
> devel_anacron: needs-triage
>
> Do we really need those releases to be listed in new CVE files? All they do is
> add noise, and if nothing uses them, perhaps we should simply leave them out?

from an OVAL point of view, I don't think this will be a problem as we moved into having one OVAL (CVE- or PKG-based) per release, therefore it will consider trusty/esm in that case, instead of trusty, for example.

I believe that won't create an issue for the Web CVE Tracker, but maybe it might be worth testing with a single CVE file and seeing if anything happens in the website.

Revision history for this message
Paulo Flabiano Smorigo (pfsmorigo) wrote :

LGTM. Maybe just fix the lpci failures:
scripts/packages-mirror: line 358: ./scripts/gen-source-map-cache: No such file or directory

06df234... by Marc Deslauriers

packages-mirror: don't generate pickles if the tool is missing

Revision history for this message
Seth Arnold (seth-arnold) wrote :

Nice, LGTM :)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/scripts/active_edit b/scripts/active_edit
2index 321f39a..a0e0efa 100755
3--- a/scripts/active_edit
4+++ b/scripts/active_edit
5@@ -37,12 +37,7 @@ parser.add_option("-C", "--cvss", help="CVSS3.1 rating", metavar="CVSS:3.1/AV:_/
6 parser.add_option("-d", "--description", help="Description", default=None)
7 (options, args) = parser.parse_args()
8
9-
10-source_releases = []
11-for release in cve_lib.all_releases:
12- if cve_lib.is_active_esm_release(release) or cve_lib.is_active_release(release):
13- source_releases.append(release)
14-source = source_map.load(releases=source_releases, skip_eol_releases=False, detailed=False)
15+source = source_map.load(detailed=False)
16
17 added_rel_pkg = []
18
19diff --git a/scripts/gen-source-map-cache b/scripts/gen-source-map-cache
20new file mode 100755
21index 0000000..afc6956
22--- /dev/null
23+++ b/scripts/gen-source-map-cache
24@@ -0,0 +1,50 @@
25+#!/usr/bin/python3
26+# -*- coding: utf-8 -*-
27+#
28+# This script pre-calculates some sources maps and saves them as pickle
29+# files to speed up being used by various tools.
30+#
31+# Author: Marc Deslauriers <marc.deslauriers@canonical.com>
32+# Copyright (C) 2024 Canonical Ltd.
33+#
34+# This script is distributed under the terms and conditions of the GNU General
35+# Public License, Version 3 or later. See http://www.gnu.org/copyleft/gpl.html
36+# for details.
37+#
38+
39+import sys
40+import os
41+import pickle
42+import cve_lib
43+import source_map
44+
45+def main():
46+ config = source_map.read_config_file(os.path.expanduser("~/.ubuntu-cve-tracker.conf"))
47+ if 'packages_mirror' not in config:
48+ sys.stderr.write("\rERROR: Could not find packages_mirror in config file!\n")
49+ sys.exit(1)
50+
51+ mirror = config['packages_mirror']
52+ sources_pickle = os.path.join(mirror, 'sources_map.pickle')
53+ sources_pickle_dev_proposed = os.path.join(mirror, 'sources_map_dev_proposed.pickle')
54+
55+ # Remove these first so that source_map.load doesn't load them!
56+ if os.path.exists(sources_pickle):
57+ os.unlink(sources_pickle)
58+ if os.path.exists(sources_pickle_dev_proposed):
59+ os.unlink(sources_pickle_dev_proposed)
60+
61+ sources = source_map.load(detailed=False)
62+ sources_dev_proposed = source_map.load(pockets=["-proposed"],
63+ releases=[cve_lib.devel_release],
64+ detailed=False)
65+
66+ # Use protocol=3 to remain compatible with focal
67+ with open(sources_pickle, 'wb') as handle:
68+ pickle.dump(sources, handle, protocol=3)
69+
70+ with open(sources_pickle_dev_proposed, 'wb') as handle:
71+ pickle.dump(sources_dev_proposed, handle, protocol=3)
72+
73+if __name__ == '__main__':
74+ main()
75diff --git a/scripts/packages-mirror b/scripts/packages-mirror
76index f7a9886..067e128 100755
77--- a/scripts/packages-mirror
78+++ b/scripts/packages-mirror
79@@ -350,3 +350,12 @@ fi
80 for path in "$outPath" "$debianPath"; do
81 find "$path" -name Sources.gz -exec zgrep '^Package: .*' {} \; | cut -c10- | sort -u > "$path/sources"
82 done
83+
84+# generate source map cache pickles
85+SOURCE_MAP_SCRIPT=$(dirname "$0")/gen-source-map-cache
86+if [ -f "$SOURCE_MAP_SCRIPT" ]; then
87+ if [ "$verbosity_args" != "-q" ]; then
88+ echo "Generating Source Map Cache"
89+ fi
90+ $SOURCE_MAP_SCRIPT
91+fi
92diff --git a/scripts/source_map.py b/scripts/source_map.py
93index 226b92e..c5ec9ad 100755
94--- a/scripts/source_map.py
95+++ b/scripts/source_map.py
96@@ -18,6 +18,7 @@ import subprocess
97 import sys
98 import cve_lib
99 import yaml
100+import pickle
101
102 built_using_tags = ["Built-Using", "Static-Built-Using", "X-Cargo-Built-Using"]
103 apt_pkg.init_system()
104@@ -216,6 +217,25 @@ def load(data_type='sources', pockets=None, releases=None, skip_eol_releases=Tru
105 if data_type not in ['sources', 'packages']:
106 raise ValueError("'data_type' should be either 'sources' or 'packages'")
107
108+ # Attempt to load from cached pickle files
109+ config = read_config_file(os.path.expanduser("~/.ubuntu-cve-tracker.conf"))
110+ if 'packages_mirror' in config:
111+ mirror = config['packages_mirror']
112+
113+ if data_type == 'sources' and skip_eol_releases == True and arch == "amd64" and detailed == False:
114+ if pockets == None and releases == None:
115+ sources_pickle = os.path.join(mirror, 'sources_map.pickle')
116+ if os.path.exists(sources_pickle):
117+ with open(sources_pickle, 'rb') as handle:
118+ map = pickle.load(handle)
119+ return map
120+ elif pockets == ["-proposed"] and releases == [cve_lib.devel_release]:
121+ sources_pickle_dev_proposed = os.path.join(mirror, 'sources_map_dev_proposed.pickle')
122+ if os.path.exists(sources_pickle_dev_proposed):
123+ with open(sources_pickle_dev_proposed, 'rb') as handle:
124+ map = pickle.load(handle)
125+ return map
126+
127 map = dict()
128 if data_type == 'sources':
129 for item in _find_sources(pockets=pockets, releases=releases, skip_eol_releases=skip_eol_releases, arch=arch):

Subscribers

People subscribed via source and target branches