Merge ~eslerm/ubuntu-cve-tracker:nvd-api-client into ubuntu-cve-tracker:master
- Git
- lp:~eslerm/ubuntu-cve-tracker
- nvd-api-client
- Merge into master
Status: | Merged |
---|---|
Merged at revision: | 6ac40f852bfe5131090cacc156ce6af015c2be35 |
Proposed branch: | ~eslerm/ubuntu-cve-tracker:nvd-api-client |
Merge into: | ubuntu-cve-tracker:master |
Diff against target: |
417 lines (+411/-0) 1 file modified
scripts/nvd_api_client.py (+411/-0) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Seth Arnold | Approve | ||
Steve Beattie | Pending | ||
Alex Murray | Pending | ||
Review via email: mp+448538@code.launchpad.net |
Commit message
nvd-api-client init
Description of the change
Alex Murray (alexmurray) wrote : | # |
Mark Esler (eslerm) wrote : | # |
Thank you Alex!
1. Can do.
Could we add `[DEFAULT]` to the top of our teams ~/.ubuntu-
Then the config becomes a valid INI file for the Python builtin configparse: https:/
2. automagic --since is doable, but I have a concern about --init
The maintenance function searches a time span of modified CVE records. NVD adds metrics after CVE record creation, so --since needs to be set to the most recent lastModified value from the local dataset. Just finding the most recent locally modified file is good enough, even if there is an older lastModified in the local dataset the overlap between them is small.
Their API doesn't document this, but searching a lastModified data of >6 months 404s. I should handle that.
(note that theses API searches download unpublished CVEs. This is acceptable/desired, since the CVE List is the primary source of CVE data which should drive triage. NVD data is supplemental to CVE List data.)
With an automated --init, a misconfigured path could cause 1.3G of API strain to NVD. I added a prompt in the init function to slow down users. Is it okay to keep that? Item X might change the context of --init.
3. That would be helpful :D
I reworked --debug and uncommitted changes look like:
```
./scripts/
DEBUG: searching for modified NVD CVEs between 2023-07-
DEBUG: local NVD mirror path is "/home/
DEBUG: saved results 0 through 2000 of 4834
DEBUG: saved results 2000 through 4000 of 4834
DEBUG: saved results 4000 through 4834 of 4834
DEBUG: NVD sync complete \o/
```
How does that look?
X. Ideally we should maintain NVD data from a central source to prevent discrepancies. Seth suggested that we may want to explore Canonistack.
Y. I'm hoping this work will also benefit https:/
Mark Esler (eslerm) wrote : | # |
Alex, I've added automation and believe this is ready for review.
Alex Murray (alexmurray) wrote : | # |
Thanks Mark - apologies for the delay in this review.
The only other thing that would be great to see is some tests - whilst we haven't traditionally had a lot test for our different internal tools, I think we should always aim to improve things going forward so would you be able to add some tests for this as well?
Mark Esler (eslerm) wrote : | # |
Hi Alex, thanks for the review.
I can add tests. Test suggestions are very welcome.
Mark Esler (eslerm) wrote : | # |
This now natively works with ~/.ubuntu-
I did not want to implement reading this file like the rest of UCT, since that requires a non-built-in library. Ultimately, I believe we should use the INI standard so we can use configparser. (This could just be adding `[DEFAULT]` to the top of the config, And updating UCT...)
Seth Arnold (seth-arnold) wrote : | # |
Thanks for tackling this! It'd be nice to have some test cases where that makes sense, and I'm afraid that having two different configuration files for this will cause problems in the long run. I'd suggest trimming out the new ~/.config/
Thanks
Steve Beattie (sbeattie) wrote : | # |
On Tue, Oct 17, 2023 at 09:12:46PM -0000, Mark Esler wrote:
> This now natively works with ~/.ubuntu-
>
> I did not want to implement reading this file like the rest of UCT, since that requires a non-built-in library. Ultimately, I believe we should use the INI standard so we can use configparser. (This could just be adding `[DEFAULT]` to the top of the config, And updating UCT...)
The reason to not use an INI style format is because shell scripts also
use this config file via '.' / 'source' (see e.g. packages-mirror).
We could fix that by having a python script that converts the [insert
flame war over config file format flame war here] config file to
something that shell scripts could evaluate dynamically; and in fact
it'd be nice to have a cve_lib.sh that shell scripts could source that
did this work for them and also accumulate common UCT shell
functionality.
--
Steve Beattie
<email address hidden>
Preview Diff
1 | diff --git a/scripts/nvd_api_client.py b/scripts/nvd_api_client.py |
2 | new file mode 100755 |
3 | index 0000000..493b3ad |
4 | --- /dev/null |
5 | +++ b/scripts/nvd_api_client.py |
6 | @@ -0,0 +1,411 @@ |
7 | +#!/usr/bin/env python3 |
8 | + |
9 | +""" |
10 | +nvd-api-client: download and maintain NVD's CVE dataset |
11 | + |
12 | + |
13 | +Configure path to local NVD mirror by creating an INI file located in |
14 | +~/.config/nvd-api-client.conf similar to: |
15 | + |
16 | + [DEFAULT] |
17 | + nvd_path=/home/eslerm/mirrors/nvd/ |
18 | + |
19 | +Alternatively, the non-INI ~/.ubuntu-cve-tracker.conf can be used with the same |
20 | +key. |
21 | + |
22 | +Make sure to create this directory! |
23 | + |
24 | + |
25 | +nvd-api-client has three primary modes: |
26 | + |
27 | + --init |
28 | + |
29 | + To initialize the mirror by downloading NVD's CVE dataset, run: |
30 | + ./scripts/nvd_api_client --init |
31 | + and follow the prompt. |
32 | + |
33 | + --maintain-since |
34 | + |
35 | + To maintain your NVD CVE dataset mirror, run the following command with the |
36 | + date set to the last time maintenance was ran: |
37 | + ./scripts/nvd_api_client --maintain-since 2022-12-25 |
38 | + The above command will download all CVEs since December 25th 2022 UCT until |
39 | + now. |
40 | + |
41 | + ISO-8601 datetime is also allowed as maintenance input: |
42 | + ./scripts/nvd_api_client --maintain-since 2023-08-01T00:00:00 |
43 | + ./scripts/nvd_api_client --maintain-since 2023-08-01T00:00:00.000001+00:00 |
44 | + |
45 | + The --maintain-since value must be within 120 days of today. (This is an |
46 | + undocumented API restriction.) |
47 | + |
48 | + --auto |
49 | + |
50 | + To automatically maintain your dataset (without needing to know when |
51 | + maintenance was last ran) run: |
52 | + ./scripts/nvd_api_client --auto |
53 | + |
54 | +All modes accept --debug or --verbose which print information in stderr. |
55 | + nb: use these options to monitor update progress |
56 | +""" |
57 | + |
58 | + |
59 | +__author__ = "Mark Esler" |
60 | +__copyright__ = "Copyright (C) 2023 Canonical Ltd." |
61 | +__license__ = "BSD-3-Clause" |
62 | +__version__ = "1.0" |
63 | + |
64 | + |
65 | +import argparse |
66 | +import configparser |
67 | +from datetime import datetime, timezone |
68 | +import json |
69 | +from pathlib import Path |
70 | +import sys |
71 | +import time |
72 | +from typing import Optional |
73 | +import requests |
74 | + |
75 | + |
76 | +# API Client Headers |
77 | +HEADERS = {"Accept-Language": "en-US", "User-Agent": "nvd-api-client"} |
78 | + |
79 | + |
80 | +# NVD_API_KEY not implemented |
81 | +NVD_API_KEY = None |
82 | + |
83 | + |
84 | +# seconds to wait after a request |
85 | +# maximally efficient timing isn't critical |
86 | +# NVD's public rate limit is 5 requests in a rolling 30 second window |
87 | +# public default based on 5 / 30 * 2 = 12, round down to 10 requests a minute |
88 | +# sleeping 6.0 seconds aligns with NVD's Best Practices |
89 | +if NVD_API_KEY: |
90 | + # 50 requests in a rolling 30 second window |
91 | + RATE_LIMIT = 0.60 |
92 | +else: |
93 | + RATE_LIMIT = 6.0 |
94 | + |
95 | + |
96 | +# requests timeout |
97 | +TIMEOUT = 30.0 |
98 | + |
99 | + |
100 | +def debug(msg: str) -> None: |
101 | + """print to stderr""" |
102 | + print("DEBUG: " + msg, file=sys.stderr) |
103 | + |
104 | + |
105 | +def find_conf() -> Path: |
106 | + """find configuration file""" |
107 | + for filename in [".ubuntu-cve-tracker.conf", ".config/nvd-api-client.conf"]: |
108 | + path = Path.home() / filename |
109 | + if path.is_file(): |
110 | + return path |
111 | + raise ValueError( |
112 | + """ |
113 | +No configuration file. |
114 | +Create ~/.ubuntu-cve-tracker.conf or ~/.config/nvd-api-client.conf""" |
115 | + ) |
116 | + |
117 | + |
118 | +def load_path(conf: Path) -> Path: |
119 | + """ |
120 | + read configuration file for path to local NVD mirror |
121 | + |
122 | + UCT does not use an INI style configuration file. Code in this section |
123 | + is a little messy to accomadate this. The rest of UCT requires a |
124 | + non-built-in Python package, which this seeks to avoid. |
125 | + """ |
126 | + config = configparser.ConfigParser() |
127 | + try: |
128 | + # nb: encoding is unset |
129 | + with open(conf) as file: |
130 | + try: |
131 | + config.read_file(file) |
132 | + try: |
133 | + path = Path(config["DEFAULT"]["nvd_path"]) |
134 | + except KeyError as exc: |
135 | + raise KeyError( |
136 | + "nvd_path not defined in configuration file" |
137 | + ) from exc |
138 | + # this is for uct not using an INI file |
139 | + except configparser.MissingSectionHeaderError: |
140 | + for line in file: |
141 | + if line.startswith("nvd_path="): |
142 | + path = Path(line.split("=")[1][:-1]) |
143 | + except OSError as exc: |
144 | + msg = f"error reading {conf}" |
145 | + raise OSError(msg) from exc |
146 | + if DEBUG: |
147 | + debug(f"local NVD mirror path is {path}") |
148 | + return path |
149 | + |
150 | + |
151 | +def verify_dirs() -> Path: |
152 | + """create directory structure if needed and return local NVD mirror path""" |
153 | + config_path = find_conf() |
154 | + nvd_path = load_path(config_path) |
155 | + |
156 | + nvd_path.mkdir(parents=True, exist_ok=True) |
157 | + |
158 | + current_year = int(time.strftime("%Y", time.gmtime())) |
159 | + for i in range(1999, current_year + 1): |
160 | + Path(nvd_path / str(i)).mkdir(parents=True, exist_ok=True) |
161 | + |
162 | + return nvd_path |
163 | + |
164 | + |
165 | +def get_url(url: str) -> requests.models.Response: |
166 | + """ |
167 | + return a url response after sleeping |
168 | + |
169 | + NOTE: could be modified for https://github.com/tomasbasham/ratelimit |
170 | + """ |
171 | + if VERBOSE: |
172 | + debug(f"requesting {url}") |
173 | + response = requests.get(url, timeout=TIMEOUT, headers=HEADERS) |
174 | + |
175 | + if response.status_code != 200: |
176 | + msg = f"API response: {response.status_code}" |
177 | + raise Exception(msg) |
178 | + |
179 | + time.sleep(RATE_LIMIT) |
180 | + |
181 | + return response |
182 | + |
183 | + |
184 | +def save_cve(page_json: dict, nvd_path: Path) -> None: |
185 | + """save all json files from a page""" |
186 | + for i in page_json["vulnerabilities"]: |
187 | + cve = i["cve"] |
188 | + year = cve["id"][4:8] |
189 | + file_path = Path(f'{nvd_path / year / cve["id"]}.json') |
190 | + if VERBOSE: |
191 | + debug(f'saving {cve["id"]}') |
192 | + with open(file_path, "w", encoding="utf-8") as file: |
193 | + json.dump(cve, file) |
194 | + |
195 | + |
196 | +def save_pages(query: Optional[str] = None) -> None: |
197 | + """ |
198 | + get all pages of CVE results and save them |
199 | + |
200 | + see https://nvd.nist.gov/developers/vulnerabilities for parameters |
201 | + """ |
202 | + |
203 | + nvd_path = verify_dirs() |
204 | + |
205 | + base_url = "https://services.nvd.nist.gov/rest/json/cves/2.0" |
206 | + start_index = 0 |
207 | + results_per_page = 2000 |
208 | + total_results = results_per_page + 1 |
209 | + |
210 | + while start_index < total_results: |
211 | + if query: |
212 | + url = ( |
213 | + f"{base_url}?{query}&" |
214 | + + f"resultsPerPage={results_per_page}&startIndex={start_index}" |
215 | + ) |
216 | + else: |
217 | + url = ( |
218 | + f"{base_url}?" |
219 | + + f"resultsPerPage={results_per_page}&startIndex={start_index}" |
220 | + ) |
221 | + |
222 | + page = get_url(url) |
223 | + page_json = page.json() |
224 | + page.close() |
225 | + |
226 | + save_cve(page_json, nvd_path) |
227 | + |
228 | + total_results = page_json["totalResults"] |
229 | + |
230 | + if DEBUG: |
231 | + if total_results == 0: |
232 | + debug("no new updates from NVD") |
233 | + elif (start_index + results_per_page) >= total_results: |
234 | + debug( |
235 | + f"saved results {start_index} through {total_results}" |
236 | + + f" of {total_results}" |
237 | + ) |
238 | + else: |
239 | + debug( |
240 | + f"saved results {start_index} through {start_index + results_per_page}" |
241 | + + f" of {total_results}" |
242 | + ) |
243 | + |
244 | + start_index += results_per_page |
245 | + |
246 | + |
247 | +def nvd_init() -> None: |
248 | + """ |
249 | + create initial NVD dataset |
250 | + |
251 | + NVD's Best Practices for Initial Data Population state: |
252 | + - Users should start by calling the API beginning with a startIndex of 0 |
253 | + - Iterative requests should increment the startIndex by the value of |
254 | + resultsPerPage until the response's startIndex has exceeded the value |
255 | + in totalResults |
256 | + NVD text accessed Aug 1st 2023 |
257 | + - https://nvd.nist.gov/developers/start-here |
258 | + """ |
259 | + res = input( |
260 | + 'Are you certain that you want to download all NVD data? Enter "Yes" to agree: ' |
261 | + ) |
262 | + if res == "Yes": |
263 | + save_pages() |
264 | + |
265 | + |
266 | +def nvd_maintain(since: datetime) -> None: |
267 | + """ |
268 | + maintain NVD dataset |
269 | + |
270 | + set the since datetime to the time that NVD dataset was last maintained |
271 | + |
272 | + it is not recommended to run this function more than once every two hours |
273 | + |
274 | + large organizations should use a single requester |
275 | + |
276 | + see https://nvd.nist.gov/developers/vulnerabilities for parameters |
277 | + |
278 | + NVD's Best Practices for Maintaining Data state: |
279 | + - After initial data population has occurred, the last modified date |
280 | + parameters provide an efficient way to update a user's local |
281 | + repository and stay within the API rate limits. No more than once |
282 | + every two hours, automated requests should include a range where |
283 | + lastModStartDate equals the time of the last CVE or CPE received and |
284 | + lastModEndDate equals the current time. |
285 | + - It is recommended that users "sleep" their scripts for six seconds |
286 | + between requests. |
287 | + - It is recommended to use the default resultsPerPage value as this value |
288 | + has been optimized for the API response. |
289 | + - Enterprise scale development should enforce these practices through a |
290 | + single requestor to ensure all users are in sync and have the latest |
291 | + CVE, Change History, CPE, and CPE match criteria information. |
292 | + NVD text accessed Aug 1st 2023 |
293 | + - https://nvd.nist.gov/developers/start-here |
294 | + """ |
295 | + start_date = since.isoformat() |
296 | + end_date = datetime.now(timezone.utc).isoformat() |
297 | + |
298 | + if DEBUG: |
299 | + debug(f"searching for modified NVD CVEs between {start_date} and {end_date}") |
300 | + |
301 | + query = f"lastModStartDate={start_date}&lastModEndDate={end_date}".replace( |
302 | + "+", "%2B" |
303 | + ) |
304 | + |
305 | + save_pages(query) |
306 | + |
307 | + |
308 | +def check_last_modified(last_modified: datetime) -> None: |
309 | + """raise error if an unallowed lastModified date is requested""" |
310 | + delta = last_modified - datetime.now(timezone.utc) |
311 | + if delta.days < -120: |
312 | + msg = "NVD API does not allow searching lastModified dates greater than 120 days ago" |
313 | + raise argparse.ArgumentTypeError(msg) |
314 | + |
315 | + |
316 | +# https://stackoverflow.com/questions/25470844/specify-date-format-for-python- |
317 | +# argparse-input-arguments |
318 | +def format_date(date_str: str) -> datetime: |
319 | + """ |
320 | + verify and format a date string into a datetime for NVD's API |
321 | + |
322 | + always returns UTC |
323 | + |
324 | + note that converting the datetime to a string requires .replace("+", "%2B") |
325 | + before running get_url() |
326 | + """ |
327 | + try: |
328 | + # API requires microseconds |
329 | + date = datetime.strptime(date_str, "%Y-%m-%d").replace( |
330 | + tzinfo=timezone.utc, microsecond=1 |
331 | + ) |
332 | + except ValueError: |
333 | + try: |
334 | + date = datetime.fromisoformat(date_str).replace(tzinfo=timezone.utc) |
335 | + except ValueError as exc: |
336 | + msg = f"not a valid date: {date_str}" |
337 | + raise argparse.ArgumentTypeError(msg) from exc |
338 | + return date |
339 | + |
340 | + |
341 | +def nvd_last_modified_file() -> datetime: |
342 | + """ |
343 | + search local dataset for most recent lastModified value |
344 | + |
345 | + inefficiency is fine if user does not know when maintenance was last ran |
346 | + """ |
347 | + nvd_path = verify_dirs() |
348 | + if DEBUG: |
349 | + debug("searching NVD dataset for most recent lastModified value") |
350 | + # compare strings instead of datetimes |
351 | + last_modified_string = "0" |
352 | + for path in nvd_path.rglob("*"): |
353 | + if path.is_dir(): |
354 | + continue |
355 | + try: |
356 | + # nb: encoding is unset |
357 | + with open(path) as file: |
358 | + data = json.load(file) |
359 | + except OSError as exc: |
360 | + msg = f"error reading {path}" |
361 | + raise OSError(msg) from exc |
362 | + if data["lastModified"] > last_modified_string: |
363 | + last_modified_string = data["lastModified"] |
364 | + if DEBUG: |
365 | + debug(f"most recent lastModified value is: {last_modified_string}") |
366 | + last_modified = format_date(last_modified_string) |
367 | + check_last_modified(last_modified) |
368 | + return last_modified |
369 | + |
370 | + |
371 | +def nvd_auto() -> None: |
372 | + """run nvd_maintain with most recent lastModified value in dataset""" |
373 | + last_modified = nvd_last_modified_file() |
374 | + check_last_modified(last_modified) |
375 | + nvd_maintain(last_modified) |
376 | + |
377 | + |
378 | +if __name__ == "__main__": |
379 | + parser = argparse.ArgumentParser(description="NVD API Client") |
380 | + parser.add_argument( |
381 | + "--init", |
382 | + help="initialize mirror of NVD dataset", |
383 | + action="store_true", |
384 | + ) |
385 | + parser.add_argument( |
386 | + "-s", |
387 | + "--maintain-since", |
388 | + help="maintain NVD dataset since YY-MM-DD or ISO-8601 datetime", |
389 | + type=format_date, |
390 | + ) |
391 | + parser.add_argument("--auto", help="automated maintenance", action="store_true") |
392 | + parser.add_argument("--debug", help="add debug info", action="store_true") |
393 | + parser.add_argument("--verbose", help="add verbose debug info", action="store_true") |
394 | + |
395 | + args = parser.parse_args() |
396 | + |
397 | + if args.verbose: |
398 | + VERBOSE = True |
399 | + DEBUG = True |
400 | + elif args.debug: |
401 | + VERBOSE = False |
402 | + DEBUG = True |
403 | + else: |
404 | + VERBOSE = False |
405 | + DEBUG = False |
406 | + |
407 | + if args.init: |
408 | + nvd_init() |
409 | + elif args.auto: |
410 | + nvd_auto() |
411 | + elif args.maintain_since: |
412 | + nvd_maintain(args.maintain_since) |
413 | + else: |
414 | + raise ValueError("an argument is needed, see --help") |
415 | + |
416 | + if DEBUG: |
417 | + debug("NVD sync complete \\o/") |
A few high level comments (I haven't yet actually run the code but will try that soon)
1. You should be able to use the configobj package to parse the configuration file rather than hand-parsing this (see cve_lib.py for some historical code for this)
2. Would it be possible to make the script as automagic as possible? ie. When it is run, it goes and looks for existing json and if that doesn't exist, then it does --init automatically. But if it does exist, then instead it uses the timestamp of that json to infer the --since date? You can keep both the --init and --since parameters as I can imagine they may be useful, but in general when we can infer and do-the-right-thing I think we should.
3. It might be useful to show a progressbar or similar AND perhaps show some indication when sleeping, since the currently implementation looks like it will sleep for 6 seconds each request which will take a long time - it would be good to give the user some kind of indication how long this is expected to take or to atleast show some kind of progress along the way so they don't think the script has hung - see the use of progressbar in sis-changes or check-cves for inspiration