Merge ~racb/git-ubuntu:changelog-date-parsing into git-ubuntu:master
- Git
- lp:~racb/git-ubuntu
- changelog-date-parsing
- Merge into master
Status: | Merged |
---|---|
Merged at revision: | 6ff151dc208809e1b8e3108c69c05090174e6221 |
Proposed branch: | ~racb/git-ubuntu:changelog-date-parsing |
Merge into: | git-ubuntu:master |
Diff against target: |
195 lines (+135/-6) 3 files modified
gitubuntu/git_repository.py (+71/-6) gitubuntu/git_repository_test.py (+59/-0) gitubuntu/importer.py (+5/-0) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Server Team CI bot | continuous-integration | Approve | |
Bryce Harrington | Approve | ||
Review via email: mp+387855@code.launchpad.net |
Commit message
Make Jenkins happy
Description of the change
Robie Basak (racb) wrote : | # |
Robie Basak (racb) wrote : | # |
Another minor issue might be that git-ubuntu.
Robie Basak (racb) wrote : | # |
Thanks to Colin in #ubuntu-devel for clarifying how it's supposed to work. I think I've now got it correct and with a force push the branch is now ready.
Server Team CI bot (server-team-bot) wrote : | # |
PASSED: Continuous integration, rev:f5cdca0822e
https:/
Executed test runs:
SUCCESS: VM Setup
SUCCESS: Build
SUCCESS: Unit Tests
IN_PROGRESS: Declarative: Post Actions
Click here to trigger a rebuild:
https:/
Bryce Harrington (bryce) wrote : | # |
8d447a009bc6458
- In code docs, param changelog_
- strptime() throws a ValueError on parsing error. May be worth testing for in git_authorship()?
26a3f3b66b4ae42
- Looks good, a concise solution to a rather messy problem. The set of regex's provided seem to cover the various combinations, and looks easy to add more cases to.
- In the assertion message, why is %r used ("Could not parse date %r" % changelog_
- Thanks for documenting the encountered cases for the invalid dates in the test case.
f5cdca0822e95d6
- Setting LC_ALL also may be changing collation, which can change sorting behavior. I doubt that will be an issue at all for git-ubuntu, but if odd things start happening could look unsetting LC_ALL and instead setting LC_TIME and LC_CTYPE.
- That aside, LGTM +1
Robie Basak (racb) wrote : | # |
Thanks for the review! I've addressed your comments in a new branch, and
I'll merge that as soon as CI passes.
On Wed, Jul 22, 2020 at 04:57:12PM -0000, Bryce Harrington wrote:
> Review: Approve
>
> 8d447a009bc6458
> - In code docs, param changelog_
Fixed, thanks.
> - strptime() throws a ValueError on parsing error. May be worth testing for in git_authorship()?
I added a new test that verifies the known bad cases that should fail.
> 26a3f3b66b4ae42
> - Looks good, a concise solution to a rather messy problem. The set of regex's provided seem to cover the various combinations, and looks easy to add more cases to.
> - In the assertion message, why is %r used ("Could not parse date %r" % changelog_
%r ensures to escape things that might be awkward for the terminal. As
we're reporting something wrong, it seems more likely than average that
this might happen here - for example with control characters or similar
that might not appear in a human-readable way in logs otherwise.
> - Thanks for documenting the encountered cases for the invalid dates in the test case.
>
> f5cdca0822e95d6
> - Setting LC_ALL also may be changing collation, which can change sorting behavior. I doubt that will be an issue at all for git-ubuntu, but if odd things start happening could look unsetting LC_ALL and instead setting LC_TIME and LC_CTYPE.
I agree there's a risk of disrupting things here. But ultimately we
probably want stable behaviour, so I think it's better to risk this
disruption now rather than later.
Server Team CI bot (server-team-bot) wrote : | # |
PASSED: Continuous integration, rev:6ff151dc208
https:/
Executed test runs:
SUCCESS: VM Setup
SUCCESS: Build
SUCCESS: Unit Tests
IN_PROGRESS: Declarative: Post Actions
Click here to trigger a rebuild:
https:/
Preview Diff
1 | diff --git a/gitubuntu/git_repository.py b/gitubuntu/git_repository.py |
2 | index 4c24176..43c8355 100644 |
3 | --- a/gitubuntu/git_repository.py |
4 | +++ b/gitubuntu/git_repository.py |
5 | @@ -8,6 +8,7 @@ import datetime |
6 | import enum |
7 | from functools import lru_cache |
8 | import itertools |
9 | +import locale |
10 | import logging |
11 | import os |
12 | import posixpath |
13 | @@ -773,6 +774,75 @@ class Changelog: |
14 | return ret |
15 | return self._shell_srcpkg |
16 | |
17 | + @staticmethod |
18 | + def _parse_changelog_date(changelog_timestamp_string): |
19 | + """Convert changelog timestamp into datetime object |
20 | + |
21 | + This function currently requires the locale to have been set to C.UTF-8 |
22 | + by the caller. This would typically be done at the main entry point to |
23 | + the importer. |
24 | + |
25 | + :param str changelog_timestamp_string: the timestamp part of the the |
26 | + signoff line from a changelog entry |
27 | + :rtype: datetime.datetime |
28 | + :returns: the timestamp as a datetime object |
29 | + :raises ValueError: if the string could not be parsed |
30 | + """ |
31 | + # We avoid using something like dateutil.parser here because the |
32 | + # parsing behaviour of malformed or unusually formatted dates must be |
33 | + # precisely as specified and not ever change behaviour. If it did, then |
34 | + # imports would no longer be reproducible. |
35 | + # |
36 | + # However, adding new a form of parsing an unambigious date is |
37 | + # acceptable if the spec is first updated accordingly since that would |
38 | + # only introduce new imports that would have previously failed. |
39 | + # |
40 | + # time.strptime ignores time zones, so we must use datetime.strptime() |
41 | + |
42 | + # strptime doesn't support anything other than standard locale names |
43 | + # for days of the week, so handle the "Thur" abbreviation as a special |
44 | + # case as defined in the spec as it is unambiguous. |
45 | + adjusted_changelog_timestamp_string = re.sub( |
46 | + r'^Thur,', |
47 | + 'Thu,', |
48 | + changelog_timestamp_string, |
49 | + ) |
50 | + |
51 | + # Ensure that strptime will parse using the standard English names for |
52 | + # weekdays and months. We actually want to check for C.UTF-8, but |
53 | + # something seems to convert that to en_US.UTF-8. Either will do. It |
54 | + # would be an improvement for this function to become |
55 | + # locale-setting-independent, but that would mean finding some other |
56 | + # way to reproducibly parse changelog timestamps and finding that isn't |
57 | + # a priority right now. |
58 | + assert any([ |
59 | + locale.getlocale(locale.LC_TIME) == ('C', 'UTF-8'), |
60 | + locale.getlocale(locale.LC_TIME) == ('en_US', 'UTF-8'), |
61 | + locale.getlocale(locale.LC_TIME) == (None, None) |
62 | + ]) |
63 | + |
64 | + acceptable_date_formats = [ |
65 | + '%a, %d %b %Y %H:%M:%S %z', # standard |
66 | + '%A, %d %b %Y %H:%M:%S %z', # full day of week |
67 | + '%d %b %Y %H:%M:%S %z', # missing day of week |
68 | + '%a, %d %B %Y %H:%M:%S %z', # full month name |
69 | + '%A, %d %B %Y %H:%M:%S %z', # full day of week and month name |
70 | + '%d %B %Y %H:%M:%S %z', # missing day of week with full month |
71 | + # name |
72 | + ] |
73 | + for date_format in acceptable_date_formats: |
74 | + try: |
75 | + return datetime.datetime.strptime( |
76 | + adjusted_changelog_timestamp_string, |
77 | + date_format, |
78 | + ) |
79 | + except ValueError: |
80 | + pass |
81 | + else: |
82 | + raise ValueError( |
83 | + "Could not parse date %r" % changelog_timestamp_string, |
84 | + ) |
85 | + |
86 | @property |
87 | def git_authorship(self): |
88 | """Extract last changelog entry's maintainer and timestamp |
89 | @@ -789,12 +859,7 @@ class Changelog: |
90 | if m is None: |
91 | raise ValueError('Cannot get authorship') |
92 | |
93 | - # time.strptime ignores time zones, so we must use datetime.strptime() |
94 | - parsed_date = datetime.datetime.strptime( |
95 | - self.date, |
96 | - '%a, %d %b %Y %H:%M:%S %z', |
97 | - ) |
98 | - |
99 | + parsed_date = self._parse_changelog_date(self.date) |
100 | epoch_seconds = int(parsed_date.timestamp()) |
101 | |
102 | # seconds -> minutes |
103 | diff --git a/gitubuntu/git_repository_test.py b/gitubuntu/git_repository_test.py |
104 | index 2637915..4f52339 100644 |
105 | --- a/gitubuntu/git_repository_test.py |
106 | +++ b/gitubuntu/git_repository_test.py |
107 | @@ -108,6 +108,65 @@ def test_changelog_maintainer_invalid(): |
108 | test_changelog.maintainer |
109 | |
110 | |
111 | +@pytest.mark.parametrize(['input_date_string', 'expected_result'], [ |
112 | + # The normal complete form |
113 | + ('Mon, 12 May 2016 08:14:34 -0700', (2016, 5, 12, 8, 14, 34, -7)), |
114 | + # Day of week missing, such as in: |
115 | + # datefudge 1.12 |
116 | + ('12 May 2016 08:14:34 -0700', (2016, 5, 12, 8, 14, 34, -7)), |
117 | + # Full (not abbreviated) month name, such as in: |
118 | + # dnsmasq 2.32-2 |
119 | + # dropbear 0.42-1 |
120 | + # e2fsprogs 1.42.11-1 |
121 | + # efibootmgr 0.5.4-7 |
122 | + # hunspell-br 0.11-1 |
123 | + # kubuntu-default-settings 1:6.06-22 |
124 | + # libvformat 1.13-4 |
125 | + ('12 June 2016 08:14:34 -0700', (2016, 6, 12, 8, 14, 34, -7)), |
126 | + # Full (not abbreviated) day of week name, such as in: |
127 | + # logcheck 1.2.22a |
128 | + ('Thursday, 15 May 2016 08:14:34 -0700', (2016, 5, 15, 8, 14, 34, -7)), |
129 | + # Part-abbreviated day of week name, such as in: |
130 | + # kubuntu-meta 1.76 |
131 | + ('Thur, 15 May 2016 08:14:34 -0700', (2016, 5, 15, 8, 14, 34, -7)), |
132 | +]) |
133 | +def test_parse_changelog_date(input_date_string, expected_result): |
134 | + """_parse_changelog_date should parse a basic date string correctly |
135 | + |
136 | + :param str input_date_string: the timestamp part of the changelog signoff |
137 | + line |
138 | + :param tuple(int, int, int, int, int, int, int) expected_result: the |
139 | + expected parse result in (year, month, day, hour, minute, second, |
140 | + timezone_offset_in_hours) form. The actual expected result needs to be |
141 | + a datetime.datetime object; to avoid duplication in test parameters |
142 | + this will be instantiated within the test. |
143 | + """ |
144 | + actual_result = target.Changelog._parse_changelog_date(input_date_string) |
145 | + expected_result_datetime = datetime.datetime( |
146 | + *expected_result[:6], |
147 | + tzinfo=datetime.timezone(datetime.timedelta(hours=expected_result[6])), |
148 | + ) |
149 | + assert actual_result == expected_result_datetime |
150 | + |
151 | + |
152 | +@pytest.mark.parametrize(['input_date_string'], [ |
153 | + ('Mon, 30 Feb 2020 15:50:58 +0200',), # ghostscript 9.50~dfsg-5ubuntu4 |
154 | + ('Mon, 03 Sep 2018 00:43:25 -7000',), # lxqt-config 0.13.0-0ubuntu4 |
155 | + ('Tue, 17 May 2008 10:93:55 -0500',), # iscsitarget |
156 | + # 0.4.15+svn148-2.1ubuntu1 |
157 | + ('Monu, 22 Jan 2007 22:10:50 -0500',), # mail-spf-perl 2.004-0ubuntu1 |
158 | + ('Wed, 29 Augl 2007 16:14:11 +0200',), # nut 2.2.0-2 |
159 | +]) |
160 | +def test_changelog_date_parse_errors(input_date_string): |
161 | + """_parse_changelog_date should raise ValueError on illegal dates |
162 | + |
163 | + :param str input_date_string: the timestamp part of the changelog signoff |
164 | + line |
165 | + """ |
166 | + with pytest.raises(ValueError): |
167 | + target.Changelog._parse_changelog_date(input_date_string) |
168 | + |
169 | + |
170 | @pytest.mark.parametrize( |
171 | 'changelog_name, name, email, epoch_seconds, offset', [ |
172 | ( |
173 | diff --git a/gitubuntu/importer.py b/gitubuntu/importer.py |
174 | index 014d6fd..880c834 100644 |
175 | --- a/gitubuntu/importer.py |
176 | +++ b/gitubuntu/importer.py |
177 | @@ -28,6 +28,7 @@ import argparse |
178 | import datetime |
179 | import functools |
180 | import getpass |
181 | +import locale |
182 | import logging |
183 | import os |
184 | import re |
185 | @@ -474,6 +475,10 @@ def main( |
186 | Returns 0 on successful import (which includes non-fatal failures); |
187 | 1 otherwise. |
188 | """ |
189 | + # For import reproduciblity, we must use the C.UTF-8 locale to parse dates, |
190 | + # etc. |
191 | + locale.setlocale(locale.LC_ALL, 'C.UTF-8') |
192 | + |
193 | logging.info('Ubuntu Server Team importer v%s' % VERSION) |
194 | |
195 | if not repo: |
This branch is ready except for the final commit. I don't think the assertion is correct. We might also have to ensure that CI is running in C.UTF-8 - otherwise the assertion (when fixed) will fail.