Merge ~racb/git-ubuntu:changelog-date-parsing into git-ubuntu:master

Proposed by Robie Basak
Status: Merged
Merged at revision: 6ff151dc208809e1b8e3108c69c05090174e6221
Proposed branch: ~racb/git-ubuntu:changelog-date-parsing
Merge into: git-ubuntu:master
Diff against target: 195 lines (+135/-6)
3 files modified
gitubuntu/git_repository.py (+71/-6)
gitubuntu/git_repository_test.py (+59/-0)
gitubuntu/importer.py (+5/-0)
Reviewer Review Type Date Requested Status
Server Team CI bot continuous-integration Approve
Bryce Harrington Approve
Review via email: mp+387855@code.launchpad.net

Commit message

Make Jenkins happy

To post a comment you must log in.
Revision history for this message
Robie Basak (racb) wrote :

This branch is ready except for the final commit. I don't think the assertion is correct. We might also have to ensure that CI is running in C.UTF-8 - otherwise the assertion (when fixed) will fail.

Revision history for this message
Robie Basak (racb) wrote :

Another minor issue might be that git-ubuntu.self-test will start to fail when run in a different locale. I'm not sure how this should work - ideally git-ubuntu.self-test would mainly run in the user's locale except for the changelog date parsing.

Revision history for this message
Robie Basak (racb) wrote :

Thanks to Colin in #ubuntu-devel for clarifying how it's supposed to work. I think I've now got it correct and with a force push the branch is now ready.

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:f5cdca0822e95d6954ffaf43db6a94819ea4bd3e
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/530/
Executed test runs:
    SUCCESS: VM Setup
    SUCCESS: Build
    SUCCESS: Unit Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/530//rebuild

review: Approve (continuous-integration)
Revision history for this message
Bryce Harrington (bryce) wrote :

8d447a009bc64585603751c5ec766927cb3b565f
  - In code docs, param changelog_date_timestamp should be changelog_timestamp_string
  - strptime() throws a ValueError on parsing error. May be worth testing for in git_authorship()?

26a3f3b66b4ae429116f11f1c7c493fc4ff7fcc4
  - Looks good, a concise solution to a rather messy problem. The set of regex's provided seem to cover the various combinations, and looks easy to add more cases to.
  - In the assertion message, why is %r used ("Could not parse date %r" % changelog_timestamp_string,) ? It appears to behave properly, but I would have used %s so I'm curious.
  - Thanks for documenting the encountered cases for the invalid dates in the test case.

f5cdca0822e95d6954ffaf43db6a94819ea4bd3e
  - Setting LC_ALL also may be changing collation, which can change sorting behavior. I doubt that will be an issue at all for git-ubuntu, but if odd things start happening could look unsetting LC_ALL and instead setting LC_TIME and LC_CTYPE.
  - That aside, LGTM +1

review: Approve
Revision history for this message
Robie Basak (racb) wrote :

Thanks for the review! I've addressed your comments in a new branch, and
I'll merge that as soon as CI passes.

On Wed, Jul 22, 2020 at 04:57:12PM -0000, Bryce Harrington wrote:
> Review: Approve
>
> 8d447a009bc64585603751c5ec766927cb3b565f
> - In code docs, param changelog_date_timestamp should be changelog_timestamp_string

Fixed, thanks.

> - strptime() throws a ValueError on parsing error. May be worth testing for in git_authorship()?

I added a new test that verifies the known bad cases that should fail.

> 26a3f3b66b4ae429116f11f1c7c493fc4ff7fcc4
> - Looks good, a concise solution to a rather messy problem. The set of regex's provided seem to cover the various combinations, and looks easy to add more cases to.
> - In the assertion message, why is %r used ("Could not parse date %r" % changelog_timestamp_string,) ? It appears to behave properly, but I would have used %s so I'm curious.

%r ensures to escape things that might be awkward for the terminal. As
we're reporting something wrong, it seems more likely than average that
this might happen here - for example with control characters or similar
that might not appear in a human-readable way in logs otherwise.

> - Thanks for documenting the encountered cases for the invalid dates in the test case.
>
> f5cdca0822e95d6954ffaf43db6a94819ea4bd3e
> - Setting LC_ALL also may be changing collation, which can change sorting behavior. I doubt that will be an issue at all for git-ubuntu, but if odd things start happening could look unsetting LC_ALL and instead setting LC_TIME and LC_CTYPE.

I agree there's a risk of disrupting things here. But ultimately we
probably want stable behaviour, so I think it's better to risk this
disruption now rather than later.

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:6ff151dc208809e1b8e3108c69c05090174e6221
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/532/
Executed test runs:
    SUCCESS: VM Setup
    SUCCESS: Build
    SUCCESS: Unit Tests
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/git-ubuntu-ci/532//rebuild

review: Approve (continuous-integration)

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/gitubuntu/git_repository.py b/gitubuntu/git_repository.py
2index 4c24176..43c8355 100644
3--- a/gitubuntu/git_repository.py
4+++ b/gitubuntu/git_repository.py
5@@ -8,6 +8,7 @@ import datetime
6 import enum
7 from functools import lru_cache
8 import itertools
9+import locale
10 import logging
11 import os
12 import posixpath
13@@ -773,6 +774,75 @@ class Changelog:
14 return ret
15 return self._shell_srcpkg
16
17+ @staticmethod
18+ def _parse_changelog_date(changelog_timestamp_string):
19+ """Convert changelog timestamp into datetime object
20+
21+ This function currently requires the locale to have been set to C.UTF-8
22+ by the caller. This would typically be done at the main entry point to
23+ the importer.
24+
25+ :param str changelog_timestamp_string: the timestamp part of the the
26+ signoff line from a changelog entry
27+ :rtype: datetime.datetime
28+ :returns: the timestamp as a datetime object
29+ :raises ValueError: if the string could not be parsed
30+ """
31+ # We avoid using something like dateutil.parser here because the
32+ # parsing behaviour of malformed or unusually formatted dates must be
33+ # precisely as specified and not ever change behaviour. If it did, then
34+ # imports would no longer be reproducible.
35+ #
36+ # However, adding new a form of parsing an unambigious date is
37+ # acceptable if the spec is first updated accordingly since that would
38+ # only introduce new imports that would have previously failed.
39+ #
40+ # time.strptime ignores time zones, so we must use datetime.strptime()
41+
42+ # strptime doesn't support anything other than standard locale names
43+ # for days of the week, so handle the "Thur" abbreviation as a special
44+ # case as defined in the spec as it is unambiguous.
45+ adjusted_changelog_timestamp_string = re.sub(
46+ r'^Thur,',
47+ 'Thu,',
48+ changelog_timestamp_string,
49+ )
50+
51+ # Ensure that strptime will parse using the standard English names for
52+ # weekdays and months. We actually want to check for C.UTF-8, but
53+ # something seems to convert that to en_US.UTF-8. Either will do. It
54+ # would be an improvement for this function to become
55+ # locale-setting-independent, but that would mean finding some other
56+ # way to reproducibly parse changelog timestamps and finding that isn't
57+ # a priority right now.
58+ assert any([
59+ locale.getlocale(locale.LC_TIME) == ('C', 'UTF-8'),
60+ locale.getlocale(locale.LC_TIME) == ('en_US', 'UTF-8'),
61+ locale.getlocale(locale.LC_TIME) == (None, None)
62+ ])
63+
64+ acceptable_date_formats = [
65+ '%a, %d %b %Y %H:%M:%S %z', # standard
66+ '%A, %d %b %Y %H:%M:%S %z', # full day of week
67+ '%d %b %Y %H:%M:%S %z', # missing day of week
68+ '%a, %d %B %Y %H:%M:%S %z', # full month name
69+ '%A, %d %B %Y %H:%M:%S %z', # full day of week and month name
70+ '%d %B %Y %H:%M:%S %z', # missing day of week with full month
71+ # name
72+ ]
73+ for date_format in acceptable_date_formats:
74+ try:
75+ return datetime.datetime.strptime(
76+ adjusted_changelog_timestamp_string,
77+ date_format,
78+ )
79+ except ValueError:
80+ pass
81+ else:
82+ raise ValueError(
83+ "Could not parse date %r" % changelog_timestamp_string,
84+ )
85+
86 @property
87 def git_authorship(self):
88 """Extract last changelog entry's maintainer and timestamp
89@@ -789,12 +859,7 @@ class Changelog:
90 if m is None:
91 raise ValueError('Cannot get authorship')
92
93- # time.strptime ignores time zones, so we must use datetime.strptime()
94- parsed_date = datetime.datetime.strptime(
95- self.date,
96- '%a, %d %b %Y %H:%M:%S %z',
97- )
98-
99+ parsed_date = self._parse_changelog_date(self.date)
100 epoch_seconds = int(parsed_date.timestamp())
101
102 # seconds -> minutes
103diff --git a/gitubuntu/git_repository_test.py b/gitubuntu/git_repository_test.py
104index 2637915..4f52339 100644
105--- a/gitubuntu/git_repository_test.py
106+++ b/gitubuntu/git_repository_test.py
107@@ -108,6 +108,65 @@ def test_changelog_maintainer_invalid():
108 test_changelog.maintainer
109
110
111+@pytest.mark.parametrize(['input_date_string', 'expected_result'], [
112+ # The normal complete form
113+ ('Mon, 12 May 2016 08:14:34 -0700', (2016, 5, 12, 8, 14, 34, -7)),
114+ # Day of week missing, such as in:
115+ # datefudge 1.12
116+ ('12 May 2016 08:14:34 -0700', (2016, 5, 12, 8, 14, 34, -7)),
117+ # Full (not abbreviated) month name, such as in:
118+ # dnsmasq 2.32-2
119+ # dropbear 0.42-1
120+ # e2fsprogs 1.42.11-1
121+ # efibootmgr 0.5.4-7
122+ # hunspell-br 0.11-1
123+ # kubuntu-default-settings 1:6.06-22
124+ # libvformat 1.13-4
125+ ('12 June 2016 08:14:34 -0700', (2016, 6, 12, 8, 14, 34, -7)),
126+ # Full (not abbreviated) day of week name, such as in:
127+ # logcheck 1.2.22a
128+ ('Thursday, 15 May 2016 08:14:34 -0700', (2016, 5, 15, 8, 14, 34, -7)),
129+ # Part-abbreviated day of week name, such as in:
130+ # kubuntu-meta 1.76
131+ ('Thur, 15 May 2016 08:14:34 -0700', (2016, 5, 15, 8, 14, 34, -7)),
132+])
133+def test_parse_changelog_date(input_date_string, expected_result):
134+ """_parse_changelog_date should parse a basic date string correctly
135+
136+ :param str input_date_string: the timestamp part of the changelog signoff
137+ line
138+ :param tuple(int, int, int, int, int, int, int) expected_result: the
139+ expected parse result in (year, month, day, hour, minute, second,
140+ timezone_offset_in_hours) form. The actual expected result needs to be
141+ a datetime.datetime object; to avoid duplication in test parameters
142+ this will be instantiated within the test.
143+ """
144+ actual_result = target.Changelog._parse_changelog_date(input_date_string)
145+ expected_result_datetime = datetime.datetime(
146+ *expected_result[:6],
147+ tzinfo=datetime.timezone(datetime.timedelta(hours=expected_result[6])),
148+ )
149+ assert actual_result == expected_result_datetime
150+
151+
152+@pytest.mark.parametrize(['input_date_string'], [
153+ ('Mon, 30 Feb 2020 15:50:58 +0200',), # ghostscript 9.50~dfsg-5ubuntu4
154+ ('Mon, 03 Sep 2018 00:43:25 -7000',), # lxqt-config 0.13.0-0ubuntu4
155+ ('Tue, 17 May 2008 10:93:55 -0500',), # iscsitarget
156+ # 0.4.15+svn148-2.1ubuntu1
157+ ('Monu, 22 Jan 2007 22:10:50 -0500',), # mail-spf-perl 2.004-0ubuntu1
158+ ('Wed, 29 Augl 2007 16:14:11 +0200',), # nut 2.2.0-2
159+])
160+def test_changelog_date_parse_errors(input_date_string):
161+ """_parse_changelog_date should raise ValueError on illegal dates
162+
163+ :param str input_date_string: the timestamp part of the changelog signoff
164+ line
165+ """
166+ with pytest.raises(ValueError):
167+ target.Changelog._parse_changelog_date(input_date_string)
168+
169+
170 @pytest.mark.parametrize(
171 'changelog_name, name, email, epoch_seconds, offset', [
172 (
173diff --git a/gitubuntu/importer.py b/gitubuntu/importer.py
174index 014d6fd..880c834 100644
175--- a/gitubuntu/importer.py
176+++ b/gitubuntu/importer.py
177@@ -28,6 +28,7 @@ import argparse
178 import datetime
179 import functools
180 import getpass
181+import locale
182 import logging
183 import os
184 import re
185@@ -474,6 +475,10 @@ def main(
186 Returns 0 on successful import (which includes non-fatal failures);
187 1 otherwise.
188 """
189+ # For import reproduciblity, we must use the C.UTF-8 locale to parse dates,
190+ # etc.
191+ locale.setlocale(locale.LC_ALL, 'C.UTF-8')
192+
193 logging.info('Ubuntu Server Team importer v%s' % VERSION)
194
195 if not repo:

Subscribers

People subscribed via source and target branches