Merge lp:~dpm/launchpad/translations-exporter into lp:launchpad

Proposed by David Planella on 2012-09-14
Status: Work in progress
Proposed branch: lp:~dpm/launchpad/translations-exporter
Merge into: lp:launchpad
Diff against target: 335 lines (+315/-0)
3 files modified
cronscripts/translations-export-stats.py (+25/-0)
lib/lp/services/config/schema-lazr.conf (+6/-0)
lib/lp/translations/scripts/export_stats.py (+284/-0)
To merge this branch: bzr merge lp:~dpm/launchpad/translations-exporter
Reviewer Review Type Date Requested Status
Abel Deuring (community) code 2012-09-14 Approve on 2013-04-12
Review via email: mp+124373@code.launchpad.net

Description of the Change

This is a result of the request on RT #55759 to move the Ubuntu translations exporter script [1] into the Launchpad tree.

As it's my first ever Launchpad merge proposal, I'm submitting this as work in progress to get some initial feedback if overall what I'm trying to do and the code looks good. I'll start working on the tests once that's been reviewed and I'm certain I'm doing the right thing :).

In short, this code is thought to be run daily to provide a tarball with an export of Ubuntu translations stats. The data is fetched by querying the Launchpad database. The data will then be used to produce the translations coverage report at release time [2] and also to provide a list of of priority templates to help the community focus on the most important translations to complete, as well as producing graphs of their progress.

There are some TODO comments in the code for a couple of areas I was not sure how to go about. The most important one is the fact that the tarballs that are generated with the database dump are uploaded to Librarian, but as they are not listed anywhere in Launchpad, I have no way of knowing their URL in advance when I'll want to fetch them. I thought perhaps a URL alias such as the ones we use for https://translations.launchpad.net/ubuntu/hardy/+latest-full-language-pack could be useful (e.g. .../+latest-stats-export). I could not really figure out how to do it. Any pointers? Or other approaches?

Thanks.

[1] https://launchpad.net/lp-get-ul10nstats/
[2] http://people.canonical.com/~dpm/stats/ubuntu-12.04-translation-stats.html

To post a comment you must log in.
Abel Deuring (adeuring) wrote :
Download full text (15.2 KiB)

Hi David,

firstly, a few general remarks:

- The big obstacle: "thou shalt not increase the LOC count for (the branch
  you are landing code in) unless" (quoted from
  https://dev.launchpad.net/PolicyAndProcess/MaintenanceCosts?highlight=%28loc%29

  See the wiki page for more details -- you can either try to convice
  somebody in charge that this is a good change, or you can "compess" the
  existing LP code base first.

  Personally, I think this script would be worth to add, but I am not the
  project lead...

> This is a result of the request on RT #55759 to move the Ubuntu translations
> exporter script [1] into the Launchpad tree.
>
> As it's my first ever Launchpad merge proposal, I'm submitting this as work
> in progress to get some initial feedback if overall what I'm trying to do
> and the code looks good. I'll start working on the tests once that's been
> reviewed and I'm certain I'm doing the right thing :).
>
> In short, this code is thought to be run daily to provide a tarball with an
> export of Ubuntu translations stats. The data is fetched by querying the
> Launchpad database. The data will then be used to produce the translations
> coverage report at release time [2] and also to provide a list of of
> priority templates to help the community focus on the most important
> translations to complete, as well as producing graphs of their progress.
>
> There are some TODO comments in the code for a couple of areas I was not
> sure how to go about. The most important one is the fact that the tarballs
> that are generated with the database dump are uploaded to Librarian, but as
> they are not listed anywhere in Launchpad, I have no way of knowing their
> URL in advance when I'll want to fetch them. I thought perhaps a URL alias
> such as the ones we use for
> https://translations.launchpad.net/ubuntu/hardy/+latest-full-language-pack
> could be useful (e.g. .../+latest-stats-export). I could not really figure
> out how to do it. Any pointers? Or other approaches?

Such a URL would mean that the Librarian data would have to be fed through
the app server in order to reach the client.

But the core problem is that a Librarian file that is not referenced from
anywhere will be deleted by a cron job. So we need a column in some LP
DB table that points to the LibraryFileAlias. I think a new column
DistroSeries.translation_statistics would do the job.

Adding this column requires a separate branch and merge proposal,
see https://dev.launchpad.net/PolicyAndProcess/DatabaseSchemaChangesProcess

This may look a bit scary, but just adding a column is not difficult.
But I suspect that the column will need an index so that the garbo job
that deletes unreferenced LFA records can run efficiently. But stub
ask stub if my suspicion is right. Anyway, the index should be added
in a separate branch.

database/schema/patch-2209-28-1.sql and database/schema/patch-2209-28-2.sql
are an example for such a DB schema change.

And once the DB column exists, you can add a new property to the Python
class DistroSeries and export it to the API.

>
> Thanks.
>
> [1] https://launchpad.net/lp-get-ul10nstats/
> [2] http://people.canonical.com/~dpm/stats/ubuntu-12...

David Planella (dpm) wrote :
Download full text (16.4 KiB)

Al 14/09/12 13:38, En/na Abel Deuring ha escrit:
> Hi David,
>
> firstly, a few general remarks:
>
> - The big obstacle: "thou shalt not increase the LOC count for (the branch
> you are landing code in) unless" (quoted from
> https://dev.launchpad.net/PolicyAndProcess/MaintenanceCosts?highlight=%28loc%29
>
> See the wiki page for more details -- you can either try to convice
> somebody in charge that this is a good change, or you can "compess" the
> existing LP code base first.
>
> Personally, I think this script would be worth to add, but I am not the
> project lead...
>

Thanks for the heads up. As per the conversation just now on
#launchpad-dev, a waiver has been granted to move this code into LP:

<lifeless> I'll grant a waiver
 this code already exists in the wrong place
 its not adding debt to move it into LP
<lifeless> its reducing debt by getting it into the right place.

http://irclogs.ubuntu.com/2012/09/14/%23launchpad-dev.html#t11:54

I'll apply fixes for all the rest of points mentioned and will talk to
stub to double-check on the database changes you're describing, as
suggested.

Thanks!

Cheers,
David.

>
>> This is a result of the request on RT #55759 to move the Ubuntu translations
>> exporter script [1] into the Launchpad tree.
>>
>> As it's my first ever Launchpad merge proposal, I'm submitting this as work
>> in progress to get some initial feedback if overall what I'm trying to do
>> and the code looks good. I'll start working on the tests once that's been
>> reviewed and I'm certain I'm doing the right thing :).
>>
>> In short, this code is thought to be run daily to provide a tarball with an
>> export of Ubuntu translations stats. The data is fetched by querying the
>> Launchpad database. The data will then be used to produce the translations
>> coverage report at release time [2] and also to provide a list of of
>> priority templates to help the community focus on the most important
>> translations to complete, as well as producing graphs of their progress.
>>
>> There are some TODO comments in the code for a couple of areas I was not
>> sure how to go about. The most important one is the fact that the tarballs
>> that are generated with the database dump are uploaded to Librarian, but as
>> they are not listed anywhere in Launchpad, I have no way of knowing their
>> URL in advance when I'll want to fetch them. I thought perhaps a URL alias
>> such as the ones we use for
>> https://translations.launchpad.net/ubuntu/hardy/+latest-full-language-pack
>> could be useful (e.g. .../+latest-stats-export). I could not really figure
>> out how to do it. Any pointers? Or other approaches?
>
> Such a URL would mean that the Librarian data would have to be fed through
> the app server in order to reach the client.
>
> But the core problem is that a Librarian file that is not referenced from
> anywhere will be deleted by a cron job. So we need a column in some LP
> DB table that points to the LibraryFileAlias. I think a new column
> DistroSeries.translation_statistics would do the job.
>
> Adding this column requires a separate branch and merge proposal,
> see https://dev.launchpad.net/PolicyAndProcess/Databas...

Robert Collins (lifeless) wrote :

On Fri, Sep 14, 2012 at 11:38 PM, Abel Deuring
<email address hidden> wrote:
> Hi David,
>
> firstly, a few general remarks:
>
> - The big obstacle: "thou shalt not increase the LOC count for (the branch
> you are landing code in) unless" (quoted from
> https://dev.launchpad.net/PolicyAndProcess/MaintenanceCosts?highlight=%28loc%29
>
> See the wiki page for more details -- you can either try to convice
> somebody in charge that this is a good change, or you can "compess" the
> existing LP code base first.
>
> Personally, I think this script would be worth to add, but I am not the
> project lead...

I'm granting a waiver; the code already exists, so moving into tree
reduces maintenance overhead.

Richard Harding (rharding) wrote :

I'm going to mark this as in progress then since it looks like it might need to wait on a db patch to get through the system before moving forward. Once the db side has been worked out it this can go forward per Abel.

Abel Deuring (adeuring) wrote :

On 14.09.2012 13:38, Abel Deuring wrote:

> But the core problem is that a Librarian file that is not referenced from
> anywhere will be deleted by a cron job. So we need a column in some LP
> DB table that points to the LibraryFileAlias. I think a new column
> DistroSeries.translation_statistics would do the job.

Another suggestion: It might make sense not only to add a column like
DistroSeries.translation_statistics, but also another column
DistroSeries.translation_statistics_updated, a datetime column
containing, well, the date and time of the most recent update of the
other column.

Abel Deuring (adeuring) wrote :

As stub wrote on IRC:

(10:08:00) stub: Assuming it hasn't atrophied, this can just land and run with the --output= option right now.
(10:08:36) stub: And I think it is a minor change to get the librarian url reported rather than the file alias id

so, r=me

review: Approve (code)

Unmerged revisions

15945. By David Planella on 2012-09-14

Converted Ubuntu translations export script to a LaunchpadCronScript

15944. By David Planella on 2012-09-13

Added exporter script, without modifications

15943. By David Planella on 2012-09-13

Switched to use Launchpad config

15942. By David Planella on 2012-09-13

Added wrapper for translations export script

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== added file 'cronscripts/translations-export-stats.py'
2--- cronscripts/translations-export-stats.py 1970-01-01 00:00:00 +0000
3+++ cronscripts/translations-export-stats.py 2012-09-14 09:49:44 +0000
4@@ -0,0 +1,25 @@
5+#!/usr/bin/python -S
6+#
7+# Copyright 2009 Canonical Ltd. This software is licensed under the
8+# GNU Affero General Public License version 3 (see the file LICENSE).
9+
10+# pylint: disable-msg=C0103,W0403
11+
12+"""Export translations stats"""
13+
14+__metaclass__ = type
15+__all__ = []
16+
17+import _pythonpath
18+
19+from lp.translations.scripts.export_stats import (
20+ TranslationsStatsExporter,
21+ )
22+from lp.services.config import config
23+
24+
25+if __name__ == '__main__':
26+ script = TranslationsStatsExporter(
27+ 'translations-export-stats',
28+ dbuser=config.translationsstatsexporter.dbuser)
29+ script.lock_and_run()
30
31=== modified file 'lib/lp/services/config/schema-lazr.conf'
32--- lib/lp/services/config/schema-lazr.conf 2012-08-09 04:44:13 +0000
33+++ lib/lp/services/config/schema-lazr.conf 2012-09-14 09:49:44 +0000
34@@ -1610,6 +1610,12 @@
35 dbuser: targetnamecacheupdater
36
37
38+[translationsstatsexporter]
39+# The database user which will be used by this process.
40+# datatype: string
41+dbuser: ubuntu_localisation_statistics
42+
43+
44 [updateremoteproduct]
45 # The database user to run this process as.
46 # datatype: string
47
48=== added file 'lib/lp/translations/scripts/export_stats.py'
49--- lib/lp/translations/scripts/export_stats.py 1970-01-01 00:00:00 +0000
50+++ lib/lp/translations/scripts/export_stats.py 2012-09-14 09:49:44 +0000
51@@ -0,0 +1,284 @@
52+# Copyright 2009 Canonical Ltd. This software is licensed under the
53+# GNU Affero General Public License version 3 (see the file LICENSE).
54+
55+"""Export Ubuntu translations statistics from the database for the use of
56+external clients.
57+
58+These stats are used to produce the translations coverage report at
59+release time [1] and also to produce a list of of priority templates
60+to help the community focus on the most important translations to complete,
61+as well as producing graphs of their progress.
62+
63+[1] http://people.canonical.com/~dpm/stats/ubuntu-12.04-translation-stats.html
64+"""
65+
66+__metaclass__ = type
67+
68+__all__ = [
69+ 'TranslationsStatsExporter',
70+ ]
71+
72+from lp.services.scripts.base import LaunchpadCronScript
73+from lp.services.database.sqlbase import (
74+ cursor,
75+ sqlvalues,
76+ )
77+from lp.services.librarian.interfaces.client import (
78+ ILibrarianClient,
79+ UploadFailed,
80+ )
81+from lp.services.tarfile_helpers import LaunchpadWriteTarFile
82+from zope.component import getUtility
83+from optparse import (
84+ Option,
85+ OptionValueError,
86+ )
87+from shutil import copyfileobj
88+import sys
89+import logging
90+import json
91+from tempfile import TemporaryFile
92+import datetime
93+
94+
95+# PostgreSQL queries to run against the given database to produce the desired
96+# output.
97+# If you need to add a new query, remember to add it to the QUERIES dict
98+# below
99+PACKAGES_QUERY = """
100+select
101+ sourcepackagename.name AS sourcepackage,
102+ potemplate.translation_domain,
103+ potemplate.name AS template_name,
104+ potemplate.date_last_updated AS template_last_updated,
105+ pofile.date_changed AS translation_last_updated,
106+ language.code AS language,
107+ (pofile.currentcount+pofile.rosettacount) AS translated,
108+ pofile.unreviewed_count AS unreviewed,
109+ pofile.updatescount AS changed,
110+ potemplate.messagecount AS total
111+ from potemplate
112+ join sourcepackagename on sourcepackagename=sourcepackagename.id
113+ join distroseries on distroseries=distroseries.id
114+ join pofile on pofile.potemplate=potemplate.id
115+ join language on pofile.language=language.id
116+ where
117+ (potemplate.iscurrent or sourcepackagename.name='openoffice.org') and
118+ distroseries.name='{0}'
119+ order by language.code,
120+ sourcepackagename.name,
121+ potemplate.translation_domain,
122+ date_last_updated;"""
123+
124+POTEMPLATES_QUERY = """
125+select
126+ sourcepackagename.name AS sourcepackage,
127+ potemplate.translation_domain,
128+ potemplate.name AS template_name,sqlvalues
129+ potemplate.messagecount AS total,
130+ potemplate.iscurrent AS enabled,
131+ potemplate.languagepack,
132+ potemplate.priority,
133+ potemplate.date_last_updated
134+ from potemplate
135+ join sourcepackagename on sourcepackagename=sourcepackagename.id
136+ join distroseries on distroseries=distroseries.id
137+ where
138+ distroseries.name='{0}'
139+ order by sourcepackagename.name,
140+ potemplate.name;"""
141+
142+# This is a test query in order to obtain some results from
143+# the database from a local Launchpad installation, which
144+# otherwise returns empty results for the above queries
145+ALL_POTEMPLATES_QUERY = """
146+select
147+ potemplate.translation_domain,
148+ potemplate.name AS template_name,
149+ potemplate.messagecount AS total,
150+ potemplate.iscurrent AS enabled,
151+ potemplate.languagepack,
152+ potemplate.priority,
153+ potemplate.date_last_updated
154+ from potemplate;
155+"""
156+
157+# Add PostgreSQL queries here using the format
158+# 'output-filename': SQL_QUERY_VARIABLE
159+# Any query you add will be executed against the
160+# given database
161+QUERIES = {'package-stats': PACKAGES_QUERY,
162+ 'potemplate-stats': POTEMPLATES_QUERY,
163+# 'all-potemplate-stats': ALL_POTEMPLATES_QUERY,
164+ }
165+
166+
167+class ExportQuery(object):
168+ """Class to contain a database query to be used to export translations
169+ data.
170+
171+ :param query: String that contains the database query to execute.
172+ :param query_id: String that describes the query.
173+ :param distro_codename: String to specify the distro series the query
174+ applies to (e.g. 'precise')
175+ :param output_file: String to optionally specify the output file path
176+ (or stdout) to save the query results to. Usually the results will
177+ be uploaded to the Librarian.
178+
179+ Supported storage formats for the results:
180+ - 'json': .json file containing a JSON structure
181+
182+ """
183+
184+ def __init__(self, query, query_id, distro_codename, output_file=None):
185+ self.query = query.format(sqlvalues(distro_codename))
186+ self.query_id = query_id
187+ self.distro_codename = distro_codename
188+ self.output_file = output_file
189+
190+ def run(self):
191+ """Run a database query and save the results to disk or stdout"""
192+
193+ cur = cursor()
194+ self.logger.debug("Executing query: {0}...".format(self.query))
195+ cur.execute(self.query)
196+
197+ # Fetch the descriptions to have them in the output
198+ desc = [col[0] for col in cur.description]
199+ logging.debug("Cursor description: {0}".format(desc))
200+
201+ results = cur.fetchall()
202+ self.logger.debug("Query results: {0}".format(results))
203+
204+ self.__publish_result(results, desc)
205+
206+ def __publish_result(self, results, description):
207+ """Publish the results of a database query to export translation data
208+ and save them to disk to a given file or to the Librarian
209+
210+ :param results: List of tuples containing the results from a
211+ translations export database query
212+ :param description: String containing the description of the query.
213+ It will be embedded in the filename of the file to save to.
214+ """
215+
216+ # Save the results into a JSON file and compress it into a tarball
217+ filehandle = TemporaryFile()
218+ archive = LaunchpadWriteTarFile(filehandle)
219+ output_string = self.__postprocess_result(results, description)
220+ archive.add_file(
221+ self.__get_output_filename('json'), output_string)
222+
223+ archive.close()
224+ size = filehandle.tell()
225+ filehandle.seek(0)
226+
227+ # Now save the tarball
228+ if self.output_file is not None:
229+ # Save the tarball to a file.
230+ if self.output_file == '-':
231+ output_filehandle = sys.stdout
232+ else:
233+ output_filehandle = file(self.output_file, 'wb')
234+
235+ copyfileobj(filehandle, output_filehandle)
236+ else:
237+ # Upload the tarball to the librarian.
238+ try:
239+ uploader = getUtility(ILibrarianClient)
240+ # For tar.gz files, the standard content type is
241+ # application/x-gtar. You can see more info on
242+ # http://en.wikipedia.org/wiki/List_of_archive_formats
243+ file_alias = uploader.addFile(
244+ name=self.__get_output_filename('.tar.gz'),
245+ size=size,
246+ file=filehandle,
247+ contentType='application/x-gtar')
248+ except UploadFailed as e:
249+ self.logger.error('Uploading to the Librarian failed: %s', e)
250+ return None
251+ except:
252+ # Bare except statements are used in order to prevent premature
253+ # termination of the script.
254+ self.logger.exception(
255+ 'Uncaught exception while uploading to the Librarian')
256+ return None
257+
258+ # TODO: register upload as 'latest-translations-stats-export'
259+ # TODO: delete old files
260+ self.logger.debug('Upload complete, file alias: %d' % file_alias)
261+
262+ def __get_output_filename(self, extension):
263+ """Builds and returns the full name of the output file to be saved to
264+ disk.
265+
266+ :param extension: String to specify the filename extension
267+ :return: String with the name of the output file
268+ """
269+ return(self.distro_codename + '-' + self.query_id + '.' + extension)
270+
271+ def __postprocess_result(self, results, description):
272+ """Process the results of a translations database export query and
273+ convert them to machine-readable formats.
274+
275+ Currently only JSON is supported.
276+
277+ :param results: List of tuples containing the results from a
278+ translations export database query
279+ :param description: String containing the description of the query.
280+ It will be embedded in the filename of the file to save to.
281+ :return: String with the converted results
282+ """
283+
284+ dthandler = lambda obj: obj.isoformat() \
285+ if isinstance(obj, datetime.datetime) else None
286+ data = [dict(zip(description, prop)) for prop in results]
287+
288+ output_string = json.dumps(data, default=dthandler)
289+
290+ return output_string
291+
292+
293+class TranslationsStatsExporter(LaunchpadCronScript):
294+ """Execute the required database queries to obtain a raw database dump
295+ in text files of a given format (currently JSON), containing statistics
296+ about translation coverage for all the languages and translatable
297+ packages under the Ubuntu project in Launchpad.
298+ """
299+
300+ description = "Export Ubuntu translations stats from the database."
301+ loglevel = logging.INFO
302+
303+ my_options = [
304+ Option(
305+ '-d', '--distro-codename', dest='distro_codename',
306+ default=None, action='store',
307+ help="The distro to get stats for (e.g. 'precise')"),
308+ Option(
309+ '-o', '--output',
310+ dest='output',
311+ default=None,
312+ action='store',
313+ help='A file to send the generated tarball to, rather than the'
314+ ' Libraran.'
315+ )
316+ ]
317+
318+ def add_my_options(self):
319+ """See `LaunchpadScript`."""
320+ self.parser.add_options(self.my_options)
321+
322+ def main(self):
323+ if not self.options.distro_codename:
324+ raise OptionValueError("A distro codename needs to be specified")
325+
326+ # TODO: validate distro codename
327+
328+ # Set up and run all specified queries
329+ for (query_id, query) in QUERIES.items():
330+ export_query = ExportQuery(query,
331+ query_id,
332+ self.options.distro_codename,
333+ self.options.output_file,
334+ )
335+ export_query.run()