= Bug 400544 =

The message-sharing migration script was using far too much memory for large projects and packages.  In previous branches (the latest still on approach for landing as I write this) we stopped the script from fetching msgid texts while merging.  This branch should further reduce memory footprint.

It doesn't do this by putting less stuff in memory per se.  It probably allocates more overall.  But it breaks things up so that less needs to be in memory at the same time.


== Details ==

The diff begins with some whitespace cleanups in potmsgset.py.  Pay no heed to them.  In the same file, getAllTranslationMessages now returns a Storm result set.  The caller can now append sorting clauses to it, which was more relevant in an earlier incarnation of this branch, but it's an improvement anyway so I left it in.  There is no more default ordering, but I verified that no code relied on the ordering there was anyway.

In the script, moved elimination of certain troublesome duplications among TranslationMessages into a separate phase of execution.  Since this brings the number of phases to 3, I also started numbering them clearly in the options list.

You'll note some new intermediate commits between phases.  The commit function has evolved to skip intermediate aborts on dry runs; skip commits or aborts when no txn has been set; and to run the garbage collector after every commit in hopes of relieving memory load somewhat.

Next in the diff comes the script's new execution phase.  It figures out which potmsgsets are "representative," and cleans up only those.  They are the only ones that are a problem for the rest of the algorithm.  There are probably few enough other cases that it's not worth going through a full cleanup.

Right below that you'll see how the _scrubPOTMsgSetTranslations invocation was removed from the previous first execution phase (which is now the second).

The extra intermediate commits I've added inside phases are a delicate matter.  Underneath Storm and above the database connection, psycopg batches query result sets.  If you commit transactions while iterating over a result set (especially if you're also removing rows that occur in it!) there's a risk of confusing the iteration.  This is why there are no _endTransaction calls while iterating over result sets.  Once the full result set has been loaded, this is no longer an issue.

_mapExistingMessages is gone.  It was a helper for _scrubPOTMsgSetTranslations, but its approach involved loading all of a potmsgset's translations right up front.  A serious memory footprint risk that was still in there because once upon a time, the resulting dict was reused elsewhere in the algorithm.

There still is a two-stage approach, but with a different division of labour: one loop retrieves the TranslationMessages a potmsgset has for a given language/variant pair, but keeps only its ids so as not to pin all those objects in the ORM cache.  A second loop sorts out only the TranslationMessages for a single language, deleting duplicates as it finds them.  Duplicates must have the same language, and a problematic potmsgset is one that has TranslationMessages in many languages, so this should be an effective way of breaking down the working sets.  Each TranslationMessage is (probably, depending on actual caching behaviour) reloaded from the database as needed.  All references to these objects are dropped once a language has been fully checked.

Finally, there's the test.  As it turns out, TestMapMessages is now a neater match for the new execution phase so I renamed it to reflect that.  There are test cases here for _scrubPOTMsgSetTranslations (the per-potmsgset method); a test that now covers the new execution phase instead of POTMsgSet merging has moved in here.


== Lint, tests, and Q/A ==

No lint.  Test:
{{{
./bin/test -vv -t message_sharing_migration
}}}

To Q/A, try running the three phases consecutively on the wordpress project:
{{{
./scripts/rosetta/message-sharing-merge.py -vvv -D p wordpress
./scripts/rosetta/message-sharing-merge.py -vvv -P p wordpress
./scripts/rosetta/message-sharing-merge.py -vvv -T p wordpress
}}}

Watch each for memory footprint as it runs.  Previously, this would take up more than 2.5 GB of memory and be killed by angry sysadmins.  Note that this may still happen when using the --dry-run option, because of all the aborts being skipped.  But a "real" run should peak at lower address-space usage.


Jeroen