Merge lp:~maddevelopers/mg5amcnlo/smart_zeros into lp:~maddevelopers/mg5amcnlo/2.3.4
Status: | Merged |
---|---|
Merged at revision: | 393 |
Proposed branch: | lp:~maddevelopers/mg5amcnlo/smart_zeros |
Merge into: | lp:~maddevelopers/mg5amcnlo/2.3.4 |
Diff against target: |
782 lines (+306/-115) (has conflicts) 13 files modified
Template/NLO/SubProcesses/makefile_loop.inc (+13/-2) Template/loop_material/StandAlone/SubProcesses/makefile (+13/-2) madgraph/iolibs/export_v4.py (+4/-4) madgraph/iolibs/file_writers.py (+1/-0) madgraph/iolibs/template_files/loop_optimized/helas_calls_split.inc (+5/-3) madgraph/iolibs/template_files/loop_optimized/loop_matrix_standalone.inc (+7/-5) madgraph/iolibs/template_files/loop_optimized/mp_compute_loop_coefs.inc (+9/-3) madgraph/iolibs/template_files/loop_optimized/mp_helas_calls_split.inc (+3/-4) madgraph/iolibs/template_files/loop_optimized/polynomial.inc (+6/-8) madgraph/loop/loop_exporters.py (+46/-21) madgraph/various/process_checks.py (+4/-0) madgraph/various/q_polynomial.py (+194/-63) tests/time_db (+1/-0) Text conflict in madgraph/various/process_checks.py |
To merge this branch: | bzr merge lp:~maddevelopers/mg5amcnlo/smart_zeros |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Hua-Sheng Shao | Approve | ||
Review via email: mp+287741@code.launchpad.net |
Description of the change
This branch brings small but significant improvement to the computation of loop polynomial coefficients in MadLoop.
Profiling shows that the overall majority of the computation time of these coefficients comes from the 'update_wl' functions that implement the tensorial product of the 'loop wavefunctions polynomials' with the 'vertex polynomial updaters' provided by aloha.
Basically I identified that within the current framework there is three ways of implementing this tensorial product:
a) Loop over only the wavefunction/
This was the only technique used so far before this branch.
b) Perform all 5 do-loops ( 3 over loop wf+updater indices and 2 over polynomial coefficients) but looping first over the updater indices and filtering over the updater coefficients which are zero
c) Same as b, so all 5 do-loops, but this time first over the loop wavefunction coefficients with a filter on the loop wf coefs which are zero.
So a gain can be obtained by wisely choosing which strategy to use for the different cases of loop_wavefunction rank and updated rank.
The following choice is made, based on several empirical profiling.
-------
if ( loop_wf rank == 0 ) or (updater rank == 0 ) or (loop_wf rank == updater rank == 1)
-> Keep the original strategy a), which seems to be faster
else ( loop_wf rank ) >= ( updater rank )
-> Use the strategy b) which exploits the fact that the loop polynomial is high rank in comparison to the vertex polynomial.
else
-> Use the strategy c) which exploits the fact that the vertex polynomial is high rank in comparison to the loop wf polynomial.
This is typically not a very relevant change as it is basically used only for the combination
(loop wf rank =1 , updater rank =2) in effective theories.
-------
So the introduction of strategy b) is really what brings the improvement. However for this improvement to be large, it is necessary that polynomial.f be compiled without '-fbounds-check' and preferably with '-O3'.
I have therefore slight altered the makefiles so that this source files in particular enforces the above, irrespectively of what is in make_opts.
The improvements obtained are not a game-changer, but still welcome (these are gains relative to the loop polynomial coefficient computation only [i.e. not relative to the timing incl. loop reduction]):
Notice that even though the implementation is such that the gain should be larger for more complicated processes, this is not guaranteed as it also depends on the sparsity of the updater polynomial coefficients.
u d~ > e+ ve -> No gain, code completely identical in this case
g g > t t~ -> -25%
g g > t t~ g -> -18%
g g > t t~ g g -> -9%
u u~ > d d~ s s~ -> -27%
u u~ > d d~ s s~ g -> -23%
g g > x0 g (HEFT process) -> -40%
g g > x0 g g (HEFT process) -> -19%
g g > y2 g (y2 = massive spin-2 boson) -> -43%
g g > y2 g g (y2 = massive spin-2 boson) -> -31%
g g > h h -> -58%
g g > h h h -> -60%
g g > h h h h -> -64%
g g > z z -> -41%
g g > z z z -> -43%
The improvement is better for loop-induced processes, but unfortunately this is also where we are anyways already dominated by the reduction time, so that it doesn't matter much :(.
Finally one crazy idea would be to dynamically chose optimally between the three methods above for each UPDATE_WL call, with a training session. but ok, let's not go there...
And of course another avenue of optimization is to properly chose the optimal l-cut location of each loop so as to exactly maximize over the number of loop wavefunctions recycled.
But this was my one improvement of the year on the loop polynomial computation, anything more will wait for 2017 at least.
(Originally the hope was to get even larger gains by having aloha setting what coefficients can be zero and keeping track overall of the list of non-zero coefficients.
But after 2 long days of testing and trying hard, the full-fledged tracking of zero coefs seems more expensive than what it saves, except when done with a partial filtering like above. Well... at least I tried.)
Anyway, as far as the review goes (I picked you Olivier, because we already discussed this a bit, but others who are reading this are welcome to give their opinion), there is not much to be done here:
Just make sure that things go smooth for a couple runs and also double-check a couple of the timing improvements above (with the check timing -reuse command) and give me a green light (pretty please).
Hi Valentin,
If it doesn't compile with the '-fbounds-check' doesn't that signify that some arrays go out of bound, which might lead to compiler dependent problems?
Cheers,
Rik