-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vertical loop fusion and demotion of temporaries #374
Conversation
Documentation for this branch can be viewed at https://sites.ecmwf.int/docs/loki/374/index.html |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #374 +/- ##
==========================================
+ Coverage 95.54% 95.56% +0.02%
==========================================
Files 188 190 +2
Lines 39468 39691 +223
==========================================
+ Hits 37709 37930 +221
- Misses 1759 1761 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks really great, and is a first step towards some very advanced optimisations. I've left a few comments, but they are mostly cosmetic. Once addressed, I think this should be good to go.
The direct inclusion in SCC also ok for now, as it will basically become a no-op unless any pragma markers are inserted. In the future, we might want to add this stuff to a more generic non-SCC loop-change transformation that can be prepended to SCC in config, or run in it's own right (eg. for CPU opts). But that is beyond this PR.
routine : :any:`Subroutine` | ||
The subroutine in the vertical loops should be fused and | ||
temporaries be demoted. | ||
horizontal : :any:`Dimension` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the horizontal
is not actually used here.
|
||
.. note:: | ||
This transfomation currently relies on pragmas being inserted in the input | ||
source files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to briefly explain the pragma annotations needed. For example, listing and linking loop-interchange and loop-fusion notation, as well as the custom "-init" and "ignore" notations.
loop_var_map = {} | ||
with pragmas_attached(routine, ir.Loop): | ||
for loop in FindNodes(ir.Loop).visit(routine.body): | ||
if loop.variable.name.lower() == self.vertical.index.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the .lower()
really needed here? Are there edge-cases where expression-string comparison not trigger?
parameters = get_pragma_parameters(loop.pragma, starts_with='fused-loop') | ||
group = parameters.get('group', 'default') | ||
if group == 'ignore': | ||
ignore = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of tracking ignore
, I think you could simply continue
here, no?
if group == 'ignore': | ||
ignore = True | ||
if not ignore: | ||
loop_var_map[loop] = as_tuple(var.name.lower() for var in FindVariables().visit(loop.body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would building the inverse of this map not make the logic much simple? As in var_loop_map[var.name] = loop
would allow you to then simply check for len(var_loop_map[var.name]) > 1
in the next part? Or am I misreading this?
Also, a CaseInsensitiveDict
might be useful here for var_loop_map
.
# look for 'loki k-caching ignore(var1, var2, ...)' pragmas within routine and ignore those vars | ||
for pragma in pragmas: | ||
if is_loki_pragma(pragma, starts_with='k-caching'): | ||
pragma_ignore = get_pragma_parameters(pragma, starts_with='k-caching').get('ignore', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nitpick: I wonder if the is_loki_pragma
and get_pragma_parameters
can be combined to something like if pragma_ignore := get_pragma_parameters(pragma, starts_with='k-caching').get(...)
. Not strictly necessary, just wondering...
|
||
def correct_init_of_multilevel_arrays(self, routine, multilevel_local_arrays): | ||
""" | ||
Possibly handle initaliztion of those multilevel local arrays via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "initialization"
new_pragmas = [pragma.clone(content=pragma.content.replace('-init', '')) if '-init' | ||
in pragma.content else pragma for pragma in pragmas] | ||
loop_map[loop] = (ir.Comment('! Loki generated loop for init ...'), | ||
Transformer(node_map_init).visit(loop.clone(\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two inline Transformer
invocations makes this very hard to read. Can we apply them in sequence to build the tuple, before putting them into the loop_map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean.
I need to apply the Transformer twice with two different maps since the key would be overwritten otherwise ...
If you propose to introduce temporary variables: sure this is possible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I just meant the cosmetics of it - the use of two transformers is fine, it just gets really hard to read if everything is applied in a single line. If you build the tuple incrementally, each Transformer.visit(..)
invocation could be a single line - but this is really just a cosmetic suggestion and not strictly necessary.
Thanks for reviewing @mlange05. I addressed all your requested changes. |
Ok, nitpicks and code stye comments have all been addressed. Many thanks for this and over to @reuterbal for final approval. As for the on/off switch, I see you argument, and I think having a safety switch built in is a good idea. Would you mind adding this please to the constructor? I'll check offline if having it enabled affects EC-physics. Once that switch is added, I think this is ready. On a wider note, I think that once we refactor this a little, we can have the loop-pragma steps as a separate "pre-processing Trafo" that can be added to the pipeline in the config files, rather than be part of SCC. Demotion of the vertical would then simply require additional parameters to the SCCDemote step - but all of this if beyond this initial PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for a very comprehensive and nicely structured implementation and the round of review and fixes already performed! This looks already very good to me. I've left a few (minor) questions and suggestions, and a comment for one potential source of problems.
1. loop interchange to expose vertical loops | ||
2. fuse vertical loops (possibly into multiple groups) | ||
3. find local arrays to be demoted and apply heuristics to check whether this is safe | ||
4. demote those arrays which are safe to be demoted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice docstring! 👏
13b73b5
to
829e214
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, many thanks!
Ok, sorry for the delay. I've just run a locally rebased version over this with ecphs-regression, and I can confirm that it works and is performance neutral. I'll trigger one more rebase, and then GTG from me. |
…some kind of ordering via potentially loking for 'insert' in the fuse pragma
…hat fuses loops and demotes temporaries in the vertical dimension
…is now necessary due to 'SCCFuseVerticalLoops')
…implemented check whether demotion is safe
…ct routines this transformation is applied to
…r functions defined by the Fortran standard (#390)
be63803
to
7a65ebe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great now, and confirmed safe and fast with ecphys-regression. GTG.
Vertical loop fusion and demotion of temporaries in the vertical dimension if possible based on
pragmas in the input source code and heuristics.
This is intended to be a starting point for a more automated and sophisticated approach (including dataflow analysis) as well as further optimisations that require this as a first step.
To test with CLOUDSC, checkout branch
nams-vertical-loop-fusion
.This results in the following performance improvements (executed with
1 262144 128
):