Loop fusion with Loki
The objective of this notebook is to go through examples of how loop fusion can be performed using Loki. It is a continuation in a series of notebooks, and builds on the lessons of notebooks on `Reading and writing files with Loki
<https://git.ecmwf.int/projects/RDX/repos/loki/browse/example/01_reading_and_writing_files.ipynb>`__ and `Working with Loki's internal representation
<https://git.ecmwf.int/projects/RDX/repos/loki/browse/example/02_working_with_the_ir.ipynb>`__.
Let us start by parsing the file src/loop_fuse.F90
from the example
directory and pick out the loop_fuse
Subroutine from that file:
[1]:
from loki import Sourcefile
source = Sourcefile.from_file('src/loop_fuse.F90')
routine = source['loop_fuse']
routine
[1]:
Subroutine:: loop_fuse
loop_fuse
starts with an Import statement to load the parameters jpim
and jprb
. Even though we have not specified where the Module parkind1
is located, Loki is still able to successfully parse the file and treats jpim
and jprb
as a
DeferredTypeSymbol. We can verify this by examining the specification Section of the loop_fuse
:
[2]:
from loki import fgen
print(fgen(routine.spec))
USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE
INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k
[3]:
routine.spec.view()
<Section::>
<Import:: parkind1 => (DeferredTypeSymbol('jpim', ('scope', Subroutine:: loop_fuse)),
DeferredTypeSymbol('jprb', ('scope', Subroutine:: loop_fuse)))>
<Intrinsic:: IMPLICIT NONE>
<Comment:: >
<VariableDeclaration:: n>
<VariableDeclaration:: var_in(n, n, n)>
<VariableDeclaration:: var_out(n, n, n)>
<VariableDeclaration:: i, j, k>
Examining the body Section of loop_fuse
reveals a nested loop with three levels:
[4]:
print(fgen(routine.body))
DO k=1,n
DO j=1,n
DO i=1,n
var_out(i, j, k) = var_in(i, j, k)
END DO
DO i=1,n
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
END DO
As a first exercise, let us try to merge all the loops that use i
as the iteration variable. This will involve using Loki’s visitor utilities to traverse, search and manipulate Loki’s internal representation (IR). If you are unfamiliar with these topics, then a quick read of `Working with Loki's internal representation
<https://github.com/ecmwf-ifs/loki/blob/main/example/02_working_with_the_ir.ipynb>`__ is highly
recommended.
Let us start by identifying all the instances of Loop that use i
as the iteration variable:
[5]:
from loki import FindNodes,Loop,flatten
iloops = [node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']
iloops
[5]:
[Loop:: i=1:n, Loop:: i=1:n]
As the output shows, the visitor search correctly identified both loops. Merging these loops comprises of three main steps. The first is to build a new loop that contains the body of both the loops indentified above:
[6]:
loop_body = flatten([loop.body for loop in iloops])
new_loop = Loop(variable=iloops[0].variable, body=loop_body, bounds=iloops[0].bounds)
print(fgen(new_loop.body))
var_out(i, j, k) = var_in(i, j, k)
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
new_loop
now contains both the Assignment statements of the original iloops
. The next step is to build a transformation map - a dictionary that maps the original node to its replacement:
[7]:
loop_map = {iloops[0]: new_loop, iloops[1]: None}
Since we want to merge two loops into one, the first loop is mapped to new_loop
and the secone is mapped to None
i.e. it will be deleted. With the transformation map defined, we can execute the Transformer:
[8]:
from loki import Transformer
routine.body = Transformer(loop_map).visit(routine.body)
print(routine.to_fortran())
assert len([node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']) == 1
SUBROUTINE loop_fuse (n, var_in, var_out)
USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE
INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k
DO k=1,n
DO j=1,n
DO i=1,n
var_out(i, j, k) = var_in(i, j, k)
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
END DO
END SUBROUTINE loop_fuse
We have also added an assert
statement to programatically check the output of our loop tranformation. The assert
will allow pytest
to determine if this notebook continues to function as expected with future updates to Loki.
Loops separated by kernel call
Let us now try a more complex loop fusion example:
[9]:
routine = source['loop_fuse_v1']
print(routine.to_fortran())
SUBROUTINE loop_fuse_v1 (n, var_in, var_out)
USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE
INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k
DO k=1,n
DO j=1,n
DO i=1,n
var_out(i, j, k) = var_in(i, j, k)
END DO
DO i=1,n
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
CALL some_kernel(n, var_out(1, 1, k))
DO j=1,n
DO i=1,n
var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
END DO
DO i=1,n
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
END DO
END SUBROUTINE loop_fuse_v1
In loop_fuse_v1
, there are two j
-loops separated by a CallStatement to a kernel that modifies var_out
. Therefore we can only merge the i
-loops within each j
-loop.
Using the visitor to locate the i
-loops as was done in the previous example is inappropriate in this case, because it will locate all four i
-loops:
[10]:
iloops = [node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']
iloops
[10]:
[Loop:: i=1:n, Loop:: i=1:n, Loop:: i=1:n, Loop:: i=1:n]
Since we know the hierarchy of the loops, we can instead run the visitor on two levels. First to locate the j
-loops, and then locate the i
-loops within the body of each j
-loop:
[11]:
jloops = [node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'j']
iloops = [[node for node in FindNodes(Loop).visit(loop.body) if node.variable == 'i'] for loop in jloops]
print(iloops[0], iloops[1])
[Loop:: i=1:n, Loop:: i=1:n] [Loop:: i=1:n, Loop:: i=1:n]
We can now merge the two blocks of i
-loops:
[12]:
for loop_block in iloops:
loop_body = flatten([loop.body for loop in loop_block])
new_loop = Loop(variable=loop_block[0].variable, body=loop_body, bounds=loop_block[0].bounds)
loop_map[loop_block[0]] = new_loop
loop_map.update({loop: None for loop in loop_block[1:]})
routine.body = Transformer(loop_map).visit(routine.body)
print(routine.to_fortran())
assert len([node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']) == 2
SUBROUTINE loop_fuse_v1 (n, var_in, var_out)
USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE
INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k
DO k=1,n
DO j=1,n
DO i=1,n
var_out(i, j, k) = var_in(i, j, k)
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
CALL some_kernel(n, var_out(1, 1, k))
DO j=1,n
DO i=1,n
var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
END DO
END SUBROUTINE loop_fuse_v1
In loop_fuse_v1
, identifying the two blocks of i
-loops was relatively straightforward because they were nested in different j
-loops. Let us now try an example where all the i
-loops and the kernel call are within the same j
-loop:
[13]:
routine = source['loop_fuse_v2']
print(routine.to_fortran())
SUBROUTINE loop_fuse_v2 (n, var_in, var_out)
USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE
INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k
DO k=1,n
DO j=1,n
DO i=1,n
var_out(i, j, k) = var_in(i, j, k)
END DO
DO i=1,n
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
CALL some_kernel(n, var_out(1, j, k))
DO i=1,n
var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
END DO
DO i=1,n
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
END DO
END SUBROUTINE loop_fuse_v2
The FindNodes visitor we used previously returns an ordered list of nodes that match a specified type. Previously we only searched for nodes of type Loop. We can easily extend this to also search for CallStatement by passing both node-types as a tuple when initializing the visitor:
[14]:
from loki import CallStatement
jloops = [loop for loop in FindNodes(Loop).visit(routine.body) if loop.variable == 'j']
assert len(jloops) == 1
nodes = FindNodes((CallStatement,Loop)).visit(jloops[0].body)
nodes
[14]:
[Loop:: i=1:n, Loop:: i=1:n, Call:: some_kernel, Loop:: i=1:n, Loop:: i=1:n]
By first using FindNodes
to locate the j
-loop, and then applying FindNodes
to that we have built an ordered list (nodes
) containing just the i
-loops and the kernel call. We can now identify the loops that appear before and after the kernel call:
[15]:
call_loc = [count for count,node in enumerate(nodes) if isinstance(node,CallStatement)][0]
iloops[0] = [node for node in nodes[:call_loc] if node.variable == 'i']
iloops[1] = [node for node in nodes[call_loc+1:] if node.variable == 'i']
We can now fuse the two blocks of i
-loops:
[16]:
for loop_block in iloops:
loop_body = flatten([loop.body for loop in loop_block])
new_loop = Loop(variable=loop_block[0].variable, body=loop_body, bounds=loop_block[0].bounds)
loop_map[loop_block[0]] = new_loop
loop_map.update({loop: None for loop in loop_block[1:]})
routine.body = Transformer(loop_map).visit(routine.body)
print(routine.to_fortran())
assert len([node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']) == 2
SUBROUTINE loop_fuse_v2 (n, var_in, var_out)
USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE
INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k
DO k=1,n
DO j=1,n
DO i=1,n
var_out(i, j, k) = var_in(i, j, k)
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
CALL some_kernel(n, var_out(1, j, k))
DO i=1,n
var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
END DO
END SUBROUTINE loop_fuse_v2
Using the built-in loop_fusion
utility
To facilitate loop fusion and make it readily available to users, Loki has a built-in loop_fusion
transformation utility. However, currently this relies on manually annotating the loops with !$loki
pragmas. To illustrate how it works, we outline its mechanics in the following:
[17]:
routine = source['loop_fuse_pragma']
print(routine.to_fortran())
SUBROUTINE loop_fuse_pragma (n, var_in, var_out)
USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE
INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k
DO k=1,n
DO j=1,n
!$loki loop-fusion group( g1 )
DO i=1,n
var_out(i, j, k) = var_in(i, j, k)
END DO
!$loki loop-fusion group( g1 )
DO i=1,n
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
CALL some_kernel(n, var_out(1, j, k))
!$loki loop-fusion group( g2 )
DO i=1,n
var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
END DO
!$loki loop-fusion group( g2 )
DO i=1,n
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
END DO
END SUBROUTINE loop_fuse_pragma
The routine loop_fuse_pragma
is identical to loop_fuse_v2
except for the i
-loops being preceded by !$loki loop-fusion
pragmas. The loops that are candidates for fusion have been assigned to the same group i.e. g1
or g2
. Examining the body of loop_fuse_pragma
reveals the following:
[18]:
routine.body.view()
<Section::>
<Comment:: >
<Loop:: k=1:n>
<Loop:: j=1:n>
<Comment:: >
<Pragma:: loki loop-fusion g...>
<Loop:: i=1:n>
<Assignment:: var_out(i, j, k) = var_in(i, j, k)>
<Pragma:: loki loop-fusion g...>
<Loop:: i=1:n>
<Assignment:: var_out(i, j, k) = 2._JPRB*var_out(i, j, k)>
<Comment:: >
<Call:: some_kernel>
<Comment:: >
<Pragma:: loki loop-fusion g...>
<Loop:: i=1:n>
<Assignment:: var_out(i, j, k) = var_out(i, j, k) + 1._JPRB>
<Pragma:: loki loop-fusion g...>
<Loop:: i=1:n>
<Assignment:: var_out(i, j, k) = 2._JPRB*var_out(i, j, k)>
<Comment:: >
<Comment:: >
One of the difficulties in parsing pragmas is that it is not always immediately clear whether they should be associated with the subsequent or preceeding node, or should stand alone; as examples think of the differing behaviours of !$omp do
, !$omp end do
and !$omp barrier
. Therefore in Loki, pragmas are not attached by default to other nodes. Instead, Loki treats pragmas essentially like comments, but gives them a separate node-type to easily distinguish them.
In situations where we do wish to associate pragmas with certain nodes, we can do so using the pragmas_attached context manager:
[19]:
from loki import pragmas_attached
with pragmas_attached(routine,Loop):
routine.body.view()
<Section::>
<Comment:: >
<Loop:: k=1:n>
<Loop:: j=1:n>
<Comment:: >
<Loop:: i=1:n>
<Assignment:: var_out(i, j, k) = var_in(i, j, k)>
<Loop:: i=1:n>
<Assignment:: var_out(i, j, k) = 2._JPRB*var_out(i, j, k)>
<Comment:: >
<Call:: some_kernel>
<Comment:: >
<Loop:: i=1:n>
<Assignment:: var_out(i, j, k) = var_out(i, j, k) + 1._JPRB>
<Loop:: i=1:n>
<Assignment:: var_out(i, j, k) = 2._JPRB*var_out(i, j, k)>
<Comment:: >
<Comment:: >
We can now visit the loops and sort them into their respective fusion groups:
[20]:
from loki import is_loki_pragma,get_pragma_parameters,Pragma
from collections import defaultdict
fusion_groups = defaultdict(list)
with pragmas_attached(routine,Loop):
for loop in FindNodes(Loop).visit(routine.body):
if is_loki_pragma(loop.pragma, starts_with='loop-fusion'):
parameters = get_pragma_parameters(loop.pragma, starts_with='loop-fusion')
group = parameters.get('group', 'default')
fusion_groups[group] += [loop]
print(f"g1: {fusion_groups['g1']}, g2: {fusion_groups['g2']}")
g1: [Loop:: i=1:n, Loop:: i=1:n], g2: [Loop:: i=1:n, Loop:: i=1:n]
fusion_groups
is now a dictionary with keys for the two fusion groups, and the associated Loop
nodes are values for each key. We can now create and apply a transformation map similar to how it was done in the previous examples:
[21]:
for group,loops in fusion_groups.items():
loop_body = flatten([loop.body for loop in loops])
new_loop = Loop(variable=loops[0].variable, body=loop_body, bounds=loops[0].bounds)
loop_map[loops[0]] = new_loop
loop_map.update({loop: None for loop in loops[1:]})
Since !$loki
pragmas are only intended to pass instructions/hints to Loki on source manipulations, and aren’t needed for the eventual compilation, we can also remove them from the routine:
[22]:
routine_copy = routine.clone()
routine.body = Transformer(loop_map).visit(routine.body)
pragma_map = {pragma: None for pragma in FindNodes(Pragma).visit(routine.body)}
routine.body = Transformer(pragma_map).visit(routine.body)
print(routine.to_fortran())
assert len([node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']) == 2
SUBROUTINE loop_fuse_pragma (n, var_in, var_out)
USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE
INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k
DO k=1,n
DO j=1,n
DO i=1,n
var_out(i, j, k) = var_in(i, j, k)
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
CALL some_kernel(n, var_out(1, j, k))
DO i=1,n
var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
END DO
END DO
END DO
END SUBROUTINE loop_fuse_pragma
You may have noticed in the previous code-cell we made a copy of the object routine
before applying a transformation to it. We can now apply the loop_fusion
utility directly on routine_copy
and compare the results:
[ ]:
from loki import do_loop_fusion
do_loop_fusion(routine_copy)
pragma_map = {pragma: None for pragma in FindNodes(Pragma).visit(routine_copy.body)}
routine_copy.body = Transformer(pragma_map).visit(routine_copy.body)
print(routine_copy.to_fortran())
loop_fuse_pragma: fused 4 loops in 2 groups.
SUBROUTINE loop_fuse_pragma (n, var_in, var_out)
USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE
INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k
DO k=1,n
DO j=1,n
! Loki loop-fusion group(g1)
DO i=1,n
! Loki loop-fusion - body 0 begin
var_out(i, j, k) = var_in(i, j, k)
! Loki loop-fusion - body 0 end
! Loki loop-fusion - body 1 begin
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
! Loki loop-fusion - body 1 end
END DO
! Loki loop-fusion group(g1) - loop hoisted
CALL some_kernel(n, var_out(1, j, k))
! Loki loop-fusion group(g2)
DO i=1,n
! Loki loop-fusion - body 0 begin
var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
! Loki loop-fusion - body 0 end
! Loki loop-fusion - body 1 begin
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
! Loki loop-fusion - body 1 end
END DO
! Loki loop-fusion group(g2) - loop hoisted
END DO
END DO
END SUBROUTINE loop_fuse_pragma
As we can see, the built-in Loki utility loop_fusion
achieves an identical result to our manual transformation.