Loop fusion with Loki

The objective of this notebook is to go through examples of how loop fusion can be performed using Loki. It is a continuation in a series of notebooks, and builds on the lessons of notebooks on `Reading and writing files with Loki <https://git.ecmwf.int/projects/RDX/repos/loki/browse/example/01_reading_and_writing_files.ipynb>`__ and `Working with Loki's internal representation <https://git.ecmwf.int/projects/RDX/repos/loki/browse/example/02_working_with_the_ir.ipynb>`__.

Let us start by parsing the file src/loop_fuse.F90 from the example directory and pick out the loop_fuse Subroutine from that file:

[1]:

from loki import Sourcefile
source = Sourcefile.from_file('src/loop_fuse.F90')
routine = source['loop_fuse']
routine

[1]:

Subroutine:: loop_fuse

loop_fuse starts with an Import statement to load the parameters jpim and jprb. Even though we have not specified where the Module parkind1 is located, Loki is still able to successfully parse the file and treats jpim and jprb as a DeferredTypeSymbol. We can verify this by examining the specification Section of the loop_fuse:

[2]:

from loki import fgen
print(fgen(routine.spec))

USE parkind1, ONLY: jpim, jprb
IMPLICIT NONE

INTEGER(KIND=jpim), INTENT(IN) :: n
REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
INTEGER(KIND=jpim) :: i, j, k

[3]:

routine.spec.view()

<Section::>
  <Import:: parkind1 => (DeferredTypeSymbol('jpim', ('scope', Subroutine:: loop_fuse)),
  DeferredTypeSymbol('jprb', ('scope', Subroutine:: loop_fuse)))>
  <Intrinsic:: IMPLICIT NONE>
  <Comment:: >
  <VariableDeclaration:: n>
  <VariableDeclaration:: var_in(n, n, n)>
  <VariableDeclaration:: var_out(n, n, n)>
  <VariableDeclaration:: i, j, k>

Examining the body Section of loop_fuse reveals a nested loop with three levels:

[4]:

print(fgen(routine.body))


DO k=1,n
  DO j=1,n
    DO i=1,n
      var_out(i, j, k) = var_in(i, j, k)
    END DO
    DO i=1,n
      var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
    END DO
  END DO
END DO

As a first exercise, let us try to merge all the loops that use i as the iteration variable. This will involve using Loki’s visitor utilities to traverse, search and manipulate Loki’s internal representation (IR). If you are unfamiliar with these topics, then a quick read of `Working with Loki's internal representation <https://github.com/ecmwf-ifs/loki/blob/main/example/02_working_with_the_ir.ipynb>`__ is highly recommended.

Let us start by identifying all the instances of Loop that use i as the iteration variable:

[5]:

from loki import FindNodes,Loop,flatten
iloops = [node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']
iloops

[5]:

[Loop:: i=1:n, Loop:: i=1:n]

As the output shows, the visitor search correctly identified both loops. Merging these loops comprises of three main steps. The first is to build a new loop that contains the body of both the loops indentified above:

[6]:

loop_body = flatten([loop.body for loop in iloops])
new_loop = Loop(variable=iloops[0].variable, body=loop_body, bounds=iloops[0].bounds)
print(fgen(new_loop.body))

var_out(i, j, k) = var_in(i, j, k)
var_out(i, j, k) = 2._JPRB*var_out(i, j, k)

new_loop now contains both the Assignment statements of the original iloops. The next step is to build a transformation map - a dictionary that maps the original node to its replacement:

[7]:

loop_map = {iloops[0]: new_loop, iloops[1]: None}

Since we want to merge two loops into one, the first loop is mapped to new_loop and the secone is mapped to None i.e. it will be deleted. With the transformation map defined, we can execute the Transformer:

[8]:

from loki import Transformer
routine.body = Transformer(loop_map).visit(routine.body)
print(routine.to_fortran())

assert len([node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']) == 1

SUBROUTINE loop_fuse (n, var_in, var_out)
  USE parkind1, ONLY: jpim, jprb
  IMPLICIT NONE

  INTEGER(KIND=jpim), INTENT(IN) :: n
  REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
  REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
  INTEGER(KIND=jpim) :: i, j, k

  DO k=1,n
    DO j=1,n
      DO i=1,n
        var_out(i, j, k) = var_in(i, j, k)
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO
    END DO
  END DO

END SUBROUTINE loop_fuse

We have also added an assert statement to programatically check the output of our loop tranformation. The assert will allow pytest to determine if this notebook continues to function as expected with future updates to Loki.

Loops separated by kernel call

Let us now try a more complex loop fusion example:

[9]:

routine = source['loop_fuse_v1']
print(routine.to_fortran())

SUBROUTINE loop_fuse_v1 (n, var_in, var_out)
  USE parkind1, ONLY: jpim, jprb
  IMPLICIT NONE

  INTEGER(KIND=jpim), INTENT(IN) :: n
  REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
  REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
  INTEGER(KIND=jpim) :: i, j, k

  DO k=1,n
    DO j=1,n
      DO i=1,n
        var_out(i, j, k) = var_in(i, j, k)
      END DO
      DO i=1,n
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO
    END DO

    CALL some_kernel(n, var_out(1, 1, k))

    DO j=1,n
      DO i=1,n
        var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
      END DO
      DO i=1,n
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO
    END DO
  END DO

END SUBROUTINE loop_fuse_v1

In loop_fuse_v1, there are two j-loops separated by a CallStatement to a kernel that modifies var_out. Therefore we can only merge the i-loops within each j-loop.

Using the visitor to locate the i-loops as was done in the previous example is inappropriate in this case, because it will locate all four i-loops:

[10]:

iloops = [node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']
iloops

[10]:

[Loop:: i=1:n, Loop:: i=1:n, Loop:: i=1:n, Loop:: i=1:n]

Since we know the hierarchy of the loops, we can instead run the visitor on two levels. First to locate the j-loops, and then locate the i-loops within the body of each j-loop:

[11]:

jloops = [node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'j']
iloops = [[node for node in FindNodes(Loop).visit(loop.body) if node.variable == 'i'] for loop in jloops]
print(iloops[0], iloops[1])

[Loop:: i=1:n, Loop:: i=1:n] [Loop:: i=1:n, Loop:: i=1:n]

We can now merge the two blocks of i-loops:

[12]:

for loop_block in iloops:
    loop_body = flatten([loop.body for loop in loop_block])
    new_loop = Loop(variable=loop_block[0].variable, body=loop_body, bounds=loop_block[0].bounds)
    loop_map[loop_block[0]] = new_loop
    loop_map.update({loop: None for loop in loop_block[1:]})
routine.body = Transformer(loop_map).visit(routine.body)
print(routine.to_fortran())

assert len([node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']) == 2

SUBROUTINE loop_fuse_v1 (n, var_in, var_out)
  USE parkind1, ONLY: jpim, jprb
  IMPLICIT NONE

  INTEGER(KIND=jpim), INTENT(IN) :: n
  REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
  REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
  INTEGER(KIND=jpim) :: i, j, k

  DO k=1,n
    DO j=1,n
      DO i=1,n
        var_out(i, j, k) = var_in(i, j, k)
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO
    END DO

    CALL some_kernel(n, var_out(1, 1, k))

    DO j=1,n
      DO i=1,n
        var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO
    END DO
  END DO

END SUBROUTINE loop_fuse_v1

In loop_fuse_v1, identifying the two blocks of i-loops was relatively straightforward because they were nested in different j-loops. Let us now try an example where all the i-loops and the kernel call are within the same j-loop:

[13]:

routine = source['loop_fuse_v2']
print(routine.to_fortran())

SUBROUTINE loop_fuse_v2 (n, var_in, var_out)
  USE parkind1, ONLY: jpim, jprb
  IMPLICIT NONE

  INTEGER(KIND=jpim), INTENT(IN) :: n
  REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
  REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
  INTEGER(KIND=jpim) :: i, j, k

  DO k=1,n
    DO j=1,n
      DO i=1,n
        var_out(i, j, k) = var_in(i, j, k)
      END DO
      DO i=1,n
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO

      CALL some_kernel(n, var_out(1, j, k))

      DO i=1,n
        var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
      END DO
      DO i=1,n
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO
    END DO
  END DO

END SUBROUTINE loop_fuse_v2

The FindNodes visitor we used previously returns an ordered list of nodes that match a specified type. Previously we only searched for nodes of type Loop. We can easily extend this to also search for CallStatement by passing both node-types as a tuple when initializing the visitor:

[14]:

from loki import CallStatement
jloops = [loop for loop in FindNodes(Loop).visit(routine.body) if loop.variable == 'j']
assert len(jloops) == 1
nodes = FindNodes((CallStatement,Loop)).visit(jloops[0].body)
nodes

[14]:

[Loop:: i=1:n, Loop:: i=1:n, Call:: some_kernel, Loop:: i=1:n, Loop:: i=1:n]

By first using FindNodes to locate the j-loop, and then applying FindNodes to that we have built an ordered list (nodes) containing just the i-loops and the kernel call. We can now identify the loops that appear before and after the kernel call:

[15]:

call_loc = [count for count,node in enumerate(nodes) if isinstance(node,CallStatement)][0]
iloops[0] = [node for node in nodes[:call_loc] if node.variable == 'i']
iloops[1] = [node for node in nodes[call_loc+1:] if node.variable == 'i']

We can now fuse the two blocks of i-loops:

[16]:

for loop_block in iloops:
    loop_body = flatten([loop.body for loop in loop_block])
    new_loop = Loop(variable=loop_block[0].variable, body=loop_body, bounds=loop_block[0].bounds)
    loop_map[loop_block[0]] = new_loop
    loop_map.update({loop: None for loop in loop_block[1:]})
routine.body = Transformer(loop_map).visit(routine.body)
print(routine.to_fortran())

assert len([node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']) == 2

SUBROUTINE loop_fuse_v2 (n, var_in, var_out)
  USE parkind1, ONLY: jpim, jprb
  IMPLICIT NONE

  INTEGER(KIND=jpim), INTENT(IN) :: n
  REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
  REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
  INTEGER(KIND=jpim) :: i, j, k

  DO k=1,n
    DO j=1,n
      DO i=1,n
        var_out(i, j, k) = var_in(i, j, k)
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO

      CALL some_kernel(n, var_out(1, j, k))

      DO i=1,n
        var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO
    END DO
  END DO

END SUBROUTINE loop_fuse_v2

Using the built-in `loop_fusion` utility

To facilitate loop fusion and make it readily available to users, Loki has a built-in loop_fusion transformation utility. However, currently this relies on manually annotating the loops with !$loki pragmas. To illustrate how it works, we outline its mechanics in the following:

[17]:

routine = source['loop_fuse_pragma']
print(routine.to_fortran())

SUBROUTINE loop_fuse_pragma (n, var_in, var_out)
  USE parkind1, ONLY: jpim, jprb
  IMPLICIT NONE

  INTEGER(KIND=jpim), INTENT(IN) :: n
  REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
  REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
  INTEGER(KIND=jpim) :: i, j, k

  DO k=1,n
    DO j=1,n

!$loki loop-fusion group( g1 )
      DO i=1,n
        var_out(i, j, k) = var_in(i, j, k)
      END DO
!$loki loop-fusion group( g1 )
      DO i=1,n
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO

      CALL some_kernel(n, var_out(1, j, k))

!$loki loop-fusion group( g2 )
      DO i=1,n
        var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
      END DO
!$loki loop-fusion group( g2 )
      DO i=1,n
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO

    END DO
  END DO

END SUBROUTINE loop_fuse_pragma

The routine loop_fuse_pragma is identical to loop_fuse_v2 except for the i-loops being preceded by !$loki loop-fusion pragmas. The loops that are candidates for fusion have been assigned to the same group i.e. g1 or g2. Examining the body of loop_fuse_pragma reveals the following:

[18]:

routine.body.view()

<Section::>
  <Comment:: >
  <Loop:: k=1:n>
    <Loop:: j=1:n>
      <Comment:: >
      <Pragma:: loki loop-fusion g...>
      <Loop:: i=1:n>
        <Assignment:: var_out(i, j, k) = var_in(i, j, k)>
      <Pragma:: loki loop-fusion g...>
      <Loop:: i=1:n>
        <Assignment:: var_out(i, j, k) = 2._JPRB*var_out(i, j, k)>
      <Comment:: >
      <Call:: some_kernel>
      <Comment:: >
      <Pragma:: loki loop-fusion g...>
      <Loop:: i=1:n>
        <Assignment:: var_out(i, j, k) = var_out(i, j, k) + 1._JPRB>
      <Pragma:: loki loop-fusion g...>
      <Loop:: i=1:n>
        <Assignment:: var_out(i, j, k) = 2._JPRB*var_out(i, j, k)>
      <Comment:: >
  <Comment:: >

One of the difficulties in parsing pragmas is that it is not always immediately clear whether they should be associated with the subsequent or preceeding node, or should stand alone; as examples think of the differing behaviours of !$omp do, !$omp end do and !$omp barrier. Therefore in Loki, pragmas are not attached by default to other nodes. Instead, Loki treats pragmas essentially like comments, but gives them a separate node-type to easily distinguish them.

In situations where we do wish to associate pragmas with certain nodes, we can do so using the pragmas_attached context manager:

[19]:

from loki import pragmas_attached
with pragmas_attached(routine,Loop):
    routine.body.view()

<Section::>
  <Comment:: >
  <Loop:: k=1:n>
    <Loop:: j=1:n>
      <Comment:: >
      <Loop:: i=1:n>
        <Assignment:: var_out(i, j, k) = var_in(i, j, k)>
      <Loop:: i=1:n>
        <Assignment:: var_out(i, j, k) = 2._JPRB*var_out(i, j, k)>
      <Comment:: >
      <Call:: some_kernel>
      <Comment:: >
      <Loop:: i=1:n>
        <Assignment:: var_out(i, j, k) = var_out(i, j, k) + 1._JPRB>
      <Loop:: i=1:n>
        <Assignment:: var_out(i, j, k) = 2._JPRB*var_out(i, j, k)>
      <Comment:: >
  <Comment:: >

We can now visit the loops and sort them into their respective fusion groups:

[20]:

from loki import is_loki_pragma,get_pragma_parameters,Pragma
from collections import defaultdict

fusion_groups = defaultdict(list)
with pragmas_attached(routine,Loop):
    for loop in FindNodes(Loop).visit(routine.body):
        if is_loki_pragma(loop.pragma, starts_with='loop-fusion'):
            parameters = get_pragma_parameters(loop.pragma, starts_with='loop-fusion')
            group = parameters.get('group', 'default')
            fusion_groups[group] += [loop]

print(f"g1: {fusion_groups['g1']}, g2: {fusion_groups['g2']}")

g1: [Loop:: i=1:n, Loop:: i=1:n], g2: [Loop:: i=1:n, Loop:: i=1:n]

fusion_groups is now a dictionary with keys for the two fusion groups, and the associated Loop nodes are values for each key. We can now create and apply a transformation map similar to how it was done in the previous examples:

[21]:

for group,loops in fusion_groups.items():
    loop_body = flatten([loop.body for loop in loops])
    new_loop = Loop(variable=loops[0].variable, body=loop_body, bounds=loops[0].bounds)
    loop_map[loops[0]] = new_loop
    loop_map.update({loop: None for loop in loops[1:]})

Since !$loki pragmas are only intended to pass instructions/hints to Loki on source manipulations, and aren’t needed for the eventual compilation, we can also remove them from the routine:

[22]:

routine_copy = routine.clone()
routine.body = Transformer(loop_map).visit(routine.body)
pragma_map = {pragma: None for pragma in FindNodes(Pragma).visit(routine.body)}
routine.body = Transformer(pragma_map).visit(routine.body)
print(routine.to_fortran())

assert len([node for node in FindNodes(Loop).visit(routine.body) if node.variable == 'i']) == 2

SUBROUTINE loop_fuse_pragma (n, var_in, var_out)
  USE parkind1, ONLY: jpim, jprb
  IMPLICIT NONE

  INTEGER(KIND=jpim), INTENT(IN) :: n
  REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
  REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
  INTEGER(KIND=jpim) :: i, j, k

  DO k=1,n
    DO j=1,n

      DO i=1,n
        var_out(i, j, k) = var_in(i, j, k)
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO

      CALL some_kernel(n, var_out(1, j, k))

      DO i=1,n
        var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
      END DO

    END DO
  END DO

END SUBROUTINE loop_fuse_pragma

You may have noticed in the previous code-cell we made a copy of the object routine before applying a transformation to it. We can now apply the loop_fusion utility directly on routine_copy and compare the results:

[23]:

from loki import loop_fusion
loop_fusion(routine_copy)
pragma_map = {pragma: None for pragma in FindNodes(Pragma).visit(routine_copy.body)}
routine_copy.body = Transformer(pragma_map).visit(routine_copy.body)
print(routine_copy.to_fortran())

loop_fuse_pragma: fused 4 loops in 2 groups.

SUBROUTINE loop_fuse_pragma (n, var_in, var_out)
  USE parkind1, ONLY: jpim, jprb
  IMPLICIT NONE

  INTEGER(KIND=jpim), INTENT(IN) :: n
  REAL(KIND=jprb), INTENT(IN) :: var_in(n, n, n)
  REAL(KIND=jprb), INTENT(OUT) :: var_out(n, n, n)
  INTEGER(KIND=jpim) :: i, j, k

  DO k=1,n
    DO j=1,n

      ! Loki loop-fusion group(g1)
      DO i=1,n
        ! Loki loop-fusion - body 0 begin
        var_out(i, j, k) = var_in(i, j, k)
        ! Loki loop-fusion - body 0 end
        ! Loki loop-fusion - body 1 begin
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
        ! Loki loop-fusion - body 1 end
      END DO
      ! Loki loop-fusion group(g1) - loop hoisted

      CALL some_kernel(n, var_out(1, j, k))

      ! Loki loop-fusion group(g2)
      DO i=1,n
        ! Loki loop-fusion - body 0 begin
        var_out(i, j, k) = var_out(i, j, k) + 1._JPRB
        ! Loki loop-fusion - body 0 end
        ! Loki loop-fusion - body 1 begin
        var_out(i, j, k) = 2._JPRB*var_out(i, j, k)
        ! Loki loop-fusion - body 1 end
      END DO
      ! Loki loop-fusion group(g2) - loop hoisted

    END DO
  END DO

END SUBROUTINE loop_fuse_pragma

As we can see, the built-in Loki utility loop_fusion achieves an identical result to our manual transformation.

Loop fusion with Loki

Loops separated by kernel call

Using the built-in loop_fusion utility

Using the built-in `loop_fusion` utility