loki.transformations.pool_allocator

Classes

TemporariesPoolAllocatorTransformation(block_dim)

Transformation to inject a pool allocator that allocates a large scratch space per block on the driver and maps temporary arrays in kernels to this scratch space

class TemporariesPoolAllocatorTransformation(block_dim, stack_ptr_name='L', stack_end_name='U', stack_size_name='ISTSZ', stack_storage_name='ZSTACK', stack_argument_name='YDSTACK', stack_local_var_name='YLSTACK', local_ptr_var_name_pattern='IP_{name}', stack_int_type_kind=IntLiteral(8, None), directive=None, check_bounds=True, cray_ptr_loc_rhs=False)

Bases: Transformation

Transformation to inject a pool allocator that allocates a large scratch space per block on the driver and maps temporary arrays in kernels to this scratch space

The stack is provided via two integer variables, <stack name>_L and <stack name>_U, which are used as a stack pointer and stack end pointer, respectively. Naming is flexible and can be changed via options to the transformation.

The transformation needs to be applied in reverse order, which will do the following for each kernel:

Add an argument/arguments to the kernel call signature to pass the stack integer(s)
- either only the stack pointer is passed or the stack end pointer additionally if bound checking is active
Create a local copy of the stack derived type inside the kernel
Determine the combined size of all local arrays that are to be allocated by the pool allocator, taking into account calls to nested kernels. This is reported in Item’s trafo_data.
Inject Cray pointer assignments and stack pointer increments for all temporaries
Pass the local copy/copies of the stack integer(s) as argument to any nested kernel calls

By default, all local array arguments are allocated by the pool allocator, but this can be restricted to include only those that have at least one dimension matching one of those provided in allocation_dims.

In a driver routine, the transformation will:

Determine the required scratch space from trafo_data
Allocate the scratch space to that size
Insert data transfers (for OpenACC offloading)
Insert data sharing clauses into OpenMP or OpenACC pragmas
Assign stack base pointer and end pointer for each block (identified via block_dim)
Pass the stack argument(s) to kernel calls

With cray_ptr_loc_rhs=False the following stack/pool allocator will be generated:

SUBROUTINE DRIVER (...)
  ...
  INTEGER(KIND=8) :: ISTSZ
  REAL, ALLOCATABLE :: ZSTACK(:, :)
  INTEGER(KIND=8) :: YLSTACK_L
  INTEGER(KIND=8) :: YLSTACK_U
  ISTSZ = (MAX(C_SIZEOF(REAL(1, kind=jprb)), 8)*<array dim1>*<array dim2> + ...) / &
   & MAX(C_SIZEOF(REAL(1, kind=JPRB)), 8)
  ALLOCATE (ZSTACK(ISTSZ, nb))
  DO b=1,nb
    YLSTACK_L = LOC(ZSTACK(1, b))
    YLSTACK_U = YLSTACK_L + ISTSZ*MAX(C_SIZEOF(REAL(1, kind=JPRB)), 8)
    CALL KERNEL(..., YDSTACK_L=YLSTACK_L, YDSTACK_U=YLSTACK_U)
  END DO
  DEALLOCATE (ZSTACK)
END SUBROUTINE DRIVER

SUBROUTINE KERNEL(...)
  ...
  INTEGER(KIND=8) :: YLSTACK_L
  INTEGER(KIND=8) :: YLSTACK_U
  INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_L
  INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_U
  POINTER(IP_tmp1, tmp1)
  POINTER(IP_tmp2, tmp2)
  ...
  YLSTACK_L = YDSTACK_L
  YLSTACK_U = YDSTACK_U
  IP_tmp1 = YLSTACK_L
  YLSTACK_L = YLSTACK_L + <array dim1>*<array dim2>*MAX(C_SIZEOF(REAL(1, kind=jprb)), 8)
  IF (YLSTACK_L > YLSTACK_U) STOP
  IP_tmp2 = YLSTACK_L
  YLSTACK_L = YLSTACK_L + ...*MAX(C_SIZEOF(REAL(1, kind=jprb)), 8)
  IF (YLSTACK_L > YLSTACK_U) STOP
END SUBROUTINE KERNEL

With cray_ptr_loc_rhs=True the following stack/pool allocator will be generated:

SUBROUTINE driver (NLON, NZ, NB, field1, field2)
  ...
  INTEGER(KIND=8) :: ISTSZ
  REAL(KIND=JPRB), ALLOCATABLE :: ZSTACK(:, :)
  INTEGER(KIND=8) :: YLSTACK_L
  INTEGER(KIND=8) :: YLSTACK_U
  ISTSZ = <array dim1>*<array dim2>
  ALLOCATE (ZSTACK(ISTSZ, nb))
  DO b=1,nb
    YLSTACK_L = 1
    YLSTACK_U = YLSTACK_L + ISTSZ
    CALL KERNEL(..., YDSTACK_L=YLSTACK_L, YDSTACK_U=YLSTACK_U, ZSTACK=ZSTACK(:, b))
  END DO
  DEALLOCATE (ZSTACK)
END SUBROUTINE driver

SUBROUTINE KERNEL(...)
  ...
  INTEGER(KIND=8) :: YLSTACK_L
  INTEGER(KIND=8) :: YLSTACK_U
  INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_L
  INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_U
  REAL(KIND=JPRB), CONTIGUOUS, INTENT(INOUT) :: ZSTACK(:)
  POINTER(IP_tmp1, tmp1)
  POINTER(IP_tmp2, tmp2)
  ...
  YLSTACK_L = YDSTACK_L
  YLSTACK_U = YDSTACK_U
  IP_tmp1 = LOC(ZSTACK(YLSTACK_L))
  YLSTACK_L = YLSTACK_L + <array dim1>*<array dim2>
  IF (YLSTACK_L > YLSTACK_U) STOP
  IP_tmp2 = LOC(ZSTACK(YLSTACK_L))
  YLSTACK_L = YLSTACK_L + ...
  IF (YLSTACK_L > YLSTACK_U) STOP
END SUBROUTINE KERNEL

Parameters:

block_dim (Dimension) – Dimension object to define the blocking dimension to use for hoisted column arrays if hoisting is enabled.
stack_ptr_name (str, optional) – Name of the stack pointer variable to be appended to the generic stack name (default: 'L') resulting in e.g., '<stack name>_L'
stack_end_name (str, optional) – Name of the stack end pointer variable to be appendend to the generic stack name (default: 'U') resulting in e.g., '<stack name>_L'
stack_size_name (str, optional) – Name of the variable that holds the size of the scratch space in the driver (default: 'ISTSZ')
stack_storage_name (str, optional) – Name of the scratch space variable that is allocated in the driver (default: 'ZSTACK')
stack_argument_name (str, optional) – Name of the stack argument that is added to kernels (default: 'YDSTACK')
stack_local_var_name (str, optional) – Name of the local copy of the stack argument (default: 'YLSTACK')
local_ptr_var_name_pattern (str, optional) – Python format string pattern for the name of the Cray pointer variable for each temporary (default: 'IP_{name}')
stack_int_type_kind (Literal or Variable) – Integer type kind used for the stack pointer variable(s) (default: '8' resulting in 'INTEGER(KIND=8)')
directive (str, optional) – Can be 'openmp' or 'openacc'. If given, insert data sharing clauses for the stack derived type, and insert data transfer statements (for OpenACC only).
check_bounds (bool, optional) – Insert bounds-checks in the kernel to make sure the allocated stack size is not exceeded (default: True)
cray_ptr_loc_rhs (bool, optional) – Whether to only pass the stack variable as integer to the kernel(s) or whether to pass the whole stack array to the driver and the calls to LOC() within the kernel(s) itself (default: False)

reverse_traversal = True

process_ignored_items = True

transform_subroutine(routine, **kwargs)

Defines the transformation to apply to Subroutine items.

For transformations that modify Subroutine objects, this method should be implemented. It gets called via the dispatch method apply().

Parameters:

routine (Subroutine) – The subroutine to be transformed.
**kwargs (optional) – Keyword arguments for the transformation.

static import_c_sizeof(routine): Import the c_sizeof symbol if necesssary.

import_allocation_types(routine, item): Import all the variable types used in allocations.

apply_pool_allocator_to_temporaries(routine, item=None)

Apply pool allocator to local temporary arrays

This appends the relevant argument to the routine’s dummy argument list and creates the assignment for the local copy of the stack type. For all local arrays, a Cray pointer is instantiated and the temporaries are mapped via Cray pointers to the pool-allocated memory region.

The cumulative size of all temporary arrays is determined and returned.

create_pool_allocator(routine, stack_size): Create a pool allocator in the driver

inject_pool_allocator_into_calls(routine, targets, ignore, driver=False): Add the pool allocator argument into subroutine calls