loki.transformations.pool_allocator
Classes
|
Transformation to inject a pool allocator that allocates a large scratch space per block on the driver and maps temporary arrays in kernels to this scratch space |
- class TemporariesPoolAllocatorTransformation(block_dim, stack_ptr_name='L', stack_end_name='U', stack_size_name='ISTSZ', stack_storage_name='ZSTACK', stack_argument_name='YDSTACK', stack_local_var_name='YLSTACK', local_ptr_var_name_pattern='IP_{name}', stack_int_type_kind=IntLiteral(8, None), directive=None, check_bounds=True, cray_ptr_loc_rhs=False)
Bases:
Transformation
Transformation to inject a pool allocator that allocates a large scratch space per block on the driver and maps temporary arrays in kernels to this scratch space
The stack is provided via two integer variables,
<stack name>_L
and<stack name>_U
, which are used as a stack pointer and stack end pointer, respectively. Naming is flexible and can be changed via options to the transformation.The transformation needs to be applied in reverse order, which will do the following for each kernel:
- Add an argument/arguments to the kernel call signature to pass the stack integer(s)
either only the stack pointer is passed or the stack end pointer additionally if bound checking is active
Create a local copy of the stack derived type inside the kernel
Determine the combined size of all local arrays that are to be allocated by the pool allocator, taking into account calls to nested kernels. This is reported in
Item
’strafo_data
.Inject Cray pointer assignments and stack pointer increments for all temporaries
Pass the local copy/copies of the stack integer(s) as argument to any nested kernel calls
By default, all local array arguments are allocated by the pool allocator, but this can be restricted to include only those that have at least one dimension matching one of those provided in
allocation_dims
.In a driver routine, the transformation will:
Determine the required scratch space from
trafo_data
Allocate the scratch space to that size
Insert data transfers (for OpenACC offloading)
Insert data sharing clauses into OpenMP or OpenACC pragmas
Assign stack base pointer and end pointer for each block (identified via
block_dim
)Pass the stack argument(s) to kernel calls
With
cray_ptr_loc_rhs=False
the following stack/pool allocator will be generated:SUBROUTINE DRIVER (...) ... INTEGER(KIND=8) :: ISTSZ REAL, ALLOCATABLE :: ZSTACK(:, :) INTEGER(KIND=8) :: YLSTACK_L INTEGER(KIND=8) :: YLSTACK_U ISTSZ = (MAX(C_SIZEOF(REAL(1, kind=jprb)), 8)*<array dim1>*<array dim2> + ...) / & & MAX(C_SIZEOF(REAL(1, kind=JPRB)), 8) ALLOCATE (ZSTACK(ISTSZ, nb)) DO b=1,nb YLSTACK_L = LOC(ZSTACK(1, b)) YLSTACK_U = YLSTACK_L + ISTSZ*MAX(C_SIZEOF(REAL(1, kind=JPRB)), 8) CALL KERNEL(..., YDSTACK_L=YLSTACK_L, YDSTACK_U=YLSTACK_U) END DO DEALLOCATE (ZSTACK) END SUBROUTINE DRIVER SUBROUTINE KERNEL(...) ... INTEGER(KIND=8) :: YLSTACK_L INTEGER(KIND=8) :: YLSTACK_U INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_L INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_U POINTER(IP_tmp1, tmp1) POINTER(IP_tmp2, tmp2) ... YLSTACK_L = YDSTACK_L YLSTACK_U = YDSTACK_U IP_tmp1 = YLSTACK_L YLSTACK_L = YLSTACK_L + <array dim1>*<array dim2>*MAX(C_SIZEOF(REAL(1, kind=jprb)), 8) IF (YLSTACK_L > YLSTACK_U) STOP IP_tmp2 = YLSTACK_L YLSTACK_L = YLSTACK_L + ...*MAX(C_SIZEOF(REAL(1, kind=jprb)), 8) IF (YLSTACK_L > YLSTACK_U) STOP END SUBROUTINE KERNEL
With
cray_ptr_loc_rhs=True
the following stack/pool allocator will be generated:SUBROUTINE driver (NLON, NZ, NB, field1, field2) ... INTEGER(KIND=8) :: ISTSZ REAL(KIND=JPRB), ALLOCATABLE :: ZSTACK(:, :) INTEGER(KIND=8) :: YLSTACK_L INTEGER(KIND=8) :: YLSTACK_U ISTSZ = <array dim1>*<array dim2> ALLOCATE (ZSTACK(ISTSZ, nb)) DO b=1,nb YLSTACK_L = 1 YLSTACK_U = YLSTACK_L + ISTSZ CALL KERNEL(..., YDSTACK_L=YLSTACK_L, YDSTACK_U=YLSTACK_U, ZSTACK=ZSTACK(:, b)) END DO DEALLOCATE (ZSTACK) END SUBROUTINE driver SUBROUTINE KERNEL(...) ... INTEGER(KIND=8) :: YLSTACK_L INTEGER(KIND=8) :: YLSTACK_U INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_L INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_U REAL(KIND=JPRB), CONTIGUOUS, INTENT(INOUT) :: ZSTACK(:) POINTER(IP_tmp1, tmp1) POINTER(IP_tmp2, tmp2) ... YLSTACK_L = YDSTACK_L YLSTACK_U = YDSTACK_U IP_tmp1 = LOC(ZSTACK(YLSTACK_L)) YLSTACK_L = YLSTACK_L + <array dim1>*<array dim2> IF (YLSTACK_L > YLSTACK_U) STOP IP_tmp2 = LOC(ZSTACK(YLSTACK_L)) YLSTACK_L = YLSTACK_L + ... IF (YLSTACK_L > YLSTACK_U) STOP END SUBROUTINE KERNEL
- Parameters:
block_dim (
Dimension
) –Dimension
object to define the blocking dimension to use for hoisted column arrays if hoisting is enabled.stack_ptr_name (str, optional) – Name of the stack pointer variable to be appended to the generic stack name (default:
'L'
) resulting in e.g.,'<stack name>_L'
stack_end_name (str, optional) – Name of the stack end pointer variable to be appendend to the generic stack name (default:
'U'
) resulting in e.g.,'<stack name>_L'
stack_size_name (str, optional) – Name of the variable that holds the size of the scratch space in the driver (default:
'ISTSZ'
)stack_storage_name (str, optional) – Name of the scratch space variable that is allocated in the driver (default:
'ZSTACK'
)stack_argument_name (str, optional) – Name of the stack argument that is added to kernels (default:
'YDSTACK'
)stack_local_var_name (str, optional) – Name of the local copy of the stack argument (default:
'YLSTACK'
)local_ptr_var_name_pattern (str, optional) – Python format string pattern for the name of the Cray pointer variable for each temporary (default:
'IP_{name}'
)stack_int_type_kind (
Literal
orVariable
) – Integer type kind used for the stack pointer variable(s) (default:'8'
resulting in'INTEGER(KIND=8)'
)directive (str, optional) – Can be
'openmp'
or'openacc'
. If given, insert data sharing clauses for the stack derived type, and insert data transfer statements (for OpenACC only).check_bounds (bool, optional) – Insert bounds-checks in the kernel to make sure the allocated stack size is not exceeded (default: True)
cray_ptr_loc_rhs (bool, optional) – Whether to only pass the stack variable as integer to the kernel(s) or whether to pass the whole stack array to the driver and the calls to
LOC()
within the kernel(s) itself (default: False)
- reverse_traversal = True
- process_ignored_items = True
- transform_subroutine(routine, **kwargs)
Defines the transformation to apply to
Subroutine
items.For transformations that modify
Subroutine
objects, this method should be implemented. It gets called via the dispatch methodapply()
.- Parameters:
routine (
Subroutine
) – The subroutine to be transformed.**kwargs (optional) – Keyword arguments for the transformation.
- static import_c_sizeof(routine)
Import the c_sizeof symbol if necesssary.
- import_allocation_types(routine, item)
Import all the variable types used in allocations.
- apply_pool_allocator_to_temporaries(routine, item=None)
Apply pool allocator to local temporary arrays
This appends the relevant argument to the routine’s dummy argument list and creates the assignment for the local copy of the stack type. For all local arrays, a Cray pointer is instantiated and the temporaries are mapped via Cray pointers to the pool-allocated memory region.
The cumulative size of all temporary arrays is determined and returned.
- create_pool_allocator(routine, stack_size)
Create a pool allocator in the driver
- inject_pool_allocator_into_calls(routine, targets, ignore, driver=False)
Add the pool allocator argument into subroutine calls