loki.transformations.temporaries.pool_allocator
Classes
|
Analog to |
|
Transformation to inject a pool allocator that allocates a large scratch space per block on the driver and maps temporary arrays in kernels to this scratch space |
- class TemporariesPoolAllocatorTransformation(block_dim, horizontal=None, stack_ptr_name='L', stack_end_name='U', stack_size_name='ISTSZ', stack_storage_name='ZSTACK', stack_argument_name='YDSTACK', stack_local_var_name='YLSTACK', local_ptr_var_name_pattern='IP_{name}', stack_int_type_kind=IntLiteral(8, None), directive=None, check_bounds=True, cray_ptr_loc_rhs=False, stack_size_var_kind=None)
Bases:
TransformationTransformation to inject a pool allocator that allocates a large scratch space per block on the driver and maps temporary arrays in kernels to this scratch space
The stack is provided via two integer variables,
<stack name>_Land<stack name>_U, which are used as a stack pointer and stack end pointer, respectively. Naming is flexible and can be changed via options to the transformation.The transformation needs to be applied in reverse order, which will do the following for each kernel:
- Add an argument/arguments to the kernel call signature to pass the stack integer(s)
either only the stack pointer is passed or the stack end pointer additionally if bound checking is active
Create a local copy of the stack derived type inside the kernel
Determine the combined size of all local arrays that are to be allocated by the pool allocator, taking into account calls to nested kernels. This is reported in
Item’strafo_data.Inject Cray pointer assignments and stack pointer increments for all temporaries
Pass the local copy/copies of the stack integer(s) as argument to any nested kernel calls
In a driver routine, the transformation will:
Determine the required scratch space from
trafo_dataAllocate the scratch space to that size
Insert data transfers (for OpenACC offloading)
Insert data sharing clauses into OpenMP or OpenACC pragmas
Assign stack base pointer and end pointer for each block (identified via
block_dim)Pass the stack argument(s) to kernel calls
With
cray_ptr_loc_rhs=Falsethe following stack/pool allocator will be generated:SUBROUTINE DRIVER (...) ... INTEGER(KIND=8) :: ISTSZ REAL(KIND=REAL64), ALLOCATABLE :: ZSTACK(:, :) INTEGER(KIND=8) :: YLSTACK_L INTEGER(KIND=8) :: YLSTACK_U ISTSZ = ISHFT(7 + C_SIZEOF(REAL(1, kind=jprb))**<array dim1>*<array dim2>, -3) + ... ALLOCATE (ZSTACK(ISTSZ, nb)) DO b=1,nb YLSTACK_L = LOC(ZSTACK(1, b)) YLSTACK_U = YLSTACK_L + ISTSZ*C_SIZEOF(REAL(1, kind=REAL64)) CALL KERNEL(..., YDSTACK_L=YLSTACK_L, YDSTACK_U=YLSTACK_U) END DO DEALLOCATE (ZSTACK) END SUBROUTINE DRIVER SUBROUTINE KERNEL(...) ... INTEGER(KIND=8) :: YLSTACK_L INTEGER(KIND=8) :: YLSTACK_U INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_L INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_U POINTER(IP_tmp1, tmp1) POINTER(IP_tmp2, tmp2) ... YLSTACK_L = YDSTACK_L YLSTACK_U = YDSTACK_U IP_tmp1 = YLSTACK_L YLSTACK_L = YLSTACK_L + ISHFT(ISHFT(<array dim1>*<array dim2>*C_SIZEOF(REAL(1, kind=JPRB)) + 7, -3), 3) IF (YLSTACK_L > YLSTACK_U) STOP IP_tmp2 = YLSTACK_L YLSTACK_L = YLSTACK_L + ISHFT(ISHFT(...*C_SIZEOF(REAL(1, kind=JPRB)) + 7, -3), 3) IF (YLSTACK_L > YLSTACK_U) STOP END SUBROUTINE KERNEL
With
cray_ptr_loc_rhs=Truethe following stack/pool allocator will be generated:SUBROUTINE driver (NLON, NZ, NB, field1, field2) ... INTEGER(KIND=8) :: ISTSZ REAL(KIND=REAL64), ALLOCATABLE :: ZSTACK(:, :) INTEGER(KIND=8) :: YLSTACK_L INTEGER(KIND=8) :: YLSTACK_U ISTSZ = ISTSZ = ISHFT(7 + C_SIZEOF(REAL(1, kind=jprb))**<array dim1>*<array dim2>, -3) + ... ALLOCATE (ZSTACK(ISTSZ, nb)) DO b=1,nb YLSTACK_L = 1 YLSTACK_U = YLSTACK_L + ISTSZ CALL KERNEL(..., YDSTACK_L=YLSTACK_L, YDSTACK_U=YLSTACK_U, ZSTACK=ZSTACK(:, b)) END DO DEALLOCATE (ZSTACK) END SUBROUTINE driver SUBROUTINE KERNEL(...) ... INTEGER(KIND=8) :: YLSTACK_L INTEGER(KIND=8) :: YLSTACK_U INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_L INTEGER(KIND=8), INTENT(INOUT) :: YDSTACK_U REAL(KIND=REAL64), CONTIGUOUS, INTENT(INOUT) :: ZSTACK(:) POINTER(IP_tmp1, tmp1) POINTER(IP_tmp2, tmp2) ... YLSTACK_L = YDSTACK_L YLSTACK_U = YDSTACK_U IP_tmp1 = LOC(ZSTACK(YLSTACK_L)) YLSTACK_L = YLSTACK_L + ISHFT(<array dim1>*<array dim2>*C_SIZEOF(REAL(1, kind=JPRB)) + 7, -3) IF (YLSTACK_L > YLSTACK_U) STOP IP_tmp2 = LOC(ZSTACK(YLSTACK_L)) YLSTACK_L = YLSTACK_L + ISHFT(...*C_SIZEOF(REAL(1, kind=JPRB)) + 7, -3) IF (YLSTACK_L > YLSTACK_U) STOP END SUBROUTINE KERNEL
- Parameters:
block_dim (
Dimension) –Dimensionobject to define the blocking dimension to use for hoisted column arrays if hoisting is enabled.stack_ptr_name (str, optional) – Name of the stack pointer variable to be appended to the generic stack name (default:
'L') resulting in e.g.,'<stack name>_L'stack_end_name (str, optional) – Name of the stack end pointer variable to be appendend to the generic stack name (default:
'U') resulting in e.g.,'<stack name>_L'stack_size_name (str, optional) – Name of the variable that holds the size of the scratch space in the driver (default:
'ISTSZ')stack_storage_name (str, optional) – Name of the scratch space variable that is allocated in the driver (default:
'ZSTACK')stack_argument_name (str, optional) – Name of the stack argument that is added to kernels (default:
'YDSTACK')stack_local_var_name (str, optional) – Name of the local copy of the stack argument (default:
'YLSTACK')local_ptr_var_name_pattern (str, optional) – Python format string pattern for the name of the Cray pointer variable for each temporary (default:
'IP_{name}')stack_int_type_kind (
LiteralorVariable) – Integer type kind used for the stack pointer variable(s) (default:'8'resulting in'INTEGER(KIND=8)')directive (str, optional) – Can be
'openmp'or'openacc'. If given, insert data sharing clauses for the stack derived type, and insert data transfer statements (for OpenACC only).check_bounds (bool, optional) – Insert bounds-checks in the kernel to make sure the allocated stack size is not exceeded (default: True)
cray_ptr_loc_rhs (bool, optional) – Whether to only pass the stack variable as integer to the kernel(s) or whether to pass the whole stack array to the driver and the calls to
LOC()within the kernel(s) itself (default: False)stack_size_var_kind (
LiteralorVariable) – Defaults to'stack_int_type_kind', however, can be overriden if necessary.
- reverse_traversal = True
- process_ignored_items = True
- transform_subroutine(routine, **kwargs)
Defines the transformation to apply to
Subroutineitems.For transformations that modify
Subroutineobjects, this method should be implemented. It gets called via the dispatch methodapply().- Parameters:
routine (
Subroutine) – The subroutine to be transformed.**kwargs (optional) – Keyword arguments for the transformation.
- add_driver_imports(routine)
- static import_c_sizeof(routine)
Import the c_sizeof symbol if necesssary.
- static import_real64(routine)
Import the real64 symbol if necesssary.
- import_allocation_types(routine, item)
Import all the variable types used in allocations.
- apply_pool_allocator_to_temporaries(routine, item=None)
Apply pool allocator to local temporary arrays
This appends the relevant argument to the routine’s dummy argument list and creates the assignment for the local copy of the stack type. For all local arrays, a Cray pointer is instantiated and the temporaries are mapped via Cray pointers to the pool-allocated memory region.
The cumulative size of all temporary arrays is determined and returned.
- create_pool_allocator(routine, stack_size)
Create a pool allocator in the driver
- inject_pool_allocator_into_calls(routine, targets, ignore, driver=False)
Add the pool allocator argument into subroutine calls
- class EcstackPoolAllocatorTransformation(block_dim, horizontal=None, stack_ptr_name='L', stack_end_name='U', stack_size_name='ISTSZ', stack_storage_name='ZSTACK', stack_argument_name='YDSTACK', stack_local_var_name='YLSTACK', local_ptr_var_name_pattern='IP_{name}', stack_int_type_kind=IntLiteral(8, None), directive=None, check_bounds=True, cray_ptr_loc_rhs=False, stack_size_var_kind=None)
Bases:
TemporariesPoolAllocatorTransformationAnalog to
TemporariesPoolAllocatorTransformation, however, instead of inserting offload pragmas use an external defined module to get a pointer to an offloaded chunk of memory.The minimal interface expected from ECSTACK should look like:
MODULE ECSTACK_MOD IMPLICIT NONE TYPE TECSTACK ... CONTAINS PROCEDURE :: GET_STACK_PTR END TYPE TECSTACK PRIVATE TYPE(TECSTACK) :: ECSTACK PUBLIC :: TECSTACK, ECSTACK CONTAINS SUBROUTINE GET_STACK_PTR(SELF, PTR, KSIZE, NGPBLKS) CLASS(TECSTACK) :: SELF REAL(KIND=JPRD), POINTER, CONTIGUOUS, INTENT(INOUT) :: PTR(:, :) INTEGER(KIND=JPIM), INTENT(IN) :: KSIZE INTEGER(KIND=JPIM), INTENT(IN) :: NGPBLKS ... END SUBROUTINE GET_STACK_PTR END MODULE ECSTACK_MOD
- add_driver_imports(routine)
- static import_ecstack(routine)