Loki’s internal representation (IR)
Important
Loki is still under active development and has not yet seen a stable release. Interfaces can change at any time, objects may be renamed, or concepts may be re-thought. Make sure to sync your work to the current release frequently by rebasing feature branches and upstreaming more general applicable work in the form of pull requests.
Loki’s internal representation aims to achieve a balance between usability and general applicability. This means that in places there may be shortcuts taken to ease its use in the context of a source-to-source translation utility but may break with established practices in compiler theory. The IR was developed with Fortran source code in mind and that shows. Where there exist similar concepts in other languages, things are transferable. In other places, Fortran-specific annotations are included for the sole purpose of enabling string reproducibility.
The internal representation is vertically divided into different layers, roughly aligned with high level concepts found in Fortran and other programming languages:
Container data structures
Outermost are container data structures that conceptually translate to Fortran’s program-units, such as modules and subprograms.
Fortran modules are represented by Module
objects which comprise
a specification part (Module.spec
) and a list of Subroutine
objects contained in the module.
Subroutines and functions are represented by Subroutine
objects that
in turn have their own docstring (Subroutine.docstring
),
specification part (Subroutine.spec
), execution part
(Subroutine.body
), and contained subprograms
(Subroutine.members
).
To map these programming language concepts to source files and ease input or output operations, any number of these container data structures can be classes.
Available container classes
|
Class to handle and manipulate source files, storing |
|
Class to handle and manipulate source modules. |
|
Class to handle and manipulate a single subroutine. |
Control flow tree
Specification and execution parts of (sub)programs and modules are the central
components of container data structures. Each of them is represented by a tree
of control flow nodes, with a Section
as root node. This tree resembles
to some extend a hierarchical control flow graph where each node can have
control flow and expression nodes as children. Consequently, this separation on
node level is reflected in the internal representation, splitting the tree into
two levels:
Control flow (e.g., loops, conditionals, assignments, etc.); the corresponding classes are declared in
loki.ir
and described in this section.Expressions (e.g., scalar/array variables, literals, operators, etc.); this is based on Pymbolic with encapsulating classes declared in
loki.expression.symbols
and described below.
The split in IR levels is meant to control the complexity of individual tree traversals, and separate the symbolic expression layer from the more Fortran-specific elements of the IR tree. This loosely follows the principles outlined in Luporini et al..
All control flow nodes implement the common base class Node
and
can have an arbitrary number of children that are either control flow nodes
or expression nodes. Thus, any control flow node looks in principle like the
following:
Node
/ | \
+------+ | +---+
/ | \
/ | \
Expression Expression Node ...
As an example, consider a basic Fortran DO i=1,n
loop: it defines a loop
variable (i
), a loop range (1:n
) and a loop body. The body can be
one/multiple statements or other control flow structures and therefore is a
subtree of control flow nodes. Loop variable and range, however, are
expression nodes.
All control flow nodes fall into one of two categories:
InternalNode
: nodes that have abody
and therefore have other control flow nodes as children.LeafNode
: nodes that (generally) do not have any other control flow nodes as children.
Note that InternalNode
can have other properties than
body
in which control flow nodes are contained as children
(for example, else_body
in Conditional
).
All Node
may, however, have one or multiple expression trees
as children.
Note
All actual control flow nodes are implementations of one of the two base classes. Two notable exceptions to the above are the following:
MultiConditional
(for example, Fortran’sSELECT CASE
): It has multiple bodies and thus does not fit the above framework. Conceptually, these could be converted into nestedConditional
but it would break string reproducibility. For that reason they are retained as aLeafNode
for the time being.TypeDef
: This defines a new scope for symbols, which does not include symbols from the enclosing scope. Thus, it behaves like a leaf node although it has technically control flow nodes as children. It is therefore also implemented as aLeafNode
.
With this separation into two types of nodes, the schematics of the control flow layer of the internal representation are as follows:
InternalNode
|
body
/|||\
+---------------+ /|\ +-------------+
/ +------+ | +-----+ \
/ / | \ \
LeafNode InternalNode LeafNode LeafNode InternalNode ...
| |
body body
/ \ / \
/ \ ....
LeafNode InternalNode
|
...
Available control flow nodes
Abstract base classes
|
Base class for all node types in Loki's internal representation. |
|
Internal representation of a control flow node that has a traversable body property. |
|
Internal representation of a control flow node without a body. |
Internal node classes
|
Internal representation of a single code region. |
|
Internal representation of a code region in which names are associated with expressions or variables. |
|
Internal representation of a loop with induction variable and range. |
|
Internal representation of a while loop in source code. |
|
Internal representation of a conditional branching construct. |
|
Internal representation of a block of code defined by two matching pragmas. |
|
Internal representation of a Fortran interface block. |
Leaf node classes
|
Internal representation of a variable assignment. |
|
Internal representation of an inline conditional assignment using a ternary operator. |
|
Internal representation of a subroutine call. |
|
Internal representation of a variable allocation. |
|
Internal representation of a variable deallocation. |
|
Internal representation of a pointer nullification. |
|
Internal representation of a single comment. |
|
Internal representation of a block comment that is formed from multiple single-line comments. |
|
Internal representation of a pragma. |
|
Internal representation of a preprocessor directive. |
|
Internal representation of an import. |
|
Internal representation of a variable declaration. |
|
Internal representation of a procedure declaration. |
|
Internal representation of a |
|
Internal representation of Fortran statement function statements |
|
Internal representation of a derived type definition. |
|
Internal representation of a multi-value conditional (eg. |
|
Internal representation of a masked array assignment ( |
|
Catch-all generic node for corner-cases. |
|
Internal representation of an |
Expression tree
Many control flow nodes contain one or multiple expressions, such as the
right-hand side of an assignment (loki.ir.Assignment.rhs
) or the
condition of an IF
statement (loki.ir.Conditional.condition
).
Such expressions are represented by expression trees, comprising a single
node (e.g., the left-hand side of an assignment may be just a scalar variable)
or a large expression tree consisting of multiple nested sub-expressions.
Loki’s expression representation is based on Pymbolic but encapsulates all classes with bespoke own implementations. This allows to enrich expression nodes by attaching custom metadata, implementing bespoke comparison operators, or store type information.
The base class for all expression nodes is pymbolic.primitives.Expression
.
Available expression tree nodes
Typed symbol nodes
|
Base class for symbols that carry type information. |
|
Factory class for |
Internal representation of symbols with deferred type |
|
|
Expression node for scalar variables. |
|
Expression node for array variables. |
Internal representation of a symbol that represents a callable subroutine or function |
Literals
|
Factory class to instantiate the best-matching literal node. |
|
A floating point constant in an expression. |
|
An integer constant in an expression. |
|
A boolean constant in an expression. |
|
A string constant in an expression. |
Any literal not represented by a dedicated class. |
|
A list of constant literals, e.g., as used in Array Initialization Lists. |
Mix-ins
Mixin to enable comparing expressions to strings. |
Expression modules
Visitor classes for traversing and transforming all expression trees in Loki’s internal representation (IR). |
|
Mappers for traversing and transforming the Expression tree. |
|
Sub-classes of Pymbolic's native operations that allow us to express niche things like mathematically irrelevant parenthesis that nevertheless change code results. |
|
Expression tree node classes for Expression tree. |
Type information and scopes
Every symbol in an expressions tree (TypedSymbol
, such as Scalar
,
Array
, ProcedureSymbol
) has a type (represented by a
DataType
) and, possibly, other attributes associated with it.
Type and attributes are stored together in a SymbolAttributes
object, which is essentially a dict.
Note
Example: An array variable VAR
may be declared in Fortran as a subroutine
argument in the following way:
INTEGER(4), INTENT(INOUT) :: VAR(10)
This variable has type BasicType.INTEGER
and the following
additional attributes:
KIND=4
INTENT=INOUT
SHAPE=(10,)
The corresponding SymbolAttributes
object can be created as
SymbolAttributes(BasicType.INTEGER, kind=Literal(4), intent='inout', shape=(Literal(10),))
If the variable object is associated with a Scope
, then its
SymbolAttributes
object is stored in the relevant SymbolTable
.
From there, all expression nodes that represent use of the associated symbol
(i.e., the variable object and any others with the same name) query the type
information from there. This means, changing the declared attributes of a symbol
applies this change for all instances of this symbol.
If the variable is not associated with a Scope
, then its
SymbolAttributes
object is stored locally and not shared by any other
variable objects.
Warning
Loki allows to apply changes very freely, which means changing symbol attributes can lead to invalid states.
For example, removing the shape
property from the SymbolAttributes
object in a symbol table converts the corresponding Array
to
a Scalar
variable. But at this point all expression tree nodes will
still be Array
, possibly also with subscript operations (represented
by the dimensions
property).
For plain Array
nodes (without subscript), rebuilding the IR will
automatically take care of instantiating these objects as Scalar
but
removing dimensions
properties must be done explicitly.
Every object that defines a new scope (e.g., Subroutine
,
Module
, implementing Scope
) has an associated symbol table
(SymbolTable
). The SymbolAttributes
of a symbol declared or
imported in a scope are stored in the symbol table of that scope.
These symbol tables/scopes are organized in a hierarchical fashion, i.e., they
are aware of their enclosing scope and allow to recursively look-up entries.
The overall schematics of the scope and type representation are depicted in the following diagram:
Subroutine | Module | TypeDef | ...
\ | /
\ | / <is>
\ | /
Scope
|
| <has>
|
SymbolTable - - - - - - - - - - - - TypedSymbol
|
| <has entries>
|
SymbolAttributes
/ | | \
/ | | \ <has properties>
/ | | \
DataType | (kind) | (intent) | (...)
Available data types
The DataType
of a symbol can be one of
BasicType
: intrinsic types, such asINTEGER
,REAL
, etc.DerivedType
: derived types defined somewhereProcedureType
: any subroutines or functions declared or imported
Note that this is different from the understanding of types in the Fortran standard, where only intrinsic types and derived types are considered a type. Treating also procedures as types allows us to treat them uniformly when considering external subprograms, procedure pointers and type bound procedures.
BasicType | DerivedType | ProcedureType
\ | /
\ | / <implements>
\ | /
DataType
Derived types
Derived type definitions (via TypeDef
) create entries in the scope’s
symbol table in which they are defined to make the type definition available
to declarations.
Imports and deferred type
For imported symbols (via Import
) the source module may not be
available and thus no information about the symbol. This is indicated by
BasicType.DEFERRED
. This is also applied to any variable that is
instantiated without providing a type and where no type information can
be found in the scope’s symbol table (either because no information has
been provided previously or because no scope is attached).
|
Scoping object that manages type caching and derivation for typed symbols. |
|
Lookup table for symbol types that maps symbol names to |
|
Representation of a symbol's attributes, such as data type and declared properties |
Base class for data types a symbol may have |
|
|
Representation of intrinsic data types, names taken from the FORTRAN convention. |
|
Representation of derived data types that may have an associated |
|
Representation of a function or subroutine type definition. |