This introduces a new function `normalizeInductionVar()` to the static
loop utility code in `concretelang/Analysis/StaticLoops.h` with code
extracted for IV normalization from the batching code and changes the
batching code to make use of the factored function.
For `ExtractSliceOp` and `InsertSliceOp`, the code performing the
hoisting of indexed operations in the batching pass derives the
indexes of the hoisted operation from the indexes provided by
`ExtractSliceOp::getOffsets()` and
`InsertSliceOp::getOffsets()`. However, these methods only return
dynamic indexes, such that operations with mixed, dynamic and static
offsets are hoisted incorrectly.
This patch replaces the invocations of `ExtractSliceOp::getOffsets()`
and `InsertSliceOp::getOffsets()` with invocations of
`ExtractSliceOp::getMixedOffsets()` and
`InsertSliceOp::getMixedOffsets()`, respectively in order to take into
account both static and dynamic indexes.
The IR generated by the batching pattern introduces intermediate
tensor values omitting the batched dimensions for batched
operands. This happens uncondiationally, leading to the generation of
`tensor.collapse_shape` operations with the same output and inout
shape. However, verification for such operation fails, since the
verifier assumes that the rank of the resulting tensor is reduced at
least by one.
This commit modified the check in `flattenTensor`, such that no
flattening operation is generated if the input and output shapes would
be identical.
Currently, the cleanup pattern matches sequences of `tensor.extract` /
`tensor.extract_slice`, `tensor.insert` / `tensor.insert_slice` and
`scf.yield` operations that are embedded into perfect loop nests whose
IVs are referenced in quasi-affine expressions with constant
steps. The pattern ensures that the extraction and insertion operation
use the same IVs for indexing, but does not check whether they appear
in the same order.
However, the order in which IVs are used for indexing is crucial when
replacing the operations with `tensor.extract_slice` and
`tensor.insert_slice` operations in order to preserve the shape of the
slices and the order of elements.
For example, when the cleanup pattern is applied to the following IR:
```
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%c2 = arith.constant 2 : index
%c3 = arith.constant 3 : index
scf.for %i = %c0 to %c3 step %c1 ... {
scf.for %j = %c0 to %c2 step %c1 ... {
%v = tensor.extract ... [%i, %j]
%t = tensor.insert ... [%j, %i]
scf.yield %t
}
...
}
```
The extracted slice has a shape of 3x2, while the insertion should be
happening with a slice with the shape 2x3.
This commit adds an additional check to the cleanup pattern that
ensures that loop IVs are used for indexing in the same order and
appear the same number of times.
Normalization of indexes in the batching pass currently
unconditionally emits arithmetic operations to shift and scale indexes
regardless of whether these indexes are already normalized. This leads
to unnecessary subtractions with 0 and divisions by 1.
This commit introduces two additional checks to the code normalizing
indexes that prevent arithmetic operations to be emitted for indexes
that do not need to be shifted or scaled.
This adds a new pattern to the batching pass that folds operations on
tensors of constants into new tensors of constants. E.g.,
%cst = arith.constant dense<...> : tensor<Nxi9>
%res = scf.for %i = %c0 to %cN {
%cst_i9 = tensor.extract %cst[%i]
%cst_i64 = arith.extui %cst_i9 : i64
...
}
becomes:
%cst = arith.constant dense<...> : tensor<Nxi64>
%res = scf.for %i = %c0 to %cN {
%cst_i64 = tensor.extract %cst[%i]
...
}
The pattern only works for static loops, indexes that are quasi-affine
expressions on single loop induction variables with a constant step
size across iterations and foldable operations that have a single
result.
The current batching pass only supports batching of operations that
have a single batchable operand, that can only be batched in one way
and that operate on scalar values. However, this does not allow for
efficient batching of all arithmetic operations in TFHE, since these
are often applied to pairs of scalar values from tensors, to tensors
and scalars or to tensors that can be grouped in higher-order tensors.
This commit introduces three new features for batching:
1. Support of multiple batchable operands
The operation interface for batching now allows for the
specification of multiple batchable operands. This set can be
composed of any subset of an operation's operands, i.e., it is
not limited to sets of operands with contiguous operand indexes.
2. Support for multiple batching variants
To account for multiple kinds of batching, the batching operation
interface `BatchableOpInterface` now supports variants. The
batching pass attempts to batch an operation by trying the
batching variants expressed via the interface in order until it
succeeds.
3. Support for batching of tensor values
Some operations that could be batched already operate on tensor
values. The new batching pass detects those patterns and groups
the batchable tensors' values into higher-dimensional tensors.
The batching pass erroneously assumes that any expression solely
composed of an induction variable has static bounds. This commit adds
a test for the lower bound, upper bound and step checking that they
are indeed static before attempting to determine their static values.
The batching pass only creates a batched version of a batchable
operation if all of its non-batchable operands are defined out ouf the
outermost loop the iterating over the values of the batchable operand.
This change also allows for operations to be batched if the
non-batachable operands are generated by operations, which are pure
and thus hoistable out of the outermost loop.
An early test for a batchable operation checks whether the batchable
operand is produced by a `tensor.extract` operation and bails out if
this is not the case. However, the use of `llvm::dyn_cast<T>()` directly
on the defining operation of the batchable operand causes an attempt
to cast a null value for an operand which is not produced by an
operation (e.g., block arguments).
Using `llvm::dyn_cast_or_null<T>()` fixes this issue.
This commit rebases the compiler onto commit 465ee9bfb26d from
llvm-project with locally maintained patches on top, i.e.:
* 5d8669d669ee: Fix the element alignment (size) for memrefCopy
* 4239163ea337: fix: Do not fold the memref.subview if the offset are
!= 0 and strides != 1
* 72c5decfcc21: remove github stuff from llvm
* 8d0ce8f9eca1: Support arbitrary element types in named operations
via attributes
* 94f64805c38c: Copy attributes of scf.for on bufferization and make
it an allocation hoisting barrier
Main upstream changes from llvm-project that required modification of
concretecompiler:
* Switch to C++17
* Various changes in the interfaces for linalg named operations
* Transition from `llvm::Optional` to `std::optional`
* Use of enums instead of string values for iterator types in linalg
* Changed default naming convention of getter methods in
ODS-generated operation classes from `some_value()` to
`getSomeValue()`
* Renaming of Arithmetic dialect to Arith
* Refactoring of side effect interfaces (i.e., renaming from
`NoSideEffect` to `Pure`)
* Re-design of the data flow analysis framework
* Refactoring of build targets for Python bindings
* Refactoring of array attributes with integer values
* Renaming of `linalg.init_tensor` to `tensor.empty`
* Emission of `linalg.map` operations in bufferization of the Tensor
dialect requiring another linalg conversion pass and registration
of the bufferization op interfaces for linalg operations
* Refactoring of the one-shot bufferizer
* Necessity to run the expand-strided-metadata, affine-to-std and
finalize-memref-to-llvm passes before converson to the LLVM
dialect
* Renaming of `BlockAndValueMapping` to `IRMapping`
* Changes in the build function of `LLVM::CallOp`
* Refactoring of the construction of `llvm::ArrayRef` and
`llvm::MutableArrayRef` (direct invocation of constructor instead
of builder functions for some cases)
* New naming conventions for generated SSA values requiring rewrite
of some check tests
* Refactoring of `mlir::LLVM::lookupOrCreateMallocFn()`
* Interface changes in generated type parsers
* New dependencies for to mlir_float16_utils and
MLIRSparseTensorRuntime for the runtime
* Overhaul of MLIR-c deleting `mlir-c/Registration.h`
* Deletion of library MLIRLinalgToSPIRV
* Deletion of library MLIRLinalgAnalysis
* Deletion of library MLIRMemRefUtils
* Deletion of library MLIRQuantTransforms
* Deletion of library MLIRVectorToROCDL