clearly differentiate between standard fp8e4 (which we'll stop
supporting on SM <= 89 because conversions are too expensive if we want
to handle the single NaN and clipping properly) and a software-optimized
fp8e4b15 format.
* [Dot] Support slicing in dot operands
This PR enables support of sliced tensors of dot operands.
One of consequences is we can reenable prefetch and sw pipeline optimizations.
* refactor to comply with next changes with swizzling support
* comment fixes for code review
Triton has supported different codegen backends for different devices,
so enabling the unit test cases to support different devices also makes
sense. Otherwise, the third-party backend might have to intrusively
change the Triton test cases.
The third-party backend might install its python package to the
`triton/third_party` python package during the build process. But the
`build_py` could be executed before the `build_ext`, and then `build_py`
would only copy the `packages` defined in the `setup.py` w/o the
third-party related packages as the third-party backend has not been
built, which is triggered by `build_ext`. Therefore, this PR refined the
build order a little bit to ensure `build_ext` always happens before
`build_py`.
It seems that patch #1773 introduced a bug, since the `lhs` object
doesn't necessarily have a `__name__` attribute.
I'm hitting this if I modify the matmul tutorial
(gflegar/triton@442b00f4d):
```
File "/home/gflegar/triton/python/triton/runtime/jit.py", line 74, in visit_Attribute
if lhs is None or lhs.__name__ == "triton":
AttributeError: 'Tensor' object has no attribute '__name__'
```
I think the idea of that patch was to remove the need to import triton
by replacing `lhs is triton` with `lhs.__name__ == "triton"`. This patch
should have the same behavior as the original code, but withouth failing
if `lhs` doesn't havea `__name__` attribute.
This is a very quick change that allows the configs' pre-hooks to see
the values in the config itself. This is useful if we'd like to allocate
intermediate tensor and the shape depends on tile size.
This is a revival of @gaxler initial ahead-of-time compiler proposal.
Code was simplified and some constraints were relaxed (i.e., we now
execute the entire file provided vs just the kernel AST) to promote
maintainability. A basic unit test was added, though it does not test
specialization right now.
co-authored by: Gregory Axler, thexler <g.axler@gmail.com>
Example:
```
if static_a == 0 and static_b == 1:
...
```
The return value of `static_a == 0` should be `constexpr(True)` but not
`True`, otherwise the bool object (True/False) doesn't have the
`logical_and` method.
* [MFMA] Activated Fused Attention Forward Pass
Patch contains following changes:
1) make_range operator now works with MFMA layout.
2) Reduce operation is forced to run in block layout:
inputs converted to block layouts, outputs returned to MFMA layout
* Use simple module walk instead of pattern rewritter.
* Remove pattern rewritter header.
* Enable basic reduce algorithm for MFMA layout
* Add TODO comment for fused attention backward pass
* Fix bug in fast codegen algorithm for reduce op
* Fix input type bug
* Increase block size to 128 since out of memory issue is not seen on MI210
* Fix block_size error
* Add mfma support in DecomposeDotOperand pattern.
`tl.cat(tensor<64>, tensor<64>) -> tensor(128)`, because it concatenates
elements into a single thread, if number of threads is 128, each thread
should own at least 2 elements.
With this PR, we also disable remat of the cat op in some cases.
Add a configurable parameter for the number of threads per warp for
other GPU. Like: Intel GPU.
Make it default to be 32 not change code logic on the CUDA/AMD GPU.
Note: The Intel GPU GenX ISA is explicit SIMD and can support variant
number of threads lane per HW execution unit.
- Applied to `loadA()` the same fix as 2c88ed6aab9ace22ccde1f0e443a1579727ee501.
- Minor cleanup of `mfmaLayout.getWarpsPerCTA()` usage.
Partially fixesROCmSoftwarePlatform/frameworks-internal#4545
Support the following three cases:
1. Operands of `load` depend on induction variables before `load`s.
2. Mixed use of induction variables and offset to update the `ptr`.
3. Cross iteration (>1) dependency values.
Also catch out-of-bounds indices at constructio and throw a proper error
in the frontend.
Finally, let's make the IR a bit prettier:
%0 = tt.get_program_id {axis = 0 : i32} : i32
becomes:
%0 = tt.get_program_id x : i32
Fixes#1718
When comparing to the expected slides, using the `==` operator will
dispatch to the component of the slice. If the user writes `a[10:20]`
these are `triton.constexpr` instances, and the `__eq__` operator which
is implemented as: `return constexpr(self.value == other.value)`. At
this point the access to `.value` on the provided `None` yields an
exception that isn't very friendly to the user.
I am not sure if the implementation of `constexpr` should be hardened
instead?
Co-authored-by: Philippe Tillet <phil@openai.com>