By default Triton generates MLIR with f32 result of the tt.dot operation on f16
typed operands. So we have "tt.dot(f16,f16,f32)->f32" types in .ttgir. But
LLVM FMA instruction requires for the same type for all three operands. So first
two operands are implicitly casted f16->f32 as
"unrealized_conversion_cast struct{f16,f16,...}->struct{f32,f32}".
The change fixed incorrect implicit cast generation.
For the int8 typed operands result operand is also casted after performing dot.
As the next step to improve FMA based dot operation FMA on f16 and int8 target
specific intrinsics (e.g. fma(f16,f16,f16)->f16) could be used, perhaps as an
option.
Fix issue https://github.com/openai/triton/issues/244
Check `end` is greater than `start`.
Check if the range can fit in `int32`.
Check the number of elements less than or equal to
`TRITON_MAX_TENSOR_NUMEL = 131072`.
---------
Co-authored-by: Philippe Tillet <phil@openai.com>
This pull request addresses a crash that occurs when casting to a
tl.constexpr type in the frontend.
More info and repro code available in:
https://github.com/openai/triton/issues/1221
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue
Using range(len(...)) is not pythonic.
Python does not have not index-based loops. Instead, it uses collection
iterators. Python has a built-in method enumerate which adds a counter
to an iterable. Using this, you can access the counter and the value
from the iterable at the same time. It is therefore recommended to
replace range(len(...)) with enumerate(...).
for ex
5bcf60a5c0/python/triton/language/extern.py (L68)f62d556fff/python/triton/language/extern.py (L68)
Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
- fixed scalar atomic_rmw implementation for fmin/fmax for f32
- fixed tensor atomic_rmw
- added atomic_cas implementation.
TODO: fix atomic_rmw for f16, implement fmin/fmax for f32 with
native instructions (asm inline in case of LLVM 14) instead of
tweak used as for NV.
Most notably, this PR:
- changes the traits (and assembly format) of addptr so it can handle offsets that have arbitrary integer width.
- adds support for `cat`
- Fixed bugs on layout conversions for int1 data (we should use int8
internally for int1 data to prevent llvm from using vec<i1> which has
different semantics)
- Fixed semantics of some casts to bool in the frontend