This pull request addresses a crash that occurs when casting to a
tl.constexpr type in the frontend.
More info and repro code available in:
https://github.com/openai/triton/issues/1221
Make cmake happier, it doesn't like multiple target_link_library
definitions for the same name.
Use find_package instead on Windows for dlfcn-win32.
Set LLVM_SYS_PATH on Windows for python setup.
Debug build almost working, AlwaysCreate error thrown still.
Minor bug: AutoTuner currently throws the following error when certain
configs go OutOfResources (e.g. the matmul example when testing on GPUs
with less shared memory).
This will removed unused private functions after we've inlined
everything. That's important because TritonToTritonGPU doesn't know how
to lower tensor arguments.
- Add text description and equations for the tutorial.
- Improve the code readability by changing variable names to align them
with the equation. The actual code logic is not changed.
This is a follow-up of #510. Let me know if a preview HTML is helpful
for the review, I can add a link to that too.
- cancels CI runs in progress when a PR is updated
- atomics tests now use small int values that can be represented exactly
- replaced some old-style formatting by some f-string
- Dependent CUDA files (ptxas, cuda.h, libdevice.bc.10) are now packaged in
`triton/third_party/cuda`. `ptxas` is downloaded from conda repo at
install time.
- Can now be built with old glibc (as that used by manylinux2014)
The literal syntax can give minor performance bumps compared to function
calls to create dict, list and tuple. This name dict must be looked up
in the global scope in case it has rebound. The same goes for the other
two types list() and tuple().
Signed-off-by: nishantsikarwar <nsikarwar@ch.iitr.ac.in>
Co-authored-by: Philippe Tillet <phil@openai.com>
Previous https://github.com/openai/triton/pull/1113 forgot to consider
that a node may have multiple parents, visiting the instruction before
any parent violates the semantic of topological sort.
The fixed implementation exhaustively add all operations into a
candidate subgraph and move an operation to the "ready" queue once all
of its operands have been visited.
- **temporarily commenting assertion in `MemBar.cpp`. We need to fix
this! but for now the following patches will unblock a number of
users.**
- Fixed frontend codegen issue for If / For / While. Emit an error when
replaced values' type mismatch.
- Added "top level" codepath for if statements, which allows users to
write patterns to exit early from kernels (e.g., `if cond1: if cond2:
return else: ...`). Added associated codegen in TritonToTritonGPUPass
- Added basic control flow tests
- Pipeline pass is no longer activated when memory accesses can't be
vectorized
- Added missing magic methods to `constexpr`
- Fixed issue in random.py: bitcast some values to uint when they need
to be.
- Added support for `Not`
- Fixed nondeterministic compilation issue
This fix enables the support on sm_90 (otherwise it will crash).
Logs like
> 'sm_90' is not a recognized processor for this target (ignoring
processor)
could be ignored and should be eliminated with the update of llvm nvptx
backend.
Using range(len(...)) is not pythonic.
Python does not have not index-based loops. Instead, it uses collection
iterators. Python has a built-in method enumerate which adds a counter
to an iterable. Using this, you can access the counter and the value
from the iterable at the same time. It is therefore recommended to
replace range(len(...)) with enumerate(...).
for ex
5bcf60a5c0/python/triton/language/extern.py (L68)f62d556fff/python/triton/language/extern.py (L68)
Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Otherwise it fails with
```
File "setup.py", line 147, in build_extension
"-DLLVM_EXTERNAL_LIT=" + lit_dir,`
TypeError: can only concatenate str (not "NoneType") to str
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>