Commit Graph

2386 Commits

Author SHA1 Message Date
Thomas Raoux
218492cd65 [BACKEND] Prevent double rounding when doing f32 -> fp8 (#2583) 2023-11-02 05:32:16 +00:00
Dongdong Li
d0098da7b1 [BACKEND] Add error reporting to report non-kernel-argument (#2552)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-11-01 20:22:10 -04:00
Vedant Roy
702cde0d6f [FRONTEND] Implement ternary operator for dynamic values (#2560) 2023-11-01 20:21:32 -04:00
Alexander Efimov
74c5fd46ee [RemoveLayoutConversions] Fix reduce failed infer type error (#377)
* [RemoveLayoutConversions] Fix reduce failed infer type error

This PR fixes layout propagation algorithm in RemoveLayoutConversions pass.
In some cases during rewriteSlice process, reduce operation with multiple outputs
rewrites only one output layout, which breaks assumption that both outputs should have same layout.

This change is a minimal part of https://github.com/openai/triton/pull/2331 change and
small lit test for regression testing.

* fix combine test

* Fix issue with incorrect inference layout of make_range output result
2023-11-01 13:31:13 -05:00
Alexander Efimov
d62a3ffdbe [RemoveLayoutConversions] Remove PatternSharedInfo structure (#378)
This structure is not used anymore after massive refactoring
of RemoveLayoutConversion pass in September IFU.
2023-11-01 12:57:35 -05:00
Chenggang Zhao
e7fdfd76fb [FRONTEND] Add value restoration for autotuner (#2549)
For in-place kernels, neither `reset_to_zero` nor `Config.prehook`
provided in the autotuner can restore the values changed during the
tuning process, so I propose a recovery mechanism here.

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-31 21:37:44 -04:00
Zahi Moudallal
3650213218 [OPTIMIZER] Thread local reduction optimization (#2542)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-31 16:13:36 -07:00
Justin Lebar
258399c114 Enable ruff linter instead of flake8 (#2574)
[FRONTEND] Enable ruff linter instead of flake8.
    
This fixes a few issues automatically, and also flagged two issues to
fix manually in test_core.py: We had two duplicate function names!  One
of these function bodies was a duplicate, so I deleted it.  The other
function body was not a duplicate, so I gave it a new name.

AIUI all of these errors should have been picked up by flake8.  I'm
confused why it wasn't working.  Anyway this is working, and it's faster
than flake8, so it seems like an improvement in all dimensions.
2023-10-31 21:28:24 +00:00
Goran Flegar
601b95cdbb [DEPS] bump LLVM version to llvm/llvm-project@49af650 (#2570)
Co-authored-by: Ashay Rane <ashay@users.noreply.github.com>
Co-authored-by: khasanovaa <khasanovaaliya19@gmail.com>
2023-10-31 12:06:25 -07:00
Zahi Moudallal
943330790a [FRONTEND] add do_not_specialize property back to JITFunction (#2573) 2023-10-31 12:02:45 -07:00
Nhat Nguyen
0cf3a67f04 [BUILD] Disable W503 in pyproject.toml (#2575)
This PR https://github.com/openai/triton/pull/2555 disabled `W503`
(means line breaks can now occur before a binary operator).

The change surprisingly didn't take any effect nor required any style
changes in `triton` main `pre-commit` stage. But our `triton-shared`
[pipeline
run](https://github.com/microsoft/triton-shared/actions/runs/6710459100/job/18236352821)
(see `Check pre-commit` stage) picked this up correctly and complained
about formatting issues. I'm not entirely sure what could be the cause
for such difference, but if we also disable `W503` in `pyproject.toml`
then the rule is picked up correctly.
2023-10-31 11:57:02 -07:00
daemondzh
96cf8f979a [OPTIMIZER][BACKEND] Fix an issue in RewriteTensorPtr pass to enable TMA with 8-bit types (#2545)
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-0148.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@dc7-sim-e12-203.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1604.nvidia.com>
Co-authored-by: Zhicheng Xiong <zhichengx@ipp2-1608.nvidia.com>
Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com>
2023-10-31 02:28:27 +00:00
Justin Lebar
29a9245559 [BUILD] use clang+lld in CI builds. (#2564)
Use clang+lld in CI builds.

This is significantly faster.
2023-10-30 19:19:27 -07:00
Chris Jones
2398b82f18 [FRONTEND][BACKEND] dd memory synchronization scope parameter to atomic ops. (#2562)
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-30 19:18:27 -07:00
Keren Zhou
70fca00b67 [BACKEND] Fix device_print without arguments (#2566) 2023-10-30 20:04:44 -04:00
Keren Zhou
492886fcde [FRONTEND] Add reverse eq and ne (#2563) 2023-10-30 16:56:43 -04:00
Justin Lebar
12f906287f [FRONTEND] Refactor jit.py. (#2556)
[FRONTEND] Refactor jit.py.

The goal is to simplify the code and make it more flexible before we
change the kernel launch syntax to
`kernel[grid, compiler_flags(...)](...)`.

The main changes here are:

 - Get rid of the eval'ed code in make_launcher.  We can do everything
   using bind().
 - Add KernelParam and KernelArg classes, letting us get rid of the
   parallel arrays/dicts indexed by parameter index.
 - Get rid of duplicated kernel launch code in the cache-hit/cache-miss
   branches.
2023-10-30 13:14:51 -07:00
Justin Lebar
f88b01f558 Apply ruff pre-commit to python/triton/runtime. (#2558)
We're in the process of incrementally converting from autopep8 + flake8
+ isort to ruff, on a directory-by-directory basis.

The motivation to switch away from autopep8 is that I can't get it to
wrap long lines, even with -aaa.  This seems to be a known problem,
https://github.com/hhatto/autopep8/issues/497.

See more details about alternatives tried in
https://github.com/openai/triton/pull/2557.
2023-10-30 11:06:44 -07:00
Justin Lebar
f7be5f8fa5 [BUILD] Disable W503 in flake8 config. (#2555)
It seems that by default, flake8 warns on both "linebreak occurred
before
binary operator" (W503) and "linebreak occurred *after* binary operator"
(W504).  You...kind of have to pick one of these.  :)

According to the docs, W503 is deprecated, so we disable that one.
https://www.flake8rules.com/rules/W503.html
2023-10-30 09:28:34 -07:00
Justin Lebar
1ea5c0f675 [DOCS] Add instrs for setting up C++ intellisense. (#2554) 2023-10-27 12:03:09 -07:00
Lixun Zhang
9517d4c256 Tweak matmul tutorial on MI2xx GPU (#376)
* Tweak matmul tutorial on MI2xx GPU

* Add config for 9728

---------

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2023-10-27 10:40:11 -05:00
Someone
cde42e6221 [BUILD] make cuda tools vendoring optional (#2546) 2023-10-26 23:16:41 -07:00
jayfurmanek
26debc92a0 Merge pull request #363 from ROCmSoftwarePlatform/post_ifu_rebase_empty_kernel_works
Third Party Backend Merge
2023-10-26 16:28:00 -05:00
Michael Melesse
1fd9b40f2f Works as StandAlone and Backend and also Perf is Good
This is a combination of 4 commits.

Works as StandAlone and Backend

Works as StandAlone and Backend

This is a combination of 13 commits.

Works StandAlone and as Backend

This is a combination of 7 commits.

backend set default dir with flag

move bitcode to backend dir

copy backend

save

empty test work in backendmode

enable backend mode when copying to upstream

clean up

fix failure

minimize diff

add skip function

fix bug with corrupted dwarf exp

match num_wraps

fix multi threaded test issue

move bitcode file out of lib

move backend to python/triton/third_party/hip

move libhsa

backend works again

restart ci

clean upstream location first before copy

match scripts

fix new error

memoize backend stuff

fix bug
2023-10-26 14:27:18 -05:00
Dongdong Li
0469d5fccd [OPTIMIZER] Remove extra wgmma_wait_group in flash attention (#2399)
Co-authored-by: dongdongl <dongdongl@nvidia.com>
2023-10-26 16:35:36 +00:00
zhu jianjiang
cfae7e2a25 [BACKEND] Fix matmul downcast path (#2528)
for https://github.com/openai/triton/issues/2523 ,add regression test

---------

Co-authored-by: Jokeren <robinho364@gmail.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-26 09:43:49 -04:00
Michael Melesse
09ba348f87 [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-26 08:36:49 -05:00
Michael Melesse
833c9b985f Backend Dir with Empty Kernel Working
This is a combination of 9 commits.

Empty Kernel Works rebase

minimzie diff: add libs

move to backend dir

match python

add includes

move everything to backend dir

match include and lib

create a backend build mode

simplify backend
2023-10-26 08:36:49 -05:00
Phil Tillet
07baf3a102 [CI] move llvm-build.yml to the top of workflow directory hierarchy 2023-10-26 02:08:56 -07:00
ian Bearman
3f95a6fb81 [DOCS] refactor meetup subdir; add triton-shared introduction slides
Adding `Introduction to Triton-Shared.pptx` as presented at the Oct. 25
Triton community meeting.

---------

Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-26 02:02:02 -07:00
Eikan Wang
40766928f1 Add the update for Intel XPU Backend. (#2551)
- Intel XPU Backend Status Update
- GEMM Lowering for Intel GPU
- Questions to the Triton community
2023-10-26 08:33:37 +00:00
kshama-msft
746d411ead [DOCS] create 10-25-2023.md (#2548) 2023-10-26 01:32:09 -07:00
Keren Zhou
bc72294507 [CI] Reenable torchinductor workflow (#2527) 2023-10-25 23:44:02 -07:00
runseny
4c816c2f59 [OPS] enable flash_attention_v2 TMA (#2544) 2023-10-25 23:31:17 -07:00
Shucai Xiao
2729ae6c6f use different int8 mfma instructions on different GPUs. (#368)
* changes support to choose different int8 instructions

* rename an instruction name

Co-authored-by: Aleksandr Efimov <efimov.alexander@gmail.com>
2023-10-25 19:12:21 -05:00
Hongtao Yu
2323adb387 [BACKEND] Handle AtomicCASOp in GPU IR conversion (#2514)
Addressing https://github.com/openai/triton/issues/2011

Co-authored-by: Philippe Tillet <phil@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-25 15:20:07 -04:00
Alexander Efimov
5a86b46bb1 [MFMA] FP8 and BF8 support (#355)
* [MFMA] FP8 and BF8 support

This PR adds support of fp8 and bf8 in AccelerateMatmul pass and
Introduces generation of float8 mfma instructions in ttg to llvm conversion.

* add tests

* fix tests

* review fix: fix variable naming and dot operand promotion.

* review comments fixes

---------

Co-authored-by: Shucai Xiao <shucai.xiao@amd.com>
2023-10-25 13:27:10 -05:00
Shucai Xiao
8547694665 set correct arch info for unit test (#370)
* set correct arch info for unit test

* address review comments
2023-10-25 13:06:45 -05:00
oplavsic
715a589ce3 [FA fwd D=128] Reduce LDS usage in epilogue (#340)
* rebase onto improve_fwd_fa

* Fixed a leftover from rebase

* rebase onto improve_fa_fwd

* Reduce tuning space

* Disable bwd with D=128

* Add test for d=128

* Fix an issue with get_best_config when there is only one config

* Added better configs for d=128

* Fix typos

---------

Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
2023-10-25 12:10:34 -05:00
Adnan Akhundov
7d55968fee [BACKEND] Dedup elementwise in LLVM IR based on constancy (#2512)
### Summary

When Triton GPU IR is lowered into LLVM IR, we can make use of the
constancy information about the result of the elementwise ops to
deduplicate otherwise redundant computation. That is the contribution of
this PR: the constancy is checked and, if possible, some of the values
in LLVM IR are reused multiple times instead of computing equal values
separately.

The change is beneficial for the PyTorch 2 / TorchInductor-generated
Triton code, as the leftmost sub-indices extracted from the flat index
by div / mod operations can be equal, given sufficiently large 2^n
factor in the rightmost rightmost dimension(s). This makes the
computation resulting in those sub-indices redundant. Consequently,
under the necessary constancy conditions, the redundant indexing
arithmetics can be deduplicated. We observe up to 29% decrease in the
latency of some of our jagged tensor kernels
2023-10-25 11:25:29 -04:00
Justin Lebar
e70e11e834 [BACKEND] Improve printf. (#2532)
[BACKEND] Improve printf.

Previously, we printed all of a GPU thread's values in a single printf()
call, and this, plus the user-specified prefix, was all we printed.

This caused a few problems.

 - nvptx printf can only handle 32 arguments; if you pass more than
   that, it prints garbage.  So if a thread had more than 32 values, you
   couldn't print them, issue #2486.

 - The order of the values within the Triton program (GPU thread block)
   is an implementation detail -- it depends on the layout the compiler
   assigns to a tensor.  So this also prevented you from interpreting
   the printed output.

To address this, we now print the Triton pid and multi-dimensional
Tensor index for each value.  And each value gets its own line to avoid
passing too many args to printf.

Example output:

    ```
    pid (0, 1, 2) idx (36, 127) x: 42
    ```

If you want to observe all the values in a tensor in order, you can grep
and then sort the output.

We also make a UX enhancement to print: The printed label always ends
with ": "; you don't have to add it yourself.

Fixes #2486.
2023-10-25 08:47:55 +00:00
Justin Lebar
2217bd2f5c [BACKEND] Delete dead vprintf and vprintf_array functions (#2531)
Delete dead vprintf and vprintf_array functions.

These were introduced in 88498d104a and
appear to have been dead at the time of introduction.
2023-10-25 00:22:53 +00:00
Philippe Tillet
31c76ddd05 [CI] revert recent changes (#2543) 2023-10-24 17:00:31 -07:00
Phil Tillet
5181d62b1b [CI] renamed third party test workflow 2023-10-24 12:12:52 -07:00
Justin Lebar
9b4d91b132 Add TRITON_BUILD_WITH_ASAN envvar. (#2537)
Note that asan doesn't work with programs that use the GPU, so this is
only useful for running tools like triton-opt.

I was not able to get msan working.  libstdc++'s std::string
implementation seems to use uninitialized memory in a way that seems
safe but triggers an msan error.  I tried and gave up on switching to
libc++ and teaching msan to ignore this error.
2023-10-24 10:30:30 -07:00
Philippe Tillet
3f2b7263e8 Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485)" (#2541)
Reverts openai/triton#2525
2023-10-24 10:23:19 -07:00
Phil Tillet
96b04493f1 [CI] move workflow around 2023-10-24 04:01:53 -07:00
Sam Shleifer
12da43084b [TESTING] add diff column, option to return df in benchmark (#2469) 2023-10-24 05:17:00 +00:00
Philippe Tillet
8f467f1ea9 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485) (#2525)
Reverts openai/triton#2497
2023-10-23 21:50:58 -07:00
Adnan Akhundov
50add54334 [FRONTEND] Add input dtypes to autotuning key (#2534) 2023-10-24 03:29:30 +00:00