Commit Graph

2386 Commits

Author SHA1 Message Date
Thomas Raoux
5e6071254c [BACKEND] Use our internal slice implementation to avoid combinatoria… (#2535)
…l explosion
2023-10-24 03:06:34 +00:00
Phil Tillet
c65d2c2ed6 [CI] run wheels job on CPU worker 2023-10-23 20:04:37 -07:00
Thomas Raoux
cba7abd682 [BACKEND] Remove ttg.cmp and ttg.select and replace by arith op (#2526)
Now that the bug related to attribute is fixed in MLIR we can use arith
ops for cmp and select ops.
2023-10-23 19:35:46 -07:00
jayfurmanek
e74bdb1581 Always promote to int32 in commonShflSync (#369) 2023-10-23 12:27:11 -05:00
Zahi Moudallal
b0c166b9e3 [BACKEND] Fixing bug in elementwise conversion (#2517) 2023-10-20 09:11:15 -07:00
runseny
dc9e3063d7 [HOPPER] Move to tl.make_block_ptr in flash_attention backward scripts (#2395) 2023-10-20 11:06:48 +08:00
Justin Lebar
30186f401e Fix segfault in assertion test. (#2520)
<git-pr-chain>

#### Commits in this PR
1. Fix segfault in assertion test.
    
The issue here is that we were not checking the return values of the
CUDA API
calls we were making. We call one function and then use the data it
returns as
input to another call. Obviously this doesn't work if the first call
returns
    an error and doesn't actually return meaningful data.
    
I don't know why this was passing in CI, but it failed consistently for
me.

#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉 #2520 👈 **YOU ARE HERE**


</git-pr-chain>
2023-10-19 13:42:38 -07:00
Justin Lebar
bdf464e4a8 Make kernel_static_print test work when called twice. (#2518)
<git-pr-chain>

#### Commits in this PR
1. Make kernel_static_print test work when called twice.
    
This test is checking that a message is printed when the kernel is
compiled.
But the test had nothing to force the kernel to be compiled every time
you ran
the test. So after you ran it once, the test would fail every time until
you
    cleared the cache.

#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉 #2518 👈 **YOU ARE HERE**
1. #2520


</git-pr-chain>
2023-10-19 13:17:38 -07:00
ian Bearman
0d57820be9 update triton-shared ref (#2506) 2023-10-19 11:53:37 -07:00
Keren Zhou
be1de890e1 [BACKEND] Replace assert(0) with llvm::report_fatal_error (#2516)
Also add missing return statements
2023-10-19 11:53:09 -07:00
Lixun Zhang
f963c04034 Use the same heuristics for mfma type as PR#352 (#366) 2023-10-18 20:32:44 -05:00
Horace He
a4f373938c [RUNTIME] Filter out paths that don't exist in json group cache (#2511)
There's no guarantee that `/tmp/triton/*/*.json` existing means
that the corresponding `/tmp/triton/*/*.cubin` file also exists because the tmp directory doesn't guarantee file stability.
2023-10-18 16:44:34 -04:00
Zahi Moudallal
a980ec50f1 [BACKEND] Fixing f8e5m2 to bf16 conversion on A100 (#2508) 2023-10-18 17:23:39 +01:00
Thomas Raoux
e36d1665ca [BACKEND] Fix unsupported view op created during optimizations (#2510)
When propagating layout we were generating a view op with mismatching
total number of element per threads. Lowering such op would require
exchanging data across threads.
This change prevents the optimizer from generating such cases. This may
require further optimizations in the future.
2023-10-18 16:37:13 +01:00
Alexander Efimov
20f316b19a [MFMA] Switch between MFMA types (#352)
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.
2023-10-18 16:57:34 +02:00
ian Bearman
768fc1fcd9 [FRONTEND] change hash to not require ptxas (#2476)
I noticed that Triton is using the `ptxas` version as part of the
version hash even for non-CUDA targets. This is an attempt at fixing
this. Moving the version calculation to the back-end makes sense to me
from an architectural standpoint, so that's my approach here. I'm not as
confident in the implementation, so please if folks have any feedback
let me know.
2023-10-17 10:28:51 -07:00
Thomas Raoux
376acb610b [BUILD] Fix macos x86 build (#2505)
There was a mismatch in the llvm link name
2023-10-17 09:49:09 -07:00
Philippe Tillet
05dc28be0e [CI] refactor workflows (#2504)
no longer run third-party tests on every PR
2023-10-17 00:27:17 -07:00
Mehdi Amini
721897fcc4 upgrade llvm to b1115f8c (NFC) (#2403)
Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
Co-authored-by: Phil Tillet <phil@openai.com>
2023-10-16 16:38:49 -07:00
Maksim Levental
87a223d76f bump triton_shared (#2501)
Bump to tip.

cc @manbearian @nhat-nguyen
2023-10-16 13:49:20 -07:00
Alexander Efimov
4d539d7dae Add licenses to AMD related files (#351) 2023-10-16 15:18:01 -05:00
Zahi Moudallal
726bdb984f [FRONTEND][BACKEND] Fix constexpr assignment ; revert #2430 (#2496)
Without this change, a constexpr assignment (ie. `A = B & C`, where `B`
and `C` are both constexpr) is getting assigned to a triton tensor,
which becomes an issue when `A` is used as the condition of an If
statement.
Note: I had to add `not isinstance(node.value, ast.Constant)` to the
condition because if we are assigning `x = 0` then the assigned value is
also a constexpr, but in this case we do want to assign a triton tensor
to `x` so that we can do `x.to(tl.int64)` for example, which cannot be
done on a constexpr.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-10-16 12:35:19 -07:00
Lixun Zhang
1de859df32 [GEMM] [Tuning] Add waves_per_eu to gemm tuning (#362)
* Add waves_per_eu in the tuning space

* Do not allocate tensor on device during kernel compilation step

* Add breakdown elapsed time

* Parallelize the post-processing step

* Parallelize the profile step with --ngpus

* Better timing info printout
2023-10-16 13:50:03 -05:00
Stewart Hall
29828fe491 [FRONTEND] add option to disable fp mul/add fusion (#2495)
By default, ptxas will enable fusion of mul/add to fma instructions. The
backend was also being configured unconditionally to enable this on
conversion from LLVM IR to PTX. This commit adds an option which can be
used to disable the FP fusion behavior in both locations.
2023-10-14 12:23:30 -07:00
Philippe Tillet
3b6ec763d5 Revert "[BACKEND] Disable BreakPhiStruct pass (#2458)" (#2498)
This reverts commit b1bc9b20a0.
2023-10-14 10:40:49 -07:00
Philippe Tillet
8db4fac3b0 Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485)" (#2497)
Reverts openai/triton#2485
2023-10-13 23:32:59 -07:00
Weixing Zhang
76858bd917 [OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485)
In current implementation, warpsPerCTA is always set to [numWarps, 1]
for 2 tt.dot fusion scenario. But, it is not optimal for cases such that
tt.dot doesn't have enough parallelism on row dimension but on column
dimension.
2023-10-12 22:25:42 -07:00
Thomas Raoux
a777e1d8db [OPTIMIZER] Propagate mma layout when the transitive use has dot_operand encoding (#2482) 2023-10-12 23:57:40 +00:00
lanzhiguanhuang
03af50b040 [BUILD] Makes the link_libraries(stdc++fs) global for all targets in the CMake project. (#2481)
### Problem
The previous change still didn't link libstdc++fs into libtriton.so,
which caused the runtime error: undefined symbol
_ZNKSt10filesystem7__cxx114path11parent_pathEv
`link_libraries(stdc++fs)` should be placed before `add_library`.

### What this PR does
This PR Makes the link_libraries(stdc++fs) global for all targets in the
CMake project. By doing so, we ensure that the stdc++fs library is
consistently linked to all targets, addressing potential build issues on
old linux OS like debian10 which uses GCC8.3.0 as the default C/C++
compiler.
2023-10-12 23:46:43 +00:00
Keren Zhou
f81d9d876f [FRONTEND] Fix math for constant values (#2472)
https://github.com/openai/triton/issues/2470
2023-10-12 12:11:42 -07:00
Thomas Raoux
cda298fae7 [Pipeliner] Allocate less shared memory when possible (#2466)
The pipeliner was overallocating shared memory for the inputs
for current schedule. This reduces the shared memory usage to only
what is needed.
Note that improving membar analysis could allow taking advantage of
allocating extra buffers to remove barriers.
2023-10-12 12:10:06 -07:00
Lixun Zhang
821e75a2b0 Improve FA fwd kernel with causal=True (#356)
* Attempt to absorb upstream's changes to improve causal=True

* Add autotuner

* Optimize for AMD MI250

- add pre_load_v as a tuning parameter
- do not define N_CTX as constexpr
- perform the second dot before sum
- remove qk_scale out of the inner loop
- add more configs in the autotuner

Note that bwd kernel is disabled for now. This is because we enabled
autotuning and grid becomes a function. So ctx.grid[0] no longer works.

* Enable bwd kernel
2023-10-12 12:34:27 -05:00
Jack Taylor
6f073a43f6 Remove old ROCM_LIBRARIES set (#360)
In the last PR I forgot to overwrite the initial setting of ROCM_LIBRARIES causing an error in the wheel building process
2023-10-12 16:01:39 +01:00
Jack Taylor
5d44c60d17 enforce cc=None on PyTorch ROCm (#296)
* enforce cc=None on ROCm

* Comment

* Update approach to ignore integer cc values

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
2023-10-12 10:17:26 +01:00
Shucai Xiao
99fa2e4237 add tutorial group gemm example (#343)
* [DOCS] Add a tutorial example of grouped gemm (#2326)
Co-authored-by: Bin Fan <binf@nvidia.com>
2023-10-11 15:13:17 -05:00
Jack Taylor
47563240f8 PyTorch triton branch synchronisation (#354)
* Restructure ROCM Library Search
Currently there are a handful of ROCM dependant files which are required for
triton to run.  The linker(ld.lld), the include files, and multiple hip/hsa
shared objects.

This change will provide three search areas to find these files.  All in
the same order.

1. third_party/rocm.  This location is within the python/triton directory
   and is carried over when triton is built.  IF all necessary files
   are in this location there will be no need to have ROCM installed at
   all on the system.

2. $ROCM_PATH environmental variable.  If this exists it will override
   all other locations to find ROCM necessary files

3. /opt/rocm.  The default location for ROCm installations.  Finding one
   here will notify triton that ROCM is installed in this environment

To ease with step 3.  A new script scripts/amd/setup_rocm_libs.sh
has been added to the repo.  Executing this script will cause all necessary
ROCM files to be downloaded from their respective packages on repo.radeon.com
and installed in third_party/rocm.  Allowing for triton to run without installing
the full ROCM stack.  setup_rocm_libs.sh takes a env_var ROCM_VERSION if a user
wishes to install a ROCM version other than the default (currently 5.4.2)

When triton whls are built to support Pytorch, method 3 will be used to stay in
sync with PyTorch's approach of bringing along any libraries needed and not
requiring ROCM to be installed.

(cherry picked from commit e6aea90fb3e8218cb562e5d990719112d8282702)

* Fix default rocm path

Running into `fatal error: hip/hip_runtime.h: No such file or directory` with latest wheel due to incorrect directory for ROCm libs

(cherry picked from commit 292bae625b113eb65c66cfe4442da7a6456c988a)

* setup_rocm_libs.sh manylinux refactor

(cherry picked from commit f995f314ada4606cb78dc6233cd9c8effc356191)

* Set setup_rocm_libs.sh to be executable

(cherry picked from commit 05d67b9418cacda0d356c2102d7c1a887948b013)

* Revert to using numbered so files to fix upstream

(cherry picked from commit 34f8189eae57a23cc15b4b4f032fe25757e0db8e)

* Remove drm script

---------

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2023-10-11 15:30:39 +01:00
Thomas Raoux
6f46c93b9e [BACKEND] Add back dot.wait when generating async_dot (#2478)
Based on discussion this is needed to make sure there is no race
condition when reading shared memory.
2023-10-10 21:45:28 -07:00
Shucai Xiao
d6d1cf2859 add gfx942 to support matrix_core (#358) 2023-10-10 22:46:24 -05:00
Lixun Zhang
515525d068 [GEMM] Tuning script v2 (#350)
* [GEMM] Tuning script v2

* Extend tuning space to include BLOCK_SIZE = 256

Check LDS in a more smart way

* Added README

* Add git branch and commit to the default tuning result filename
2023-10-10 20:49:49 -05:00
Zahi Moudallal
4749072fbd [BACKEND] Allow reduce with sliced 3D layout as input (#2480) 2023-10-10 15:19:11 -07:00
Alexander Efimov
7e34c244c2 [Triton] Mfma16 support (#251)
* [MFAM] Support mfma with NM size 16

This PR code emitting of MFMA instructions with size 16.

* add control over mfma type with MFMA_TYPE=16 env var
2023-10-09 13:59:54 -05:00
Beal Wang
5812d970a8 [HOPPER][OPTIMIZER] remove divOp and remOp from gemm math loop (#2402)
This is just for Warp Specialization kernels on Hopper. Replace DivOp
and RemOp with SelectOp and AndOp/XorOp.
2023-10-09 14:42:06 +08:00
Philippe Tillet
424e67e727 [FRONTEND] improved while loop error messages (#2463) 2023-10-06 18:37:52 -07:00
Tori Baker
ab4549310b [OPTIMIZER] erase ops after use in iterator (#2455)
This seems to have worked fine in opt mode (although it may be producing
undefined behavior), but in debug mode on a newer version of llvm, it
segfaults without this PR as the iterators get invalidated.

This is also consistent with other places it is done in this file.
2023-10-06 18:02:56 -07:00
Thomas Raoux
b1bc9b20a0 [BACKEND] Disable BreakPhiStruct pass (#2458)
This is causing functional failures in pytorch workload. Disabling it
until I figure out the problem.
2023-10-06 17:59:53 -07:00
Thomas Raoux
a7061e19b2 [BACKEND] Fix multiple bugs in WGMMA (#2457)
Fix dependencies in wgmma_wait op to prevent the scheduler from moving
it past the uses of wgmma accumulator. We need to explicitly represent
the dependency between the wait and the accumulator uses otherwise LLVM
is free to re-order those.
This allows us to remove a workaround to prevent the re-ordering. We can
also remove the wait op added in the loop during pipelining.

Also fix the descritpor calcuation for wgmma, we should calculate the
same descriptor for the whole warpgroup.
Added a workaround for a bug that was exposed by different timing due to
those changes. We shouldn't insert operations between the loop and
async_wait or we may have race conditions.
2023-10-06 17:59:28 -07:00
Lixun Zhang
ded79e87ee [TUTORIALS] Enable causal=False in FA fwd kernel (#2459) 2023-10-06 17:54:45 -07:00
Sam Shleifer
fb3c2f3b2b [TUTORIALS] attention: support torch 2.1 (#2461) 2023-10-06 17:50:11 -07:00
Zahi Moudallal
be19cf3103 [BACKEND] Enable reduce with 3D tensors and added tests (#2460) 2023-10-06 15:08:22 -07:00
Keren Zhou
a42d517021 [FRONTEND] Better warning on nested jit functions (#2453) 2023-10-06 14:22:51 -07:00