Commit Graph

95 Commits

Author SHA1 Message Date
Chenggang Zhao
e7fdfd76fb [FRONTEND] Add value restoration for autotuner (#2549)
For in-place kernels, neither `reset_to_zero` nor `Config.prehook`
provided in the autotuner can restore the values changed during the
tuning process, so I propose a recovery mechanism here.

---------

Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-10-31 21:37:44 -04:00
Justin Lebar
258399c114 Enable ruff linter instead of flake8 (#2574)
[FRONTEND] Enable ruff linter instead of flake8.
    
This fixes a few issues automatically, and also flagged two issues to
fix manually in test_core.py: We had two duplicate function names!  One
of these function bodies was a duplicate, so I deleted it.  The other
function body was not a duplicate, so I gave it a new name.

AIUI all of these errors should have been picked up by flake8.  I'm
confused why it wasn't working.  Anyway this is working, and it's faster
than flake8, so it seems like an improvement in all dimensions.
2023-10-31 21:28:24 +00:00
Zahi Moudallal
943330790a [FRONTEND] add do_not_specialize property back to JITFunction (#2573) 2023-10-31 12:02:45 -07:00
Justin Lebar
12f906287f [FRONTEND] Refactor jit.py. (#2556)
[FRONTEND] Refactor jit.py.

The goal is to simplify the code and make it more flexible before we
change the kernel launch syntax to
`kernel[grid, compiler_flags(...)](...)`.

The main changes here are:

 - Get rid of the eval'ed code in make_launcher.  We can do everything
   using bind().
 - Add KernelParam and KernelArg classes, letting us get rid of the
   parallel arrays/dicts indexed by parameter index.
 - Get rid of duplicated kernel launch code in the cache-hit/cache-miss
   branches.
2023-10-30 13:14:51 -07:00
Justin Lebar
f88b01f558 Apply ruff pre-commit to python/triton/runtime. (#2558)
We're in the process of incrementally converting from autopep8 + flake8
+ isort to ruff, on a directory-by-directory basis.

The motivation to switch away from autopep8 is that I can't get it to
wrap long lines, even with -aaa.  This seems to be a known problem,
https://github.com/hhatto/autopep8/issues/497.

See more details about alternatives tried in
https://github.com/openai/triton/pull/2557.
2023-10-30 11:06:44 -07:00
Adnan Akhundov
50add54334 [FRONTEND] Add input dtypes to autotuning key (#2534) 2023-10-24 03:29:30 +00:00
Justin Lebar
30186f401e Fix segfault in assertion test. (#2520)
<git-pr-chain>

#### Commits in this PR
1. Fix segfault in assertion test.
    
The issue here is that we were not checking the return values of the
CUDA API
calls we were making. We call one function and then use the data it
returns as
input to another call. Obviously this doesn't work if the first call
returns
    an error and doesn't actually return meaningful data.
    
I don't know why this was passing in CI, but it failed consistently for
me.

#### [PR chain](https://github.com/jlebar/git-pr-chain)
1. 👉 #2520 👈 **YOU ARE HERE**


</git-pr-chain>
2023-10-19 13:42:38 -07:00
Horace He
a4f373938c [RUNTIME] Filter out paths that don't exist in json group cache (#2511)
There's no guarantee that `/tmp/triton/*/*.json` existing means
that the corresponding `/tmp/triton/*/*.cubin` file also exists because the tmp directory doesn't guarantee file stability.
2023-10-18 16:44:34 -04:00
ian Bearman
768fc1fcd9 [FRONTEND] change hash to not require ptxas (#2476)
I noticed that Triton is using the `ptxas` version as part of the
version hash even for non-CUDA targets. This is an attempt at fixing
this. Moving the version calculation to the back-end makes sense to me
from an architectural standpoint, so that's my approach here. I'm not as
confident in the implementation, so please if folks have any feedback
let me know.
2023-10-17 10:28:51 -07:00
Stewart Hall
29828fe491 [FRONTEND] add option to disable fp mul/add fusion (#2495)
By default, ptxas will enable fusion of mul/add to fma instructions. The
backend was also being configured unconditionally to enable this on
conversion from LLVM IR to PTX. This commit adds an option which can be
used to disable the FP fusion behavior in both locations.
2023-10-14 12:23:30 -07:00
edimetia3d
cb83b42ed6 [FRONTEND] using closure to create jit launcher (#2289)
Hi,

I'm adding some features to
`triton.runtime.jit.JITFunction_make_launcher` and found it is hard to
debug it:
1. The inlined Python code is hard to inspect in my editor.
2. My debugger fails to step into these inlined codes.

In response, I've introduced some code to solve these issues. My
modifications include:
~~1. Refactoring the launcher's inline Python code, ensuring it only
relies on the "self" object.~~
~~2. Add a utility method that generates a temporary file to create a
launcher when debugging kernel in main module~~
Using a closure to hold the launcher's body

Because this features might be good to others, I have initiated this
Pull Request.

~~Tests are yet to be added; if this submission might be accepted, I
will add it later.~~
Since this change is a refactor, no new test was added.
2023-09-22 17:01:54 -07:00
Philippe Tillet
894fa9e943 [RUNTIME][INTERPRETER] now also override __str__ method for tensors (#2325) 2023-09-17 16:49:30 -07:00
Philippe Tillet
e686b4d6d4 [FRONTEND] interpreter rewrite (#2321)
This is a new interpreter mode that shares semantic analysis with the
JIT'ed codepath and that the Triton core team is committed to maintain
2023-09-17 14:58:50 -07:00
Thomas Raoux
b63e8f87fc [FRONTEND] Override prototype (#2214)
Low tech but very useful way to override kernels on the fly. This can be
use for debugging functionality or performance problems this lets user
dump modify and feed back IR into the jit compiler.
2023-09-13 10:05:47 -07:00
Ying Zhang
37f12497b0 [FRONTEND] Add PyTorch fp8 dtypes to Triton (#2279)
Add PyTorch fp8 dtypes
(8025b193a9/torchgen/api/types/types.py (L50-L51))
to Triton.
2023-09-12 08:57:01 -07:00
Shintaro Iwasaki
8da27c1c95 [Build] Fix very minor compilation problems (#2277)
This PR fixes a few very minor compilation issues found in internal
deployment at Meta. It looks like nit-picking, but it'd be really
appreciated if it could be addressed in OSS Triton (to reduce
differences from OSS), and we believe these changes are not bad in
general. Neither performance nor functionality is affected by this PR.

1. Type cast in `python/triton/runtime/backends/cuda.c`. Implicit `void
*` -> `cuuint{32,64}_t *` cast is not allowed by many compilers (with
certain flags). It'd be nice to add an explicit cast (like
`backends/hip.c`).

2. Inconsistent include path specification in
`lib/Conversion/TritonGPUToLLVM/DotOpToLLVM/WGMMA.cpp`. Unlike other
`DotOpToLLVM/*.cpp`, include paths used in `WGMMA.cpp` are not relative.
This is problematic in some compilation settings since a compiler
somehow needs to find headers in a parent directory. It'd be great to
use a relative path, like other source files in Triton.

cc: @yuguo68
2023-09-11 19:28:31 -07:00
Izzy Putterman
7d01c1852a Revert unintentional change (#2257)
This change seems to have been unintentionally reverted in the hopper
PR:
38d767ea93

Adding it back.
2023-09-07 10:48:12 -07:00
Keren Zhou
9e9fbe01f0 [FRONTEND] Fix specialization on triton integer types (#2236)
https://github.com/openai/triton/issues/2231
2023-09-03 23:57:08 -07:00
Shantanu
a4df60e20a [FRONTEND] Fix GIL handling in error conditions (#2225)
The use of the opaque GIL state APIs should mean that the
PyErr_SetString is now safe, regardless of whether the caller has the
GIL or not.
2023-09-01 13:30:42 -07:00
Michael Melesse
c6d33dcebf [ROCM] Core Functionality for AMD (#1983)
* this pr adds a third party backend for triton that works on AMD 
* this expose a lot of the work that has been done in our
[fork](https://github.com/ROCmSoftwarePlatform/triton)
* most unit tests on `test_core.py` pass
* it skips some unit tests for various reasons
* we plan to follow up with more prs improving Functionality and
Performance in the future

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-31 14:02:00 -07:00
jon-chuang
9af76e7d5a [RUNTIME] Fix cache dir (#2196)
---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-08-29 21:07:16 -04:00
Greg Brockman
ab3e8b0dad [FRONTEND] fix handling of do_not_specialize with interior constantexprs (#2188) 2023-08-26 09:19:34 -07:00
Mohammed Anany
ebfe0ffb29 [FRONTEND] fix for undefined dtypes in jit during loading defaults (#2114)
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-08-25 10:28:23 -07:00
Shantanu
7083dae4f2 [FRONTEND] drop the GIL around more CUDA ops (#2173) 2023-08-24 20:31:38 -07:00
chengjunlu
6cb67185f8 [FRONTEND]To use proper default num_warps and num_stages based on the device backend in JITFucntion (#2130)
The default values used by JITFunction for num_warps and num_stages are
coupled with Nvidia GPU architecture. We should use the proper default
values based on the device backend for the kernel to be compiled to.
1. Add two functions to return the default num_warps and num_stages for
the specific device backend.
2. JITFunction uses the proper default num_warps and num_stages based on
the specific device backend.

Co-authored-by: Wang Weihan <eikan.wang@intel.com>
2023-08-24 21:58:18 +08:00
Zahi Moudallal
23dd11d471 [BACKEND] Solidify f8e4m3 (#2105)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-18 19:12:09 -07:00
Izzy Putterman
fc667d1f8f [FRONTEND] fix new absolute imports (#2072)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-13 14:23:36 +00:00
Thomas
98372f46d3 [FRONTEND] Remove extra calls to _get_config causing runtime overhead (#2094) 2023-08-13 06:51:26 -07:00
Zahi Moudallal
a01c116f76 [FRONTEND/BACKEND] Revived Float8E4B15x4 (#2090) 2023-08-11 17:49:52 -07:00
Keren Zhou
382e8fb1fa [RUNTIME] Make apis compatible with cuda 11 drivers (#2081)
https://github.com/openai/triton/issues/2042
2023-08-11 17:46:56 -07:00
Shantanu
776b3784c2 [FRONTEND] further improve version_key speed (#2073)
Realised I could do this right after my first PR got merged. This saves
another 100ms
2023-08-09 22:29:36 +00:00
Shantanu
0e11257b8d [FRONTEND] improve speed of computing version_key (#2071)
libtriton.so is pretty large these days and hashing it is slow.
Switching the hash from md5 to sha1 shaves close to 300ms off the time
for me (as well as being a better hash, for whatever that's worth).

As far as I could tell, sha1 is the fastest stable hash in the Python
standard library, including things like zlib.crc32
2023-08-09 21:44:10 +00:00
Keren Zhou
30a331e628 [FRONTEND] Support jit functions without arguments (#2043)
Issue https://github.com/openai/triton/issues/1973

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-08-07 19:05:56 -07:00
goostavz
f1512bded1 Initial code merge of Hopper support (#2036)
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-07 09:53:04 +08:00
Shantanu
4f1b2ea8d7 [FRONTEND] fix error with -> None return annotation (#1987)
None is not a type, so you get:
```
    self.constexprs = [self.arg_names.index(name) for name, ty in self.__annotations__.items() if 'constexpr' in ty]
E   TypeError: argument of type 'NoneType' is not iterable
```

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-25 18:49:45 -07:00
Philippe Tillet
3452615d79 [BUILD] Reverted ptxas change and fixed bug in cache key computation (#1971) 2023-07-19 20:58:24 -07:00
Alex Collins
80163a9c1e [FRONTEND] Add support for default args in kernel wrappers (#1943)
Fixes the case where setting default values for arguments in a kernel
function signature results in a generated kernel wrapper function
without these default values.

For example:
```
@triton.jit
def kernel(x, y, z=3):
    ...

...
kernel[grid](x,y)
```

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-07-14 21:32:47 +00:00
Philippe Tillet
5a722b5f74 [OPS][TESTS] Added float8 support in triton.ops.matmul (#1918)
this also adds rather extensive testing for mixed precision mode,
including `float8e4b15 x float8e5` and `float8e5 x float16`
2023-07-10 09:31:12 -07:00
Natalia Gimelshein
778ed64a66 [BACKEND] make sure we always bind to primary context in loadBinary (#1912) 2023-07-07 14:28:03 -07:00
Bert Maher
38d767ea93 [FRONTEND] fix memory leak caused by retaining args to autotuned kernel (#1911) 2023-07-07 20:58:29 +00:00
Philippe Tillet
6d1285e1ae [FRONTEND][BACKEND] improved fp8 specs (#1906)
This un-reverts commit
d4c941177e.
2023-07-06 13:03:53 -07:00
Philippe Tillet
f77015967d Revert "[FRONTEND][BACKEND] improved fp8 specs (#1841)" (#1865)
This reverts commit d4c941177e.
2023-06-29 21:07:01 -04:00
Izzy Putterman
9961b5c7aa [TESTING] allow user to adjust warmup and repetition time for autotuning (#1850)
Adds an option to adjust warmup and repetition time for autotuning. It
should default to old values and have no effect on current kernels.
This is useful for bigger kernels where runtime might be a sizable
fraction 100ms and lead to less warmup and more variance during
benchmarking.
2023-06-28 11:04:43 -07:00
Philippe Tillet
d4c941177e [FRONTEND][BACKEND] improved fp8 specs (#1841)
clearly differentiate between standard fp8e4 (which we'll stop
supporting on SM <= 89 because conversions are too expensive if we want
to handle the single NaN and clipping properly) and a software-optimized
fp8e4b15 format.
2023-06-26 16:19:03 -07:00
Izzy Putterman
3c400e7818 [FRONTEND] switch absolute imports to relative v2 (#1833) 2023-06-26 04:13:12 +00:00
Goran Flegar
8d566e4196 [FRONTEND] Fix missing attribute access in DependenciesFinder (#1820)
It seems that patch #1773 introduced a bug, since the `lhs` object
doesn't necessarily have a `__name__` attribute.

I'm hitting this if I modify the matmul tutorial
(gflegar/triton@442b00f4d):

```
File "/home/gflegar/triton/python/triton/runtime/jit.py", line 74, in visit_Attribute
  if lhs is None or lhs.__name__ == "triton":
AttributeError: 'Tensor' object has no attribute '__name__'
```

I think the idea of that patch was to remove the need to import triton
by replacing `lhs is triton` with `lhs.__name__ == "triton"`. This patch
should have the same behavior as the original code, but withouth failing
if `lhs` doesn't havea `__name__` attribute.
2023-06-22 13:30:25 -07:00
Izzy Putterman
5686c51cdb [FRONTEND] allow pre-hook in autotuner configs to access config kwargs (#1814)
This is a very quick change that allows the configs' pre-hooks to see
the values in the config itself. This is useful if we'd like to allocate
intermediate tensor and the shape depends on tile size.
2023-06-22 05:40:48 -07:00
Philippe Tillet
9a2580de13 [CI] Added H100 node (#1779) 2023-06-15 14:21:47 -07:00
Philippe Tillet
b24dc19741 [FRONTEND] cleaned up symbol names (#1782) 2023-06-14 18:55:32 -07:00
Izzy Putterman
71e21f5797 [FRONTEND] switch absolute imports to relative imports in Triton (#1773) 2023-06-14 23:59:24 +00:00