Commit Graph

1164 Commits

Author SHA1 Message Date
Michael Melesse
1581a1da26 fix libdevice 2023-04-26 12:09:56 -05:00
Michael Melesse
2784b804d9 Merge remote-tracking branch 'upstream/main' into ifu_4_26_2023 2023-04-26 12:04:21 -05:00
Keren Zhou
8f7ec23401 [FRONTEND] Refine arithmetic checks and corresponding tests for extern_elementwise (#1577)
The current main would fail on `math.scalbn` because we implicitly cast
the first argument from `int32` to `float32`, while the function only
accepts `int32` as the first argument and `float32` as the second
argument.

So we update the type matching logic as follows:

1. Check if there's a type tuple that matches the types of the input
arguments
2. If yes, we don't allow arithmetic check.
3. If not, we will do arithmetic check to implicitly cast types among
arguments.
4. If we still don't find a corresponding function that accepts the
casted types, throwing an error.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-25 14:25:45 -07:00
Philippe Tillet
d9020179ee [FRONTEND] libdevice path no longer part of the runtime driver (#1580) 2023-04-25 13:44:08 -07:00
Zahi Moudallal
4963d67cd3 [FRONTEND] Use ttgir module num-warps instead of default value (#1576)
Use ttgir num-warps attribute instead of default value.
2023-04-25 08:22:49 -07:00
Natalia Gimelshein
d5969b81fe [FRONTEND] Test pow with mixed dtypes (#1575)
Also reverts #1541 that breaks this test.
2023-04-24 21:38:40 -04:00
Philippe Tillet
ec242430d1 [THIRD_PARTY] bumped ptxas version to 12.1.105 (#1574) 2023-04-24 16:49:31 -07:00
Himanshu Pathak
6d226431b1 [FRONTEND] do not run AccelerateMatmul on pre-Volta GPUs (#1505)
Related to #1271 . I am currently working on adding support for
Pre-volta GPUs in Triton.

---------

Co-authored-by: Himanshu Pathak <himanshu@mtatva.com>
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-24 15:53:02 -07:00
Philippe Tillet
a359b62ef3 [RUNTIME] Lazy driver initialization (#1571) 2023-04-24 15:16:09 -07:00
Ian O'Connell
cd096afa58 [FRONTEND] don't hold a file lock (#1569)
We have had complaints/issues randomly where a zombie python process is
holding this lock. We don't need it since renames are atomic on posix.
So refactor this to make temp files unique and then use replace
(https://docs.python.org/3/library/os.html#os.replace )
2023-04-24 12:50:24 -07:00
Michaël Benesty
7d2a4d95c2 [DOCS] fixed num warps / stages in matmul (#1561) 2023-04-21 12:57:26 -07:00
peterbell10
c71bf73f24 [BUILD] Use a persistent directory for cmake (#1548)
Fixes #1545

`build_temp` is a temporary directory which `distutils` used to keep in
the `./build` directory, but when `pyproject.toml` is present `pip` now
puts it in `/tmp` and removes it at the end of the build.

Instead, this creates a new permanent directory like
`python/build/cmake.linux_x86_64-cpython-3.8` (the old name but with
cmake instead of temp).

While I was looking at the verbose pip output, I also noticed a bunch of
warnings like
```
Python recognizes 'triton/runtime.backends' as an importable package,
but it is not listed in the `packages` configuration of setuptools.

'triton/runtime.backends' has been automatically added to the distribution only
because it may contain data files, but this behavior is likely to change
in future versions of setuptools (and therefore is considered deprecated).
```

So I've also added these to the packages list.

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-04-20 16:38:44 -07:00
cctry
3e213dccb1 [FRONTEND] Make lru_cache compatible for Python 3.7 or older (#1552)
Change the usage of LRU cache decorator from @functools.lru_cache to
@functools.lru_cache().
The former raises an error TypeError('Expected maxsize to be an integer
or None' for Python 3.7 or older.
2023-04-20 16:14:32 -07:00
Keren Zhou
fef8150b65 [FRONTEND] Remove debug print in code_gen (#1550) 2023-04-19 17:13:01 -07:00
Alexander Efimov
8b5b45fbf3 replace outdated allclose function, fix comments in test 2023-04-19 10:58:02 +00:00
Da Yan
b42e3d06d4 [FRONTEND] fix type checking in extern_elementwise (#1541)
Some math ops accept inputs of different types (e.g., tl.math.jn).
We don't want to cast the scalar types of input operands of those math
ops.
2023-04-18 16:59:21 -07:00
Daniil Fukalov
a90a2d864f [BUILD] Add ability to build with clang+lld. (#1544)
This way reduces build time with assertions enabled LLVM and
dramatically speeds up triton's build with a "debug" LLVM.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-18 21:20:12 +00:00
Alexander Efimov
fe612b1fc7 fix rebase issues 2023-04-18 18:16:59 +02:00
Alexander Efimov
9ca9f7a604 Update python/test/unit/language/test_core_amd.py 2023-04-18 18:13:58 +02:00
Aleksandr Efimov
d7dbe8f3a9 add test 2023-04-18 18:13:58 +02:00
Aleksandr Efimov
53f09da370 [Matmul] [Optimization] Disable pipeline pass for amd gpu
This PR temporary disables pipeline optimization for amd gpu to fix matmul operation.
2023-04-18 18:13:56 +02:00
Natalia Gimelshein
7d1a95b046 [TESTS] Added test for avg_pool_bwd kernel (#1540)
This kernel was briefly broken on main, prevent future regressions.

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-04-17 21:20:34 -07:00
peterbell10
a3c3e5a3a1 [TESTS][OPTIMIZER] enable tests for argmin/max and fix some bugs (#1537)
`argmin`/`argmax` is currently only tested in 1d and when we enable the
tests for 2d it reveals a few bugs.
2023-04-17 18:47:31 -07:00
Sharad Vikram
cf26e05a8f [FRONTEND] remove debug print (#1538) 2023-04-17 15:17:19 -07:00
rsanthanam-amd
a791028601 Merge pull request #193 from ROCmSoftwarePlatform/bc-patch
Move cuda2gcn.bc files to third_party
2023-04-17 14:59:51 -05:00
Michael Melesse
d211cd7750 skip bad test 2023-04-17 13:12:34 -05:00
Michael Melesse
705d47d0dd fix lit test issues
This is a combination of 6 commits.

install lit

fix lit test

fix lit test

fix aot lit issues

fix final lit tests

add lit tests
2023-04-17 11:46:37 -05:00
Philippe Tillet
608ec061c1 [TESTING] Added more tests for annotations and autotuner (#1533)
Essentially identical to #538, but it fails formatting tests and I don't
want to ping the author on a weekend.
2023-04-15 19:44:08 -07:00
Philippe Tillet
df6c2babbd [FRONTEND] Now using strings for annotations (#1529)
Works with `__future__` annotations and also avoids having to import
torch just for the sake of type annotations.
2023-04-15 15:32:22 -07:00
Philippe Tillet
f367647b38 [FRONTEND] Added tl.extra.cuda.smid (#1532) 2023-04-15 14:42:59 -07:00
Chenggang Zhao
c9311ef361 [TUTORIALS] Fix rendering issues in the block pointer tutorial (#1530)
Found some rendering issues here:
https://triton-lang.org/main/getting-started/tutorials/08-experimental-block-pointer.html,
sorry for not checking carefully in the last PR.
2023-04-15 14:27:14 -07:00
Philippe Tillet
e5c7d2a83c [FRONTEND] cleaned up language; added frontend function for globaltimer special register (#1525) 2023-04-14 15:29:27 -07:00
peterbell10
0d76c4ca95 [FRONTEND] Rename tl.reduction -> tl.reduce and improve testing (#1521)
`tl.reduction` is currently tested indirectly through the existing
reduction operators, but it's good to have a direct test for the
function itself.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-14 14:35:31 -07:00
Bert Maher
bfd1f65ac7 [FRONTEND] cache path to ptxas (#1526)
When running python 3.8, I've found that process creation gets slower
over time (e.g. after creating a CUDA context, it can take 50-300ms per
subprocess.run), and we do one of these calls to `ptxas --version` for
every kernel, so a model with thousands of kernels can end up spending
substantial time just calling ptxas redundantly.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-14 17:01:42 +00:00
Chenggang Zhao
c624778e73 [TUTORIALS] Add tutorial for block pointers (#1519)
This PR contains:
- Several fixes for the matrix multiplication (M and N dimensions may
have out-of-bound access)
- A type check for block-based store
- The tutorial for block pointers
- Fix some formats
2023-04-14 00:40:41 -07:00
Keren Zhou
fdf1c1f2a1 [DOCS] Fix documentation workflow (#1520)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-04-13 13:49:36 -07:00
Michael Melesse
3603483fc0 clean up previous platform functions 2023-04-13 13:20:08 -05:00
peterbell10
6550c528b7 [FRONTEND] don't call tl.view in arg{min,max} (#1518)
A small oversight in #1305, since `view` can rearrange elements it
should be avoided here. Instead I use indexing with `None` to create new
dimensions.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-13 07:32:23 +00:00
Philippe Tillet
c0d86d3b04 [RUNTIME] refactor driver (#1515)
Improved separation between different backends
2023-04-12 23:50:44 -07:00
peterbell10
e152183570 [FRONTEND][BACKEND] ReduceOp to support arbitrary reduce operations (#1305)
Fixes #1285

This changes `tt.reduce` to replace `redOp` by a region containing
arbitrary code. For example, `tl.sum` is now lowered as:
```mlir
%res = "tt.reduce"(%arg0) ({
^bb0(%arg1: f32, %arg2: f32):
  %add = arith.addf %arg1, %arg2 : f32
  tt.reduce.return %add : f32
}) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32>
```
Support for index reductions at the MLIR level are also dropped in favor
of simultaneous reductions over multiple tensors. Which generalizes the
code without loss of performance. So for example `argmin` gets lowered
as:
```mlir
  %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
  %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32>
  %9:2 = "tt.reduce"(%6, %8) ({
  ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32):
    %14 = arith.cmpf olt, %arg4, %arg6 : f32
    %15 = arith.cmpf ogt, %arg4, %arg6 : f32
    %16 = arith.cmpi slt, %arg5, %arg7 : i32
    %17 = arith.select %16, %arg5, %arg7 : i32
    %18 = arith.select %15, %arg7, %17 : i32
    %19 = arith.select %14, %arg5, %18 : i32
    %20 = arith.cmpf olt, %arg4, %arg6 : f32
    %21 = arith.select %20, %arg4, %arg6 : f32
    tt.reduce.return %21, %19 : f32, i32
  }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>)
```
2023-04-13 01:37:39 +00:00
Philippe Tillet
5b9119117b [CI] No longer install triton in editable mode to run tests (#1476) 2023-04-12 17:55:44 -07:00
root
a0a1c92622 Move cuda2gcn files to third party 2023-04-12 15:30:03 +00:00
root
4115fc71fd Revert "Append .bc files to package_data"
This reverts commit 455c29591c.
2023-04-12 14:54:55 +00:00
Jack Taylor
455c29591c Append .bc files to package_data
Additional context: https://github.com/ROCmSoftwarePlatform/frameworks-internal/issues/3367#issuecomment-1505072217

When triton is installed via `python setup.py install` the required cuda2gcn.bc file is not copied over to the package location. This results in UT failures in pytorch `Failed to load /opt/conda/envs/py_3.8/lib/python3.8/site-packages/triton/language/cuda2gcn.bcTranslate to LLVM IR failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.`

To alleviate this issue I have proposed to add the .bc file to package_data of setup.py to ensure the file is copied over. 

Reproducing torch UT:
`pytest test/inductor/test_torchinductor_dynamic_shapes.py -k "test_any_dynamic_shapes_cuda" --verbose`
2023-04-12 12:03:34 +01:00
Phil Tillet
9530d93504 [TESTING] change do_bench defaults 2023-04-11 22:03:52 -07:00
Phil Tillet
d7d62ddae9 Revert "[BUILD] Fixed typo in setup.py"
This reverts commit 2931bb8195.
2023-04-11 20:12:22 -07:00
Phil Tillet
2931bb8195 [BUILD] Fixed typo in setup.py 2023-04-11 20:09:09 -07:00
Philippe Tillet
02e3c18f04 [TESTING] clean up testing.do_bench (#1513) 2023-04-11 20:05:58 -07:00
zahimoud
fd34b20fba [BACKEND] Fixed bug in reduce; add tests 2023-04-11 18:09:18 -07:00
Phil Tillet
3e22e18295 [TESTING] do_bench now return min time by default.
This is likely to be more stable in general for benchmarks that have L2
hit rate comparable to what is encountered in practice
2023-04-11 17:18:01 -07:00