Commit Graph

765 Commits

Author SHA1 Message Date
Ian O'Connell
cd096afa58 [FRONTEND] don't hold a file lock (#1569)
We have had complaints/issues randomly where a zombie python process is
holding this lock. We don't need it since renames are atomic on posix.
So refactor this to make temp files unique and then use replace
(https://docs.python.org/3/library/os.html#os.replace )
2023-04-24 12:50:24 -07:00
Michaël Benesty
7d2a4d95c2 [DOCS] fixed num warps / stages in matmul (#1561) 2023-04-21 12:57:26 -07:00
peterbell10
c71bf73f24 [BUILD] Use a persistent directory for cmake (#1548)
Fixes #1545

`build_temp` is a temporary directory which `distutils` used to keep in
the `./build` directory, but when `pyproject.toml` is present `pip` now
puts it in `/tmp` and removes it at the end of the build.

Instead, this creates a new permanent directory like
`python/build/cmake.linux_x86_64-cpython-3.8` (the old name but with
cmake instead of temp).

While I was looking at the verbose pip output, I also noticed a bunch of
warnings like
```
Python recognizes 'triton/runtime.backends' as an importable package,
but it is not listed in the `packages` configuration of setuptools.

'triton/runtime.backends' has been automatically added to the distribution only
because it may contain data files, but this behavior is likely to change
in future versions of setuptools (and therefore is considered deprecated).
```

So I've also added these to the packages list.

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-04-20 16:38:44 -07:00
cctry
3e213dccb1 [FRONTEND] Make lru_cache compatible for Python 3.7 or older (#1552)
Change the usage of LRU cache decorator from @functools.lru_cache to
@functools.lru_cache().
The former raises an error TypeError('Expected maxsize to be an integer
or None' for Python 3.7 or older.
2023-04-20 16:14:32 -07:00
Keren Zhou
fef8150b65 [FRONTEND] Remove debug print in code_gen (#1550) 2023-04-19 17:13:01 -07:00
Da Yan
b42e3d06d4 [FRONTEND] fix type checking in extern_elementwise (#1541)
Some math ops accept inputs of different types (e.g., tl.math.jn).
We don't want to cast the scalar types of input operands of those math
ops.
2023-04-18 16:59:21 -07:00
Daniil Fukalov
a90a2d864f [BUILD] Add ability to build with clang+lld. (#1544)
This way reduces build time with assertions enabled LLVM and
dramatically speeds up triton's build with a "debug" LLVM.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-18 21:20:12 +00:00
Natalia Gimelshein
7d1a95b046 [TESTS] Added test for avg_pool_bwd kernel (#1540)
This kernel was briefly broken on main, prevent future regressions.

---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-04-17 21:20:34 -07:00
peterbell10
a3c3e5a3a1 [TESTS][OPTIMIZER] enable tests for argmin/max and fix some bugs (#1537)
`argmin`/`argmax` is currently only tested in 1d and when we enable the
tests for 2d it reveals a few bugs.
2023-04-17 18:47:31 -07:00
Sharad Vikram
cf26e05a8f [FRONTEND] remove debug print (#1538) 2023-04-17 15:17:19 -07:00
Philippe Tillet
608ec061c1 [TESTING] Added more tests for annotations and autotuner (#1533)
Essentially identical to #538, but it fails formatting tests and I don't
want to ping the author on a weekend.
2023-04-15 19:44:08 -07:00
Philippe Tillet
df6c2babbd [FRONTEND] Now using strings for annotations (#1529)
Works with `__future__` annotations and also avoids having to import
torch just for the sake of type annotations.
2023-04-15 15:32:22 -07:00
Philippe Tillet
f367647b38 [FRONTEND] Added tl.extra.cuda.smid (#1532) 2023-04-15 14:42:59 -07:00
Chenggang Zhao
c9311ef361 [TUTORIALS] Fix rendering issues in the block pointer tutorial (#1530)
Found some rendering issues here:
https://triton-lang.org/main/getting-started/tutorials/08-experimental-block-pointer.html,
sorry for not checking carefully in the last PR.
2023-04-15 14:27:14 -07:00
Philippe Tillet
e5c7d2a83c [FRONTEND] cleaned up language; added frontend function for globaltimer special register (#1525) 2023-04-14 15:29:27 -07:00
peterbell10
0d76c4ca95 [FRONTEND] Rename tl.reduction -> tl.reduce and improve testing (#1521)
`tl.reduction` is currently tested indirectly through the existing
reduction operators, but it's good to have a direct test for the
function itself.

---------

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-14 14:35:31 -07:00
Bert Maher
bfd1f65ac7 [FRONTEND] cache path to ptxas (#1526)
When running python 3.8, I've found that process creation gets slower
over time (e.g. after creating a CUDA context, it can take 50-300ms per
subprocess.run), and we do one of these calls to `ptxas --version` for
every kernel, so a model with thousands of kernels can end up spending
substantial time just calling ptxas redundantly.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-14 17:01:42 +00:00
Chenggang Zhao
c624778e73 [TUTORIALS] Add tutorial for block pointers (#1519)
This PR contains:
- Several fixes for the matrix multiplication (M and N dimensions may
have out-of-bound access)
- A type check for block-based store
- The tutorial for block pointers
- Fix some formats
2023-04-14 00:40:41 -07:00
Keren Zhou
fdf1c1f2a1 [DOCS] Fix documentation workflow (#1520)
Co-authored-by: Phil Tillet <phil@openai.com>
2023-04-13 13:49:36 -07:00
peterbell10
6550c528b7 [FRONTEND] don't call tl.view in arg{min,max} (#1518)
A small oversight in #1305, since `view` can rearrange elements it
should be avoided here. Instead I use indexing with `None` to create new
dimensions.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-13 07:32:23 +00:00
Philippe Tillet
c0d86d3b04 [RUNTIME] refactor driver (#1515)
Improved separation between different backends
2023-04-12 23:50:44 -07:00
peterbell10
e152183570 [FRONTEND][BACKEND] ReduceOp to support arbitrary reduce operations (#1305)
Fixes #1285

This changes `tt.reduce` to replace `redOp` by a region containing
arbitrary code. For example, `tl.sum` is now lowered as:
```mlir
%res = "tt.reduce"(%arg0) ({
^bb0(%arg1: f32, %arg2: f32):
  %add = arith.addf %arg1, %arg2 : f32
  tt.reduce.return %add : f32
}) {axis = 1 : i32} : (tensor<128x128xf32>) -> tensor<128xf32>
```
Support for index reductions at the MLIR level are also dropped in favor
of simultaneous reductions over multiple tensors. Which generalizes the
code without loss of performance. So for example `argmin` gets lowered
as:
```mlir
  %7 = tt.make_range {end = 256 : i32, start = 0 : i32} : tensor<256xi32>
  %8 = tt.view %7 : (tensor<256xi32>) -> tensor<1x256xi32>
  %9:2 = "tt.reduce"(%6, %8) ({
  ^bb0(%arg4: f32, %arg5: i32, %arg6: f32, %arg7: i32):
    %14 = arith.cmpf olt, %arg4, %arg6 : f32
    %15 = arith.cmpf ogt, %arg4, %arg6 : f32
    %16 = arith.cmpi slt, %arg5, %arg7 : i32
    %17 = arith.select %16, %arg5, %arg7 : i32
    %18 = arith.select %15, %arg7, %17 : i32
    %19 = arith.select %14, %arg5, %18 : i32
    %20 = arith.cmpf olt, %arg4, %arg6 : f32
    %21 = arith.select %20, %arg4, %arg6 : f32
    tt.reduce.return %21, %19 : f32, i32
  }) {axis = 1 : i32} : (tensor<1x256xf32>, tensor<1x256xi32>) -> (tensor<1xf32>, tensor<1xi32>)
```
2023-04-13 01:37:39 +00:00
Philippe Tillet
5b9119117b [CI] No longer install triton in editable mode to run tests (#1476) 2023-04-12 17:55:44 -07:00
Phil Tillet
9530d93504 [TESTING] change do_bench defaults 2023-04-11 22:03:52 -07:00
Phil Tillet
d7d62ddae9 Revert "[BUILD] Fixed typo in setup.py"
This reverts commit 2931bb8195.
2023-04-11 20:12:22 -07:00
Phil Tillet
2931bb8195 [BUILD] Fixed typo in setup.py 2023-04-11 20:09:09 -07:00
Philippe Tillet
02e3c18f04 [TESTING] clean up testing.do_bench (#1513) 2023-04-11 20:05:58 -07:00
zahimoud
fd34b20fba [BACKEND] Fixed bug in reduce; add tests 2023-04-11 18:09:18 -07:00
Phil Tillet
3e22e18295 [TESTING] do_bench now return min time by default.
This is likely to be more stable in general for benchmarks that have L2
hit rate comparable to what is encountered in practice
2023-04-11 17:18:01 -07:00
Philippe Tillet
0fedf6b79a [TESTS] disable launch latency test (#1510) 2023-04-11 10:31:16 -07:00
Philippe Tillet
e0d6f5f4f5 [BUILD] updated LLVM binaries (#1504)
Co-authored-by: Christian Sigg <csigg@google.com>
2023-04-11 00:14:00 -07:00
Keren Zhou
6d0ed41307 [BACKEND] Replace Func Dialect with custom triton ops (func, call, return) (#1502)
MLIR current only supports a custom inlining interface per dialect, so
we cannot change the inlining decision of `func.func`.


https://discourse.llvm.org/t/avoid-inlining-some-functions-using-the-func-dialect/69830/3

Could revert it back once they've designed a better inliner interface.

Inlining attributes will be implemented in the next PR since this PR is
already huge.
2023-04-10 21:08:40 -07:00
Zilin Zhu
19e424eb98 [ops/blocksparse] Fix grid shape for large lm (#1491)
When the language model grows really large, the axis 1 of the origin
grid shape (`c.shape[1]`, correspond to the number of nonzero elements
in the layout) will be larger than 65536, exceeds the cuda limit and
results in `[CUDA]: invalid argument`.

This PR is moving the axis 1 of the origin grid to axis 0, as the limit
for axis 0 is 2^31 - 1.

Thank you for your time on reviewing this PR :)
2023-04-10 09:00:12 -07:00
mcskatkat
82ec1a89ea [FRONTEND] code_generator.py TODOs fixed & removed (#1484)
Handled TODOs that were waiting for the circular import issue to be
resolved
2023-04-07 22:05:46 -07:00
Ian O'Connell
bc0b007e4b [FRONTEND] Allow cache manager to be overridden, and tweak apis to easier work with remote caches (#1478)
The changes here come with a few separate bits:

- Allow replacing the cache manager with an ENV variable to make it
pluggable
- Make the `make_path` api private since its leaking some internal bits
of the cache and allowing file access. Use a get operation instead.
- For the `compile` operation we have a several files part of a single
compile pipeline that are small, this can be not the most performant
with remote caches. Also some operations like
`_triton.get_shared_memory_size` only work when everything is cached or
none(or some key ones aren't). They segfault otherwise. So grouping
these as an entity avoids that.
2023-04-07 13:38:28 -07:00
Keren Zhou
6743e42eb5 [FRONTEND] Data type specification for math functions (#1485) 2023-04-07 10:26:19 -07:00
Keren Zhou
7f3f58f332 [FRONTEND] Fix broadcast semantics (#1480)
https://github.com/openai/triton/pull/1183

---------

Co-authored-by: Yen-Chen Lin <yenchenlin1994@gmail.com>
2023-04-06 10:40:40 -07:00
Philippe Tillet
8cbf9b40a4 [TESTING] Minor fixes (#1479) 2023-04-06 00:48:33 -07:00
Phil Tillet
4c1d001ae4 [TESTING] Now using numpy instead of pytorch in triton.assert_close
More memory-efficient than pytorch
2023-04-04 23:57:12 -07:00
Eta
577cafff0a [BUILD] Add missing subpackages to build (#1475)
The `triton/compiler`, `triton/runtime/driver`, and `triton/third_party`
subpackages were missing from the distribution built with the old
`setup.py` after #1464, causing an immediate error upon importing Triton
with a non-editable installation. This change adds the missing Python
subpackages and moves `triton/third_party` inclusion to `MANIFEST.in`,
where it will automatically be included in wheels due to the existing
`include_package_data` setup flag.
2023-04-04 22:41:08 -07:00
Phil Tillet
0e11f1e167 [TESTING] Added triton.allclose wrapper around torch.testing.allclose.
This adds a convenience layer to test linear algebra kernels and their perf.
2023-04-04 21:53:36 -07:00
Keren Zhou
00a9143bb4 [FRONTEND] Expose Autotuner to users (#1473)
The Autotuner is a handy utility. By allowing external access to the
Autotuner, users can overwrite some functions (e.g., `run`) to
load/store best configurations, initialize tensors based on
configuration values, and change benchmarking standard (e.g., based on
bytes instead of time).
2023-04-04 17:12:00 -07:00
Philippe Tillet
053af4e9f8 [FRONTEND] Refactor file hierarchy (#1464)
The purpose of this PR is to remove some circular dependencies and
separate concerns better in the frontend. It's still not perfect --
`triton.compile` still includes a few runtime architecture-specific
component, but at least much better than before.

This PR still assumes that AMD only supports empty kernels right now.
Other PRs will follow to make the frontend supports multiple devices in
a more modular way.
2023-04-02 12:07:08 -07:00
Keren Zhou
0855cacdd8 [BACKEND] Fix small matmul dot (#1463)
https://github.com/openai/triton/issues/1449

In theory, we might be able to support even 8x8 dot if we also wrap
around `cOff`.
2023-04-02 02:05:05 +00:00
Keren Zhou
859952a0aa [FRONTEND] Include the debug field when computing the kernel hash (#1458)
Co-authored-by: Philippe Tillet <phil@openai.com>
2023-04-01 00:52:51 +00:00
Da Yan
bf158bf01f [FRONTEND] kwargs as autotune key (#1457) 2023-03-31 17:09:14 -07:00
Kern Handa
2c0417da96 [DOCS] fixed typo triton.testing.allclose -> torch.allclose in MatMul tutorial (#1460) 2023-03-31 17:06:46 -07:00
Keren Zhou
cc4aa1ebbc [FRONTEND] Fix if-exp parsing for size-1 tensors (#1455) 2023-03-31 15:05:52 -07:00
Phil Tillet
966e5d955b [TEST] Increase `LATENCY_THRESHOLD_US 2023-03-31 11:38:18 -07:00
Francisco Massa
c1b057eee9 [FRONTEND] Add option to specify number of compilation threads during Triton compilation (#1450)
On some machines, the amount of available RAM might not be enough to
compile Triton with `2 * num_cpus` parallelism. For example, CircleCI's
`large` instance can't handle Triton compilation as is due to
insufficient memory.

Instead, I propose to take PyTorch's approach where we can define a
[`MAX_JOBS` env
var](0e4ddc2b40/tools/setup_helpers/cmake.py (L366-L368))
that gives the user the possibility to reduce (or increase) the
parallelism during compilation.

Co-authored-by: Philippe Tillet <phil@openai.com>
2023-03-31 11:34:18 -07:00