User can pass custom arguments to benchmarks. For example, user can pass
`dtype` which will be used to create tensors in a benchmark.
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.
The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
Right now, `do_bench` estimates the runtime of the kernel and then uses
that to run it a number of times approximately equal to 100ms (by
default).
However, when actually running the kernel, it also issues a `zero_()`
call to clear the L2 cache. For small kernels, the `zero_()` kernel can
be slower than the actual kernel we're benchmarking, causing us to badly
overshoot our target latency.
This has the perverse effect that very small invocations may take much
longer to autotune than larger ones. By way of concrete example, before
this PR, I tested the wall-clock time for the first call to
`triton.ops.matmul(A, B.T)` in a process, on two `(N, N)` matrices in
float32. I found that a 4k x 4k x 4k matmul warmed up in about 2.5s, but
a 64 x 64 x 64 matmul took over 5 seconds!
This PR fixes this issue by including the same call to `zero_()` inside
our measurement loop.
With this change, I find that the 4kx4kx4k and 64x64x64 matmuls warm up
in very similar amounts of time, both around 2.5s.
I noticed this because we tend to run tests on very small models in CI
just to test code paths without regard to numerics, and found those
tests were perversely taking longer than "real" models in some cases. It
seems plausible that a better solution would be a pragma to disable
autotuning entirely for such tests, but I think this change is a clear
improvement as-is.
The purpose of this PR is to remove some circular dependencies and
separate concerns better in the frontend. It's still not perfect --
`triton.compile` still includes a few runtime architecture-specific
component, but at least much better than before.
This PR still assumes that AMD only supports empty kernels right now.
Other PRs will follow to make the frontend supports multiple devices in
a more modular way.
`_triton.runtime.num_sm`, `_triton.runtime.clock_rate`,
`_triton.runtime.cc` seem no longer exist.
use the corresponding methods from `get_max_tensorcore_tflops` in the
same file.
For stupid reasons, ops on int8 are 3 times slower than on int, and for
another set of stupid reasons we are not using cudaMemset for `zero_`,
so using `int8` buffer in `do_bench` makes it slow.
Co-authored-by: Philippe Tillet <phil@openai.com>
This PR completely rewrites the runtime of Triton to be more lean and
clearly separate the compilation step from the just-in-time caching logic.
This should substantially reduce launch overhead.
* dds layout now internally re-uses dsd code path for increased code
* at_mask and kp_mask related things are now dropped from the softmax API. I couldn't think of any case where it was needed beyond is_causal. And if there is any, we should probably find a way to get it implemented statically so that users don't have to materialize masks.
* fixed bug in blocksparse matmul that caused troubles when layout had a full row/col of zeros
* blocksparse softmax now no longer modifies any data in-place
* blocksparse softmax now takes an is_dense arguments that provides better performance. Passing is_dense=True, is_causal=True is the best way to achieve triangular attention.
* unit tests now test backward pass
Run:
```
isort ./python
autopep8 -i --ignore E501,E701,E731 $(find ./python/ -name '*.py')
```
with an `.isort.cfg` and then clean up a few warts. This PR should be a no-op; the idea is that this is all boring whitespace changes, and any config file changes will be in a different change to make it easier to review.
Since numpy supports unsigned integers, and pytorch doesn't, this will make it easier to test unsigned integer support.
This adds an explicit requirement for numpy in tests, but we already required scipy, so it was already an implicit dependency.
Significantly improves the performance of `triton.ops.matmul` in memory-bound settings via the use of many more block configs coupled with a performance model to drive the auto-tuning process.
- Promote 16-bit floating-point `/` and `%` to 32-bit; we have to anyway.
- Do not force result of integer binary operations to be the LHS type. There used to be a bug in pytorch that did this, which Triton matched, but that bug is fixed now.
- When testing signed integer operations, use random numbers from the full range of the type.
- Add an optional `seed` argument to `triton.testing.random` so binary operations are not tested with both sides equal when the LHS and RHS have the same type.
- Fix a bad `CompilationError` invocation.
- Fix a warning suppression that causes tests to fail if you run them with `-W error` on python 3.8.
Improved codegen for the Ampere GPUs.
* Make the layout pass recognize the multistage pipelined pattern.
* Now the pipeline pass can automate the multistage pipelining transformation.
* Remove extra barriers (from the prefetch pass & WAR) on Ampere.
* Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.
* Load libcuda.so.1 if libcuda.so is not there. Error if both aren't
there.
* Support for multiple grad_to_none in triton.testing.do_bench
* Benchmark dataframe printed along with name
This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes.
See documentations for more information on the new API
Recently there has been more and more report about installation issues:
- Installing Triton before upgrading pytorch can create some issues because Triton uses some torch headers
- llvm-10-dev not available on some platform; llvm-11-dev not available on e.g. Ubuntu.
absence of nightly builds
This PR should fix all these issues. Some CMake tricks are used to download and install llvm at build time. Triton Python bindings were modified to remove dependence on pytorch ops. Midnight CI job added to generate binary wheels for all Triton version and update them on pypi's new triton-nightly project.
This PR will also make it very easy to use LLVM forks in the future for whatever needs we have.