github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Sam Shleifer	12da43084b	[TESTING] add diff column, option to return df in benchmark (#2469 )	2023-10-24 05:17:00 +00:00
Philippe Tillet	eea0718445	[TESTING] better cudagraph-based benchmarking (#2394 )	2023-09-25 21:41:26 -07:00
Zahi Moudallal	293b7fd592	[TESTING] cleanup (#2293 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-09-22 05:37:14 +00:00
danny.jang	ec4a968d44	[TESTS] Enhance benchmark flexibility (#2239 ) User can pass custom arguments to benchmarks. For example, user can pass `dtype` which will be used to create tensors in a benchmark. Co-authored-by: Keren Zhou <kerenzhou@openai.com>	2023-09-11 15:31:30 -04:00
Ethan Pronovost	56fee37a0d	[FRONTEND] Fix benchmark plotting (#2177 )	2023-08-24 20:34:04 -07:00
Zahi Moudallal	23dd11d471	[BACKEND] Solidify f8e4m3 (#2105 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-08-18 19:12:09 -07:00
Thomas	387fc890a5	[FRONTEND][BACKEND] Add a performance test for reductions (#2125 ) Also stop promoting integer types as it doesn't give better perf this will allow more vectorization oportuinity in the future.	2023-08-17 16:30:33 -07:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Philippe Tillet	7e3ebbc4c8	[TESTING] now using cuda graphs for perf regression tests (#1925 )	2023-07-10 22:49:25 -07:00
Izzy Putterman	71e21f5797	[FRONTEND] switch absolute imports to relative imports in Triton (#1773 )	2023-06-14 23:59:24 +00:00
Edward Z. Yang	f294a18864	[FRONTEND] force quantile tensors to be float; prevents accidents (#1741 ) In particular, sometimes this was failing with: ``` RuntimeError: quantile() input tensor must be either float or double dtype ``` Fixes https://github.com/pytorch/pytorch/issues/103054 Signed-off-by: Edward Z. Yang <ezyang@meta.com> --------- Signed-off-by: Edward Z. Yang <ezyang@meta.com>	2023-06-05 20:55:40 -07:00
Nelson Elhage	0274446b3a	[FRONTEND] speed up autotuning of small kernel invocations. (#1701 ) Right now, `do_bench` estimates the runtime of the kernel and then uses that to run it a number of times approximately equal to 100ms (by default). However, when actually running the kernel, it also issues a `zero_()` call to clear the L2 cache. For small kernels, the `zero_()` kernel can be slower than the actual kernel we're benchmarking, causing us to badly overshoot our target latency. This has the perverse effect that very small invocations may take much longer to autotune than larger ones. By way of concrete example, before this PR, I tested the wall-clock time for the first call to `triton.ops.matmul(A, B.T)` in a process, on two `(N, N)` matrices in float32. I found that a 4k x 4k x 4k matmul warmed up in about 2.5s, but a 64 x 64 x 64 matmul took over 5 seconds! This PR fixes this issue by including the same call to `zero_()` inside our measurement loop. With this change, I find that the 4kx4kx4k and 64x64x64 matmuls warm up in very similar amounts of time, both around 2.5s. I noticed this because we tend to run tests on very small models in CI just to test code paths without regard to numerics, and found those tests were perversely taking longer than "real" models in some cases. It seems plausible that a better solution would be a pragma to disable autotuning entirely for such tests, but I think this change is a clear improvement as-is.	2023-05-26 09:42:31 -07:00
Philippe Tillet	c0d86d3b04	[RUNTIME] refactor driver (#1515 ) Improved separation between different backends	2023-04-12 23:50:44 -07:00
Phil Tillet	9530d93504	[TESTING] change `do_bench` defaults	2023-04-11 22:03:52 -07:00
Philippe Tillet	02e3c18f04	[TESTING] clean up `testing.do_bench` (#1513 )	2023-04-11 20:05:58 -07:00
Phil Tillet	3e22e18295	[TESTING] `do_bench` now return min time by default. This is likely to be more stable in general for benchmarks that have L2 hit rate comparable to what is encountered in practice	2023-04-11 17:18:01 -07:00
Philippe Tillet	8cbf9b40a4	[TESTING] Minor fixes (#1479 )	2023-04-06 00:48:33 -07:00
Phil Tillet	4c1d001ae4	[TESTING] Now using numpy instead of pytorch in `triton.assert_close` More memory-efficient than pytorch	2023-04-04 23:57:12 -07:00
Phil Tillet	0e11f1e167	[TESTING] Added `triton.allclose` wrapper around `torch.testing.allclose`. This adds a convenience layer to test linear algebra kernels and their perf.	2023-04-04 21:53:36 -07:00
Philippe Tillet	053af4e9f8	[FRONTEND] Refactor file hierarchy (#1464 ) The purpose of this PR is to remove some circular dependencies and separate concerns better in the frontend. It's still not perfect -- `triton.compile` still includes a few runtime architecture-specific component, but at least much better than before. This PR still assumes that AMD only supports empty kernels right now. Other PRs will follow to make the frontend supports multiple devices in a more modular way.	2023-04-02 12:07:08 -07:00
Philippe Tillet	fc7c0b0e43	[FRONTEND] Removed torch dependency and cleaned up testing (#1394 ) `assert triton.testing.allclose` -> `torch.testing.assert_allclose` `triton.testing.assert_almost_equal` -> `torch.testing.assert_allclose`	2023-03-23 22:37:21 -07:00
Philippe Tillet	dde34904d0	[TESTING] triton.testing.allclose now uses torch.allclose (#1333 )	2023-03-13 17:48:32 -07:00
Yu Guo	ef55ccfed0	[TESTING] fix get_max_simd_tflops (#1318 ) `_triton.runtime.num_sm`, `_triton.runtime.clock_rate`, `_triton.runtime.cc` seem no longer exist. use the corresponding methods from `get_max_tensorcore_tflops` in the same file.	2023-03-11 10:07:25 -08:00
Philippe Tillet	ba0198326e	[TESTS] make performance regression testing less strict (#1231 )	2023-02-21 22:22:02 -08:00
Philippe Tillet	20100a7254	Merge `triton-mlir` branch - Complete rewrite of the backend from scratch (#1004 ) This PR merges the `triton-mlir` branch, in which we have been quietly rewriting the Triton backend from scratch to increase maintainability, stability and ultimately performance. Changes to the runtime are minimal, and this new version aims to remain backward-compatible with the previous commit. The legacy backend is now officially deprecated, but can still be accessed via the `legacy-backend` tag. Co-authored-by: Keren Zhou <kerenzhou@openai.com> Co-authored-by: Yan Chunwei <yanchunwei@outlook.com> Co-authored-by: goostavz <109190422+goostavz@users.noreply.github.com> Co-authored-by: Shintaro Iwasaki <siwasaki@fb.com> Co-authored-by: Yan Da <dyanab@connect.ust.hk> Co-authored-by: Jun Yang <yangjunpro@gmail.com> Co-authored-by: Ian Bearman <ianb@microsoft.com> Co-authored-by: Jason Ansel <jansel@jansel.net> Co-authored-by: Qingyi Liu <qingyil@nvidia.com> Co-authored-by: ben-zhang-609 <110140741+ben-zhang-609@users.noreply.github.com> Co-authored-by: Chenggang Zhao <lyricz@yeah.net> Co-authored-by: ben-zhang-609 <benzh609@gmail.com> Co-authored-by: dongdongl <dongdongl@nvidia.com>	2022-12-21 01:30:50 -08:00
Natalia Gimelshein	0d7e753227	[TESTING] use torch.int for autotuning cache (#840 ) For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for `zero_`, so using `int8` buffer in `do_bench` makes it slow. Co-authored-by: Philippe Tillet <phil@openai.com>	2022-11-04 18:05:16 -07:00
Philippe Tillet	dad97528b2	[TESTING] allclose fixup (#724 )	2022-09-28 22:49:05 +00:00
Philippe Tillet	4a77dfb042	[FRONTEND] Complete rewrite of the runtime (#644 ) This PR completely rewrites the runtime of Triton to be more lean and clearly separate the compilation step from the just-in-time caching logic. This should substantially reduce launch overhead.	2022-09-18 08:51:48 -07:00
Shintaro Iwasaki	c668d6596e	[DOCS] Fix spelling (#664 ) This PR applies minor spelling fix in comments and string literals to `master`. It shouldn't hurt anything.	2022-09-16 12:26:40 -07:00
Philippe Tillet	7d6c504e8d	[TESTING] Added testing utilities for fixing clock and using cuda-memcheck (#500 )	2022-04-21 22:40:10 -07:00
daadaada	a9dfdcaaa9	[FRONTEND] Make the performance model work for int8, tf32, and fp32 (#456 )	2022-02-11 22:34:42 -08:00
Philippe Tillet	5a8a544d10	[OPS][BLOCKSPARSE] Improved robustness, clarity and performance (#450 ) * dds layout now internally re-uses dsd code path for increased code * at_mask and kp_mask related things are now dropped from the softmax API. I couldn't think of any case where it was needed beyond is_causal. And if there is any, we should probably find a way to get it implemented statically so that users don't have to materialize masks. * fixed bug in blocksparse matmul that caused troubles when layout had a full row/col of zeros * blocksparse softmax now no longer modifies any data in-place * blocksparse softmax now takes an is_dense arguments that provides better performance. Passing is_dense=True, is_causal=True is the best way to achieve triangular attention. * unit tests now test backward pass	2022-02-06 18:00:45 -08:00
daadaada	2a944ded53	[TESTS] Added bfloat16 tests (#430 )	2022-01-13 23:38:32 -08:00
Madeleine Thompson	8bf551ae7a	[STYLE] run autopep8 and isort (#421 ) Run: ``` isort ./python autopep8 -i --ignore E501,E701,E731 $(find ./python/ -name '*.py') ``` with an `.isort.cfg` and then clean up a few warts. This PR should be a no-op; the idea is that this is all boring whitespace changes, and any config file changes will be in a different change to make it easier to review.	2022-01-06 14:34:17 -08:00
Madeleine Thompson	d8db0308cb	[TEST] use numpy for reference results in test_core.py (#409 ) Since numpy supports unsigned integers, and pytorch doesn't, this will make it easier to test unsigned integer support. This adds an explicit requirement for numpy in tests, but we already required scipy, so it was already an implicit dependency.	2022-01-04 13:07:29 -08:00
daadaada	39d4bfed83	[OPS] Add performance model for gemm/gemv (#397 ) Significantly improves the performance of `triton.ops.matmul` in memory-bound settings via the use of many more block configs coupled with a performance model to drive the auto-tuning process.	2021-12-21 09:56:10 -08:00
Madeleine Thompson	5cdb948c05	[FRONTEND] signed-integer math fixes and testing (#395 ) - Promote 16-bit floating-point `/` and `%` to 32-bit; we have to anyway. - Do not force result of integer binary operations to be the LHS type. There used to be a bug in pytorch that did this, which Triton matched, but that bug is fixed now. - When testing signed integer operations, use random numbers from the full range of the type. - Add an optional `seed` argument to `triton.testing.random` so binary operations are not tested with both sides equal when the LHS and RHS have the same type. - Fix a bad `CompilationError` invocation. - Fix a warning suppression that causes tests to fail if you run them with `-W error` on python 3.8.	2021-12-21 09:46:05 -08:00
Philippe Tillet	da5063d898	[TEST] Added performance regression tests (#283 )	2021-09-14 01:46:32 -07:00
Philippe Tillet	3e395bc84e	[LANG] Fixed semantics of NaN in float comparisons (#281 )	2021-09-13 15:06:29 -07:00
Philippe Tillet	4ff3714d61	[CODEGEN] Various bugfixes and stability improvements in compiler backend (#240 )	2021-08-30 11:50:35 -07:00
Philippe Tillet	b120d70a0a	[CI] Moved from `assert_allclose` to `assert_almost_equal` (#200 )	2021-08-12 12:00:30 -07:00
Philippe Tillet	b253b77c71	[DOCS] Improved documentation and integration in CI (#139 )	2021-07-27 12:38:49 -07:00
daadaada	d8d6b715c8	[CODEGEN] Performance improvement on A100 (#125 ) Improved codegen for the Ampere GPUs. * Make the layout pass recognize the multistage pipelined pattern. * Now the pipeline pass can automate the multistage pipelining transformation. * Remove extra barriers (from the prefetch pass & WAR) on Ampere. * Update the code generator (generator.cc) to make Triton generate n-buffered shared memory loads/stores.	2021-07-27 12:38:49 -07:00
Philippe Tillet	0274429429	[IR] Added IR and Codegen support for atomic_rmw (#120 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	9f30af76fb	[GENERAL] Minor improvements: (#110 ) * Load libcuda.so.1 if libcuda.so is not there. Error if both aren't there. * Support for multiple grad_to_none in triton.testing.do_bench * Benchmark dataframe printed along with name	2021-07-27 12:38:49 -07:00
Philippe Tillet	bfc0a7587d	[PYTHON] Renamed triton.core -> triton.language (#92 )	2021-07-27 12:38:49 -07:00
Philippe Tillet	39f4730305	Deprecation of Triton-C and Replacement by decorated Python functions (#86 ) This PR implements a major overhaul of the frontend for Triton, and replaces Triton-C by a pure Python API in which kernels are defined as @triton.jit decorated functions. The documentation and tutorials have also been updated to accommodate these changes. See documentations for more information on the new API	2021-07-27 12:38:49 -07:00
Philippe Tillet	1fdb465b71	[DOCS] Various improvements and typo fixes	2021-07-27 12:38:49 -07:00
Philippe Tillet	5ba5a77561	[BUILD] Remove compilation warnings	2021-07-27 12:38:49 -07:00
Philippe Tillet	2f80a98776	[BUILD] Added automatic nightly build releases to pip in CI; removed build-time dependence on LLVM and PyTorch (#77 ) Recently there has been more and more report about installation issues: - Installing Triton before upgrading pytorch can create some issues because Triton uses some torch headers - llvm-10-dev not available on some platform; llvm-11-dev not available on e.g. Ubuntu. absence of nightly builds This PR should fix all these issues. Some CMake tricks are used to download and install llvm at build time. Triton Python bindings were modified to remove dependence on pytorch ops. Midnight CI job added to generate binary wheels for all Triton version and update them on pypi's new triton-nightly project. This PR will also make it very easy to use LLVM forks in the future for whatever needs we have.	2021-07-27 12:38:49 -07:00

1 2

59 Commits