github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-02-21 03:00:39 -05:00

Author	SHA1	Message	Date
Weixing Zhang	34b89a1173	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2581 ) In current implementation, warpsPerCTA is always set to [numWarps, 1] for 2 tt.dot fusion scenario. But, it is not optimal for cases such that tt.dot doesn't have enough parallelism on row dimension but on column dimension.	2023-11-03 16:40:03 -04:00
Justin Lebar	df08301e76	Reformat Python code with yapf. (#2589 ) I've add an option to yapf to do what we want for long lines, see https://github.com/google/yapf/pull/1177. We can now have a real Python formatter, yay! To make this PR, I ran my modified yapf over the repository, then looked over the full diff. Where yapf was mangling the param list of long function decls/calls (mostly kernels), I manually added `#` to put linebreaks where we want. I fixed up other formatting too -- mostly adding or removing a trailing comma from lists. Overall, trailing `#` was sufficient to get formatting similar to our current code. I didn't have to disable yapf anywhere. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-11-02 20:44:17 -07:00
runseny	4c816c2f59	[OPS] enable flash_attention_v2 TMA (#2544 )	2023-10-25 23:31:17 -07:00
Philippe Tillet	3f2b7263e8	Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 )" (#2541 ) Reverts openai/triton#2525	2023-10-24 10:23:19 -07:00
Philippe Tillet	8f467f1ea9	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 ) (#2525 ) Reverts openai/triton#2497	2023-10-23 21:50:58 -07:00
runseny	dc9e3063d7	[HOPPER] Move to tl.make_block_ptr in flash_attention backward scripts (#2395 )	2023-10-20 11:06:48 +08:00
Philippe Tillet	8db4fac3b0	Revert "[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 )" (#2497 ) Reverts openai/triton#2485	2023-10-13 23:32:59 -07:00
Weixing Zhang	76858bd917	[OPTIMIZER] Tweak warpsPerCTA based on the shape of MMA output (#2485 ) In current implementation, warpsPerCTA is always set to [numWarps, 1] for 2 tt.dot fusion scenario. But, it is not optimal for cases such that tt.dot doesn't have enough parallelism on row dimension but on column dimension.	2023-10-12 22:25:42 -07:00
Philippe Tillet	eea0718445	[TESTING] better cudagraph-based benchmarking (#2394 )	2023-09-25 21:41:26 -07:00
Zahi Moudallal	293b7fd592	[TESTING] cleanup (#2293 ) Co-authored-by: Philippe Tillet <phil@openai.com>	2023-09-22 05:37:14 +00:00
Thomas	387fc890a5	[FRONTEND][BACKEND] Add a performance test for reductions (#2125 ) Also stop promoting integer types as it doesn't give better perf this will allow more vectorization oportuinity in the future.	2023-08-17 16:30:33 -07:00
Phil Tillet	521cfae44d	[CI] disabled float32 perf regression tests	2023-08-07 12:43:16 -07:00
goostavz	f1512bded1	Initial code merge of Hopper support (#2036 ) The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <gzhu@nvidia.com>	2023-08-07 09:53:04 +08:00
Izzy Putterman	de6f053c0f	[TRITON][OPS] add Flash Attention v2 to Ops (#1970 ) I also dropped the do_scaled as it is no longer needed (no scaling done to the do in v2). --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-07-23 14:07:15 -07:00
Philippe Tillet	bf5acf46e2	[OPS] improved pointer arithmetic in attention (#1926 ) this provides an additional 3-4% speed-up in non-causal attention, which now tops at 155TFLOPS	2023-07-11 12:04:00 -07:00
Philippe Tillet	8fe5524c75	[BACKEND] no longer uses shared mem or barriers for single-warp reductions (#1915 ) 0-bytes shared mem buffers don't materialize empty allocation buffers; this could lead to unnecessary barriers. note: reduceop code has become quite messy and will require some cleanup	2023-07-11 00:23:26 -07:00
Philippe Tillet	7e3ebbc4c8	[TESTING] now using cuda graphs for perf regression tests (#1925 )	2023-07-10 22:49:25 -07:00
Izzy Putterman	d39d78fa08	[OPS] Add more perf-tests, new features to FA (#1849 ) Adding new tests across the board for float32, bfloat16, non-powers-of-2 shapes (to test masks), and tests on sequence parallel for atomics. This also adds the sequence parallel features from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py. I am not sure about the best way to grab the baseline benchmarking numbers. I have access to V100s and A100s, but I saw on the tests it mentions " # A100 in the CI server is slow-ish for some reason. # On some other servers, we are getting about 90% peak for 8kx8x8k float16". Current plan is to run CI here and use those numbers for baseline, then match against my GPUs as a sanity check. --------- Co-authored-by: Phil Tillet <phil@openai.com>	2023-07-10 18:52:59 -07:00
Wang Weihan	b27a91a113	[FRONTEND] Enable triton to support register thirdparty backend at runtime (#1643 ) This PR intends to provide a mechanism to support a third-party backend at runtime to generate the backend-specific code. The mechanism provided a common class to abstract the third-party backend logic and two essential functions to register and get the third-party backend at runtime. - `BaseBackend`: A common class to abstract the third-party backend logic - `register_backend`: Register a third-party backend with a given device type - `get_backend`: Get the third-party backend with a given device type Generally, a third-party backend must inherit from `BaseBackend` and implement all the member functions according to the backend characteristics. As long as the backend implementation is ready, the third-party backend can invoke `register_backend` to register it under a given device. During the kernel compilation and execution, the mechanism will get the registered backend to generate the kernel and launcher code for a given device. This PR added a dummy backend to simulate a third-party backend and demonstrate the usage. - [test_device_backend.py](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1): To define a third-party backend and register the backend - [ExtensionBackend](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R123): Inherit from the `BaseBackend` and implement some specific logic like [filter out some compile stages](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R129-R135) - [Register the `ExtensionBackend` for `CPU`](https://github.com/openai/triton/pull/1643/files#diff-bbe4d50624f2d11bf17c878a1ed4d422918c124c182cf9357b993240c385bea1R279) - [extension_backend.c](https://github.com/openai/triton/pull/1643/files#diff-169c1d08b3a0a7b343cfa3258fbc32b47e0f6c46305a112652fa1bdaaec89d29): To provide the utility function to load kernel binary and get the backend properties.	2023-06-09 09:09:59 -07:00
Philippe Tillet	5b9119117b	[CI] No longer install triton in editable mode to run tests (#1476 )	2023-04-12 17:55:44 -07:00
Philippe Tillet	02e3c18f04	[TESTING] clean up `testing.do_bench` (#1513 )	2023-04-11 20:05:58 -07:00
Philippe Tillet	46672772b4	[FORMAT] autopep8 now uses max-line-length=88 (#1410 )	2023-03-25 15:46:50 -07:00
Xuehai Pan	5b36cb48ad	[CI][TEST] update `pre-commit` hooks and use `pre-commit` for style tests in CI (#1409 ) Ref issue: - #1408 Changes: - Add `.editorconfig` - Add `pre-commit-hooks`: ```yaml - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.4.0 hooks: - id: check-symlinks - id: destroyed-symlinks - id: trailing-whitespace - id: end-of-file-fixer - id: check-yaml - id: check-toml - id: check-ast - id: check-added-large-files - id: check-merge-conflict - id: check-executables-have-shebangs - id: check-shebang-scripts-are-executable - id: detect-private-key - id: debug-statements ``` - Add `flake8` to `pre-commit` config and add `.flake8` file - Use `pre-commit` for style tests in CI - Run `pre-commit` and fix existing violations: - fix trailing spaces - fix end-of-files - fix mod file mode with `chmod -x` - run `autopep8` on existing code - fix `flake8` violations	2023-03-25 14:52:16 -07:00
Philippe Tillet	fc7c0b0e43	[FRONTEND] Removed torch dependency and cleaned up testing (#1394 ) `assert triton.testing.allclose` -> `torch.testing.assert_allclose` `triton.testing.assert_almost_equal` -> `torch.testing.assert_allclose`	2023-03-23 22:37:21 -07:00
Phil Tillet	e650d3708b	[FRONTEND] `dot` now uses `tl.float32` by default for `out_dtype`.	2023-03-19 21:58:46 -07:00
Philippe Tillet	b4decbe155	[BACKEND] Now using `call_once` to initialize LLVM target (#1373 )	2023-03-19 21:23:39 -07:00
rsanthanam-amd	c575911a01	[FRONTEND] Change libdevice to mathlib and fix abs (#1361 ) Co-authored-by: Phil Tillet <phil@openai.com>	2023-03-19 01:34:16 -07:00
Philippe Tillet	6a8634e2a7	[BACKEND] No longer running LLVM-IR optimizations after codegen. (#1338 ) This triggered some outrageous bugs. See #1337.	2023-03-13 22:50:15 -07:00
Philippe Tillet	ba0198326e	[TESTS] make performance regression testing less strict (#1231 )	2023-02-21 22:22:02 -08:00
Philippe Tillet	174f121c1c	[TESTS] Added attention regression tests (#1227 )	2023-02-21 20:22:36 -08:00
Philippe Tillet	307dde9cb5	[CI] revived regression tests (#1225 )	2023-02-21 16:33:03 -08:00
Natalia Gimelshein	0d7e753227	[TESTING] use torch.int for autotuning cache (#840 ) For stupid reasons, ops on int8 are 3 times slower than on int, and for another set of stupid reasons we are not using cudaMemset for `zero_`, so using `int8` buffer in `do_bench` makes it slow. Co-authored-by: Philippe Tillet <phil@openai.com>	2022-11-04 18:05:16 -07:00
Shintaro Iwasaki	c668d6596e	[DOCS] Fix spelling (#664 ) This PR applies minor spelling fix in comments and string literals to `master`. It shouldn't hurt anything.	2022-09-16 12:26:40 -07:00
Philippe Tillet	9f08ecd684	[FRONTEND] Semantic analysis refactor (#491 ) Moved dispatch.cc to semantic.py (@ptillet) Integer signedness analysis was moved from C++ to python (@daadaada) Cleaner frontend types (@daadaada) Moved SSA construction to a separate object (@ptillet) Co-authored-by: Yan Da <dyanab@connect.ust.hk>	2022-04-06 16:13:53 -07:00
Philippe Tillet	76a9ee50a8	Revert "[FRONTEND] Semantic analysis refactor (#473 )" (#483 ) This reverts commit `539961072c`.	2022-03-24 17:16:50 -07:00
daadaada	539961072c	[FRONTEND] Semantic analysis refactor (#473 ) Moved dispatch.cc to semantic.py Integer signedness now moved from C++ to python Cleaner frontend type Co-authored-by: Phil Tillet <phil@openai.com>	2022-03-16 21:25:30 -07:00
daadaada	a9dfdcaaa9	[FRONTEND] Make the performance model work for int8, tf32, and fp32 (#456 )	2022-02-11 22:34:42 -08:00
Madeleine Thompson	efdabe6073	[STYLE] check python with flake8 (#424 ) I've been using this locally to find errors without running tests, and now that we're using autopep8, it passes with minimal suppressions. This is also what turned up the issues with the tutorials, which were fixed in #422.	2022-01-07 15:28:36 -08:00
Madeleine Thompson	a70acfec77	[STYLE] add isort and autopep8 config files and check on CI (#423 ) Also a fix a few more style issues from the "aggressive" mode of autopep8.	2022-01-07 13:11:34 -08:00
Madeleine Thompson	9801aa7b56	[DOCS] fix tutorials for v2.0 (#422 ) - Fix meta-parameter usage on tutorials. - Install tutorial dependencies on CI. - Switch from `requirements-test.txt` to `extras_require` for test dependencies, and also use it for tutorial dependencies. - Make some performance tests deterministic.	2022-01-07 12:34:38 -08:00
Madeleine Thompson	8bf551ae7a	[STYLE] run autopep8 and isort (#421 ) Run: ``` isort ./python autopep8 -i --ignore E501,E701,E731 $(find ./python/ -name '*.py') ``` with an `.isort.cfg` and then clean up a few warts. This PR should be a no-op; the idea is that this is all boring whitespace changes, and any config file changes will be in a different change to make it easier to review.	2022-01-06 14:34:17 -08:00
daadaada	39d4bfed83	[OPS] Add performance model for gemm/gemv (#397 ) Significantly improves the performance of `triton.ops.matmul` in memory-bound settings via the use of many more block configs coupled with a performance model to drive the auto-tuning process.	2021-12-21 09:56:10 -08:00
daadaada	1296eb877b	[RUNTIME] Config hook v2.0 (#373 ) * Add pre_hook to triton.Config * Use argument names in triton.heuristics * Update base perf * Remove meta from heuristics	2021-11-21 11:20:59 -08:00
Philippe Tillet	2acaa4d0dd	[LANG] Added support for constexpr (#361 )	2021-10-30 00:32:58 -07:00
Philippe Tillet	a12827848d	[FRONTEND] Now using exist_ok=True when creating cache directories (#288 )	2021-09-18 23:44:21 -07:00
Philippe Tillet	313d6488f6	[CODEGEN] Fixed over-aggressive division handling in alignment pass (#280 )	2021-09-15 00:40:17 -07:00
Philippe Tillet	da5063d898	[TEST] Added performance regression tests (#283 )	2021-09-14 01:46:32 -07:00

47 Commits