github/ROCm - ROCm - AtHeartEngineering

mirror of https://github.com/ROCm/ROCm.git synced 2026-04-05 03:01:17 -04:00

Author	SHA1	Message	Date
Keren Zhou	adc4d25276	[BACKEND] A general interface for initializing destination operands in load/store operations (#1427 )	2023-03-27 22:13:01 -07:00
Chenggang Zhao	72b071253e	[FRONTEND] Support block pointer semantics (#1392 ) This PR introduces a new semantics: block pointer, which makes users easier & faster to load a block from a parent tensor. Below is a detailed API change by an example: ``` # Make a block pointer, which points to a block in the parent shape # `base`: the parent tensor # `shape`: the shape of the parent tensor # `strides`: the strides of the parent tensor # `offsets`: the offsets of the block in the parent tensor # `order`: the order of the data arrangement in memory # Below is an example loading a 2D column-major matrix block_ptr = tl.make_block_ptr(base=ptr, shape=(M, N), strides=(stride_m, stride_n), offsets=(0, 0), block_shape=(BLOCK_M, BLOCK_N), order=(1, 0)) # Advance the offsets; note that the striding information is already saved in `block_ptr` # `base`: the block pointer to be advanced # `offsets`: the offsets for each dimension block_ptr = tl.advance(base=block_ptr, offsets=(BLOCK_M, -BLOCK_N)) block_ptr = tl.advance(base=block_ptr, offsets=(-BLOCK_M, BLOCK_N)) # Load from a block pointer, the output type is the dereferenced type of `block_ptr`, e.g. ptr<tensor<32x32xf32>> -> tensor<32x32xf32> # `ptr`: the block pointer to be loaded # `boundary_check`: a tuple of dimensions to check the boundary # `padding`: padding strategy for elements out of bound val = tl.load(ptr=block_ptr, boundary_check=(0, 1), padding="zero") # Store by a block pointer, in which the pointer and the value tensor should have the same shape # `ptr`: the block pointer to be stored # `boundary_check`: a tuple of dimensions to check the boundary (no-write if out of bound) tl.store(ptr=block_ptr, value=val, boundary_check=(0, 1)) ``` --------- Co-authored-by: Philippe Tillet <phil@openai.com>	2023-03-27 16:46:49 -07:00
Xuehai Pan	5b36cb48ad	[CI][TEST] update `pre-commit` hooks and use `pre-commit` for style tests in CI (#1409 ) Ref issue: - #1408 Changes: - Add `.editorconfig` - Add `pre-commit-hooks`: ```yaml - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.4.0 hooks: - id: check-symlinks - id: destroyed-symlinks - id: trailing-whitespace - id: end-of-file-fixer - id: check-yaml - id: check-toml - id: check-ast - id: check-added-large-files - id: check-merge-conflict - id: check-executables-have-shebangs - id: check-shebang-scripts-are-executable - id: detect-private-key - id: debug-statements ``` - Add `flake8` to `pre-commit` config and add `.flake8` file - Use `pre-commit` for style tests in CI - Run `pre-commit` and fix existing violations: - fix trailing spaces - fix end-of-files - fix mod file mode with `chmod -x` - run `autopep8` on existing code - fix `flake8` violations	2023-03-25 14:52:16 -07:00
Michael Melesse	a9c87245b4	[ROCM] Enable ROCM Backend #1 : Empty Kernel (#1312 ) This PR is a first in a series of PRs to import the changes that we have made to enable ROCM on [our fork](https://github.com/ROCmSoftwarePlatform/triton) of triton. The PR contains the major changes to the python frontend and enough changes to the c++ backend to allow compilation and running of the empty kernel. We use the ROCM ci added a few weeks ago to verify things. --------- Co-authored-by: Ronan Keryell <ronan@keryell.fr>	2023-03-24 17:18:27 -07:00
Philippe Tillet	c34ceca741	[BACKEND] Remove DotOpHelpers (i.e., decouple ConvertLayoutOpToLLVM and DotOpToLLVM) (#1383 ) One long-standing issue in the backend has been the apparent complexity of the tensor core codegen. This complexity mostly stems from the existence of the DotOpHelpers` utilities, which have become over time a catch-all for all things related to MmaEncoding and DotOperandEncoding. The purpose of this PR is to decouple what should be decoupled, as a first step towards cleaning our tensor core codegen. Other, more more local PRs will follow.	2023-03-21 15:24:28 -07:00
Rohit Santhanam	a3f18dba5b	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03192023	2023-03-20 12:51:52 +00:00
Philippe Tillet	e4b2d1bc3d	[FRONTEND][BACKEND] no longer using indices for loops (#1370 )	2023-03-19 14:57:50 -07:00
Rohit Santhanam	a84b4883e6	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03192023	2023-03-19 13:46:50 +00:00
peterbell10	c9740f0870	[OPTIMIZER] Add canonicalize/fold for ExpandDimsOp, ViewOp and BroadcastOp (#1354 ) These eliminate no-op reshapes, and simplify some combinations of view ops into a single view. e.g. viewing a splat becomes a single splat.	2023-03-16 21:13:58 -07:00
Berke Kocaoğlu	ba91f39dbf	[DOC] Fix syntax errors, typos, formatting; increase consistency (#1357 ) This PR; - Fixes syntax errors like `.type values: dict[str, Callable[[list[Any]], Any]]` to `:type values: dict[str, Callable[[list[Any]], Any]]`, - Fixes typos, - Fixes formatting like `k ++` to ` k++`, - Increases consistency (e.g. by transforming the minority `cd dir/` to the majority `cd dir`).	2023-03-16 15:32:02 -07:00
Rohit Santhanam	77dcb667b5	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03162023	2023-03-16 13:21:15 +00:00
Philippe Tillet	082828af47	[OPTIMIZER] Fixed up divisibility analysis in div operation (#1341 )	2023-03-14 18:17:05 -07:00
Rohit Santhanam	6ff54b495c	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-03122023	2023-03-13 18:09:12 +00:00
Philippe Tillet	9b7c65a3a9	[BACKEND][OPTIMIZER] Refactor MMAv1 codegen (#1322 ) - Significant simplification of the optimizer pipeline. Right mma version is now set directly after the coalescing pass. DotOperand layout no longer hold a state to `isRow` argument, and instead query it from their parent - Moved a bunch of things from TritonGPUToLLVM/DotOpHelpers to TritonGPUAttrDefs. All MMAv1 state is now queried from attributes. - logic for getELemsPerThread is no longer duplicated in TypeConverter	2023-03-12 19:54:38 -07:00
Christian Sigg	64fc0e23ce	[BACKEND] Fix `triton-convert-arith-to-index`. (#1310 ) The dialect of created ops needs to be part of dependent dialects.	2023-03-12 19:43:41 -07:00
Philippe Tillet	3fe3adbcde	[FRONTEND][BACKEND] Add support for float8e5m2 type (#1314 )	2023-03-10 19:14:47 -08:00
B1tway	b5dc18d7c9	Added missing #ifdef and fixed code style	2023-03-07 11:32:52 +00:00
Philippe Tillet	3db55c5f94	[OPTIMIZER]]BACKEND] Some backend and optimization passes clean-up (#1284 ) * Cleaned up pipeline pass. Now works when there are element-wise ops between the load and the dot * Made `splat` compatible with varibales that have DotOperandLayout * Moves rematerialization utils to separate Transforms/Utility.cpp file.	2023-03-06 17:17:59 -08:00
B1tway	693cf89777	Changed warpSize = 32 to warpSize = 64	2023-03-06 18:33:35 +00:00
Keren Zhou	d376020f90	[FRONTEND][BACKEND] Implement `tl.device_assert` and rename `tl.printf` to `tl.device_print` (#1143 ) Note that `tl.device_print` and `print` accepts different arguments than the normal `print`. The first argument must be a string, following by variables. Device side: - `tl.device_print` - `tl.device_assert` - `print` - `assert` Compilation time: - `tl.static_assert` - `tl.static_print` Usage example: 1. ```Python tl.device_assert(x == 0, "x != 0") ``` Output: ```Python ... python/test/unit/language/assert_helper.py:18: kernel: block: [0,0,0], thread: [33,0,0] Assertion `x != 0` failed. ... ``` 2. ```Python tl.device_print("hello ", x) ``` Output: ```Python ... hello 1 ... ``` The environment variable `TRITON_DEBUG` sets the default debugging flag; if it's true, `tl.device_assert` or `assert` will be skipped.	2023-03-04 08:08:29 -08:00
Philippe Tillet	fa0fbc937f	[FRONTEND][BACKEND][OPTIMIZER] Loops now use 64-bit indices when necessary (#1261 ) * Frontend: - `int` kernel arguments are always signed - Loop induction variable is now determine by integer promotion on lb/ub/step * Optimizer: - Added new ExtractSliceOp that enforces 32-bit offsets * Backend: - Use 64-bit indices when lowering functions and control flow - Removed `idx_val` macro and replaced it with `i32_val` - Cleaned up comments - Added new ArithToIndex pass to make sure operations on indices are done with the `index` dialect, that gets converted to LLVM separately using a 64-bit target	2023-03-01 23:09:48 -08:00
Rohit Santhanam	cd9ae1cd36	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02232023	2023-02-23 21:41:54 +00:00
Philippe Tillet	0ec277efc5	[OPTIMIZER] cleaned, renamed and simplified some optimization passes (#1232 ) This shouldn't actually change the behavior of Triton -- only clean things up.	2023-02-22 13:54:55 -08:00
Keren Zhou	123c687ed9	[BACKEND] Rewrite Membar to fit the CF dialect (#1213 )	2023-02-19 14:54:33 -08:00
Philippe Tillet	c1194bd237	[OPTIMIZER] Refined side-effect traits (#1216 )	2023-02-19 01:21:19 -08:00
Rohit Santhanam	841784d1e3	Merge remote-tracking branch 'upstream/main' into upgrade_triton_mlir_rocm_to_llvm_head	2023-02-18 09:25:20 +00:00
Christian Sigg	9ef4b5d773	Rebase to LLVM-head. (#1200 ) Rebase to `37b7a60cd7`	2023-02-17 13:16:11 -08:00
Christian Sigg	fc7a8e3581	Rebase Triton to LLVM-15. (#1070 ) This PR rebases Triton from LLVM-14 to LLVM-15. Most changes are mechanical, except for the analysis framework changes.	2023-02-16 06:40:53 -08:00
Philippe Tillet	8bca84ce3d	[OPTIMIZER] Bugfix in Combine.cpp ; Added `trans` support in Pipeline.cpp (#1174 )	2023-02-14 13:36:44 -08:00
rsanthanam-amd	2ec42ea37b	Merge pull request #117 from ROCmSoftwarePlatform/fix_sramecc_xnack_warnings_navi21 Fix warning on some amdgpu arch (i.e., navi21)	2023-02-14 07:37:42 -06:00
Chao Chen	c0a8c72343	update function to get full arch details and compile it with arch details instead of hardcode	2023-02-14 12:59:26 +00:00
Rohit Santhanam	a2416e0901	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-02112023	2023-02-11 14:48:19 +00:00
Philippe Tillet	2aba985daa	[OPTIMIZER] Improved layout simplifications heuristics (#1168 )	2023-02-09 20:17:25 -08:00
Philippe Tillet	0cbe368fe5	[OPTIMIZER] Using new multiRootGetSlice utility in memory coalescing pass (#1169 )	2023-02-09 18:43:33 +00:00
Yu Guo	474ed978b9	[BUILD] Fixed typo in CMake type tablegen (#1124 )	2023-02-03 18:46:11 -08:00
Keren Zhou	bde52f9db2	[BACKEND] Fix alignment calculation (#1149 ) `getDivisibility` represents if the address in bytes is divisible by a certain number, so we should convert `#aligned bytes` to `#aligned elements`.	2023-02-03 17:20:23 -08:00
Rohit Santhanam	8cb6ab5b1a	Merge remote-tracking branch 'upstream/main' into triton_mlir_IFU_02022023	2023-02-02 22:54:53 +00:00
Keren Zhou	82befe32ad	[BACKEND] Improve torch inductor performance (#1108 ) - Rewrite the AxisInfo analysis to handle each op case by case. - Add bit shift, min max, div/rem, and select ops to AxisInfo. - Rematerialize across load/store ops in the following two cases: - A size 1 tensor is considered not expensive since all threads will load the same - the targeEncoding may expose more vectorization opportunities (more elements per thread on the first dim) _res2next_ benchmark GPU Kernel time comparison on A100. - Average kernel sum. Triton 16838630ns vs Triton-MLIR 17105166ns. 1.016x slowdown. - Total kernel sum. Triton 6511735460ns vs Triton-MLIR 6512370620ns.	2023-02-01 18:21:15 -08:00
Keren Zhou	71c6d56b9b	[BACKEND] Hotfix to remove SameOperandsAndResultEncoding for the trans operation (#1136 ) The order of the input encoding is permuted to form the output encoding. For example: ``` #A_SHARED = #triton_gpu.shared<{vec = 2, perPhase = 2, maxPhase = 4, order = [1, 0]}> #A_SHARED_T = #triton_gpu.shared<{vec = 2, perPhase = 2, maxPhase = 4, order = [0, 1]}> %b = tt.trans %tensor : (tensor<16x32xf16, #A_SHARED>) -> tensor<32x16xf16, #A_SHARED_T> ```	2023-02-01 10:03:38 -08:00
Keren Zhou	5dd8ce3745	[BACKEND] Fix topological sort and add new test cases (#1132 ) Previous https://github.com/openai/triton/pull/1113 forgot to consider that a node may have multiple parents, visiting the instruction before any parent violates the semantic of topological sort. The fixed implementation exhaustively add all operations into a candidate subgraph and move an operation to the "ready" queue once all of its operands have been visited.	2023-01-31 23:41:20 -08:00
Philippe Tillet	8fea1fb478	[FRONTEND] Adding static range (#1130 ) Included: Revert "[BACKEND] Replace `mlir::topologicalSort` with a custom implementation (#1113)"	2023-01-31 18:04:19 -08:00
Michael Melesse	a9f955f862	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-2023-30-1	2023-01-30 14:04:01 -06:00
Keren Zhou	bc8a26d56f	[BACKEND] Replace `mlir::topologicalSort` with a custom implementation (#1113 ) `multiRootTopologicalSort` is faster than `mlir::topologicalSort` because it prunes nodes that have been visited before.	2023-01-29 18:57:21 -08:00
Keren Zhou	5bcf60a5c0	[BACKEND] Refactored the code to no longer include static functions in header files. (#1109 )	2023-01-28 14:58:28 -08:00
Keren Zhou	34ac01e597	[BACKEND] Change DenseMap to std::map in Allocation (#1092 ) Since we need to get the address/reference of each entry, it is not appropriate to use DenseMap here because rehashing on DenseMap will reallocate buckets and change addresses of entries. Unlike DenseMap, associative containers in STL do not change references to existing entries when other entries are inserted or deleted. https://stackoverflow.com/questions/61771352/will-it-change-the-address-of-a-existed-keys-value-when-inserting-new-keys	2023-01-25 05:18:20 -08:00
Rohit Santhanam	2d0ee0fa0f	Merge remote-tracking branch 'upstream/main' into triton-mlir-IFU-01232023	2023-01-24 03:59:17 +00:00
Keren Zhou	b5d32896b1	[BACKEND] Verify the same operand and result element type for convert_layout (#1081 ) And a hotfix for incorrect convert_layout construction in the GPU combine pass.	2023-01-22 16:59:24 +00:00
Yan Chunwei	88498d104a	[BACKEND] DotOp enable ld.v4 in MMAv1 (#1020 ) The existing convert distributed to distributed layouts logic is based on processing each MMA-block, this requires each MMA-block to share exactly the same fixed pattern(such as the one described in the [NV PTX doc](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment-mma-16816-float)). While for MMAv1, things are different, the MMA-block has variant patterns for different shapes and data layouts as below <img width="200" alt="image" src="https://user-images.githubusercontent.com/328693/213354941-731d7856-ad24-4f48-be0e-3cf41532cfa4.png"> This requires all the cell coordinates in DotOp output to be computed.	2023-01-19 09:42:33 -08:00
Philippe Tillet	408d1d7e87	[OPTIMIZER] Improved flash attention forward pass performance (#1075 ) - Fixed typo in instruction reordering pass - Minor additional optimizations for shared memory allocator - Optimized flash attention tutorial forward pass kernel	2023-01-19 06:46:01 +00:00
Philippe Tillet	660f2e8cce	[OPTIMIZER] pipeline and prefetch pass now use a more ptxas-friendly schedule (#1065 )	2023-01-17 15:21:19 -08:00

... 2 3 4 5 6 ...

487 Commits