Commit Graph

1255 Commits

Author SHA1 Message Date
Shucai Xiao
d6d1cf2859 add gfx942 to support matrix_core (#358) 2023-10-10 22:46:24 -05:00
Alexander Efimov
7e34c244c2 [Triton] Mfma16 support (#251)
* [MFAM] Support mfma with NM size 16

This PR code emitting of MFMA instructions with size 16.

* add control over mfma type with MFMA_TYPE=16 env var
2023-10-09 13:59:54 -05:00
oplavsic
e801638b40 Add waves_per_eu as kernel parameter (#319)
* Add waves_per_eu as kernel parameter

* Fix failing tests

* Add default value for waves_per_eu for ttgir_to_llir function

* Remove aot.py
2023-10-06 12:08:34 -05:00
oplavsic
6a173eab8a Remove redundant fp32->fp16 conversion in FA (#349) 2023-10-04 14:10:07 -05:00
Michael Melesse
31fe8aadc5 ROCM IFU: Fix minimize_alloc
ROCM IFU: Small fixes
2023-10-03 05:34:44 +00:00
Shucai Xiao
334c9b5aed ROCM IFU: Fix unit tests error related to fp8/fp16 mixed input 2023-10-03 04:30:44 +00:00
Lixun Zhang
a41f13adcd ROCM IFU: Extend input to 32-bit when necessary
Note: we'll need to check this later if we can use i8 for some
reduction operations
2023-10-03 04:30:37 +00:00
Michael Melesse
28c571ea43 ROCM IFU: Fix test_if 2023-10-03 04:30:22 +00:00
Aleksandr Efimov
8ccc4b0cce ROCM IFU: Fix layout formatting 2023-10-03 04:30:16 +00:00
wenchenvincent
42a5bf9c7c ROCM IFU: Enabled conversion between fp8e4m3b15x4 and fp16. Refactored conversion between fp8e4m3nv and fp16. (#335) 2023-10-03 04:30:01 +00:00
Jason Furmanek
e5d7bb4fae Initial commit to resolve merge conflicts
rename tl.float8e4 to tl.float8e4nv to align with upstream

ROCM IFU: Fix python arch issues

ROCM IFU: Fix kernel launcher

ROCM IFU: Fix merge conflicts

fix debug build

Set correct threadsPerCTA
2023-10-03 04:04:26 +00:00
Jason Furmanek
74fd8e9754 Merge commit '36fc54b6f28168d3644808bfe299f1ba06a36272' into ifu230908-2
Conflicts:
	.gitignore
	bin/triton-translate.cpp
	include/triton/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	include/triton/Dialect/TritonGPU/IR/TritonGPUDialect.td
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2.cpp
	lib/Conversion/TritonGPUToLLVM/DotOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMBase.h
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Conversion/TritonGPUToLLVM/Utility.h
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/IR/Dialect.cpp
	lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/test/unit/runtime/test_subproc.py
	python/triton/compiler/compiler.py
	python/triton/compiler/make_launcher.py
	python/triton/language/semantic.py
	python/triton/runtime/jit.py
	python/tutorials/06-fused-attention.py
	test/Conversion/triton_to_tritongpu.mlir
	test/Conversion/tritongpu_to_llvm.mlir
	test/TritonGPU/coalesce.mlir
	unittest/Conversion/TritonGPUToLLVM/CMakeLists.txt
2023-10-02 18:01:04 +00:00
SJW
287b0adcc2 [Stream] Fixed bug in stream-pipeline for FA (#345)
* [Stream] Fixed bug in stream-pipeline for FA
* updated gemm tutorial for num_stages=0

* * updated configs
2023-09-29 20:20:55 -05:00
Shucai Xiao
6e82aa8dbc support gemm fp8/fp16 mixed input (#333)
* changes to support fp8/fp16 mixed inputs

* add unit test for fp8/fp16 mixed input for gemm
2023-09-27 08:00:31 -05:00
Aleksandr Efimov
d80cd2d374 [MFMA] Change kWidth parameter semantics
This PR changes kWidth semantics "from elements per instruction" to
"elements per thread per instruction" along k axis.
2023-09-25 10:56:44 -05:00
Shucai Xiao
10795d8fd3 Fixed a bug related to split_k and prune unnecessary tuning space (#332)
* refine tuning scrit by adding prune_configs, also fixed a bug in generating tuning configs

* fixed a bug in returning the empty config
2023-09-21 23:47:14 -05:00
Ognjen Plavsic
a8574be74d [FA] Fix bug from IFU merge
This commit reverts num_warps back to 4 for
FA forward pass with d=128.
2023-09-21 10:48:41 -05:00
Justin Lebar
2a3746bac5 [BUILD] use ninja (#2318) 2023-09-18 15:30:06 -05:00
Alexander Efimov
b25557ad5e [FA] Upstream FA qk initialization (#328)
This PR replaces qk matrix initialization with
upstream version
2023-09-14 00:34:14 -05:00
Lixun Zhang
ea397b49aa Fix the issue when CTA coverage is larger than the tile 2023-09-12 10:16:44 -05:00
Alexander Efimov
a06072f8ff Fix dangling gpu_has_mfma use (#325)
* Fix dangling gpu_has_mfma use

This PR replaces gpu_has_mfma use with gpu_matrix_core_version

* add basic test
2023-09-11 12:31:48 -05:00
Alexander Efimov
6691de65db [MFMA] Support BFloat16 on MI100 (#295)
* [MFMA] Support BFloat16 on MI100

This PR makes use of mfma_f32_32x32x4bf16 instruction, available on MI100.

* fix tests, fix mfma encoding comment, fix switch between mfma versions.

* replace kDim from mfma layout with kWidth from dotOp layout

* rebase fix

* fix mfma to dot op shortcut for bfloat16

* fix review comments
2023-09-08 15:08:34 -05:00
SJW
491eb9ddfe [MLIR] Added tritongpu-stream-pipeline pass (#305)
* [MLIR] Added tritongpu-stream-pipeline pass
     - Prologue: Hoist the pipelinable load operations and shared memory store
       for the ramp up stage
     - Pipelined Loop: Assemble the loop body minus last iteration
       - Prefetch next tile from global into regs (while computing from previous)
       - Non-load loop body
       - Store next tile into shared mem
     - Epilogue: Peeled non-load loop body for last iteration

* * updated comment
2023-09-07 15:24:59 -05:00
jayfurmanek
83a0958566 Merge pull request #322 from ROCmSoftwarePlatform/f8_and_bf16_conversions
Enable fp8 conversions and fix bf16 conversions
2023-09-07 14:38:16 -05:00
Shucai Xiao
fb3f2d6feb refine gemm tuning scripts (#309)
* refine the gemm tuning scripts to reduce tuning space and better perf numbers

* added code to support tuning in full tuning space

* add a function to get best tuning config

* refine the matmul tutorial example to print out best tuning config for each input

* added even_k to gemm kernel heuristic for better performance

* address review comments
2023-09-07 08:09:11 -05:00
Wen Chen
ffc230ebfe [ROCM] Fixed implementation of fp32 to bf16 conversion on ROCm. 2023-09-06 18:10:54 -05:00
Wen Chen
2d3e38e182 [ROCM] Added ROCm support for int8 to bfloat16 conversion. 2023-09-06 18:10:54 -05:00
Wen Chen
59a40d3f72 [ROCM] Added ROCm support for the conversions of following data types:
[float8e4m3, float8e4m3b15, float8e5m2] <-> [float16, bfloat16]
2023-09-06 18:10:54 -05:00
Jason Furmanek
320b1029da Temporarily disable F8 tests on ROCm 2023-09-01 04:02:14 +00:00
Jason Furmanek
df5c263a19 Fix merge conflicts 2023-09-01 04:01:32 +00:00
Jason Furmanek
3eaeb89d18 Merge commit '5df904233c11a65bd131ead7268f84cca7804275' into ifu230810-2
Conflicts:
	include/triton/Dialect/Triton/Transforms/Passes.h
	include/triton/Dialect/TritonGPU/IR/Dialect.h
	include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td
	lib/Analysis/Allocation.cpp
	lib/Analysis/Utility.cpp
	lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVM.cpp
	lib/Conversion/TritonGPUToLLVM/TritonGPUToLLVMPass.cpp
	lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp
	lib/Dialect/TritonGPU/Transforms/RemoveLayoutConversions.cpp
	lib/Dialect/TritonGPU/Transforms/ReorderInstructions.cpp
	lib/Target/LLVMIR/LLVMIRTranslation.cpp
	python/src/triton.cc
	python/triton/compiler/compiler.py
	python/triton/ops/flash_attention.py
	python/triton/runtime/autotuner.py
	python/triton/runtime/jit.py
	python/triton/tools/aot.py
	python/tutorials/06-fused-attention.py
	test/Conversion/tritongpu_to_llvm.mlir
	test/Target/tritongpu_to_llvmir.mlir
	test/Target/tritongpu_to_llvmir_noinline.mlir
2023-09-01 03:25:33 +00:00
Philippe Tillet
ec51552fff [BACKEND] Lift restriction for float8e4b15 to only support row-col layout (#2212) 2023-08-30 14:06:31 -07:00
jon-chuang
9af76e7d5a [RUNTIME] Fix cache dir (#2196)
---------

Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-08-29 21:07:16 -04:00
Vinayak Gokhale
9cdf3a58c3 Enable split kernel in bwd pass (#303)
* Add fwd and bwd v2

Changes are largely from upstream.

* Split bwd kernel in dq and dk+dv

Only adds the split kernels. They are not enabled yet.

* Pull scalar multiplies out of the loop

* Enable split kernel for bwd pass

* Put back P_SEQ=128 in fwd test

Not used for bwd test

* Address review comments

* Address comments

Conditionally set causal/ splitkernel to False for bwd.

* Add block pointer semantics to bwd pass

This significantly increases perf for bwd, similar to fwd.
2023-08-29 13:51:29 -05:00
Lixun Zhang
b834f42ae4 [autotuner] Add an option to print best_config for each key 2023-08-28 14:45:54 -05:00
goostavz
1465b573e8 [TESTS][HOPPER] Prune hopper tests to speedup CI (#2193)
Co-authored-by: Goostav Zhu <gzhu@nvidia.com>
2023-08-27 20:45:23 -07:00
Philippe Tillet
5f448b2f08 [FRONTEND] remove dead libhopper_helpers.bc file (#2190) 2023-08-26 12:17:17 -07:00
Greg Brockman
a9b8c8c37d [FRONTEND] drop GIL for launch, and set value=false upon pointer error (#2185) 2023-08-26 17:07:57 +00:00
Keren Zhou
6e4932cda8 [BACKEND] Fix fma mixed-precision (#2184)
and expose the allow_tf32 argument to the matmul op

@shunting314
2023-08-26 09:49:58 -07:00
Greg Brockman
ab3e8b0dad [FRONTEND] fix handling of do_not_specialize with interior constantexprs (#2188) 2023-08-26 09:19:34 -07:00
Mohammed Anany
ebfe0ffb29 [FRONTEND] fix for undefined dtypes in jit during loading defaults (#2114)
Co-authored-by: Keren Zhou <kerenzhou@openai.com>
2023-08-25 10:28:23 -07:00
Ethan Pronovost
56fee37a0d [FRONTEND] Fix benchmark plotting (#2177) 2023-08-24 20:34:04 -07:00
Greg Brockman
64d8df4c69 [FRONTEND] handle errors from launch_enter_hook (#2178) 2023-08-24 20:32:01 -07:00
Shantanu
7083dae4f2 [FRONTEND] drop the GIL around more CUDA ops (#2173) 2023-08-24 20:31:38 -07:00
Zahi Moudallal
120ce0a5bf [DOCS] Fixing docs (#2175) 2023-08-24 15:58:59 -07:00
jayfurmanek
ff7e707f87 Enable usage of block pointer semantics for AMD gpus (#301)
* Enable usage of block pointer semantics for AMD gpus

This commit enables usage of block pointer semantics by enabling
rewrite_tensor_pointer_pass that rewrites block pointer loads/stores
to legacy loads/stores.

* Update FA fwd in tutorial to use the block pointers

* use 90 compute capability for amd gpus in python/triton/compiler/compiler.py

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

---------

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@dxc.com>
Co-authored-by: Lixun Zhang <lixun.zhang@amd.com>
Co-authored-by: Aleksandr Efimov <130555951+alefimov-amd@users.noreply.github.com>
Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>
2023-08-24 13:05:12 -05:00
chengjunlu
6cb67185f8 [FRONTEND]To use proper default num_warps and num_stages based on the device backend in JITFucntion (#2130)
The default values used by JITFunction for num_warps and num_stages are
coupled with Nvidia GPU architecture. We should use the proper default
values based on the device backend for the kernel to be compiled to.
1. Add two functions to return the default num_warps and num_stages for
the specific device backend.
2. JITFunction uses the proper default num_warps and num_stages based on
the specific device backend.

Co-authored-by: Wang Weihan <eikan.wang@intel.com>
2023-08-24 21:58:18 +08:00
Bin Fan
dad83f9dcb [TOOLS] Add support for autotuning AOT kernel (#2123)
This PR makes the following change to AOT kernel

- Allow the client to generate AOT kernels with different sets of
constexprs and meta-parameters. Each combination of constexpr set and
meta-parameters is referred to an "algo". Within an algo client can
still give different hints about integer arguments.
- Add a API int ${kernle_name}_get_num_algos() that returns the total
number of algos.
- Add a algo_id to allow client to the generated kernel to select the
algo
- Remove gX, gY and gZ from the kernel parameter list. This is because
the launch grid is usually different with different algos, and the
client should not need to care about how to compute the launch grid for
each algo. Instead, we ask the client to pass the expression of
computing gX, gY and gZ for compile.py (when AOT kernels are generated).
The expression can only use kernel parameter or const values.
- We also change the testing flow. Now we first build the kernels into a
shared library libkernel.so, then the client test.c code is built and
link with libkernel.so. This is closer to a typical AOT kernel usage
flow.
2023-08-23 09:38:29 -07:00
Zahi Moudallal
5282ed890d [CI] Add back pre-commit to nvidia CI job (#2159) 2023-08-23 01:11:03 +00:00
Keren Zhou
6a65c894fe [TUTORIALS] Skip running TMA tutorials on non-hopper architectures (#2153) 2023-08-22 18:02:26 -07:00