[BACKEND] Pipeliner refactoring (#2565)

Refactor the pipeliner pass in order to make it more generic. The main
change is that the pipeliner is now broken into 2 pieces one calculating
a modulo schedule and create async ops based on the IR and an expander
that will generate the pipelined IR based on the modulo schedule.
The advantage of separating the two pieces is that it will allow us to
create different schedule without having to change the expander and it
will allow for more complex schedules.
For now the schedule generated for matmul case matches rougly the
schedule picked by the previous pipeliner in order to avoid changes.

This also creates a different sequence of insert/extract slice for the
alloc. We should probably change shared alloc to use memory semantic.
This commit is contained in:
Thomas Raoux
2023-11-02 09:56:39 -07:00
committed by GitHub
parent 218492cd65
commit ca8f110617
14 changed files with 1901 additions and 2061 deletions

View File

@@ -339,6 +339,9 @@ def test_gemm(BLOCK_M, BLOCK_N, BLOCK_K, NUM_WARPS, NUM_CTAS, M, N, K, TRANS_A,
'16-32-64-8-2-256-256-256-True',
]:
pytest.skip('Known legacy issue, ldmatrix can only support x4')
enable_tma = os.environ.get('ENABLE_TMA', 'not found').lower()
if NUM_CTAS > 1 and enable_tma in ["on", "true", "1"]:
pytest.skip('multi-CTA with TMA not supported in MaterializeLoadStore')
M = BLOCK_M if M is None else M
N = BLOCK_N if N is None else N