mirror of
https://github.com/tinygrad/tinygrad.git
synced 2026-01-24 06:18:01 -05:00
70 lines
2.1 KiB
Plaintext
70 lines
2.1 KiB
Plaintext
SuperScalar is hard
|
|
VLIW is scam but maybe okay if we remove V? (322-bit is only 41 bytes)
|
|
Google TPU is nice. Admit it!!!!! U just jealous U don't have one
|
|
|
|
TPUv1:
|
|
256x256 = 64K MACs
|
|
|
|
TPUv2:
|
|
MXU (128x128) supports BF16
|
|
compiler-controlled memory hierarchy
|
|
322-bit VLIW instructions from local memory
|
|
|
|
TPUv3:
|
|
2x 128x128 MXU
|
|
2x HBM capacity, 900 GB/s
|
|
"midlife kicker"
|
|
32MB on chip
|
|
Same 322-bit wide as TPUv2
|
|
|
|
TPUv4i:
|
|
7 nm
|
|
4x 128x128 MXU
|
|
Is it inference only just because they removed the interconnect?
|
|
144MB on chip
|
|
Still HBM (SRAM is 20x more energy efficient than DRAM)
|
|
four-dimensional (triple-strided) tensor DMA architecture (DRAM -> SRAM)
|
|
This lessens the stride burden on the SRAM
|
|
Custom on-chip interconnect (OCI)
|
|
512B native access size (64 uint8)
|
|
The VLIW instruction needed extra fields to handle the four MXUs and the CMEM scratchpad memory, which were easy to add given no need for binary compatibility. The TPUv4i instruction is 25% wider than TPUv3
|
|
https://www.researchgate.net/figure/Architecture-of-a-systolic-array-based-DNN-accelerator-that-serves-as-a-baseline-for_fig1_325861610
|
|
TPUv4i first sums groups of four multiplication results together
|
|
custom four-input floating point adder that eliminates the rounding and normalization logic for the intermediate results.
|
|
|
|
https://www.hotchips.org/assets/program/conference/day2/HotChips2020_ML_Training_Google_Norrie_Patil.v01.pdf
|
|
322b VLIW bundle
|
|
|
|
* 2 scalar slots
|
|
* 4 vector slots (2 for load/store, 2 for ALU?)
|
|
* 128 wide vector
|
|
* 2 matrix slots (push, pop)
|
|
* bfloat16 multiply {1,8,7}
|
|
* float32 accumulation
|
|
* 1 misc slot
|
|
* 6 immediates
|
|
|
|
Scalar Unit:
|
|
4Ki x 32b Scalar Memory
|
|
32 x 32b Scalar Reg File
|
|
2x ALUs
|
|
|
|
128x Vector Unit:
|
|
32Ki x 32b Vector (Lane) Memory (8 ports)
|
|
8x: Sublane
|
|
32 x 32b Vector (Lane) Reg File
|
|
2x ALUs on a float each, 2048 total
|
|
Connections to matrix unit
|
|
2048 Vector ALUs
|
|
|
|
2x Matrix Multiply Unit
|
|
128 x 128 systolic array
|
|
Streaming LHS and results
|
|
Stationary RHS (w/ optional transpose)
|
|
|
|
Transpose / Permute Unit
|
|
Transpose, Reduction, Permutation
|
|
allow reshuffling of data across vector lanes
|
|
|
|
|