Files
tinygrad/accel/cherry/TPUNOTES
2021-10-30 16:10:59 -07:00

70 lines
2.1 KiB
Plaintext

SuperScalar is hard
VLIW is scam but maybe okay if we remove V? (322-bit is only 41 bytes)
Google TPU is nice. Admit it!!!!! U just jealous U don't have one
TPUv1:
256x256 = 64K MACs
TPUv2:
MXU (128x128) supports BF16
compiler-controlled memory hierarchy
322-bit VLIW instructions from local memory
TPUv3:
2x 128x128 MXU
2x HBM capacity, 900 GB/s
"midlife kicker"
32MB on chip
Same 322-bit wide as TPUv2
TPUv4i:
7 nm
4x 128x128 MXU
Is it inference only just because they removed the interconnect?
144MB on chip
Still HBM (SRAM is 20x more energy efficient than DRAM)
four-dimensional (triple-strided) tensor DMA architecture (DRAM -> SRAM)
This lessens the stride burden on the SRAM
Custom on-chip interconnect (OCI)
512B native access size (64 uint8)
The VLIW instruction needed extra fields to handle the four MXUs and the CMEM scratchpad memory, which were easy to add given no need for binary compatibility. The TPUv4i instruction is 25% wider than TPUv3
https://www.researchgate.net/figure/Architecture-of-a-systolic-array-based-DNN-accelerator-that-serves-as-a-baseline-for_fig1_325861610
TPUv4i first sums groups of four multiplication results together
custom four-input floating point adder that eliminates the rounding and normalization logic for the intermediate results.
https://www.hotchips.org/assets/program/conference/day2/HotChips2020_ML_Training_Google_Norrie_Patil.v01.pdf
322b VLIW bundle
* 2 scalar slots
* 4 vector slots (2 for load/store, 2 for ALU?)
* 128 wide vector
* 2 matrix slots (push, pop)
* bfloat16 multiply {1,8,7}
* float32 accumulation
* 1 misc slot
* 6 immediates
Scalar Unit:
4Ki x 32b Scalar Memory
32 x 32b Scalar Reg File
2x ALUs
128x Vector Unit:
32Ki x 32b Vector (Lane) Memory (8 ports)
8x: Sublane
32 x 32b Vector (Lane) Reg File
2x ALUs on a float each, 2048 total
Connections to matrix unit
2048 Vector ALUs
2x Matrix Multiply Unit
128 x 128 systolic array
Streaming LHS and results
Stationary RHS (w/ optional transpose)
Transpose / Permute Unit
Transpose, Reduction, Permutation
allow reshuffling of data across vector lanes