221 Commits

Author SHA1 Message Date
chenyu
b34c637767 support bfloat16 for CL (#14073) 2026-01-08 14:14:29 -05:00
chenyu
3caa1e2c98 fix cast HALF with PYTHON backend (#14058) 2026-01-07 16:52:05 -05:00
chenyu
5f1ede7f7e clean up test_dtype (#14055)
use less lambda
2026-01-07 15:45:42 -05:00
George Hotz
7abf4591ba use bitsize on dtype (#14011)
* use bitsize on dtype [pr]

* bitsize

* bitsize in js export, but might be wrong

* reverts

* revert that
2026-01-04 12:16:21 -08:00
Jakob Sachs
ab2220b834 Handle missing bfloat16 natives on CPU architectures (#13553)
* CPU: fix compiler-rt libcall by adding intermediate casts for bfloat16

* fix lint

* remove old manual bypass of bf16 for CPU tests, and add diversion converstion from bf16 to/from fp16

---------

Co-authored-by: Jakob Sachs <jakobs99@purelymail.com>
2025-12-11 15:38:43 -05:00
Christopher Milan
0aabc1e938 Mesa NIR backend (NAK/LLVMpipe) (#12089)
* nak works

* TestOps::test_add works

* testop has no crashes

* fix bool casts

* fix typo

* add disassemble

* RANGE and locals/regs

* simplify NAKCompiler

* disass cleanup

* cleanup nir codegen

* almost all tests passing

* cleanup notes in extra/

* old notes

* only import nak if NIR=1

* fix new SPECIAL syntax

* fix local/shared memory

* more tests passing

* add DEFINE_VAR support

* llvmpipe kinda works

* diskcache

* some mypy stuff

* lvp passing test_ops.py

* fix imports

* actually fix imports

* remove 'stdout'

* fix llvm import

* fix mypy issues

* nicer errors

* simpler test_dtype skips

* test lvp in CI

* fix github action syntax

* fix more actions typos

* switch to mesa 25.1.0

* diskcache_put

* better generation for lvp nir_options

* b64encode shader blobs

* Revert diskcache changes

This reverts commits 930fa3de8a and 8428c694b3.

* general cleanup

* better error messages

* fix llvm import

* fix windows tests

* link with libm and libgcc_s

* fix some errors

* dont check for 'float4'

* NIR uses pointer arithmetic

* use tinymesa

* bump tinymesa

* bump tinymesa again

* update lvp nir_options

* print nir shader with DEBUG

* simplify LVPCompiler

* more tests

* "gated" STORE

* NAK is cacheable

* more tests

* all tests pass locally for NAK

* test autogen in CI

* autogen deps

* more deps

* fix uop_gc

* fix macos

* mypy

* save 2 lines

* save two more lines

* save 1 line

* save 4 lines

* save more lines

* Revert "save more lines"

This reverts commit dd3a720c5a.

* save more lines

* fix LVP on windows

* refactor

* reorganize some code

* refactor lib_gpu

* move LVP check

* out of order loads

* remove support.mesa

* bump tinymesa version

* simplify LVP jit

* macos

* macos ci

* shell: bash

* testing

* more testing

* compute brew prefix

* stupid typo

* actually fix

* lib

* stdout on macos

* inline gallivm_compile_module

* Revert "inline gallivm_compile_module"

This reverts commit b65983b151.

* elf macos

* semicolon

* inherit from CPULLVMCompiler

* ruff

* disas test

* fix libm linking

* default is fine actually

* arm works

* add elf loader link test

* fix NAK beam

* pylint is too smart by half

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: nimlgen <138685161+nimlgen@users.noreply.github.com>
2025-10-15 17:38:33 +08:00
chenyu
9f2b69b870 enable few tests for PTX test_dtype (#12445) 2025-10-03 08:56:30 -04:00
b1tg
54c15d74a4 python float8 support (#11960)
* basic support

* alu

* nan in exec_alu

* rand_for_dtype

* inf + 0.0

* finfo

* revert rand_for_dtype

* clean

* truncate fp8s inf

* spec ok

* float_to_fp8 nan/inf

* least_upper_dtype

* clean up

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-09-18 09:17:09 -04:00
chenyu
98ecab7563 remove ml_dtypes (#12169) 2025-09-14 14:20:05 -04:00
chenyu
0e266f376c ops_gpu -> ops_cl (#12103) 2025-09-10 15:15:48 -04:00
nimlgen
551560b87c do not use getenv('PTX') in tests (#12095)
* test without ptx

* fix tests

* fix test

* linters
2025-09-10 14:04:07 +03:00
b1tg
58d13a6e3e remove redundant check (#12087) 2025-09-09 15:15:39 -04:00
b1tg
82e955fe79 fix inf bug in float_to_fp8 (#12085) 2025-09-09 12:02:56 -04:00
chenyu
677220ae7e test_tesnor_data to unit/ (#12013) 2025-09-04 19:58:27 -04:00
b1tg
a9f07c31bc fix amd llvm sqrt (#11936)
* fix amd llvm sqrt

* lint

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-09-01 09:31:14 -04:00
b1tg
c1eeb3b99c only skip AMD_LLVM (#11934)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-31 18:15:47 +03:00
b1tg
75d380a77c fix transcendentals in python renderer (#11932)
* fix transcendentals in python renderer

* add test

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-31 09:37:17 -04:00
b1tg
b2cc06218a python bfloat16 (#11912)
* python bf16

* _to_torch_storage_type

---------

Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-08-29 15:18:02 -04:00
chenyu
f28f613f85 improved float_to_bf16 (#11848)
round instead of truncate
2025-08-26 11:14:06 -04:00
chenyu
e22e5da9a5 move some test_dtype tests to unit (#11479) 2025-08-02 15:25:00 -04:00
chenyu
a41140241b truncate unsigned const in cstyle (#11318)
it can be a warning or a hard error in clang

PTX and PYTHON also need fix, skipping for now
2025-07-22 08:02:12 -04:00
George Hotz
a493eb396c fix view add 0 (#10840) 2025-06-16 16:46:12 -07:00
chenyu
8c28b5d833 move dtype spec tests into unit test (#10808)
* move dtype spec tests into unit test

can clean up more after the split

* skip CI test_backward_sum_acc_dtype
2025-06-13 22:21:22 -04:00
Sieds Lykles
0daa4c6ed0 Add DType.min and DType.max properties (#10749)
* add properties

* cleaner test

* remove added newline
2025-06-10 08:31:34 -07:00
George Hotz
81b9c04574 move high level stuff to unit tests [pr] (#10708)
* move high level stuff to unit tests [pr]

* process replay on unit tests

* fix pr, less compute

* set omp num threads

* set 200MB buffer size limit

* delete junk

* fix tests

* faster

* move test_indexing to unit

* faster
2025-06-08 14:05:56 -07:00
George Hotz
32e9949052 rename lazydata to uop (#10698) 2025-06-08 08:42:22 -07:00
pkotzbach
dbbd755cba FP8s truncate (#9937)
* truncate fp8

* fix

* maybe like that?

* fix linters

* ruff

* move from extra and add ml_types to tests

* minor changes

* str to dtypes and nan support

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
2025-04-22 19:12:49 -04:00
pkotzbach
5849c43382 FP8s part 1 (#9887)
* fp8s part 1

* prettier

* fixes

* fixes

* remove stuff that should be in next pr

* revert

* add creation

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
2025-04-15 11:20:02 -04:00
chenyu
ce454793e6 support specifying dtype for Tensor.linear (#9886) 2025-04-14 13:55:11 -04:00
George Hotz
78caf55154 Revert "FP8 support on NVIDIA (#8631)"
This reverts commit 2c8e4ea865.
2025-04-09 12:27:41 +08:00
pkotzbach
2c8e4ea865 FP8 support on NVIDIA (#8631)
* squashed fp8 commits

* tensorcore start

* minor changes

* pre-commit

* pylint

* Delete fp8mul.cu

* clean

* small bugfix

* fix test_dtype

* fix test_dtype_alu

* add EMULATE_CUDA_SM89

* fix ci

* fix test_linearizer

* fix test_linearizer

* fix swizzle

* add debug to simple_matmul

* fixed swizzle

* python emulator

* refactor python emulator

* setup fix

* numpy setup

* ml_dtypes only in emulate_cuda_sm89

* fix pylint

* fix tests

* fix mypy

* fix mypy

* fix ruff

* done python emulator

* add acc type

* tests

* mypy

* clean code

* add cuda tensor core tests to CI

* minor fix

* clean test_dtype.py

* clean cstyle.py

* clean test_ops.py

* fix test

* fix test

* whitespaces

* pylint

* pylint

* amd?

* amd?

* amd

* reduce lines

* mockgpu remove

* fix

* ruff

* ruff

* fix mypy

* ruff

* test only for cuda

* fixed formatting

* small fixes

* small fix

* least_upper_dtype if fp8s not supported

* log and reciprocal are supported for fp8s

* ops python fixes

* dtypes.fp8s use

* e4m3 + e5m2 result dtype test

* truncate linter fix

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
Co-authored-by: chenyu <chenyu@fastmail.com>
2025-04-08 21:54:04 -04:00
chenyu
79145e3d40 cleanup truncate_bf16 [pr] (#9725)
use torch bfloat16 for groundtruth in test. also a TODO for discrepancy
2025-04-03 05:43:49 -04:00
qazal
e26caf4c3a hotfix: skip test_mean_half_precision_underflow on amd ci (#9476)
The global size is very large (781250 gidx) and the emulated version takes more than 1
minute to execute the kernel.
2025-03-17 16:47:48 +08:00
chenyu
01e8b60911 acc_dtype -> dtype (#9402)
matched numpy and torch
2025-03-10 16:05:30 -04:00
chenyu
3ae66e59a3 least_upper_float is at least default_float (#9303)
* least_upper_float is at least default_float

en route for div rounding mode. dtype of true int division would change from int32 to default_float, which matches torch too.

* fix bert acc
2025-02-28 10:41:56 -05:00
qazal
cbfe95d306 bring cast before view back (#9230)
* bring cast before view back

* tune it to only trigger on expands

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-25 01:50:39 +02:00
Ahmed Harmouche
59fe45f947 Solve get_grouped_dims does not split issue (#9085)
* Solve dims too large errors on webgpu

* Simplify divisor find

* Test square root divisor

* Fix lint

* Refactor into group_dims and split_dims

* Refactor

* Fix lint

* Add back max check in _group_dims

* Prefer grouping over split

---------

Co-authored-by: chenyu <chenyu@fastmail.com>
2025-02-16 19:57:29 -05:00
b1tg
1f1362fd27 add truncate_bf16 (#9078)
Co-authored-by: b1tg <b1tg@users.noreply.github.com>
2025-02-15 07:59:09 +08:00
Ahmed Harmouche
916d5e7f08 WebGPU f16 support (f16 bounty part 2) (#8653)
* WebGPU f16 support

* Don't enable f16 yet

* dtype tests passing after bitcast fix

* Maybe all WebGPU green?

* Require shader-f16 in examples

* Minor wgsl touchup

* 1 line shorter

* Simpler

* Add transcendetal support

* log2 nan location mismatch on Vulkan

* Nan skips
2025-02-12 19:46:53 +08:00
George Hotz
c1c5227acb preserve size in dtype ptr [pr] (#8898) 2025-02-05 14:38:57 +08:00
George Hotz
c85737c200 assert to prepare for grad uop [pr] (#8280)
* assert to prepare for grad uop [pr]

* fix test_nn

* fix most of test_tensor

* few more tests

* fix multi

* uniform gradient

* acc_dtype

* any for multi

* fix typing

* fix assert, CAST_BEFORE_VIEW is still the issue

* explict test for CAST_BEFORE_VIEW

---------

Co-authored-by: qazal <77887910+Qazalin@users.noreply.github.com>
2025-01-14 13:26:56 -08:00
qazal
ed618a72e7 do not use subbuffer for bitcast (#8514)
* do not use subbuffer for bitcast

* edit that test

* explicit test for ptx

* ptx
2025-01-06 18:40:46 +02:00
George Hotz
6608ba316d add size of the buffer to the ptr dtype (#8322) 2024-12-18 12:46:35 -08:00
uuuvn
da2245a458 Fix double => half cast on clang (#8265) 2024-12-15 11:24:05 -08:00
pkotzbach
c1b79c118f add unit tests for to_dtype (#8217)
* add unit test for to_dtype

* add unit test for to_dtype

---------

Co-authored-by: pkotzbach <pawkotz@gmail.com>
2024-12-13 16:21:02 -05:00
chenyu
40a4c603b9 remove more test skip for webgpu [pr] (#8192) 2024-12-12 14:06:35 -05:00
Ahmed Harmouche
db330a3110 Remove WebGL (#8012) 2024-12-03 16:02:53 +01:00
George Hotz
d53cd92364 fix tests for delete lazy [pr] (#7980) 2024-12-02 12:00:48 +08:00
Ahmed Harmouche
10618aba98 Bring back WebGPU (#7063)
* Start from andredaprato:webgpu-clean

* Fix infs

* inf wgsl function is not needed

* Emulated ulong for threefry, more tests passing

* Randomness tests passing

* Update model export to support new changes in webgpu, efficientnet export works again

* Simplify shift emulation in wgsl

* Delete test file

* Fix bigger than u32 u32 literal

* Why was skip copies added here?

* Python3.12 for webgpu tests

* Fix model export syntax error

* Get test ops passing with some skips

* Fix lint

* Much simpler shift

* Run more tests

* Timestamp queries are not supported in CI, so skip search tests

* All fancy indexing passing

* r is ctx

* Run more dtype tests by using is_dtype_supported

* Cleanup ulong shift rendering

* UPat -> Pat, UOps -> Ops

* Pat -> UPat

* Refactor render_ushift if-else

* Pattern to avoid ulong mul

* Remove vals_dtype

* is_nan trick + rewrite, test_isnan passing

* Rewrite a * select(1, nan, gate) -> select(a, nan, gate)

* No arg, just op

* Support char, uchar, short, ushort

* Run test_index_mnis now that we have uint8

* Fix pyling

* Save 3 lines by using base Compiler

* No more long emulation

* Remove fixup_binops

* No more external_local_bufx wgsl specific cstyle modif, use base extra_pm

* Simpler, faster copyin/out

* Skip some new tests that use long

* Fix typo

* copyout touchup

* Save lines by using render_cast

* WebGL is not supported in core, delete it from is_dtype_supported

* More narrow test skips for some unary tests

* TernaryOps, UnaryOps -> Ops

* TinyGrad supports WebGPU

* StableDiffusion demo: f16tof32 gpu is a lib, update UI

* Packed load/store, no more scale_size, no core tinygrad changes

* Rename copyin, copyout

* Device -> dev

* Fix lint

* Pattern matcher rule for packed load/store

* Refactor

* Shorter packed load/store

* this should fix lint

* Fix mypy

* SD compile script working

* New SD webgpu UI

* New default prompt

* New SD weights

* Fix title when webgpu not available

* Run symbolic tests, simplify is_nan, use round_up

* Show step time on UI

* Bump minimum wgpu version to v0.19

* Fix latent

---------

Co-authored-by: George Hotz <72895+geohot@users.noreply.github.com>
2024-11-26 12:26:40 +08:00
chenyu
40d7535eeb clean up DTYPES_DICT [pr] (#7845) 2024-11-22 10:01:34 -05:00