* support symbolic shapes in split/chunk when split dim is concrete Previously split() and chunk() required all dimensions to be concrete. Now they only require the dimension being split to be concrete, allowing them to work with tensors that have symbolic shapes in other dimensions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * update CLAUDE.md: add pre-commit and no-amend rules 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix dim resolution order in split/chunk Ensure dim_sz is retrieved after dim is resolved, not before. The previous one-liner evaluated self.shape[dim] with the original unresolved dim value. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
8.7 KiB
Claude Code Guide for tinygrad
Architecture Overview
tinygrad compiles tensor operations into optimized kernels. The pipeline:
- Tensor (
tensor.py) - User-facing API, creates UOp graph - UOp (
uop/ops.py) - Unified IR for all operations (both tensor and kernel level) - Schedule (
engine/schedule.py,schedule/) - Converts tensor UOps to kernel UOps - Codegen (
codegen/) - Converts kernel UOps to device code - Runtime (
runtime/) - Device-specific execution
Key Concepts
UOp (Universal Operation)
Everything is a UOp - tensors, operations, buffers, kernels. Key properties:
op: The operation type (Ops enum)dtype: Data typesrc: Tuple of source UOpsarg: Operation-specific argumenttag: Optional tag for graph transformations
UOps are immutable and cached - creating the same UOp twice returns the same object (ucache).
PatternMatcher
Used extensively for graph transformations:
pm = PatternMatcher([
(UPat(Ops.ADD, src=(UPat.cvar("x"), UPat.cvar("x"))), lambda x: x * 2),
])
result = graph_rewrite(uop, pm)
Schedule Cache
Schedules are cached by graph structure. BIND nodes (variables with bound values) are unbound before cache key computation so different values hit the same cache.
Directory Structure
tinygrad/
├── tensor.py # Tensor class, user API
├── device.py # Buffer, device management
├── dtype.py # Data types
├── helpers.py # Utilities, environment vars
├── uop/
│ ├── ops.py # UOp class, Ops enum, PatternMatcher
│ ├── spec.py # UOp type verification
│ └── symbolic.py # Symbolic math simplification
├── engine/
│ ├── schedule.py # Schedule creation, caching
│ ├── realize.py # Tensor realization
│ ├── jit.py # JIT compilation
│ └── memory.py # Memory planning
├── schedule/
│ ├── rangeify.py # Convert movements to ranges
│ └── indexing.py # Index calculations
├── codegen/
│ ├── kernel.py # Kernel optimization
│ └── uopgraph.py # UOp graph transformations
├── renderer/ # Code generation (CUDA, Metal, etc.)
└── runtime/ # Device backends
Testing
# Run specific test
python -m pytest test/unit/test_schedule_cache.py -xvs
# Run with timeout
python -m pytest test/test_symbolic_ops.py -x --timeout=60
# Debug with print
DEBUG=2 python -m pytest test/test_schedule.py::test_name -xvs
# Visualize UOp graphs
VIZ=1 python -c "from tinygrad import Tensor; Tensor.ones(10).sum().realize()"
Common Environment Variables
DEBUG=1-4- Increasing verbosityVIZ=1- Enable graph visualizationSPEC=1- Enable UOp spec verificationNOOPT=1- Disable optimizationsDEVICE=CPU/CUDA/AMD/METAL- Set default device
Debugging Tips
- Print UOp graphs:
print(tensor.uop)orprint(tensor.uop.sink()) - Check schedule:
tensor.schedule()returns list of ScheduleItems - Trace graph rewrites: Use
VIZ=1or add print in PatternMatcher callbacks - Find UOps by type:
[u for u in uop.toposort() if u.op is Ops.SOMETHING]
Workflow Rules
- NEVER commit without explicit user approval - always show the diff and wait for approval
- NEVER amend commits - always create a new commit instead
- Run
pre-commit run --all-filesbefore committing to catch linting/type errors - Run tests before proposing commits
- Test with
SPEC=2when modifying UOp-related code
Style Notes
- 2-space indentation, 150 char line limit
- PatternMatchers should be defined at module level (slow to construct)
- Prefer
graph_rewriteover manual graph traversal - UOp methods like
.replace()preserve tags unless explicitly changed - Use
.rtag(value)to add tags to UOps
Lessons Learned
UOp ucache Behavior
UOps are cached by their contents - creating a UOp with identical (op, dtype, src, arg) returns the same object. This means:
uop.replace(tag=None)on a tagged UOp returns the original untagged UOp if it exists in cache- Two UOps with same structure are identical (
iscomparison works)
Spec Validation
When adding new UOp patterns, update tinygrad/uop/spec.py. Test with:
SPEC=2 python3 test/unit/test_something.py
Spec issues appear as RuntimeError: SPEC ISSUE None: UOp(...).
Schedule Cache Key Normalization
The schedule cache strips values from BIND nodes so different bound values (e.g., KV cache positions) hit the same cache entry:
pm_pre_sched_cache: BIND(DEFINE_VAR, CONST) → BIND(DEFINE_VAR) for cache keypm_post_sched_cache: restores original BIND from context- When accessing
bind.src[1], checklen(bind.src) > 1first (might be stripped) - Extract var_vals from
input_buffersdict after graph_rewrite (avoids extra toposort)
Avoiding Extra Work
- Use ctx dict from graph_rewrite to collect info during traversal instead of separate toposort
- Only extract var_vals when schedule is non-empty (no kernels = no vars needed)
- PatternMatchers are slow to construct - define at module level, not in functions
Testing LLM Changes
# Quick smoke test
echo "Hello" | DEBUG=1 python tinygrad/apps/llm.py --model "llama3.2:1b"
# Check cache hits (should see "cache hit" after warmup)
echo "Hello world" | DEBUG=1 python tinygrad/apps/llm.py --model "llama3.2:1b" 2>&1 | grep cache
# Test with beam search
echo "Hello" | BEAM=2 python tinygrad/apps/llm.py --model "llama3.2:1b"
Common Patterns
Graph Transformation
def my_transform(ctx, x):
# Return new UOp or None to skip
return x.replace(arg=new_arg)
pm = PatternMatcher([
(UPat(Ops.SOMETHING, name="x"), my_transform),
])
result = graph_rewrite(input_uop, pm, ctx={})
Finding Variables
# Get all variables in a UOp graph
variables = uop.variables()
# Get bound variable values
var, val = bind_uop.unbind()
Shape Handling
# Shapes can be symbolic (contain UOps)
shape = tensor.shape # tuple[sint, ...] where sint = int | UOp
Performance Optimization
When optimizing tinygrad internals:
-
Measure wall time, not just call counts - Reducing
graph_rewritecalls doesn't always improve wall time. The overhead of conditional checks can exceed the cost of the operation being skipped. -
Profile each optimization individually - Run benchmarks with and without each change to measure actual impact. Use
test/external/external_benchmark_schedule.pyfor schedule/rewrite timing. -
Early exits in hot paths are effective - Simple checks like
if self.op is Ops.CONST: return selfinsimplify()can eliminate many unnecessarygraph_rewritecalls. -
graph_rewriteis expensive - Each call has overhead even for small graphs. Avoid calling it when the result is trivially known (e.g., simplifying a CONST returns itself). -
Beware iterator overhead - Checks like
all(x.op is Ops.CONST for x in self.src)can be slower than just running the operation, especially for small sequences. -
Verify cache hit rates before adding/keeping caches - Measure actual hit rates with real workloads. A cache with 0% hit rate is pure overhead (e.g.,
pm_cachewas removed because the algorithm guarantees each UOp is only passed topm_rewriteonce). -
Use
TRACK_MATCH_STATS=2to profile pattern matching - This shows match rates and time per pattern. Look for patterns with 0% match rate that still cost significant time - these are pure overhead for that workload. -
Cached properties beat manual traversal -
backward_sliceuses@functools.cached_property. A DFS with early-exit sounds faster but is actually slower because it doesn't benefit from caching. The cache hit benefit often outweighs algorithmic improvements. -
Avoid creating intermediate objects in hot paths - For example,
any(x.op in ops for x in self.backward_slice)is faster thanany(x.op in ops for x in {self:None, **self.backward_slice})because it avoids dict creation.
Pattern Matching Profiling
Use TRACK_MATCH_STATS=2 to identify expensive patterns:
TRACK_MATCH_STATS=2 PYTHONPATH="." python3 test/external/external_benchmark_schedule.py
Output format: matches / attempts -- match_time / total_time ms -- location
Key patterns to watch (from ResNet50 benchmark):
split_load_store: ~146ms, 31% match rate - does real worksimplify_valid: ~75ms, 0% match rate in this workload - checks AND ops for INDEX in backward slicevmin==vmax folding: ~55ms, 0.33% match rate - checks 52K ops but rarely matches
Patterns with 0% match rate are workload-specific overhead. They may be useful in other workloads, so don't remove them without understanding their purpose.