openpilot kernel fix from 209 to 207 (#2006)

* Fix openpilot kernel from 209 to 206

1. Use push_movement_ops conditions in _movement_op. Don't push
PAD or check if the ops are safe to be pushed with PAD

2. Don't push if all the op.buffers are realized

* change ALLOWED_KERNEL_COUNT to 206 for openpilot

* don't push through sourceless buffers

* change the tests to adjust kernel counts for new behaviour

* restore pushing of movement ops through childless buffer

* don't push EXPAND, causes OOM

* allow push of intermediate movement ops

* adding new test behaviour

* modifying external_test_opt for new behaviour

* restore old tests

* Reenable push of EXPAND and introduce new tests

I was wrong intially thinking EXPAND can cause OOM and hence I had
disabled it. Since it is 0 stride and doesn't allocate memory its cool

* Don't push EXPAND above LoadOps LB. This is causing OOM

* Push should be decided on movement root of bufs

To check if ast.op.buffers is sourceless/ realized go the the movement
root and then decide if pushing should be done or not

* refactor for readability

* use .base instead

* don't push expand, bad memory/compute consumption

* restrict push of reshape, seeing improvement

* push reshape if unary without further check

* disable PAD solves convnext kernel count increase

* reenable test_cache_binaryop_transpose

* small nit
This commit is contained in:
Amrit Sahu
2023-10-14 00:29:15 +05:30
committed by GitHub
parent 90c777d815
commit 63869c62fc
3 changed files with 11 additions and 7 deletions

View File

@@ -64,7 +64,7 @@ class TestInferenceMinKernels(unittest.TestCase):
for p in get_parameters(model): p.assign(np.zeros(p.shape, dtype=p.dtype.np))
img = Tensor.randn(1, 3, 224, 224)
# TODO: this seems very high
with CLCache(116):
with CLCache(115):
model.forward(img).realize()
def test_resnet(self):
@@ -78,7 +78,7 @@ class TestInferenceMinKernels(unittest.TestCase):
model = ViT(embed_dim=192, num_heads=3)
for p in get_parameters(model): p.assign(np.zeros(p.shape, dtype=p.dtype.np))
img = Tensor.randn(1, 3, 224, 224)
with CLCache(223): # NOTE: this is way too high
with CLCache(222): # NOTE: this is way too high
out = model.forward(img)
assert len(CacheCollector.cache) == 0, "ViT prerealized?"
out.realize()
@@ -88,7 +88,7 @@ class TestInferenceMinKernels(unittest.TestCase):
args_tiny = {"dim": 512, "multiple_of": 256, "n_heads": 8, "n_layers": 4, "norm_eps": 1e-05, "vocab_size": 1000}
model = Transformer(**args_tiny)
for p in get_parameters(model): p.assign(np.zeros(p.shape, dtype=p.dtype.np))
with CLCache(94):
with CLCache(85):
model(Tensor([[1,2,3,4]]), 0).realize()
@unittest.skipUnless(Device.DEFAULT == "GPU", "Not Implemented")