Match Torch speed for sum reduction on M1 (#1187)

* Add additional kernel when reducing multiple dimensions at once. * Faster for smaller inputs * Whitespace and naming * Cleaner, guard for Metal only, and max 1 split rather than N * Draft of different approach * One additional kernel call for this test (as expected)
2026-01-09 15:08:02 -05:00 · 2023-07-20 02:18:58 +10:00
parent fde9f0e60d
commit 59af9b81c5
2 changed files with 15 additions and 2 deletions
--- a/test/external/external_test_opt.py
+++ b/test/external/external_test_opt.py
@@ -63,7 +63,7 @@ class TestInferenceMinKernels(unittest.TestCase):
    for p in get_parameters(model): p.assign(np.zeros(p.shape, dtype=p.dtype.np))
    img = Tensor.randn(1, 3, 224, 224)
    # TODO: this seems very high
-    with CLCache(115):
+    with CLCache(116):
      model.forward(img).realize()

  def test_resnet(self):