* Add additional kernel when reducing multiple dimensions at once.
* Faster for smaller inputs
* Whitespace and naming
* Cleaner, guard for Metal only, and max 1 split rather than N
* Draft of different approach
* One additional kernel call for this test (as expected)