* clean up opt * don't let global kernels get too small * 8192 -> 1024 * disable local shape for clang * fix can_merge * unroll the 5x5 depthwise convs in op * load float4 check
* remove reduceopop * not float4 yet * float4 acc works * group_float4 on store