- removes 16 bit limitation on base_log
- optimizes shared memory use: buffers for decomposition are not used anymore, rotated buffers are reused as state buffer for decomposition for the amortized PBS.
- Add a private test for cuda PBS, as we have for fft backend.