This is kind of a bug because currently with DEBUG>=1 it will say that
remote has device and then an array of renderer props instead of a real
device name which doesn't make sense:
```
127.0.0.1 - - [24/Apr/2025 16:50:44] "GET /properties HTTP/1.1" 200 -
remote has device ['tinygrad.renderer.cstyle', 'MetalRenderer', []]
opened device CLOUD from pid:20210
```
Now it will actually print the name of device behind cloud:
```
127.0.0.1 - - [24/Apr/2025 16:56:29] "GET /properties HTTP/1.1" 200 -
remote has device METAL
opened device CLOUD from pid:20315
```
`BS=96 BASEDIR="/raid/datasets/openimages" MODEL=retinanet python examples/mlperf/model_eval.py`
```
...
loaded dataset @ 8.64s
loaded initial data @ 12.57s
****** 619.97 ms to enqueue, 46042.13 ms to realize ( 116.22 ms fetching, 45399.58 ms postprocess_detections). 0.09 examples/sec. 0.83 TFLOPS @ 59.23s
****** 147.49 ms to enqueue, 37362.16 ms to realize ( 146.96 ms fetching, 36618.84 ms postprocess_detections). 0.11 examples/sec. 1.03 TFLOPS @ 96.74s
****** 152.85 ms to enqueue, 37244.08 ms to realize ( 120.67 ms fetching, 36235.19 ms postprocess_detections). 0.11 examples/sec. 1.04 TFLOPS @ 134.14s
****** 146.39 ms to enqueue, 37279.85 ms to realize ( 65.07 ms fetching, 36233.56 ms postprocess_detections). 0.11 examples/sec. 1.04 TFLOPS @ 171.56s
****** 152.41 ms to enqueue, 37264.04 ms to realize ( 127.08 ms fetching, 36196.10 ms postprocess_detections). 0.11 examples/sec. 1.04 TFLOPS @ 208.98s
****** 151.29 ms to enqueue, 36868.08 ms to realize ( 142.73 ms fetching, 36153.07 ms postprocess_detections). 0.11 examples/sec. 1.05 TFLOPS @ 246.00s
****** 136.41 ms to enqueue, 37325.04 ms to realize ( 90.29 ms fetching, 36573.38 ms postprocess_detections). 0.11 examples/sec. 1.04 TFLOPS @ 283.46s
```
`out.cast(it.dtype.fmt).tolist()` fails with `ValueError: memoryview: destination format must be a native single character format prefixed with an optional '@'`
* truncate fp8
* fix
* maybe like that?
* fix linters
* ruff
* move from extra and add ml_types to tests
* minor changes
* str to dtypes and nan support
---------
Co-authored-by: pkotzbach <pawkotz@gmail.com>
eval 35 sec -> 20 sec. it was spending 13 seconds assembling output tensor on CPU backend. GPUS[0] seems to have enough memory, otherwise we can lower EVAL_BS