Enzo Di Maria
ca2a79f1fb
refactor(gpu): Threshold for multi-GPU with Classical PBS
2025-12-18 09:27:09 +01:00
Agnes Leroy
cfa53682ae
fix(gpu): add missing sync before free in oprf
2025-12-16 09:42:11 +01:00
Agnes Leroy
006d6cc300
fix(gpu): fix some cpu memory leaks
2025-12-15 14:27:35 -03:00
Guillermo Oyarzun
11579bd3d0
feat(gpu): create noise and pfail tests for pbs + ks + ms
2025-12-12 09:41:11 +01:00
Enzo Di Maria
5273f61593
refactor(gpu): creating InternalCudaStreams to improve the management of multiple streams per GPU
2025-12-05 10:13:18 +01:00
Enzo Di Maria
a96d68323b
fix(gpu): No broadcast is needed because full prop is done on 1 single GPU
2025-12-04 09:17:40 +01:00
Enzo Di Maria
6752249f7f
refactor(gpu): moving vector_comparisons's functions to the backend
2025-12-03 09:11:00 +01:00
Enzo Di Maria
cc33161b97
refactor(gpu): moving cast_to_signed to the backend
2025-12-02 12:40:47 +01:00
Enzo Di Maria
3cfbaa40c3
refactor(gpu): unchecked_index_of_clear to backend
2025-11-28 14:34:53 +01:00
Andrei Stoian
e2063c8ef4
chore(gpu): bench KS latency batches
2025-11-27 17:32:44 +01:00
Enzo Di Maria
0aa0918fea
refactor(gpu): vector_find's functions to backend
2025-11-27 13:36:10 +01:00
Enzo Di Maria
b6fffa3d86
fix(gpu): wrong number of blocks in tmp_match_result
2025-11-25 10:50:30 +01:00
Enzo Di Maria
32b1a7ab1d
refactor(gpu): unchecked_match_value_or to backend
2025-11-24 13:50:15 +01:00
Guillermo Oyarzun
02312e23ea
fix(gpu): fix pbs128 selection for small num samples
2025-11-21 10:57:14 +01:00
Enzo Di Maria
c6709a82c0
refactor(gpu): match_value to backend with multiple streams
2025-11-20 17:11:15 +01:00
Beka Barbakadze
80cacbd079
feat(gpu): add boolean bitops in cuda backend
2025-11-20 14:56:21 +01:00
Agnes Leroy
4f9f4982f6
fix(gpu): fix memory leak in rerand
2025-11-14 14:00:01 +01:00
Enzo Di Maria
026cc376ed
refactor(gpu): multibit decompression
2025-10-30 08:59:10 +01:00
Pedro Alves
867f8fb579
feat(gpu): implement re-randomization
...
- exposed to integer and HL API
- test on the HL API
- benchmarks for GPU and CPU implementation
2025-10-29 17:55:45 -03:00
Guillermo Oyarzun
62780ac500
fix(gpu): fix decompression mem leak
2025-10-24 13:02:41 +02:00
Enzo Di Maria
126e779533
refactor(gpu): oprf_unsigned_custom_range + tests
2025-10-16 09:31:01 +02:00
Enzo Di Maria
353237c0d6
refactor(gpu): oprf_unsigned_custom_range
2025-10-16 09:31:01 +02:00
Agnes Leroy
c3ed1a7558
chore(gpu): internal renaming
2025-10-14 09:23:11 +02:00
Agnes Leroy
6347f25668
chore(gpu): synchronize after every release
2025-10-14 09:23:11 +02:00
Agnes Leroy
91b263d480
chore(gpu): split integer utilities file
2025-10-10 14:49:02 +02:00
Andrei Stoian
30938eec74
chore(gpu): use active streams in int_radix_lut
2025-10-09 21:59:15 +02:00
Andrei Stoian
0604d237eb
chore(gpu): multi-gpu debug target
2025-10-03 16:48:42 +02:00
Agnes Leroy
f9e876730a
chore(gpu): remove support for drift noise reduction
2025-10-03 09:45:20 +02:00
Agnes Leroy
71b45c14da
chore(gpu): refactor subset_first and subset
2025-09-30 12:21:39 +02:00
Beka Barbakadze
7549474aac
feat(gpu): Implements optimized division algorithm for message_2_carry_2, when 4 or more gpus are used
2025-09-29 15:16:34 +02:00
Agnes Leroy
23d46ba2bc
fix(gpu): fix oprf output degree
2025-09-29 08:33:25 +02:00
Agnes Leroy
daf0e79e4a
fix(gpu): fix get oprf size on gpu
2025-09-29 08:33:25 +02:00
Agnes Leroy
9aab79e23a
chore(gpu): fix compilation warning
2025-09-26 17:04:17 +02:00
Agnes Leroy
f53c75636d
chore(gpu): refactor oprf test, remove unused arg and fix multi-GPU for oprf
2025-09-26 13:19:34 +02:00
Agnes Leroy
4b0623da4a
chore(gpu): remove unused variable
2025-09-22 16:36:34 +02:00
Guillermo Oyarzun
022cb3b18a
fix(gpu): avoid out of memory when benchmarking throughput
2025-09-19 14:44:12 +02:00
Agnes Leroy
fe6e81ff78
chore(gpu): post hackathon cleanup
2025-09-18 16:30:45 +02:00
Andrei Stoian
1dcc3c8c89
chore(gpu): structure to encapsulate streams
2025-09-18 09:43:17 +02:00
Pedro Alves
c78cc2d2e9
chore(gpu): add a benchmark for 128-bit multi-bit noise squashing
...
- Also, remove the lut indexes concept from the 128-bit multi-bit pbs. It's assumed not to exist by the entire backend (as it doesn't for classical PBS). So to keep it here would be a bit error prone.
2025-09-09 07:51:35 -03:00
Agnes Leroy
5d70ae4232
fix(gpu): add missing broadcast lut
2025-09-09 08:47:53 +02:00
Guillermo Oyarzun
a3168eb1b5
feat(gpu): enable lut generation with preallocated buffers
2025-09-08 10:01:34 +02:00
Guillermo Oyarzun
eeccace7b3
fix(gpu): add missing syncs when releasing scalar ops and returning to old lut release
2025-09-05 09:53:00 +02:00
Guillermo Oyarzun
baad6a6b49
feat(gpu): change broadcast lut to communicate the minimum possible
2025-09-03 15:20:58 +02:00
Guillermo Oyarzun
88c3df8331
feat(gpu): improve communication scheme
2025-09-03 15:20:58 +02:00
Pedro Alves
57ea3e3e88
chore(gpu): refactor the entry points for PBS in the backend
2025-08-29 16:46:27 -03:00
Pedro Alves
cad4070ebe
fix(gpu): fix the decompression function signature in the backend
2025-08-29 21:09:40 +02:00
Pedro Alves
94d24e1f8b
feat(gpu): implement the centered modulus switch technique to classical PBS
2025-08-29 11:38:26 -03:00
Pedro Alves
9a1c0f48f4
feat(gpu): implement 128-bit compression and add it to the integer API
2025-08-29 11:26:07 -03:00
Andrei Stoian
c06b513182
chore(gpu): add valgrind and fix leaks
2025-08-28 14:21:57 +02:00
Enzo Di Maria
14063ca3b3
fix(gpu): fix perf of ilog2 backend
2025-08-26 14:53:08 +02:00