Commit Graph

143 Commits

Author SHA1 Message Date
Enzo Di Maria
ca2a79f1fb refactor(gpu): Threshold for multi-GPU with Classical PBS 2025-12-18 09:27:09 +01:00
Agnes Leroy
cfa53682ae fix(gpu): add missing sync before free in oprf 2025-12-16 09:42:11 +01:00
Agnes Leroy
006d6cc300 fix(gpu): fix some cpu memory leaks 2025-12-15 14:27:35 -03:00
Guillermo Oyarzun
11579bd3d0 feat(gpu): create noise and pfail tests for pbs + ks + ms 2025-12-12 09:41:11 +01:00
Enzo Di Maria
5273f61593 refactor(gpu): creating InternalCudaStreams to improve the management of multiple streams per GPU 2025-12-05 10:13:18 +01:00
Enzo Di Maria
a96d68323b fix(gpu): No broadcast is needed because full prop is done on 1 single GPU 2025-12-04 09:17:40 +01:00
Enzo Di Maria
6752249f7f refactor(gpu): moving vector_comparisons's functions to the backend 2025-12-03 09:11:00 +01:00
Enzo Di Maria
cc33161b97 refactor(gpu): moving cast_to_signed to the backend 2025-12-02 12:40:47 +01:00
Enzo Di Maria
3cfbaa40c3 refactor(gpu): unchecked_index_of_clear to backend 2025-11-28 14:34:53 +01:00
Andrei Stoian
e2063c8ef4 chore(gpu): bench KS latency batches 2025-11-27 17:32:44 +01:00
Enzo Di Maria
0aa0918fea refactor(gpu): vector_find's functions to backend 2025-11-27 13:36:10 +01:00
Enzo Di Maria
b6fffa3d86 fix(gpu): wrong number of blocks in tmp_match_result 2025-11-25 10:50:30 +01:00
Enzo Di Maria
32b1a7ab1d refactor(gpu): unchecked_match_value_or to backend 2025-11-24 13:50:15 +01:00
Guillermo Oyarzun
02312e23ea fix(gpu): fix pbs128 selection for small num samples 2025-11-21 10:57:14 +01:00
Enzo Di Maria
c6709a82c0 refactor(gpu): match_value to backend with multiple streams 2025-11-20 17:11:15 +01:00
Beka Barbakadze
80cacbd079 feat(gpu): add boolean bitops in cuda backend 2025-11-20 14:56:21 +01:00
Agnes Leroy
4f9f4982f6 fix(gpu): fix memory leak in rerand 2025-11-14 14:00:01 +01:00
Enzo Di Maria
026cc376ed refactor(gpu): multibit decompression 2025-10-30 08:59:10 +01:00
Pedro Alves
867f8fb579 feat(gpu): implement re-randomization
- exposed to integer and HL API
- test on the HL API
- benchmarks for GPU and CPU implementation
2025-10-29 17:55:45 -03:00
Guillermo Oyarzun
62780ac500 fix(gpu): fix decompression mem leak 2025-10-24 13:02:41 +02:00
Enzo Di Maria
126e779533 refactor(gpu): oprf_unsigned_custom_range + tests 2025-10-16 09:31:01 +02:00
Enzo Di Maria
353237c0d6 refactor(gpu): oprf_unsigned_custom_range 2025-10-16 09:31:01 +02:00
Agnes Leroy
c3ed1a7558 chore(gpu): internal renaming 2025-10-14 09:23:11 +02:00
Agnes Leroy
6347f25668 chore(gpu): synchronize after every release 2025-10-14 09:23:11 +02:00
Agnes Leroy
91b263d480 chore(gpu): split integer utilities file 2025-10-10 14:49:02 +02:00
Andrei Stoian
30938eec74 chore(gpu): use active streams in int_radix_lut 2025-10-09 21:59:15 +02:00
Andrei Stoian
0604d237eb chore(gpu): multi-gpu debug target 2025-10-03 16:48:42 +02:00
Agnes Leroy
f9e876730a chore(gpu): remove support for drift noise reduction 2025-10-03 09:45:20 +02:00
Agnes Leroy
71b45c14da chore(gpu): refactor subset_first and subset 2025-09-30 12:21:39 +02:00
Beka Barbakadze
7549474aac feat(gpu): Implements optimized division algorithm for message_2_carry_2, when 4 or more gpus are used 2025-09-29 15:16:34 +02:00
Agnes Leroy
23d46ba2bc fix(gpu): fix oprf output degree 2025-09-29 08:33:25 +02:00
Agnes Leroy
daf0e79e4a fix(gpu): fix get oprf size on gpu 2025-09-29 08:33:25 +02:00
Agnes Leroy
9aab79e23a chore(gpu): fix compilation warning 2025-09-26 17:04:17 +02:00
Agnes Leroy
f53c75636d chore(gpu): refactor oprf test, remove unused arg and fix multi-GPU for oprf 2025-09-26 13:19:34 +02:00
Agnes Leroy
4b0623da4a chore(gpu): remove unused variable 2025-09-22 16:36:34 +02:00
Guillermo Oyarzun
022cb3b18a fix(gpu): avoid out of memory when benchmarking throughput 2025-09-19 14:44:12 +02:00
Agnes Leroy
fe6e81ff78 chore(gpu): post hackathon cleanup 2025-09-18 16:30:45 +02:00
Andrei Stoian
1dcc3c8c89 chore(gpu): structure to encapsulate streams 2025-09-18 09:43:17 +02:00
Pedro Alves
c78cc2d2e9 chore(gpu): add a benchmark for 128-bit multi-bit noise squashing
- Also, remove the lut indexes concept from the 128-bit multi-bit pbs. It's assumed not to exist by the entire backend (as it doesn't for classical PBS). So to keep it here would be a bit error prone.
2025-09-09 07:51:35 -03:00
Agnes Leroy
5d70ae4232 fix(gpu): add missing broadcast lut 2025-09-09 08:47:53 +02:00
Guillermo Oyarzun
a3168eb1b5 feat(gpu): enable lut generation with preallocated buffers 2025-09-08 10:01:34 +02:00
Guillermo Oyarzun
eeccace7b3 fix(gpu): add missing syncs when releasing scalar ops and returning to old lut release 2025-09-05 09:53:00 +02:00
Guillermo Oyarzun
baad6a6b49 feat(gpu): change broadcast lut to communicate the minimum possible 2025-09-03 15:20:58 +02:00
Guillermo Oyarzun
88c3df8331 feat(gpu): improve communication scheme 2025-09-03 15:20:58 +02:00
Pedro Alves
57ea3e3e88 chore(gpu): refactor the entry points for PBS in the backend 2025-08-29 16:46:27 -03:00
Pedro Alves
cad4070ebe fix(gpu): fix the decompression function signature in the backend 2025-08-29 21:09:40 +02:00
Pedro Alves
94d24e1f8b feat(gpu): implement the centered modulus switch technique to classical PBS 2025-08-29 11:38:26 -03:00
Pedro Alves
9a1c0f48f4 feat(gpu): implement 128-bit compression and add it to the integer API 2025-08-29 11:26:07 -03:00
Andrei Stoian
c06b513182 chore(gpu): add valgrind and fix leaks 2025-08-28 14:21:57 +02:00
Enzo Di Maria
14063ca3b3 fix(gpu): fix perf of ilog2 backend 2025-08-26 14:53:08 +02:00