# ZKStats Library ## Overview ZKStats Library is the core library for ZKStats Platform, designed to generate zero-knowledge (ZK) proofs for statistical functions, leveraging PyTorch and powered by [EZKL](https://github.com/zkonduit/ezkl). This library allows data providers to share statistical result of their dataset with users while still preserving privacy. Users can be convinced the correctness of their computation by verifying a ZK proof without learning the underlying data. ## Supported Statistical Functions ZKStats Library supports the same set of statistical functions as [Python statistcs module](https://docs.python.org/3/library/statistics.html#averages-and-measures-of-central-location): `mean`, `geometric_mean`, `harmonic_mean`, `median`, `mode`, `pstdev`, `pvariance`, `stdev`, `variance`, `covariance`, `correlation`, and `linear_regression`. ## Installation Make sure you have Python 3.9 or later installed. You can install ZKStats library using pip: ```bash pip install zkstats ``` To hack on the library, you'll need to [install poetry](https://python-poetry.org/docs/#installing-with-pipx) to build the project: ```bash git@github.com:ZKStats/zk-stats-lib.git cd zk-stats-lib poetry install ``` ## Getting Started ### Define Your Computation User computation must be defined as **a function** using ZKStats operations and PyTorch functions. The function signature must be `Callable[[State, Args], torch.Tensor]`: - first argument is a `State` object, which contains the statistical functions that ZKStats supports. - second argument is a `Args` object, which is a dictionary of PyTorch tensors, the input data. `Args['column1']` is the first column, `Args['column2']` is the second column, and so on. For example, we have two columns of data and we want to compute the mean of the medians of the two columns: ```python def user_computation(s: State, data: Args) -> torch.Tensor: # Compute the median of the first column median1 = s.median(data['column1']) # Compute the median of the second column median2 = s.median(data['column2']) # Compute the mean of the medians return s.mean(torch.cat((median1.unsqueeze(0), median2.unsqueeze(0))).reshape(1,-1,1)) ``` > NOTE: `reshape` is required for now since input must be in shape `[1, data_size, 1]` for now. It should be addressed in the future, the same for torch.cat(), and unsqueeze(), we will write wrapper in the future. #### Torch Operations Aside from the ZKStats operations, you can also use PyTorch functions like (`torch.abs`, `torch.max`, ...etc). TODO: We should have a list for all supported PyTorch functions. **Caveats**: Not all PyTorch functions are supported. For example, filtering data from a list by `X[X > 0]` is not supported because the zk circuit needs to be of a predetermined size, hence we cannot arbitrarily reshape our X into a new shape based on the filter condition inside the circuit. To filter data based on condition, we can use s.where as follows. #### Data Filtering Although we cannot filter data into any arbitrary shape using just condition + index (e.g. `X[X > 0]`), we implemented State.where operation that allows users to filter data by their own choice of condition as follows. ```python def user_computation(s: State, data: Args) -> torch.Tensor: # Compute the mean of the absolute values x = data['x'] # Here condition can be chained as shown below, and can have many variables if we have more than just x: e.g. filter = torch.logical_and(x>20, y<2) in case of regression for example. filter = torch.logical_and(x > 20, x<50) # call our where function filtered_x = s.where(filter, x) # Then, can use the stats operation as usual return s.mean(filtered_x) ``` ### Proof Generation and Verification The flow between data providers and users is as follows: ![zkstats-lib-flow](./assets/zkstats-flow.png) #### Data Provider: generate data commitments Data providers should generate commitments for their dataset beforehand. For a dataset (e.g. a table in a SQL database), there should be a commitment for each column. These commitments are used by users later, to verify the zkp proof and be convinced the computation is done with the correct dataset. ```python from zkstats.core import generate_data_commitment data_path = "/path/to/your/data.json" data_commitment_path = "/path/to/store/data_commitments.json" # possible_scales is a list of possible scales for the data to be encoded. For example, here we use [0, 20) as the possible scales, to make sure possible_scales = list(range(20)) # data commitment is generated by data providers and shared with users generate_data_commitment(data_path, possible_scales, data_commitment_path) ``` When generating a proof, since dataset might contain floating points, data providers need to specify a proper "scale" to encode and decode floating points. Scale is chosen based on the value precision in the dataset and the type of computation. `possible_scales` should cover as many scales as possible and data providers should always use the scales within `possible_scales`, to make sure users can always get the corresponding commitments to verify the proofs. #### Both: derive PyTorch model from the computation When a user wants to request a data provider to generate a proof for their defined computation, the user must let the data provider know what the computation is. Then, the data provider, with real dataset, will generate model from computation using computation_to_model() method. Since we use witness approach (described more in Note section below), the data provider is required to send the pre-calculated witness back to verifier. Then, verifier, with pre-calculated witness, generates the model from computation to be the exact model as prover. Note here, that we can also just let prover generate model, and then send that model to verifier directly. However, to make sure that the prover's model actually comes from verifier's computation, it's better to have verifier generates the model itself from its computation, but just with the help of pre-calculated witness. ```python from zkstats.core import computation_to_model # For prover: generate prover_model, and write to precal_witness file selected_columns, _, prover_model = computation_to_model(user_computation, precal_witness_path, data_shape, True, error) # For verifier, generate verifier model (which is same as prover_model) by reading precal_witness file selected_columns, _, verifier_model = computation_to_model(user_computation, precal_witness_path, data_shape, False, error) ``` #### Data Provider: generate settings ```python prover_gen_settings( data_path, # path to the dataset selected_columns, # the column names to be used by the computation sel_data_path, # path to the preprocessed dataset prover_model, # the model generated from the computation prover_model_path, # path to store the generated onnx format of the model scale, # scale to encode and decode floating points mode, # mode to generate settings settings_path, # path to store the generated settings ) ``` #### Data Provider: get proving key ```python setup( prover_model_path, # path to the onnx format model prover_compiled_model_path, # path to store the compiled model settings_path, # path to the settings file vk_path, # path to store the generated verification key pk_path, # path to store the generated proving key ) ``` #### User: generate verification key ```python verifier_define_calculation( dummy_data_path, # path to the dummy data selected_columns, # selected columns sel_dummy_data_path, # path to store the selected dummy data verifier_model, # the model generated from the computation verifier_model_path, # path to store the generated onnx format of the model ) ``` ```python setup( verifier_model_path, # path to the onnx format model verifier_compiled_model_path, # path to store the compiled model settings_path, # path to the settings file vk_path, # path to store the generated verification key pk_path, # path to store the generated proving key ) ``` #### Data Provider: generate proof ```python prover_gen_proof( prover_model_path, # path to the onnx format model sel_data_path, # path to the preprocessed dataset witness_path, # path to store the generated witness file prover_compiled_model_path, # path to store the generated compiled model settings_path, # path to the settings file proof_path, # path to store the generated proof pk_path, # path to the proving key ) ``` #### User: verify proof and get the result ```python res = verifier_verify( proof_path, # path to the proof settings_path, # path to the settings file vk_path, # path to the verification key selected_columns, # selected columns data_commitment_path, # path to the data commitment ) print("The result is", res) ``` - **Success**: The result is correct and the computation is verified. - **Failure Cases**: - Computations not within the acceptable error margin. - Runtime errors should be reported for further investigation. ## Examples See our jupyter notebook for [examples](./examples/). ## Benchmarks TOFIX: Update the benchmark method. See more in issues. See our jupyter notebook for [benchmarks](./benchmark/). ## Note - We implement using witness approach instead of directly calculating the value in circuit. This sometimes allows us to not calculate stuffs like division or exponential which requires larger scale in settings. (If we don't use larger scale in those cases, the accuracy will be very bad) - Dummy data to feed in verifier onnx file needs to have same shape as the private dataset, but can be filled with any value (we just randomize it to be uniform 1-10 with 1 decimal). - For Mode function, if there are more than 1 value possible, we just outputthe one that first encountered, conforming to the spec of statistics.mode in python lib (https://docs.python.org/3.9/library/statistics.html#statistics.mode) ## Legacy Not relevant after Commit 48142b5b9ab6b577deadc27886018131008ebad5 - For non-linearity function, larger scale leads to larger lookup table, hence bigger circuit size. Can compare between geomean_OG (implemented in traditional way, instead of witness approach) which is the non-linearity function (p bad with larger scale), and mean_OG which is linear function (p fine with larger scale). Hence, we can say that for linearity func like mean, we can use traditional way, while for non-linear func like geomean, we should use witness approach. However, we just abstract this out by using only the witness approach in our library (which makes sense!)