Files
concrete/docs/user/advanced_examples/PoissonRegression.ipynb
Jeremy Bradley-Silverio Donato cb660d89f9 docs: Update PoissonRegression.ipynb
2022-01-06 19:14:10 +01:00

889 lines
225 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "b760a0f6",
"metadata": {},
"source": [
"# Poisson Regression\n",
"\n",
"This tutorial shows how to train several Generalized Linear Models (GLM) with scikit-learn, quantize them and run them in FHE using Concrete Numpy. We make use of strong quantization to ensure the accumulator of the linear part does not overflow when computing in FHE (7-bit accumulator). We show that conversion to FHE does not degrade performance with respect to the quantized model working on values in the clear."
]
},
{
"cell_type": "markdown",
"id": "253288cf",
"metadata": {},
"source": [
"### Import libraries\n",
"\n",
"We import scikit-learn libraries and Concrete Numpy quantization tools:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6200ab62",
"metadata": {},
"outputs": [],
"source": [
"from copy import deepcopy\n",
"import numpy as np\n",
"\n",
"from sklearn.linear_model import PoissonRegressor\n",
"from sklearn.datasets import fetch_openml\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.decomposition import PCA\n",
"from tqdm import tqdm\n",
"\n",
"from concrete.quantization import QuantizedLinear, QuantizedArray, QuantizedModule\n",
"from concrete.quantization.quantized_activations import QuantizedActivation\n"
]
},
{
"cell_type": "markdown",
"id": "f43e2387",
"metadata": {},
"source": [
"And finally we import some helpers for visualization:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d104c8df",
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import matplotlib.pyplot as plt\n",
"from IPython.display import display"
]
},
{
"cell_type": "markdown",
"id": "53e676b8",
"metadata": {},
"source": [
"### Insurance claims dataset\n",
"\n",
"In this tutorial, we show how to build a regression model that predicts the frequency of incidents in an insurance setting.\n",
"\n",
"We download a data set from OpenML that contains 670,000 examples giving the frequency of car accidents for drivers of various ages, past accident history, car type, car color, geographical region, etc. We take only the first 50 000 examples to speed up training.\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "d451e829",
"metadata": {},
"outputs": [],
"source": [
"df = fetch_openml(data_id=41214, as_frame=True, cache=True, data_home=\"~/.cache/sklearn\").frame\n",
"df = df.head(50000)"
]
},
{
"cell_type": "markdown",
"id": "39a70df7",
"metadata": {},
"source": [
"The target variable is the number of claims per year, which is computed by the following formula :"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5e163891",
"metadata": {},
"outputs": [],
"source": [
"df[\"Frequency\"] = df[\"ClaimNb\"] / df[\"Exposure\"]"
]
},
{
"cell_type": "markdown",
"id": "75f4fdb7",
"metadata": {},
"source": [
"Let's visualize our data set, showing that the target variable, \"Frequency\" has a poisson distribution !"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "2a124a62",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 1080x504 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.ioff()\n",
"fig, ax = plt.subplots(1,2,figsize=(15,7))\n",
"fig.patch.set_facecolor('white')\n",
"ax[0].set_title(\"Frequency of claims vs. Driver Age\")\n",
"ax[0].set_xlabel(\"Driver Age\")\n",
"ax[0].set_ylabel(\"Frequency of claims\")\n",
"ax[0].scatter(df[\"DrivAge\"], df[\"Frequency\"], marker=\"o\", color=\"#ffb700\")\n",
"ax[1].set_title(\"Histogram of Frequency of claims\")\n",
"ax[1].set_xlabel(\"Frequency of claims\")\n",
"ax[1].set_ylabel(\"Count\")\n",
"df[\"Frequency\"].hist(bins=30, log=True, ax=ax[1], color=\"black\")\n",
"display(fig)"
]
},
{
"cell_type": "markdown",
"id": "5c8310ab",
"metadata": {},
"source": [
"We split the data into a training and a test set, but we also keep a part of the data to be used for calibration. This calibration set is not used for training, nor for testing the model. Thus we ensure better generalization of the quantized model."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d81db277",
"metadata": {},
"outputs": [],
"source": [
"df_train, df_test = train_test_split(df, test_size=0.2, random_state=0)\n",
"df_calib, df_test = train_test_split(df_test, test_size=100, random_state=0)\n"
]
},
{
"cell_type": "markdown",
"id": "4690cc15",
"metadata": {},
"source": [
"## Simple single variable insurance incident frequency predictor\n",
"\n",
"Our initial example only uses a single predictor feature, so we can easily visualize results. "
]
},
{
"cell_type": "markdown",
"id": "faa5247c",
"metadata": {},
"source": [
"We first train the scikit-learn PoissonRegressor model:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "682fb2d8",
"metadata": {},
"outputs": [],
"source": [
"reg = PoissonRegressor(max_iter=300)\n",
"reg.fit(df_train[\"DrivAge\"].values.reshape(-1,1), df_train[\"Frequency\"]);"
]
},
{
"cell_type": "markdown",
"id": "084fb296",
"metadata": {},
"source": [
"We can now test this predictor on the test data:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "4953b03e",
"metadata": {},
"outputs": [],
"source": [
"test_data = np.sort(df_test[\"DrivAge\"].values).reshape(-1,1)\n",
"predictions = reg.predict(test_data)"
]
},
{
"cell_type": "markdown",
"id": "f28155cf",
"metadata": {},
"source": [
"Let's visualize our predictions to see how our model performs !"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "111574ed",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 864x576 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.clf()\n",
"fig, ax = plt.subplots(1,figsize=(12,8))\n",
"fig.patch.set_facecolor('white')\n",
"ax.plot(test_data, predictions, color=\"black\", label=f\"Float clear trend line\")\n",
"ax.scatter(df_test[\"DrivAge\"], df_test[\"Frequency\"], marker=\"o\", color=\"#ffb700\")\n",
"ax.set_xlabel(\"Driver Age\")\n",
"ax.set_ylim(0,10)\n",
"ax.set_title(\"Regression with sklearn\")\n",
"ax.set_ylabel(\"Frequency of claims\")\n",
"ax.legend(loc=\"upper right\")\n",
"display(fig)"
]
},
{
"cell_type": "markdown",
"id": "429d8cc8",
"metadata": {},
"source": [
"### Analysis\n",
"\n",
"The trend line obtained from the model suggests an increase of incidents with driver age, but the data shows that incidents peak around the ages of 30 to 40 years of age with a decrease afterwards. This simple model does not seem to be a good one. We convert it to FHE to show visually some details of the conversion. In the second part of this example we train a more powerful model."
]
},
{
"cell_type": "markdown",
"id": "2d959640",
"metadata": {},
"source": [
"### FHE models need to be quantized, so let's define a **Quantized Poisson Regressor**\n",
"\n",
"We use the quantization primitives available in the Concrete library: QuantizedArray, QuantizedFunction, and QuantizedLinear to define a Poisson Regressor which is a Generalized Linear Model with exponential link."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "9f5acbfe",
"metadata": {},
"outputs": [],
"source": [
"class QuantizedExp(QuantizedActivation):\n",
" \"\"\"Quantized exponential function.\"\"\"\n",
"\n",
" def calibrate(self, x: np.ndarray):\n",
" self.q_out = QuantizedArray(self.n_bits, np.exp(x))\n",
"\n",
" def __call__(self, q_input: QuantizedArray) -> QuantizedArray:\n",
" quant_exp = np.exp(self.dequant_input(q_input))\n",
" q_out = self.quant_output(quant_exp)\n",
" return q_out\n",
" \n",
"class QuantizedGLM(QuantizedModule):\n",
" def __init__(self, n_bits, sklearn_model, calibration_data) -> None:\n",
" # Create a QuantizedLinear layer\n",
" self.n_bits = n_bits\n",
"\n",
" self.q_calibration_data = QuantizedArray(n_bits, calibration_data)\n",
"\n",
" q_weights = QuantizedArray(2, np.expand_dims(sklearn_model.coef_,1), is_signed=False)\n",
" q_bias = QuantizedArray(1, sklearn_model.intercept_)\n",
" q_layer = QuantizedLinear(6, q_weights, q_bias)\n",
" quant_layers_dict = {}\n",
" # Calibrate and get new calibration_data for next layer/activation\n",
" calibration_data = self._calibrate_and_store_layers_activation(\n",
" \"linear\", q_layer, calibration_data, quant_layers_dict\n",
" )\n",
"\n",
" # Create a new quantized layer (based on type(layer))\n",
" q_exp = QuantizedExp(n_bits=7)\n",
" calibration_data = self._calibrate_and_store_layers_activation(\n",
" \"invlink\", q_exp, calibration_data, quant_layers_dict\n",
" )\n",
"\n",
" super().__init__(quant_layers_dict)\n",
"\n",
"\n",
" def _calibrate_and_store_layers_activation(self, name, q_function, calibration_data, quant_layers_dict):\n",
" # Calibrate the output of the layer\n",
" q_function.calibrate(calibration_data)\n",
" # Store the learned quantized layer\n",
" quant_layers_dict[name] = q_function\n",
" # Create new calibration data (output of the previous layer)\n",
" q_calibration_data = QuantizedArray(self.n_bits, calibration_data)\n",
" # Dequantize to have the value in clear and ready for next calibration\n",
" return q_function(q_calibration_data).dequant()\n",
"\n",
"\n",
" def quantize_input(self, x):\n",
" q_input_arr = deepcopy(self.q_calibration_data)\n",
" q_input_arr.update_values(x)\n",
" return q_input_arr"
]
},
{
"cell_type": "markdown",
"id": "ab82ae87",
"metadata": {},
"source": [
"### We can now convert the scikit-learn model to our quantized version\n",
"\n",
"First, we get the calibration data, and we then run it through the non-quantized (float) model to determine all possible intermediate values. After each operation, these values are quantized and the quantized version of the operations are stored in the QuantizedGLM module."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "09d12194",
"metadata": {},
"outputs": [],
"source": [
"calib_data = np.expand_dims(df_calib[\"DrivAge\"].values, 1)\n",
"n_bits = 5\n",
"q_glm = QuantizedGLM(n_bits, reg, calib_data)"
]
},
{
"cell_type": "markdown",
"id": "e2528092",
"metadata": {},
"source": [
"Once the model's parameters and input ranges are quantized, we can quantize our test data and perform quantized inference. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f0f0699a",
"metadata": {},
"outputs": [],
"source": [
"q_test_data = q_glm.quantize_input(test_data)\n",
"y_pred = q_glm.forward_and_dequant(q_test_data)\n"
]
},
{
"cell_type": "markdown",
"id": "a5a50eb8",
"metadata": {},
"source": [
"Let's visualize the results of the quantized model. We can measure the goodness of fit on the test data using the Poisson deviance. We then plot the two trend lines (float value model and quantized model) to check for differences. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "04777aeb",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"mean Poisson deviance (float): 3.7115219475021872\n",
"mean Poisson deviance (quant): 3.716861851757367\n"
]
}
],
"source": [
"from sklearn.metrics import mean_poisson_deviance\n",
"\n",
"y_gt = df_test[\"Frequency\"]\n",
"gt_weight = df_test[\"Exposure\"]\n",
"\n",
"dev_real = mean_poisson_deviance(y_gt, predictions, sample_weight=gt_weight)\n",
"dev_q = mean_poisson_deviance(y_gt, y_pred, sample_weight=gt_weight)\n",
"\n",
"print(f\"mean Poisson deviance (float): {dev_real}\")\n",
"print(f\"mean Poisson deviance (quant): {dev_q}\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "5fb15eb4",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 864x576 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.clf()\n",
"fig, ax = plt.subplots(1,figsize=(12,8))\n",
"fig.patch.set_facecolor('white')\n",
"#ax.set_yscale(\"log\")\n",
"ax.plot(test_data, predictions, color=\"black\", label=f\"Float clear trend line, d={dev_real:.3f}\")\n",
"ax.scatter(df_test[\"DrivAge\"], df_test[\"Frequency\"], marker=\"o\", color=\"gray\", label=\"Test data\")\n",
"ax.set_xlabel(\"Driver Age\")\n",
"ax.set_ylim(0,10)\n",
"ax.set_title(\"Poisson Regression, float in clear vs. quantized \")\n",
"ax.set_ylabel(\"Frequency of claims\")\n",
"ax.plot(test_data, y_pred, color=\"red\",label=f\"Quantized trend line, d={dev_q:.3f}\")\n",
"ax.legend(loc=\"upper left\")\n",
"ax.grid()\n",
"\n",
"# inset axes....\n",
"axins = ax.inset_axes([0.5, 0.5, 0.47, 0.47])\n",
"axins.plot(test_data, predictions, color=\"black\", label=f\"Float clear trend line, d={dev_real:.3f}\")\n",
"axins.plot(test_data, y_pred, color=\"red\",label=f\"Quantized trend line, d={dev_q:.3f}\")\n",
"# sub region of the original image\n",
"x1, x2, y1, y2 = 60, 65, 2.3, 2.7\n",
"axins.set_xlim(x1, x2)\n",
"axins.set_ylim(y1, y2)\n",
"#axins.set_xticklabels([])\n",
"#axins.set_yticklabels([])\n",
"axins.grid()\n",
"ax.indicate_inset_zoom(axins, edgecolor=\"black\")\n",
"\n",
"display(fig)"
]
},
{
"cell_type": "markdown",
"id": "aa8854b2",
"metadata": {},
"source": [
"### Analysis\n",
"\n",
"We see, in the graph above, that the trend line of the quantized model is more jaggy and has slightly higher deviance. The tradeoff between better fit and compatibility with FHE compilation needs to be made by the practitioner."
]
},
{
"cell_type": "markdown",
"id": "af6bc89e",
"metadata": {},
"source": [
"### Now it's time to make the inference homomorphic. Compiling a model to FHE is done with a single line of code\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "fe9935bd",
"metadata": {},
"outputs": [],
"source": [
"engine = q_glm.compile(q_test_data)"
]
},
{
"cell_type": "markdown",
"id": "46753da7",
"metadata": {},
"source": [
"And now we can test the model on the test set in FHE:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "ca928b78",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 100/100 [00:21<00:00, 4.72it/s]\n"
]
}
],
"source": [
"y_pred_fhe = np.zeros((test_data.shape[0],), np.float32)\n",
"for i, test_sample in enumerate(tqdm(q_test_data.qvalues)):\n",
" q_sample = np.expand_dims(test_sample, 1).transpose([1,0]).astype(np.uint8)\n",
" q_pred_fhe = engine.run(q_sample)\n",
" y_pred_fhe[i] = q_glm.dequantize_output(q_pred_fhe)"
]
},
{
"cell_type": "markdown",
"id": "68f67b3f",
"metadata": {},
"source": [
"Finally we check if there are any differences to the quantized model on non-encrypted clear data by plotting the trend lines. Sometimes, FHE noise can create minor artifacts."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "92c7f2f5",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 864x576 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"plt.clf()\n",
"fig, ax = plt.subplots(1,figsize=(12,8))\n",
"fig.patch.set_facecolor('white')\n",
"ax.plot(test_data, predictions, color=\"black\", label=f\"Float clear trend line, d={dev_real:.3f}\")\n",
"ax.plot(test_data, y_pred_fhe, color=\"blue\", label=f\"FHE quantized trend line\")\n",
"ax.scatter(df_test[\"DrivAge\"], df_test[\"Frequency\"], marker=\"o\", color=\"gray\", label=\"Test data\")\n",
"ax.set_xlabel(\"Driver Age\")\n",
"ax.set_ylim(0,10)\n",
"ax.set_title(\"Poisson Regression, float in clear vs. quantized FHE encrypted\")\n",
"ax.set_ylabel(\"Frequency of claims\")\n",
"ax.plot(test_data, y_pred, color=\"red\",label=f\"Quantized trend line, d={dev_q:.3f}\")\n",
"ax.legend(loc=\"upper left\")\n",
"ax.grid()\n",
"\n",
"axins = ax.inset_axes([0.5, 0.5, 0.47, 0.47])\n",
"axins.plot(test_data, predictions, color=\"black\", label=f\"Float clear trend line, d={dev_real:.3f}\")\n",
"axins.plot(test_data, y_pred, color=\"red\",label=f\"Quantized FHE trend line, d={dev_q:.3f}\")\n",
"axins.plot(test_data, y_pred_fhe, color=\"blue\", label=f\"FHE quantized trend line\")\n",
"x1, x2, y1, y2 = 60, 65, 2.3, 2.7\n",
"axins.set_xlim(x1, x2)\n",
"axins.set_ylim(y1, y2)\n",
"axins.grid()\n",
"ax.indicate_inset_zoom(axins, edgecolor=\"black\")\n",
"\n",
"display(fig)"
]
},
{
"cell_type": "markdown",
"id": "14394b94",
"metadata": {},
"source": [
"## A multi-variate model\n",
"\n",
"The simple single variable model does not achieve good results (age is not a good predictor for the number of claims). Let's train a model with all of our predictor variables. We proceed by transforming the raw features into ones that can be input to a regression model. Thus, the categorical features are transformed into one-hot encoding, but we also reduce the resolution of vehicle and person by binning. Transforming the data this way, we end up with a total of 57 continuous features (instead of the initial 11).\n",
"\n",
"Here is where we encounter one of the limitations of our framework. We perform a dot product in the prediction, in the QuantizedLinear class, but in our framework the maximum integer size is, for now, limited to 7 bits. As every multiplication doubles the number of bits of precision of the inputs performing 57 multiplication-additions of integers to compute w.x would quickly overflow 7 bits. \n",
"\n",
"As a workaround to the limited accumulator resolution, we perform PCA to reduce dimensionality from 57 to 14 dimensions and train our multi-variate model in this reduced dimensionality space. However, we also train a reference model on all of the original features. "
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "759507c5",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import mean_poisson_deviance\n",
"\n",
"from sklearn.compose import ColumnTransformer\n",
"from sklearn.pipeline import Pipeline, make_pipeline\n",
"from sklearn.preprocessing import (\n",
" FunctionTransformer,\n",
" KBinsDiscretizer,\n",
" OneHotEncoder,\n",
" StandardScaler,\n",
")\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"\n",
"log_scale_transformer = make_pipeline(\n",
" FunctionTransformer(np.log, validate=False), StandardScaler()\n",
");\n",
"\n",
"linear_model_preprocessor = ColumnTransformer(\n",
" [\n",
" (\"passthrough_numeric\", \"passthrough\", [\"BonusMalus\"]),\n",
" (\"binned_numeric\", KBinsDiscretizer(n_bins=10), [\"VehAge\", \"DrivAge\"]),\n",
" (\"log_scaled_numeric\", log_scale_transformer, [\"Density\"]),\n",
" (\n",
" \"onehot_categorical\",\n",
" OneHotEncoder(sparse=False),\n",
" [\"VehBrand\", \"VehPower\", \"VehGas\", \"Region\", \"Area\"],\n",
" ),\n",
" ],\n",
" remainder=\"drop\",\n",
");\n",
"\n",
"poisson_glm = Pipeline(\n",
" [\n",
" (\"preprocessor\", linear_model_preprocessor),\n",
" (\"regressor\", PoissonRegressor(alpha=1e-12, max_iter=300)),\n",
" ]\n",
");\n",
"\n",
"poisson_glm_pca = Pipeline(\n",
" [\n",
" (\"preprocessor\", linear_model_preprocessor),\n",
" (\"pca\", PCA(n_components=14, whiten=True)),\n",
" (\"regressor\", PoissonRegressor(alpha=1e-12, max_iter=300)),\n",
" ]\n",
");\n",
"\n",
"poisson_glm.fit(df_train, df_train[\"Frequency\"], regressor__sample_weight=df_train[\"Exposure\"])\n",
"\n",
"poisson_glm_pca.fit(\n",
" df_train, df_train[\"Frequency\"], regressor__sample_weight=df_train[\"Exposure\"]\n",
");"
]
},
{
"cell_type": "markdown",
"id": "bfbd0ff1",
"metadata": {},
"source": [
"### Now we evaluate the new models"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "0ffae598",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PoissonRegressor evaluation: 1.3773\n",
"PoissonRegressor+PCA evaluation: 1.4399\n"
]
}
],
"source": [
"def score_estimator(y_pred, y_gt, gt_weight):\n",
" \"\"\"Score an estimator on the test set.\"\"\"\n",
" y_pred = np.squeeze(y_pred)\n",
" dev = mean_poisson_deviance(y_gt, y_pred, sample_weight=gt_weight)\n",
" return dev\n",
"\n",
"\n",
"def score_sklearn_estimator(estimator, df_test):\n",
" \"\"\"A wrapper to score a sklearn pipeline on a dataframe\"\"\"\n",
" return score_estimator(estimator.predict(df_test), df_test[\"Frequency\"], df_test[\"Exposure\"])\n",
"\n",
"\n",
"def score_concrete_glm_estimator(poisson_glm_pca, q_glm, df_test):\n",
" \"\"\"A wrapper to score QuantizedGLM on a dataframe, transforming the dataframe using\n",
" a sklearn pipeline\n",
" \"\"\"\n",
" test_data = poisson_glm_pca[\"pca\"].transform(poisson_glm_pca[\"preprocessor\"].transform(df_test))\n",
" q_test_data = q_glm.quantize_input(test_data)\n",
" y_pred = q_glm.forward_and_dequant(q_test_data)\n",
" return score_estimator(y_pred, df_test[\"Frequency\"], df_test[\"Exposure\"])\n",
"\n",
"\n",
"print(f\"PoissonRegressor evaluation: {score_sklearn_estimator(poisson_glm, df_test):.4f}\")\n",
"print(f\"PoissonRegressor+PCA evaluation: {score_sklearn_estimator(poisson_glm_pca, df_test):.4f}\")\n"
]
},
{
"cell_type": "markdown",
"id": "de58b9eb",
"metadata": {},
"source": [
"### Test the multi-variate GLM with multiple quantization bit-widths"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "bce8b011",
"metadata": {},
"outputs": [],
"source": [
"# Now, get calibration data from the held out set\n",
"calib_data = poisson_glm_pca[\"pca\"].transform(\n",
" poisson_glm_pca[\"preprocessor\"].transform(df_calib)\n",
")\n",
"\n",
"# Let's see how performance decreases with bit-depth.\n",
"# This is just a test of our quantized model, not in FHE\n",
"n_bits_test = np.asarray([28, 16, 6, 5, 4, 3, 2])\n",
"dev_bits_test = np.zeros_like(n_bits_test,dtype=np.float32)\n",
"for i, n_bits in enumerate(n_bits_test):\n",
" q_glm = QuantizedGLM(n_bits, poisson_glm_pca[\"regressor\"], calib_data)\n",
" dev_bits_test[i] = score_concrete_glm_estimator(poisson_glm_pca, q_glm, df_test)\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "6dcb5f7e",
"metadata": {},
"source": [
"We plot the Poisson deviance with respect to the quantized bit-width, to show how performance degrades with quantization:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "0e3c4858",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"<Figure size 864x576 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from matplotlib import pyplot as plt\n",
"plt.clf()\n",
"fig, ax = plt.subplots(1, figsize=(12,8)) \n",
"fig.patch.set_facecolor(\"white\")\n",
"ax.plot(n_bits_test, dev_bits_test, label=\"Poisson deviance for quantized FHE GLM\")\n",
"ax.set_xlim(2,28)\n",
"ax.invert_xaxis()\n",
"ax.set_xlabel(\"Number of bits\")\n",
"ax.set_ylabel(\"Poisson deviance\")\n",
"ax.set_xscale(\"log\")\n",
"ax.set_xticks(n_bits_test)\n",
"ax.set_xticklabels([str(k) for k in n_bits_test])\n",
"ax.grid()\n",
"ax.legend(loc=\"upper left\")\n",
"display(fig)"
]
},
{
"cell_type": "markdown",
"id": "43e6fd06",
"metadata": {},
"source": [
"### Analysis\n",
"\n",
"While the prediction quality is mostly stable until 6 bits, we see a decrease in prediction performance in lower bit-widths. For 4 bits the performance seems to improve, but this is probably just a lucky sampling of the data, as this graph shows a single experiment. We expect to have a smooth increase of the deviance with lower bit-width when running the experiment multiple times.\n",
"\n",
"With 14 features, we can have weights and data in at most 2 bits. "
]
},
{
"cell_type": "markdown",
"id": "1ac216b1",
"metadata": {},
"source": [
"We now choose an operating point that is compatible with FHE: 2 bit quantization."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "3c521ec8",
"metadata": {},
"outputs": [],
"source": [
"q_glm = QuantizedGLM(2, poisson_glm_pca[\"regressor\"], calib_data)\n",
"test_data = poisson_glm_pca[\"pca\"].transform(poisson_glm_pca[\"preprocessor\"].transform(df_test))\n",
"q_test_data = q_glm.quantize_input(test_data)"
]
},
{
"cell_type": "markdown",
"id": "a7f45c8c",
"metadata": {},
"source": [
"### Compile the multi-variate GLM to FHE. Again, with a single line of code we compile to FHE:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "f89eaa07",
"metadata": {},
"outputs": [],
"source": [
"engine = q_glm.compile(q_test_data)"
]
},
{
"cell_type": "markdown",
"id": "baa0667b",
"metadata": {},
"source": [
"Finally, we evaluate the model on encrypted data:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "f6fe2737",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 100/100 [00:40<00:00, 2.50it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"PoissonRegressor evaluation: 1.3773\n",
"PoissonRegressor+PCA evaluation: 1.4399\n",
"FHE Quantized deviance: 1.6530\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"y_pred_fhe = np.zeros((test_data.shape[0],), np.float32)\n",
"for i, test_sample in enumerate(tqdm(q_test_data.qvalues)):\n",
" q_sample = np.expand_dims(test_sample, 1).transpose([1, 0]).astype(np.uint8)\n",
" q_pred_fhe = engine.run(q_sample)\n",
" y_pred_fhe[i] = q_glm.dequantize_output(q_pred_fhe)\n",
"\n",
"dev_pca_quantized_fhe = score_estimator(y_pred_fhe, df_test[\"Frequency\"], df_test[\"Exposure\"])\n",
"\n",
"print(f\"PoissonRegressor evaluation: {score_sklearn_estimator(poisson_glm, df_test):.4f}\")\n",
"print(f\"PoissonRegressor+PCA evaluation: {score_sklearn_estimator(poisson_glm_pca, df_test):.4f}\")\n",
"print(f\"FHE Quantized deviance: {dev_pca_quantized_fhe:.4f}\")"
]
},
{
"cell_type": "markdown",
"id": "c18dbdd1",
"metadata": {},
"source": [
"### Conclusion\n",
"\n",
"In this tutorial, we have discussed how we can use Concrete Numpy to convert a scikit-learn based Poisson regression model to FHE. \n",
"\n",
"First of all, we have shown that with the proper choice of pipeline and parameters, we can do the conversion with little loss of precision. This decrease in the quality of prediction is due to quantization of model weights and input data, and some minor noise can appear due to FHE. This noise is visible on the single variable FHE trend line as minor deviations of the blue curve with respect to the red one. \n",
"\n",
"Finally, we have shown how conversion of a model to FHE can be done with a single line of code and how quantization is aided by the tools in Concrete Numpy. \n"
]
}
],
"metadata": {
"execution": {
"timeout": 10800
}
},
"nbformat": 4,
"nbformat_minor": 5
}