mirror of https://github.com/zama-ai/concrete.git synced 2026-02-12 13:45:08 -05:00

Files

Benoit Chevallier-Mames 69ee148d97 fix(frontend): fixing an issue in the string generation

closes #819

2024-07-30 16:41:03 +02:00

levenshtein_distance.py

fix(frontend): fixing an issue in the string generation

2024-07-30 16:41:03 +02:00

README.md

fix(frontend): fixing an issue in the string generation

2024-07-30 16:41:03 +02:00

README.md

Computing the Levenshtein distance in FHE

Levenshtein distance

Levenshtein distance is a classical distance to compare two strings. Let's write strings a and b as vectors of characters, meaning a[0] is the first char of a and a[1:] is the rest of the string. Levenshtein distance is defined as:

Levenshtein(a, b) :=
    length(a) if length(b) == 0, or
    length(b) if length(a) == 0, or
    Levenshtein(a[1:], b[1:]) if a[0] == b[0], or
    1 + min(Levenshtein(a[1:], b), Levenshtein(a, b[1:]), Levenshtein(a[1:], b[1:]))

More information can be found for example on the Wikipedia page.

Computing the distance in FHE

It can be interesting to compute this distance over encrypted data, for example in the banking sector. We show in our code how to do that simply, with our FHE modules.

Available options are:

usage: levenshtein_distance.py [-h] [--show_mlir] [--show_optimizer] [--autotest] [--autoperf] [--distance DISTANCE DISTANCE]
                               [--alphabet {string,STRING,StRiNg,ACTG}] [--max_string_length MAX_STRING_LENGTH]

Levenshtein distance in Concrete.

optional arguments:
  -h, --help            show this help message and exit
  --show_mlir           Show the MLIR
  --show_optimizer      Show the optimizer outputs
  --autotest            Run random tests
  --autoperf            Run benchmarks
  --distance DISTANCE DISTANCE
                        Compute a distance
  --alphabet {string,STRING,StRiNg,ACTG}
                        Setting the alphabet
  --max_string_length MAX_STRING_LENGTH
                        Setting the maximal size of strings

The different alphabets are:

string: non capitalized letters, ie [a-z]*
STRING: capitalized letters, ie [A-Z]*
StRiNg: non capitalized letters and capitalized letters
ACTG: [ACTG]*, for DNA analysis

It is very easy to add a new alphabet in the code.

The most important usages are:

python levenshtein_distance.py --distance Zama amazing --alphabet StRiNg --max_string_length 7: Compute the distance between strings "Zama" and "amazing", considering the chars of "StRiNg" alphabet


Running distance between strings 'Zama' and 'amazing' for alphabet StRiNg:

    Computing Levenshtein between strings 'Zama' and 'amazing' - distance is 5, computed in 44.51 seconds

Successful end

python levenshtein_distance.py --autotest: Run random tests with the alphabet.

Making random tests with alphabet string
Letters are abcdefghijklmnopqrstuvwxyz

Computations in simulation

    Computing Levenshtein between strings '' and '' - OK
    Computing Levenshtein between strings '' and 'p' - OK
    Computing Levenshtein between strings '' and 'vv' - OK
    Computing Levenshtein between strings '' and 'mxg' - OK
    Computing Levenshtein between strings '' and 'iuxf' - OK
    Computing Levenshtein between strings 'k' and '' - OK
    Computing Levenshtein between strings 'p' and 'g' - OK
    Computing Levenshtein between strings 'v' and 'ky' - OK
    Computing Levenshtein between strings 'f' and 'uoq' - OK
    Computing Levenshtein between strings 'f' and 'kwfj' - OK
    Computing Levenshtein between strings 'ut' and '' - OK
    Computing Levenshtein between strings 'pa' and 'g' - OK
    Computing Levenshtein between strings 'bu' and 'sx' - OK
    Computing Levenshtein between strings 'is' and 'diy' - OK
    Computing Levenshtein between strings 'fz' and 'unda' - OK
    Computing Levenshtein between strings 'sem' and '' - OK
    Computing Levenshtein between strings 'dbr' and 'o' - OK
    Computing Levenshtein between strings 'dgj' and 'hk' - OK
    Computing Levenshtein between strings 'ejb' and 'tfo' - OK
    Computing Levenshtein between strings 'afa' and 'ygqo' - OK
    Computing Levenshtein between strings 'lhcc' and '' - OK
    Computing Levenshtein between strings 'uoiu' and 'u' - OK
    Computing Levenshtein between strings 'tztt' and 'xo' - OK
    Computing Levenshtein between strings 'ufsa' and 'mil' - OK
    Computing Levenshtein between strings 'uuzl' and 'dzkr' - OK

Computations in FHE

    Computing Levenshtein between strings '' and '' - OK in 1.29 seconds
    Computing Levenshtein between strings '' and 'p' - OK in 0.26 seconds
    Computing Levenshtein between strings '' and 'vv' - OK in 0.26 seconds
    Computing Levenshtein between strings '' and 'mxg' - OK in 0.22 seconds
    Computing Levenshtein between strings '' and 'iuxf' - OK in 0.22 seconds
    Computing Levenshtein between strings 'k' and '' - OK in 0.22 seconds
    Computing Levenshtein between strings 'p' and 'g' - OK in 1.09 seconds
    Computing Levenshtein between strings 'v' and 'ky' - OK in 1.93 seconds
    Computing Levenshtein between strings 'f' and 'uoq' - OK in 3.09 seconds
    Computing Levenshtein between strings 'f' and 'kwfj' - OK in 3.98 seconds
    Computing Levenshtein between strings 'ut' and '' - OK in 0.25 seconds
    Computing Levenshtein between strings 'pa' and 'g' - OK in 1.90 seconds
    Computing Levenshtein between strings 'bu' and 'sx' - OK in 3.52 seconds
    Computing Levenshtein between strings 'is' and 'diy' - OK in 5.04 seconds
    Computing Levenshtein between strings 'fz' and 'unda' - OK in 6.53 seconds
    Computing Levenshtein between strings 'sem' and '' - OK in 0.22 seconds
    Computing Levenshtein between strings 'dbr' and 'o' - OK in 2.78 seconds
    Computing Levenshtein between strings 'dgj' and 'hk' - OK in 4.92 seconds
    Computing Levenshtein between strings 'ejb' and 'tfo' - OK in 7.18 seconds
    Computing Levenshtein between strings 'afa' and 'ygqo' - OK in 9.25 seconds
    Computing Levenshtein between strings 'lhcc' and '' - OK in 0.22 seconds
    Computing Levenshtein between strings 'uoiu' and 'u' - OK in 3.52 seconds
    Computing Levenshtein between strings 'tztt' and 'xo' - OK in 6.45 seconds
    Computing Levenshtein between strings 'ufsa' and 'mil' - OK in 9.11 seconds
    Computing Levenshtein between strings 'uuzl' and 'dzkr' - OK in 12.01 seconds

Successful end

python levenshtein_distance.py --autoperf: Benchmark with random strings, for the different alphabets.

Typical performances for alphabet ACTG, with string of maximal length:

    Computing Levenshtein between strings 'GCGA' and 'GTCA' - OK in 6.04 seconds
    Computing Levenshtein between strings 'TCGA' and 'ACAA' - OK in 5.57 seconds
    Computing Levenshtein between strings 'CAGT' and 'CGTT' - OK in 5.63 seconds

Typical performances for alphabet string, with string of maximal length:

    Computing Levenshtein between strings 'ctow' and 'qtor' - OK in 17.54 seconds
    Computing Levenshtein between strings 'vwky' and 'enfh' - OK in 16.46 seconds
    Computing Levenshtein between strings 'dqse' and 'spps' - OK in 16.49 seconds

Typical performances for alphabet STRING, with string of maximal length:

    Computing Levenshtein between strings 'TQBW' and 'LKIZ' - OK in 16.62 seconds
    Computing Levenshtein between strings 'HANA' and 'CFVO' - OK in 16.32 seconds
    Computing Levenshtein between strings 'BEXY' and 'YAWM' - OK in 16.58 seconds

Typical performances for alphabet StRiNg, with string of maximal length:

    Computing Levenshtein between strings 'iYmH' and 'ONnz' - OK in 30.56 seconds
    Computing Levenshtein between strings 'hZyX' and 'vhHH' - OK in 30.11 seconds
    Computing Levenshtein between strings 'sJdj' and 'strn' - OK in 30.48 seconds

Successful end

Complexity analysis

Let's analyze a bit the complexity of the function levenshtein_fhe in FHE. We can see that the function cannot apply if's as in the clear function levenshtein_clear: it has to compute the two branches (the one for the True, and the one for the False), and finally compute an fhe.if_then_else of the two possible values. This slowdown is not specific to Concrete, it is by nature of FHE, where encrypted conditions imply such a trick.

Another interesting part is the impact of the choice of the alphabet: in run, we are going to compare two chars of the alphabet, and return an encrypted boolean to code for the equality / inequality of these two chars. This is basically done with a single programmable bootstrapping (PBS) of w+1 bits, where w is the floored log2 value of the number of chars in the alphabet. For example, for the 'string' alphabet, which has 26 letters, w = 5 and so we use a signed 6-bit value as input of a table lookup. For the larger 'StRiNg' alphabet, that's a signed 7-bit PBS. For small DNA alphabet 'ACTG', it's only signed 3-bit PBS.

Benchmarks on hpc7a

The benchmarks were done using Concrete 2.7 on hpc7a machine on AWS, and give:

Typical performances for alphabet ACTG, with string of maximal length:

    Computing Levenshtein between strings 'AGTC' and 'TGGA' - OK in 6.00 seconds
    Computing Levenshtein between strings 'GTAA' and 'AGAC' - OK in 5.51 seconds
    Computing Levenshtein between strings 'TCTT' and 'CACG' - OK in 5.49 seconds

Typical performances for alphabet string, with string of maximal length:

    Computing Levenshtein between strings 'jqdk' and 'zqlf' - OK in 17.43 seconds
    Computing Levenshtein between strings 'uquc' and 'qvvp' - OK in 16.50 seconds
    Computing Levenshtein between strings 'vebm' and 'ybqo' - OK in 16.46 seconds

Typical performances for alphabet STRING, with string of maximal length:

    Computing Levenshtein between strings 'UQES' and 'NWXQ' - OK in 16.53 seconds
    Computing Levenshtein between strings 'LAJG' and 'NEGP' - OK in 16.26 seconds
    Computing Levenshtein between strings 'OSQG' and 'OTEH' - OK in 16.52 seconds

Typical performances for alphabet StRiNg, with string of maximal length:

    Computing Levenshtein between strings 'ixgu' and 'cOSy' - OK in 30.94 seconds
    Computing Levenshtein between strings 'QGCj' and 'Lknx' - OK in 29.82 seconds
    Computing Levenshtein between strings 'fKVC' and 'xqaI' - OK in 30.27 seconds

Successful end