mirror of
https://github.com/data61/MP-SPDZ.git
synced 2026-01-09 13:37:58 -05:00
149 lines
5.3 KiB
ReStructuredText
149 lines
5.3 KiB
ReStructuredText
Compiler Optimizations
|
|
======================
|
|
|
|
A core tenet of MP-SPDZ is to reduce the number of communication
|
|
rounds by parallelizing operations. The are two mechanisms for this,
|
|
called merging and CISC. Merging is the more basic mechanism, doing
|
|
just round reduction and nothing else. CISC in addition uses
|
|
vectorization, which reduces the cost during compilation and the
|
|
footprint. In the following, we will explain the two mechanisms using
|
|
examples.
|
|
|
|
|
|
Merging
|
|
-------
|
|
|
|
Merging is the original mechanism forming the basis for `Keller et
|
|
al. <https://eprint.iacr.org/2013/143>`_ It is based on extending the
|
|
common three-operand instruction design to any number of argument
|
|
tuples. Consider the following high-level code::
|
|
|
|
sint(1) * sint(2)
|
|
sint(3) * sint(4)
|
|
|
|
Putting this into ``Programs/Source/2mul.py`` and running
|
|
``./compile.py 2mul -a debug`` results in the following content in
|
|
``debug-2mul-0``::
|
|
|
|
# 2mul-0--0
|
|
ldsi s2, 1 # 0
|
|
ldsi s3, 2 # 1
|
|
ldsi s4, 3 # 2
|
|
ldsi s5, 4 # 3
|
|
muls 8, 1, s0, s2, s3, 1, s1, s4, s5 # 4
|
|
|
|
You can see all the inputs loaded in the first few lines appear in the
|
|
last one. The arguments are follows (see also
|
|
:py:class:`Compiler.instructions.muls`): 8 indicates the number of
|
|
arguments to follow, and every four arguments correspond to one
|
|
multiplication. The first number withing every four arguments is the
|
|
vector size, following by the result register and the input
|
|
registers.
|
|
|
|
On the backend side, the multiplication is implemented in the
|
|
:cpp:func:`muls` member function in
|
|
:download:`../Processor/Processor.hpp`. Notice that
|
|
``protocol.exchange()`` (where the communication happens) is only
|
|
called once. See :ref:`low-level` for more information.
|
|
|
|
Within the compiler, the optimization is executed in the
|
|
:py:class:`Merger` class in :download:`../Compiler/allocator.py`. The
|
|
:py:func:`dependency_graph` member function builds a dependency graph
|
|
for all instructions, and instructions are merged in
|
|
:py:func:`longest_path_merge`.
|
|
|
|
|
|
CISC
|
|
----
|
|
|
|
This mechanism takes its name from `Complex Instruction Set Computer
|
|
<https://en.wikipedia.org/wiki/Complex_instruction_set_computer>`_,
|
|
which refers to the fact that more complex operations are treated as
|
|
instructions internally before being merged and spelled in vectorized
|
|
instructions. For a simple example, consider the following high-level
|
|
code::
|
|
|
|
program.use_trunc_pr = True
|
|
sfix(1) * sfix(2)
|
|
sfix(3) * sfix(4)
|
|
|
|
The resulting file ``debug-2fmul-0`` starts similarly to the one
|
|
above::
|
|
|
|
# 2fmul-0-cisc-1
|
|
ldsi s8, 65536 # 0
|
|
ldsi s9, 131072 # 1
|
|
ldsi s10, 196608 # 2
|
|
ldsi s11, 262144 # 3
|
|
muls 8, 1, s0, s8, s9, 1, s7, s10, s11 # 4
|
|
|
|
This corresponds to the fact that fixed-point multiplication starts
|
|
with an integer multiplication before truncation. The file continues
|
|
as follows::
|
|
|
|
# 2fmul-0--2
|
|
concats 5, s3(2), 1, s0, 1, s7 # 5
|
|
vmovs 2, s3(2), s3(2) # 6
|
|
# 2fmul-0-update-3
|
|
jmp 11 # 7
|
|
|
|
The first instruction creates a vector of size from the multiplication
|
|
results, and the last instruction jumps over the next code block just
|
|
to end up here::
|
|
|
|
# 2fmul-0-end-TruncPr(2)_47_16-5
|
|
ldint ci0, 3 # 19
|
|
stmint ci0, 8192 # 20
|
|
jmp -14 # 21
|
|
|
|
This code prepares for a function call. Functions are used for code
|
|
reusability, i.e., the same code only has to be compiled once per tape
|
|
and vector size. The last instruction jumps to the start of the
|
|
function here (the code block jumped over above)::
|
|
|
|
# 2fmul-0-begin-TruncPr(2)_47_16-4
|
|
ldarg ci1 # 8
|
|
vldi 2, c0(2), 32768 # 9
|
|
vmulci 2, c2(2), c0(2), 2147483647 # 10
|
|
vaddci 2, c0(2), c2(2), 32768 # 11
|
|
vaddm 2, s5(2), s3(2), c0(2) # 12
|
|
vtrunc_pr 2, 4, s3(2), s5(2), 47, 16 # 13
|
|
vsubsi 2, s5(2), s3(2), 1073741824 # 14
|
|
vmovs 2, s3(2), s5(2) # 15
|
|
vmovs 2, s1(2), s3(2) # 16
|
|
ldmint ci1, 8192 # 17
|
|
jmpi ci1 # 18
|
|
|
|
The actual truncation happens in the vectorized instructions starting
|
|
with v. The only communication-relevant instruction is (v)trunc_pr,
|
|
where truncation by 16 bits is done via a protocol defined in the
|
|
virtual machine. For example, the implementation for Rep3 is found in
|
|
:download:`../Protocols/Replicated.hpp`. The other vectorized
|
|
instructions are required to turn negative values into positive ones,
|
|
which is a precondition for the protocol. See Protocol in 3.1 in
|
|
`Catrina and Saxena <https://www.ifca.ai/pub/fc10/31_47.pdf>`_ for an
|
|
explanation. Lastly, the last two instructions load where to jump back
|
|
to, which is here::
|
|
|
|
# 2fmul-0-call-TruncPr(2)_47_16-6
|
|
picks s0, s1(2), 0, 1 # 22
|
|
picks s0, s1(2), 1, 1 # 23
|
|
|
|
The two instructions extract the results from the vector, which makes
|
|
them available individually for further computation.
|
|
|
|
While this example saves relatively little by using the CISC
|
|
functionality, this isn't the case for other usages. For example,
|
|
removing ``program.use_trunc_pr`` results in several hundred
|
|
instructions, and more involved mathematical functions corresponds to
|
|
thousands of instructions.
|
|
|
|
Internally, the CISC functionality is in implemented in functions
|
|
decorators in :download:`../Compiler/instructions_base.py`. Our
|
|
example uses :py:func:`ret_cisc` on :py:func:`TruncPr` in
|
|
:download:`../Compiler/floatingpoint.py`. This is because
|
|
:py:func:`TruncPr` returns an :py:class:`sint`. Other decorators are
|
|
:py:func:`cisc` (the result is stored in the first argument, which
|
|
must be an :py:class:`sint`) and :py:func:`sfix_cisc` (the result and
|
|
all arguments are instances of :py:class:`sfix`).
|