SPEEDING UP MULTI-EXPONENTIATION ALGORITHM ON A MULTICORE SYSTEM

A public key cryptosystem is a basic tool to protect data security. Most public key cryptosystem schemes include time consuming operations such as the modular multi exponentiation. To address this problem, a new parallel algorithm for the modular multi exponentiation is introduced. The proposed algorithm is based on parallelizing the binary method. The experimental study on a multicore system shows that the running time of the proposed algorithm is smaller than the previous parallel algorithm in the cases of large data sizes under different number of processors. The percentage of improvement is up to 55% compared with the previous algorithm. keywords: public key cryptosystem, modular multi exponentiation, parallel algorithm. MSC: 11Y16, 68W10 and 11T71.


Introduction
Given n messages as integer numbers M j , 0 ≤ j < n, and exponents in the binary representation E j = (e j(l−1) e j(l−2) . . .e j1 e j0 ) 2 , where l is the maximum length of the binary representation of the exponents.The modular multi exponentiation problem is to compute n−1 j=0 M Ej j mod N, where N is a large modulus.The modular multi exponentiation is a fundamental operation in the area of the cryptography.There are several cryptographic schemes and algorithms that depend on multi exponentiation such as the RSA cryptosystem [1] and the Diffie−Hellman scheme [2] for the single exponentiation.The DSA algorithm [3], the Schnorr identification and signature scheme [4] and the BrickellMcCurley scheme [5] for double exponentiation.The ElGamal digital signature scheme [6] is an example for triple exponentiation.
There are two directions for speeding up the computation of the modular multi exponentiation.The first is to design a fast modular multiplication method.Several papers were presented to speed up the multiplication.For example, Yen and Laih [7] presented a multiplication method to speeding up the multiplication.The second direction is to design a fast exponentiation algorithm using several methods such as the binary method, 2 w −ary method and sliding window method [8][9][10].Dimitrov et al. [11] presented new algorithms for the computation of modular exponentiations.These algorithms are based on complex arithmetic representations.The applicability of the proposed algorithms depends on the assumption that the inverse elements can be precomputed in advance.Moller [9,12] presented two algorithms based on the sliding windows method.For speeding up the computation of modular multi exponentiation, Wu et al. [13] introduced a fast algorithm combining the complement recoding method and the minimal weight BSD representation technique.Sun et al. analyzed the computational efficiency of Wu et al.'s algorithm [13] by modeling it as a Markov chain [14].Yen et al. [15] presented a series of algorithms for the computation of multi exponentiation.These algorithms are based on the binary greatest common divisor algorithm.The proposed algorithms have the capability to evaluate double or multi exponentiation with variable base numbers and exponents.In 2016 Wu et al. [16] proposed several batch multi exponentiation algorithms for the computation of the exponentiation operations.The mechanism of these algorithms is by allowing a large number of multi-base exponentiations to be processed in batch.
Different parallel works were presented for the computation of the multi exponentiation problem.Chiou [17] proposed a parallel algorithm depending on the concurrent computation of the squaring and multiplication of the binary method.Chang and Lou [8] proposed two efficient parallel methods for the computation of the modular multi exponentiation which are based on the linear array model.Also Chang and Lou [18] generalized Chiou's parallel algorithm [17] to compute the modular multi exponentiation.Recently, Borges et al. [19] proposed a parallel algorithm to compute the modular multi exponentiation, the proposed algorithm is the generalization of Lara et al. work [20].
In this work, we present a fast parallel algorithm to compute the modular multi exponentiation based on the binary method and the parallel multiplication algorithm.Then we study the efficiency of the proposed algorithm on a multicore system.
The rest of the paper is organized as an introduction and four sections.In Section 2, we discuss the parallel algorithm presented by Borges et al. [19].The introduced parallel algorithm for the computation of the modular multi exponentiation is given in Section 3. In Section 4, we study the proposed algorithm practically on a multi-core system to measure the running time and the speedup complexity.Also the comparison and discussion with the recent parallel algorithm is given in the same section.The conclusion of the presented work is given Section 5.

Borges et al. Algorithm
In this section, we will describe and analyze the recent parallel algorithm for the modular multi exponentiation presented by Borges et al. [19] which will be named BLP algorithm.The section will be organized in two subsections.In the first subsection, we will describe the main idea and the complete steps of the algorithm.In the second subsection, we will describe our observations on this algorithm .

The Algorithm
Borges et al. [19] introduced two parallel algorithms to compute the modular multi exponentiation.In the first algorithm, the binary representation of the exponents, as a two dimensional array, is split among the available processors into equal length partitions.Let {r 0 = 0, r 1 , r 2 , . . ., r k }, denote the points of partition.Using the points of partition, each processor computes ) 2 r i−1 mod N, where E j,i = (e jri−1 e jri−2 . . .e jri−1 ) is the ith partition of E j .Finally, the algorithm computes the multiplication of the results of each processor.The second algorithm (BLP) was designed to make the load balance between the processors equally to maximize the speedup.The algorithm is based on computing the optimal number of processors, k, and the optimal points of partition, {r 0 r 2 , . . ., r k }, to distribute the work equally among the k processors.The value of k is computed as follows [19].
where [ ] represents the nearest integer, 2 M , M is the cost of the modular multiplication operation and equal to 1.14 from practical view, and d is a chosen value (greater than or equal 1) to avoid the collision of r i and r i+1 after truncation.
To compute the points of partitions, they defined the value α = l 1−x k 2 , so that the optimal points of partitions will be defined as follows [19] r The complete pseudocode for the algorithm is as follows.Algorithm: BLP Input: Messages as integer numbers M j , 0 ≤ j < n, exponents in the binary representation E j = (e j(l−1) e j(l−2) . . .e j1 e j0 ) 2 and modulus N, where l is the maximum length.
Compute the optimal number of processors k as in Eq(1).

Analysis and Observations
In this subsection, we will give two observations on Eq(1) and Eq(2) which are used to compute the optimal number of processors suggested by Borges et al. and to determine the positions of partition.In order to give our observations, we first calculate the optimal number of processors suggested by Borges et al. on different values of n and l where n = 4, 8, 12, 16 and 20, and l = 2 10 , 2 11 , 2 12 , 2 13 , 2 14 , 2 15 , 2 16 and 2 17 .Table 1 shows the optimal number of processors at which the maximum speed will occur according to Borges et 1 we observed the following: 1.The optimal number of processors increases with the increasing of the values of n and l.In the cases of large values of n and l, the optimal number of processors may not be available in real machines.
2. The formulas depend on the value of M which is the cost of the modular multiplication operation and this cost depends on the machine used.Also the reduction of the formulas depends on assuming that: the number of 1 s in each column of the array of the binary representation of the exponents is n 2 .
The question now is whether an algorithm can be designed to achieve the following goals: 1.The proposed algorithm does not depend on any parameters and assumptions.
2. The running time of the proposed algorithm is smaller than the running time of BLP algorithm at the optimal number of processors k.
3. The running time of the proposed algorithm is smaller than the running time of BLP algorithm at a fixed number of processors depending on the computing power of the machine.Because a large number of machines does not have the required number of processors k.
4. The behavior of the running time of the proposed algorithm is better than the behavior of the running time of BLP algorithm at different numbers of processors.
3 The Proposed Multi-Exponentiation Algorithm In this section, we will introduce a parallel algorithm to perform the modular multi exponentiation that achieves the goals of the question asked in the previous section.
Given the messages as an array of integers M j , 0 ≤ j < n, and the binary representation of the exponents as a two dimensional array E. A row E j represents the binary representation of the (j + 1) th exponents, 0 ≤ j < n.The maximum length of the exponents will be denoted by l.First we will define an array G of l elements as follows: The value of g i for 0 ≤ i < l can be considered as the value of the (i + 1) th column of the array E. Now the modular multi exponentiation can be computed as l−1 i=0 g 2 i i mod N .The main idea to design the proposed algorithm for the modular multi exponentiation in parallelism consists of two stages.The main idea of the first stage is an assignment of a processor for each column of the array of the binary representation of the exponents E to compute g 2 i i mod N, 0 ≤ i ≤ l − 1.This goal can be achieved as follows.First, by initializing the value of g i with n−1 j=0 M j mod N if e ji = 0 and then by squaring g i i−times.In the second stage, we apply the parallel multiplication algorithm [21] to multiply g i , 0 ≤ i ≤ l − 1 and store the result in g 0 .
The main differences between the proposed algorithm and BLP algorithm are as follows: 1.In the proposed algorithm, we do not compute the optimal number of processors but we use l processors theoretically.
In the experimental study we simulate the algorithm on p (available number of processors).
2. In the proposed algorithm, each processor is assigned to a column in the binary representation array to compute the value of g i where 0 ≤ i < l.In BLP algorithm, each processor is assigned to a two dimensional block of the array of the binary representation to compute the value of a i where 0 ≤ i < k.
3. The final multiplication step in BLP algorithm is executed sequentially while in the proposed algorithm the multiplication step is executed in parallel.
The complete pseudocode for the fast parallel modular multi exponentiation algorithm, FMME, is as follows: step 1 represents the first stage of the proposed algorithm while steps 2-5 represent the second stage.Algorithm: FMME Input: Messages as an array of integers M of n elements, exponents in the binary representation as a two dimensional array E of n rows and l columns, and modulus N.
for j = 0 to n − 1 do if e ji = 0 then

End
The running time complexity of the proposed algorithm can be analyzed as follows.In the first step there are l −1 squares and n multiplications will be performed.In the steps 2 − 5, there are approximately t FPM [21] multiplications.The time of the proposed algorithm is T = (l − 1) × S + (n + t FPM ) × M where S and M is the running time required for the modular squaring and multiplication respectively.The storage complexity of the proposed algorithm is O(l).

Experimental Results
In this section, we present an implementation for FMME and BLP algorithms and then we compare between them according to running time and speedup.
The implementation is based on a multicore system from Google Cloud systems.The name of machine is n1-highmem-64 and consists of 64 cores and 416 GB memory.The algorithms were implemented using used C++ programming language and we use the library OpenMP to allow the parallelization.Also we used GNU Multiple Precision (GMP) library for arbitrary precision arithmetic.
In the experiment, we implemented the algorithms on different sizes and numbers of the exponent E. N in our experiments is equal to 2 l + 1. E and M were chosen as random numbers of length l.The sizes of the exponents, l, are from 2 10 to 2 17 .The numbers of the exponents, n, are 4, 8, 12, 16 and 20.We implemented the algorithms on a sample of 20 random numbers.
The analysis of the experimental results will be described in three subsections.In the first subsection, we will compare FMME and BLP algorithms using the number of processors suggested by Borges et al [19].In the second subsection, we will compare FMME and BLP algorithms using a fixed number of processors.The speed up of FMME and BLP algorithms will be studied in the third subsection.

The Comparison at k Processors
In order to analyze the performance of the FMME algorithm, we compared FMME and BLP algorithms using the optimal number of processors suggested by Borges et al. and shown in Table 1.In the cases of n = 4, 8 and 12, we give the results using the required number of processors.Because the required number of processors in the cases of n = 16 and 20 is greater than 64 processors so we give the results using a simulation on a machine with 64 cores.An important note is about the running time of the multiplication of g i , 0 ≤ i < l generated from the first step of FMME algorithm.In some cases this running time using the sequential algorithm for the multiplication is smaller than the time using the parallel multiplication algorithm [21].The reason is that the modular squaring performed in the first step sometimes gives a lot of small values for some g i .The results of the running time of the two algorithms using different inputs are shown in

The Comparison at a Fixed Number of Processors
In the cases of a large number of exponents n and a large size of the exponents l, the required number of processors k suggested by Borges et al. will be a large number of processors.Therefore the comparison between FMME and BLP algorithms using a fixed number of processors, independent of n and l, is important to give an impression of the behavior of the running time of FMME and BLP algorithms.An important note is that we ignored the running time in the cases of l = 2 10 , 2 11 , 2 12 and 2 13 .Because all the values of the running time in these cases are smaller than 1 second.From Table 3, we observed the following: 1. BLP algorithm is better than FMME algorithm in the case of n = 4 and l = 2 14 at p = 16 and p = 32.The percentage of the improvement of BLP algorithm is 4%.
2. FMME algorithm is better than BLP algorithm in all cases of n = 8, 12, 16 and 20 for all values of l at p = 16 and 32 and in the case of n = 4 and l = 2 15 , 2 16 and 2 17 at p = 16 and p = 32.We also observed that the percentage of improvement of FMME algorithm increases with the increasing of the length of the exponents i.e with the increasing of l.For example, the percentage of improvement of FMME algorithm in the case of n = 12 and l = 2 14 is 34% for p = 16 while the percentage of improvement of FMME algorithm in the case of n = 12 and l = 2 17 is 44% for p = 16.

The Speed up Measurement
The speedup of a parallel algorithm is defined to be the ratio of the running time of the sequential version to the running time using p processors [22].4 gives the running time of the sequential algorithm [19] which will be used to compute the speed up of FMME and BLP algorithms.The implementation of the algorithm is based on the same platform and input data.
To measure the speed up of FMME and BLP algorithms we computed the speed up of both algorithms at different number of processors and different data sizes.Table 5 gives the speed up comparison between FMME and BLP algorithms at p = 16, 32 and at the optimal number of processors k suggested by Borges et

4 .
Multiply the k values of a i , and assign it to C = k i=1 a i mod N End

Table 2 ,
where the running time is measured in seconds.

Table 2 :
The time comparison between FMME and BLP algorithms

Table 3 :
Table 3 represents the running time of both algorithms at p = 16 and 32.The time comparison between FMME and BLP algorithms at p = 16 and 32

Table 4 :
The sequential time Table

Table 5 :
al.The speed up comparison between FMME and BLP algorithms at p = 16, 32 and p = k From Table5, we observed that the speed up of FMME algorithm is better than the speed up of BLP algorithm except