Table Of ContentOn the Performance of Network Parallel Training in Artificial Neural Networks
Ludvig Ericson, Rendani Mbuvha
School of Computer Science & Communication
KTH Royal Institute of Technology
Stockholm, Sweden
{ludv,rendani}@kth.se
7 Abstract—ArtificialNeuralNetworks(ANNs)havereceived and may display suboptimal performance when utilized
1 increasingattentioninrecentyearswithapplicationsthatspan for other tasks. Consequently, this can be impractical for
0 a wide range of disciplines including vital domains such as
modern scalability platforms such as HPC clusters and
2 medicine, network security and autonomous transportation.
n However, neural network architectures are becoming increas- cloud computing infrastructure.
a inglycomplexandwithanincreasingneedtoobtainreal-time Dahl et al. [3] present a Pattern Parallel Training (PPT)
J results from such models, it has become pivotal to use par- algorithmforparallelANNtraining.InPPT,multipleANNs
allelization as a mechanism for speeding up network training are duplicated in different processes. In each process an
8
and deployment. In this work we propose an implementation
1 ANNistrainedonarandomlyselectedsubsetofthetraining
ofNetworkParallelTrainingthroughCannon’sAlgorithmfor
examples. At the end of an epoch each process broadcasts
matrixmultiplication.Weshowthatincreasingthenumberof
I] processes speeds up training until the point where process itsweightupdateswhicharethenappliedtoallotherANNs.
A communication costs become prohibitive; this point varies by This has the effect of reducing the training time by faster
. networkcomplexity.Wealsoshowthroughempiricalefficiency convergence.
s calculations that the speedup obtained is superlinear.
c Suri et al. [4] use a similar approach to [3], however,
[ at the end of each full training round they apply an
I. INTRODUCTION evolutionaryalgorithmtoselectthebestcandidatenetworks
1
v Artificial neural networks, or ANNs, are computational for further training.
0 tools that are inspired by their biological namesake. ANNs The approach presented here is most similar to our
3 are particularly useful when estimating functions of ar- work is [5] who also use a Node Parallelization strategy,
1
bitrary complexity using given example data[1]. This in- whilecommunicatingtheweightscorrespondingto“ghost”
5
turn makes them well suited for pattern recognition tasks neurons. Their strategy requires that each process is fed
0
. wheretheyhavebeensuccessfullyappliedindimensionality with all the inputs. It therefore follows that this becomes
1
reduction, time series prediction, and data mining. While onerous on the amount of memory required.
0
7 they have long been on the fringes of the research com- The contribution and novelty in our proposed approach
1 munity since their inception in the ’80s, recent advances centers on its memory efficiency in that by using Cannon’s
: in computer and material sciences have allowed for an algorithmitdoesnotrequirealltheinputstobefedtoeach
v
i unprecedented growth in computation power, and thus al- process and that no specialized hardware is required. The
X
lowing the possibility of even more sophisticated networks. remainder of this paper is organised as follows: Section II
r One of the drawbacks of modern ANNs is the time it takes describes Artificial Neural Networks, Backpropagation and
a
to train such a network. our proposed implementation of Network Parallel Training.
As society becomes increasingly dependent on machine Section III sets out our experiments. Section IV discusses
learning applications, real-time output from on-line train- the experimental results. We conclude our paper in section
ing of neural network systems is critical to positive user V
experiences and usefulness of such systems. Thus speeding
II. METHODS
up neural network training with a parallel algorithm as we
A. Artificial Neural Networks
proposehasbecomeessentialtotheefficientdeploymentof
machine learning applications. The case for parallelization An ANN is defined by a weighted, directed graph
is also reinforced by the increase in availability of multi- G = (V,E), where the vertices are likened to neurons;
core CPUs and general purpose graphical processing units edges to “synapses,” the connections between neurons. The
in recent years. weights are sensitivities, e.g. a highly-weighted connection
Most early work in parallelization of neural networks will contribute more to the excitation of a neuron. We will
like[2]primarilysuggestspecialcustomisedhardware.This concern ourselves only with fully-connected feed-forward
approach has certain limitations as such hardware is costly networks.Afeed-forwardnetworkisoneinwhichthegraph
isacyclic,andsotheneuronscanbesubdividedintolayers.
Being fully-connected, all layers are maximally connected
to the next. The output z of some neuron j is thus a
j
function of the output of the previous layer. We define it as
(cid:88)
z =f (x)=h( w z +b ),
j j ij i j
{i,j}∈E
where h is some non-linear activation function, g is the
i
output of the ith neuron in the previous layer, w is the
ij
weight for that connection, and b is the neuron’s bias.
j
A choice for the non-linearity function h that has gained
popularity in recent years is the ReLU, or rectified linear
unit. We will use the leaky ReLU, defined as
Figure 1. Illustration of the unit division in Network Parallel Training
(cid:40)
x if x>0, [3].
h(x)=
0.01x otherwise.
The activation can be described in terms of a matrix Cannon’salgorithmformatrixmultiplication,thencompute
multiplication. Let Wi be the weights of ith layer, with the activation output for that unit. The shifting of the
each neuron as a column of weights. The activations Zi for local matrices within Cannon’s algorithm as defined in
that layer on a row-wise subset X of the training data is algorithm 1 allows for the partial products required for the
then calculationofactivationsineachunittobeavailablewithin
Zi =h(XWi+bi). the corresponding process.
B. Training Neural Networks
Training a neural network is done with stochastic gra- D. Cannon’s Algorithm
dient descent, or SGD. In SGD, the gradient of the loss
Cannon’s algorithm is a popular parallelization strategy
function L is computed for a single sample. The network
parameters are then updated using the expression θ(cid:48) = formatrixmultiplication[7].Thealgorithmefficientlycom-
putes matrix multiplications by shifting rows and columns
θ−η∇L,where0<η <1isthelearningrate.Thisprocess
ofthecandidatematricesbetweenprocessesbasedonade-
is repeated until some convergence criterion is satisfied.
finedsetofrules.Algorithmm1showsCannon’salgorithm
The loss function is typically the L distance of the
2
oftwocandidatematricesA(M×N)andB(N×K)within
network output and the ground truth label, and some reg-
agridofp×qprocesses.InthecontextofBackpropagation,
ularization, i.e. given ground truth label y, network output
Cannon’s algorithm effectively allows communication of
f , and regularization strength λ, we have
θ
weights between nodes that are placed within different
L(x,y)=|y−fθ(x)|2+λ|θ|2, processes.
where |·| is the Euclidean distance (i.e. L distance.)
2 2
A common way to extend SGD is through mini-batch Algorithm 1: Cannon’s Matrix Multiplication
processing,whereabatchofsamplesareprocessed,andthe
Data: Number of parts N, sub-matrices A,B
gradient update is computed by the mean of each sample’s
Result: Local part C of global matrix multiplication
gradient. A further improvement is momentum learning,
begin
where consecutive updates in the same direction contribute Shift rows of A such that the mth row of A shifts
to “momentum” in that direction. m−1 positions to the left
Shift columns of B such that the nth column of B
C. Network Parallel Training
shifts n−1 positions to upwards
Network Parallel Training (NPT) [3] is sometimes re-
for i∈{1,2,...,N} do
ferred to as Unit Parallelization. In NPT, one divides the
C ←−C+AB
network units within each layer across different processes.
Shift rows of A one process to the left, and
In its simplest form, each process is responsible for cal-
columns of B one process upwards
culating activation of a single unit in each layer [6]. In a
end
more practical setup the units are divided in a linearly load
Undo initial shifts in A and B
balanced manner. See figure 1 for an example. end
In our approach, the strategy is to compute the matrix-
matrix multiplication XW for each layer in parallel using
i
E. Implementation Efficiency with varying network size, no. of processors, nodes
2.5
We implement our Neural Network in C using the
2.0
Message Passing Interface (MPI) [8]. We use the
Mgfirgi.Pd1It.oCpoalrotgycrtehaatteefnuanbcletisonpatoralclereliaztaetiaonviartsuadleCpiacrtteedsiainn Efficiency ηP11..05
1 layer, 1 node
1 layer, 4 nodes
0.5
2 layers, 4 nodes
10R4untime with varying network size, 1n ola.y oefr ,p r1o nceosdseors, nodes 0.020 221 big lay2e2rs, 4 n2o3des 24 25 26 27
1 layer, 4 nodes No. of processes P
2 layers, 4 nodes
me TP103 2 big layers, 4 nodes Figure4. PlotoftheefficiencyηP = SPP.
Runti
102
1) A One hidden layer network with the number of input
and output units q =128
10120 21 22 23 24 25 26 27 2) A Two hidden layer network with the number of input
No. of processes P and output units q =128
3) A Two hidden layer network with the number of input
Figure2. PlotoftheruntimesTP. andoutputunitsq =256-werefertothisasthe“big”
network.
III. EXPERIMENTS We used an MPI barrier to synchronize before starting
A. Training Data the training routine, and immediately stored the wall-clock
The data was created artificially by imposing a linear time. The same was done after training completed. The
relationshipbetweentheinputsX andtheoutputsY .The runtime data is presented as relative runtimes in table I.
ij ik
inputs are created by sampling random numbers uniformly
IV. EXPERIMENTRESULTS
between 0 and 200, with the bias input set to 1. The output
vector is created using The networks above were run on an HPC cluster with
an increasing number of processors. The resulting runtimes
Y =((7(k+1)mod1000)+
ik in seconds are shown in figure 2 together with the relative
(3k mod100))X . performance statistics in table I. It can be seen that for the
ik
single layer networks the runtimes decrease as processes
Interestingly,since7,3and1000areallpairwisecoprimes,
increaseuntilweobtaintheoptimalpointwith16processes;
this forms a basis for all linear functions y =kx+m with
after this point communication cost becomes prohibitive
parameters k,m∈[0,1000).
resulting in increasing runtimes. As the number of nodes
B. Experiment Setup are increased from 1 to 4, the runtime also increases which
might imply the additional cost of redundant communica-
In order to evaluate the performance of our NPT algo-
tion and startup related to the unused nodes.
rithm we created the following networks of different sizes
and complexity: Increasing the number of layers l from 1 to 2 increases
the optimal number of processes from 16 to 32 which
gives empirical evidence that the runtimes are linear in the
Speedup with varying network size, no. of processors, nodes number of layers l .The runtimes for the “big” network
80
1 layer, 1 node increase quadratically relative to the two layered network
70 1 layer, 4 nodes
60 2 layers, 4 nodes on the same number of nodes Thus the optimal number of
2 big layers, 4 nodes processes for the “big layer” network is 64.
Speedup SP345000 netIwtiosrkraathreerlosuwreprritshinagntthhaotsethoefruthnetimsinesgloefltahyeertewdoalfatyeerr3e2d
20 processes by a significant margin on the same number of
10 nodes.Thisisapointthatwillrequirefurtherinvestigation.
020 21 22 23 24 25 26 27 The speedup curve in figure 3 shows similar results as
No. of processes P in the runtime plot with exactly the same optimal process
parameters. The efficiency plot in figure 4 shows that it is
Figure 3. Plot of speedup SP. We define the speedup at P processes
apsroScePsse=s.TTTPS∗∗wishtehree“TfPastiesstthkenoewxencusetiqounentitmialevoefrsoiounr”a,lgsionrcitehmwewdiothnPot mthoeroeveprraolclebsesnoerfietf(fiicnicernetasteoihnasvpeee8duppro)coefshsoarvsin-gaabdodviteiotnhaisl
S
knowofanybetteralgorithm,weletTS∗=T1. processes decreases.
P 1layer,1node 2layer,1node 2layer,4node 2biglayers,4node
1 25.69 51.46 51.00 215.88
2 8.73 17.04 17.07 71.98
4 3.47 6.60 6.85 24.78
8 1.71 3.21 3.13 11.33
16 1.00 1.87 1.96 5.80
32 1.16 2.22 1.87 3.87
64 2.54 4.89 2.30 2.80
128 2.94 10.80 3.28 5.34
TableI
RELATIVEPERFORMANCEOFTHEDIFFERENTWORKLOADSASMEASUREDINWALL-CLOCKTIMEWITHTHEBASELINEASA1LAYERNETWORK
WITHONSINGLENODEWITH16PROCESSES.
Animportantobservationfromtheefficiencyplotsisthat [2] Farber P, Asanovic K. Parallel neural network training on
the processor efficiency exceeds 1 in certain regions - this Multi-Spert.In:AlgorithmsandArchitecturesforParallelPro-
cessing,1997.ICAPP97.,19973rdInternationalConference
implies that the seedup is superlinear [9]. Thus meaning
on; 1997. p. 659–666.
- adding additional processes until a certain point actually
increases the speedup possibly due to reduction in RAM [3] Dahl G, McAvinney A, Newhall T. Parallelizing Neural
access times. NetworkTrainingforClusterSystems. In:Proceedingsofthe
IASTEDInternationalConferenceonParallelandDistributed
V. CONCLUSIONSANDREMARKS Computing and Networks. PDCN ’08. Anaheim, CA, USA:
ACTA Press; 2008. p. 220–225.
An implementation of Network Parallel Training using
Cannon’salgorithmwasproposed.Runningtimes,Speedup [4] SuriNNRR,DeodhareD,NagabhushanP.ParallelLevenberg-
Marquardt-basedNeuralNetworkTrainingonLinuxClusters
and Efficiency were evaluated on networks of varying
- A Case Study. In: Linux Clusters, ICVGIP 2002, 3rd
complexity. Results show that when using this strategy
Indian Conference on Computer Vision, Graphics and Image
significant speedup is obtained. This effect is even more Processing, Ahmadabad; 2002. .
pronounced for more complex networks. We have also
showed that there are regions where such speedup is super- [5] Long LN, Gupta A. Scalable Massively Parallel Artificial
Neural Networks. JACIC. 2008;5:3–15.
linear thus increasing the efficiency of resource utilization.
It would be interesting to explore different means of [6] PethickM,LiddleM,WersteinP,HuangZ. Parallelizationof
parallelizationotherthanaper-unitbasedsplitoftheweight a backpropagation neural network on a cluster computer. In:
matrices. The topologies tested here were of relatively lim- Internationalconferenceonparallelanddistributedcomputing
and systems (PDCS 2003); 2003. .
ited size. Parallelized neural networks are mostly necessary
once the networks are of unusually large size, otherwise
[7] Cannon LE. A Cellular Computer to Implement the Kalman
communication time is going to far outweigh any benefits Filter Algorithm. Bozeman, MT, USA; 1969. AAI7010025.
of parallel execution due to the many data dependencies.
The data model used was a simple linear function de- [8] SnirM,OttoSW,WalkerDW,DongarraJ,Huss-LedermanS.
MPI: The Complete Reference. Cambridge, MA, USA: MIT
pending only on one variable. While it was not relevant for
Press; 1995.
the scope of this work, implementing a more sophisticated
meansofdatagenerationwouldbeaninterestingnextstep. [9] Wilkinson B, Allen M. Parallel Programming: Techniques
Another aspect to explore is how the local matrix mul- andApplicationsUsingNetworkedWorkstationsandParallel
tiplication can be improved. Perhaps offloading the matrix Computers.UpperSaddleRiver,NJ,USA:Prentice-Hall,Inc.;
1999.
multiplication to a graphics processing unit would be ben-
eficial.
ACKNOWLEDGEMENTS
The computations in this work were performed on re-
sourcesprovidedbytheSwedishNationalInfrastructurefor
Computing (SNIC) at PDC Centre for High Performance
Computing (PDC-HPC).
REFERENCES
[1] MarwalaT. EconomicModelingUsingArtificialIntelligence
Methods. Springer Publishing Company, Incorporated; 2013.