ebook img

On the Performance of Network Parallel Training in Artificial Neural Networks PDF

0.33 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview On the Performance of Network Parallel Training in Artificial Neural Networks

On the Performance of Network Parallel Training in Artificial Neural Networks Ludvig Ericson, Rendani Mbuvha School of Computer Science & Communication KTH Royal Institute of Technology Stockholm, Sweden {ludv,rendani}@kth.se 7 Abstract—ArtificialNeuralNetworks(ANNs)havereceived and may display suboptimal performance when utilized 1 increasingattentioninrecentyearswithapplicationsthatspan for other tasks. Consequently, this can be impractical for 0 a wide range of disciplines including vital domains such as modern scalability platforms such as HPC clusters and 2 medicine, network security and autonomous transportation. n However, neural network architectures are becoming increas- cloud computing infrastructure. a inglycomplexandwithanincreasingneedtoobtainreal-time Dahl et al. [3] present a Pattern Parallel Training (PPT) J results from such models, it has become pivotal to use par- algorithmforparallelANNtraining.InPPT,multipleANNs allelization as a mechanism for speeding up network training are duplicated in different processes. In each process an 8 and deployment. In this work we propose an implementation 1 ANNistrainedonarandomlyselectedsubsetofthetraining ofNetworkParallelTrainingthroughCannon’sAlgorithmfor examples. At the end of an epoch each process broadcasts matrixmultiplication.Weshowthatincreasingthenumberof I] processes speeds up training until the point where process itsweightupdateswhicharethenappliedtoallotherANNs. A communication costs become prohibitive; this point varies by This has the effect of reducing the training time by faster . networkcomplexity.Wealsoshowthroughempiricalefficiency convergence. s calculations that the speedup obtained is superlinear. c Suri et al. [4] use a similar approach to [3], however, [ at the end of each full training round they apply an I. INTRODUCTION evolutionaryalgorithmtoselectthebestcandidatenetworks 1 v Artificial neural networks, or ANNs, are computational for further training. 0 tools that are inspired by their biological namesake. ANNs The approach presented here is most similar to our 3 are particularly useful when estimating functions of ar- work is [5] who also use a Node Parallelization strategy, 1 bitrary complexity using given example data[1]. This in- whilecommunicatingtheweightscorrespondingto“ghost” 5 turn makes them well suited for pattern recognition tasks neurons. Their strategy requires that each process is fed 0 . wheretheyhavebeensuccessfullyappliedindimensionality with all the inputs. It therefore follows that this becomes 1 reduction, time series prediction, and data mining. While onerous on the amount of memory required. 0 7 they have long been on the fringes of the research com- The contribution and novelty in our proposed approach 1 munity since their inception in the ’80s, recent advances centers on its memory efficiency in that by using Cannon’s : in computer and material sciences have allowed for an algorithmitdoesnotrequirealltheinputstobefedtoeach v i unprecedented growth in computation power, and thus al- process and that no specialized hardware is required. The X lowing the possibility of even more sophisticated networks. remainder of this paper is organised as follows: Section II r One of the drawbacks of modern ANNs is the time it takes describes Artificial Neural Networks, Backpropagation and a to train such a network. our proposed implementation of Network Parallel Training. As society becomes increasingly dependent on machine Section III sets out our experiments. Section IV discusses learning applications, real-time output from on-line train- the experimental results. We conclude our paper in section ing of neural network systems is critical to positive user V experiences and usefulness of such systems. Thus speeding II. METHODS up neural network training with a parallel algorithm as we A. Artificial Neural Networks proposehasbecomeessentialtotheefficientdeploymentof machine learning applications. The case for parallelization An ANN is defined by a weighted, directed graph is also reinforced by the increase in availability of multi- G = (V,E), where the vertices are likened to neurons; core CPUs and general purpose graphical processing units edges to “synapses,” the connections between neurons. The in recent years. weights are sensitivities, e.g. a highly-weighted connection Most early work in parallelization of neural networks will contribute more to the excitation of a neuron. We will like[2]primarilysuggestspecialcustomisedhardware.This concern ourselves only with fully-connected feed-forward approach has certain limitations as such hardware is costly networks.Afeed-forwardnetworkisoneinwhichthegraph isacyclic,andsotheneuronscanbesubdividedintolayers. Being fully-connected, all layers are maximally connected to the next. The output z of some neuron j is thus a j function of the output of the previous layer. We define it as (cid:88) z =f (x)=h( w z +b ), j j ij i j {i,j}∈E where h is some non-linear activation function, g is the i output of the ith neuron in the previous layer, w is the ij weight for that connection, and b is the neuron’s bias. j A choice for the non-linearity function h that has gained popularity in recent years is the ReLU, or rectified linear unit. We will use the leaky ReLU, defined as Figure 1. Illustration of the unit division in Network Parallel Training (cid:40) x if x>0, [3]. h(x)= 0.01x otherwise. The activation can be described in terms of a matrix Cannon’salgorithmformatrixmultiplication,thencompute multiplication. Let Wi be the weights of ith layer, with the activation output for that unit. The shifting of the each neuron as a column of weights. The activations Zi for local matrices within Cannon’s algorithm as defined in that layer on a row-wise subset X of the training data is algorithm 1 allows for the partial products required for the then calculationofactivationsineachunittobeavailablewithin Zi =h(XWi+bi). the corresponding process. B. Training Neural Networks Training a neural network is done with stochastic gra- D. Cannon’s Algorithm dient descent, or SGD. In SGD, the gradient of the loss Cannon’s algorithm is a popular parallelization strategy function L is computed for a single sample. The network parameters are then updated using the expression θ(cid:48) = formatrixmultiplication[7].Thealgorithmefficientlycom- putes matrix multiplications by shifting rows and columns θ−η∇L,where0<η <1isthelearningrate.Thisprocess ofthecandidatematricesbetweenprocessesbasedonade- is repeated until some convergence criterion is satisfied. finedsetofrules.Algorithmm1showsCannon’salgorithm The loss function is typically the L distance of the 2 oftwocandidatematricesA(M×N)andB(N×K)within network output and the ground truth label, and some reg- agridofp×qprocesses.InthecontextofBackpropagation, ularization, i.e. given ground truth label y, network output Cannon’s algorithm effectively allows communication of f , and regularization strength λ, we have θ weights between nodes that are placed within different L(x,y)=|y−fθ(x)|2+λ|θ|2, processes. where |·| is the Euclidean distance (i.e. L distance.) 2 2 A common way to extend SGD is through mini-batch Algorithm 1: Cannon’s Matrix Multiplication processing,whereabatchofsamplesareprocessed,andthe Data: Number of parts N, sub-matrices A,B gradient update is computed by the mean of each sample’s Result: Local part C of global matrix multiplication gradient. A further improvement is momentum learning, begin where consecutive updates in the same direction contribute Shift rows of A such that the mth row of A shifts to “momentum” in that direction. m−1 positions to the left Shift columns of B such that the nth column of B C. Network Parallel Training shifts n−1 positions to upwards Network Parallel Training (NPT) [3] is sometimes re- for i∈{1,2,...,N} do ferred to as Unit Parallelization. In NPT, one divides the C ←−C+AB network units within each layer across different processes. Shift rows of A one process to the left, and In its simplest form, each process is responsible for cal- columns of B one process upwards culating activation of a single unit in each layer [6]. In a end more practical setup the units are divided in a linearly load Undo initial shifts in A and B balanced manner. See figure 1 for an example. end In our approach, the strategy is to compute the matrix- matrix multiplication XW for each layer in parallel using i E. Implementation Efficiency with varying network size, no. of processors, nodes 2.5 We implement our Neural Network in C using the 2.0 Message Passing Interface (MPI) [8]. We use the Mgfirgi.Pd1It.oCpoalrotgycrtehaatteefnuanbcletisonpatoralclereliaztaetiaonviartsuadleCpiacrtteedsiainn Efficiency ηP11..05 1 layer, 1 node 1 layer, 4 nodes 0.5 2 layers, 4 nodes 10R4untime with varying network size, 1n ola.y oefr ,p r1o nceosdseors, nodes 0.020 221 big lay2e2rs, 4 n2o3des 24 25 26 27 1 layer, 4 nodes No. of processes P 2 layers, 4 nodes me TP103 2 big layers, 4 nodes Figure4. PlotoftheefficiencyηP = SPP. Runti 102 1) A One hidden layer network with the number of input and output units q =128 10120 21 22 23 24 25 26 27 2) A Two hidden layer network with the number of input No. of processes P and output units q =128 3) A Two hidden layer network with the number of input Figure2. PlotoftheruntimesTP. andoutputunitsq =256-werefertothisasthe“big” network. III. EXPERIMENTS We used an MPI barrier to synchronize before starting A. Training Data the training routine, and immediately stored the wall-clock The data was created artificially by imposing a linear time. The same was done after training completed. The relationshipbetweentheinputsX andtheoutputsY .The runtime data is presented as relative runtimes in table I. ij ik inputs are created by sampling random numbers uniformly IV. EXPERIMENTRESULTS between 0 and 200, with the bias input set to 1. The output vector is created using The networks above were run on an HPC cluster with an increasing number of processors. The resulting runtimes Y =((7(k+1)mod1000)+ ik in seconds are shown in figure 2 together with the relative (3k mod100))X . performance statistics in table I. It can be seen that for the ik single layer networks the runtimes decrease as processes Interestingly,since7,3and1000areallpairwisecoprimes, increaseuntilweobtaintheoptimalpointwith16processes; this forms a basis for all linear functions y =kx+m with after this point communication cost becomes prohibitive parameters k,m∈[0,1000). resulting in increasing runtimes. As the number of nodes B. Experiment Setup are increased from 1 to 4, the runtime also increases which might imply the additional cost of redundant communica- In order to evaluate the performance of our NPT algo- tion and startup related to the unused nodes. rithm we created the following networks of different sizes and complexity: Increasing the number of layers l from 1 to 2 increases the optimal number of processes from 16 to 32 which gives empirical evidence that the runtimes are linear in the Speedup with varying network size, no. of processors, nodes number of layers l .The runtimes for the “big” network 80 1 layer, 1 node increase quadratically relative to the two layered network 70 1 layer, 4 nodes 60 2 layers, 4 nodes on the same number of nodes Thus the optimal number of 2 big layers, 4 nodes processes for the “big layer” network is 64. Speedup SP345000 netIwtiosrkraathreerlosuwreprritshinagntthhaotsethoefruthnetimsinesgloefltahyeertewdoalfatyeerr3e2d 20 processes by a significant margin on the same number of 10 nodes.Thisisapointthatwillrequirefurtherinvestigation. 020 21 22 23 24 25 26 27 The speedup curve in figure 3 shows similar results as No. of processes P in the runtime plot with exactly the same optimal process parameters. The efficiency plot in figure 4 shows that it is Figure 3. Plot of speedup SP. We define the speedup at P processes apsroScePsse=s.TTTPS∗∗wishtehree“TfPastiesstthkenoewxencusetiqounentitmialevoefrsoiounr”a,lgsionrcitehmwewdiothnPot mthoeroeveprraolclebsesnoerfietf(fiicnicernetasteoihnasvpeee8duppro)coefshsoarvsin-gaabdodviteiotnhaisl S knowofanybetteralgorithm,weletTS∗=T1. processes decreases. P 1layer,1node 2layer,1node 2layer,4node 2biglayers,4node 1 25.69 51.46 51.00 215.88 2 8.73 17.04 17.07 71.98 4 3.47 6.60 6.85 24.78 8 1.71 3.21 3.13 11.33 16 1.00 1.87 1.96 5.80 32 1.16 2.22 1.87 3.87 64 2.54 4.89 2.30 2.80 128 2.94 10.80 3.28 5.34 TableI RELATIVEPERFORMANCEOFTHEDIFFERENTWORKLOADSASMEASUREDINWALL-CLOCKTIMEWITHTHEBASELINEASA1LAYERNETWORK WITHONSINGLENODEWITH16PROCESSES. Animportantobservationfromtheefficiencyplotsisthat [2] Farber P, Asanovic K. Parallel neural network training on the processor efficiency exceeds 1 in certain regions - this Multi-Spert.In:AlgorithmsandArchitecturesforParallelPro- cessing,1997.ICAPP97.,19973rdInternationalConference implies that the seedup is superlinear [9]. Thus meaning on; 1997. p. 659–666. - adding additional processes until a certain point actually increases the speedup possibly due to reduction in RAM [3] Dahl G, McAvinney A, Newhall T. Parallelizing Neural access times. NetworkTrainingforClusterSystems. In:Proceedingsofthe IASTEDInternationalConferenceonParallelandDistributed V. CONCLUSIONSANDREMARKS Computing and Networks. PDCN ’08. Anaheim, CA, USA: ACTA Press; 2008. p. 220–225. An implementation of Network Parallel Training using Cannon’salgorithmwasproposed.Runningtimes,Speedup [4] SuriNNRR,DeodhareD,NagabhushanP.ParallelLevenberg- Marquardt-basedNeuralNetworkTrainingonLinuxClusters and Efficiency were evaluated on networks of varying - A Case Study. In: Linux Clusters, ICVGIP 2002, 3rd complexity. Results show that when using this strategy Indian Conference on Computer Vision, Graphics and Image significant speedup is obtained. This effect is even more Processing, Ahmadabad; 2002. . pronounced for more complex networks. We have also showed that there are regions where such speedup is super- [5] Long LN, Gupta A. Scalable Massively Parallel Artificial Neural Networks. JACIC. 2008;5:3–15. linear thus increasing the efficiency of resource utilization. It would be interesting to explore different means of [6] PethickM,LiddleM,WersteinP,HuangZ. Parallelizationof parallelizationotherthanaper-unitbasedsplitoftheweight a backpropagation neural network on a cluster computer. In: matrices. The topologies tested here were of relatively lim- Internationalconferenceonparallelanddistributedcomputing and systems (PDCS 2003); 2003. . ited size. Parallelized neural networks are mostly necessary once the networks are of unusually large size, otherwise [7] Cannon LE. A Cellular Computer to Implement the Kalman communication time is going to far outweigh any benefits Filter Algorithm. Bozeman, MT, USA; 1969. AAI7010025. of parallel execution due to the many data dependencies. The data model used was a simple linear function de- [8] SnirM,OttoSW,WalkerDW,DongarraJ,Huss-LedermanS. MPI: The Complete Reference. Cambridge, MA, USA: MIT pending only on one variable. While it was not relevant for Press; 1995. the scope of this work, implementing a more sophisticated meansofdatagenerationwouldbeaninterestingnextstep. [9] Wilkinson B, Allen M. Parallel Programming: Techniques Another aspect to explore is how the local matrix mul- andApplicationsUsingNetworkedWorkstationsandParallel tiplication can be improved. Perhaps offloading the matrix Computers.UpperSaddleRiver,NJ,USA:Prentice-Hall,Inc.; 1999. multiplication to a graphics processing unit would be ben- eficial. ACKNOWLEDGEMENTS The computations in this work were performed on re- sourcesprovidedbytheSwedishNationalInfrastructurefor Computing (SNIC) at PDC Centre for High Performance Computing (PDC-HPC). REFERENCES [1] MarwalaT. EconomicModelingUsingArtificialIntelligence Methods. Springer Publishing Company, Incorporated; 2013.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.