An OpenCL Deep Learning Accelerator on Arria 10 TM Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, Gordon R. Chiu IntelCorporation Toronto,Canada utku.aydonat|shane.oconnell|davor.capalija|andrew.ling|[email protected] ABSTRACT One of the reasons that FPGAs have not been able to 7 achievegoodperformanceagainstGPUsisduetotheirlim- 1 Convolutional neural nets (CNNs) have become a practi- ited external memory bandwidth. CNNs are often solved 0 cal means to perform vision tasks, particularly in the area usingmatrix-multiplicationbasedapproaches,whichrequire 2 of image classification. FPGAs are well known to be able large amounts of data to be moved between the compute toperformconvolutionsefficiently,however,mostrecentef- n unitsandexternalmemory[16]. Additionally,previousFPGA fortstorunCNNsonFPGAshaveshownlimitedadvantages a architecturesforCNNshavenotbeenabletotakeadvantage J over other devices such as GPUs. Previous approaches on of the peak operations of the device leading to low perfor- FPGAs have often been memory bound due to the limited 3 mance [21, 16, 13, 20]. externalmemorybandwidthontheFPGAdevice. Weshow 1 To address the problems above, we introduce a novel ar- a novel architecture written in OpenCLTM, which we refer chitecture described in OpenCL and provide the following ] to as a Deep Learning Accelerator (DLA), that maximizes contributions: C datareuseandminimizesexternalmemorybandwidth. Fur- • A methodology to minimize bandwidth of convolutional D thermore,weshowhowwecanusetheWinogradtransform andfully-connectedlayersbycachingallintermediatefeature- to significantly boost the performance of the FPGA. As a mapson-chipinstream-buffers. Inconjunctionwithbatch- . s result, when running our DLA on Intel’s Arria 10 device ing images during fully-connected layers, which is similar c we can achieve a performance of 1020img/s, or 23img/s/W to what is used in [20], we are able to reduce the external [ whenrunningtheAlexNetCNNbenchmark. Thiscomesto bandwidthrequirementsbyanorder-of-magnitudeforboth 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS theconvolutionalandfully-connectedlayers. 1 v and 5.8x better efficiency than the state-of-the-art on FP- • Adesignspaceexplorationmethodologythatleveragesan- 4 GAs. Additionally, 23 img/s/W is competitive against the alytical models for resource usage and throughput and is 3 best publicly known implementation of AlexNet on nVidia’s able to find the optimal architecture configuration, for a specificFPGAdeviceandCNN,togetmaximumthrough- 5 TitanX GPU. put. 3 Keywords 0 • An approach that leverages the Winograd transformation toreducethemultiply-accumulateoperationsoftheconvo- . Deep Neural Network, Convolution Neural Network 1 lutions[18]. 0 1. INTRODUCTION Due to the contributions above we are able to imple- 7 Convolutional neural nets (CNNs) have become widely mentalllayersofAlexNet[7]onIntel’sArria10FPGAand 1 adopted in various computer vision applications including achieveover10xbetterthroughputand8.4xmoreGFLOPS : v driverassistandimageclassification. Morerecently,FPGAs thanthestate-of-the-artFPGAimplementationofAlexNet[20]. i have shown promise in efficiently implementing CNNs [21, Furthermore, we show that, to the best of our knowledge, X 16,13,20,11,2,12,14,4]. Unfortunately,thevastmajority this is the first FPGA implementation whose performance r ofFPGAimplementationsofCNNshaveonlyimplemented per watt is competitive against the same generation highly- a theconvolutionallayerslimitingthebenefitoftheapproach optimized TitanX GPU results [3, 9, 10]. sinceotherlayersmayquicklybecomethebottleneckofthe Therestofthepaperisorganizedasfollows. Section2has neuralnet[20]. Therehasbeenworkinimplementingallthe backgroundonCNNsandrelatedwork. Section3describes layers on the FPGA [20, 16, 13], however, when compared the DLA architecture. Section 4 describes our analytical to some of the best results publicly known for GPUs [3, 9], modelfordesignspaceexploration. Finally,Sections5and6 FPGA performance have fallen short significantly. describe our results. Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor 2. BACKGROUND classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcita- Deep neural networks are machine learning algorithms tiononthefirstpage. Copyrightsforcomponentsofthisworkownedbyothersthan that are inspired by the structure and function of the hu- ACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orre- man brain. They consist of several interconnected artificial publish,topostonserversortoredistributetolists,requirespriorspecificpermission and/[email protected]. neurons that are modeled after the neurons of the human FPGA’17,February22-24,2017,Monterey,CA,USA nervous system. An artificial neuron accepts numerical in- putfromotherneurons,andproducesanoutput. ForDNNs, (cid:13)c 2017ACM.ISBN978-1-4503-4354-1/17/02...$15.00 DOI:http://dx.doi.org/10.1145/3020078.3021738 theoutputiscomputedasadot-productofitsinputsandits unique set of learnable weights. Subsequently, a non-linear malizationlayerscaleseachelementinitsinputfeaturemap activation function (e.g. tanh,ReLU,sigmoid)isappliedto by a factor that is a function of the elements at the same the dot-product result. This output is then used as input locationinadjacentchannelsastheelementbeingnormal- ized. Thedimensionsoftheoutputandinputfeaturemaps by other neurons. Neural networks have been used to solve areidentical. many complex problems to which robust solutions cannot be designed by hand such as image recognition, handwrit- • Maxpooling-Amaxpoolinglayerstridesatwo-dimensional window across each channel of the input feature map and tentext,gesture,andspeechrecognition;game-playingand propagates the element of maximum value in the window decisionmaking(e.g. AlphaGo);faceidentification;andob- throughtotheoutputfeaturemap. Comparedtotheinput ject detection. feature map, the output feature map has the same depth, smallerheight,andsmallerwidth. 2.1 ConvolutionalNeuralNetworks • Fully-connected(dense) - A fully connected layer is a convolution layer in which H = R and C = W = S = 1 Convolutional neural networks (CNNs) are neural net- (whichinturnimpliesthatP =Q=1). Thatis,theheight worksthatexcelinclassifyingimagesandvideos. Theyhave andwidthofeachfilterisequaltotheheightandwidthof garneredaconsiderableamountofattentioninrecentyears theinputfeaturemap. Describedfromaneuronalperspec- due their ability to achieve state-of-the-art results in image tive,afully-connectedlayerisoneinwhicheachneuronis recognitionandobjectdetection. CNNsareneuralnetsthat connectedtoeveryneuroninthepreviouslayer(hencethe name fully-connected). Since the input feature map and consistprimarilyofconvolution layersinwhicheachneuron each filter have the same dimensions, no striding occurs isconnectedonlytoasmall,nearbyregionofneuronsinthe whencomputingtheoutputfeaturemap. Asaresult,itis previous layer. This local connectivity is intentionally de- moreconvenienttothinkoftheoutputasamatrix-vector signedintothenetworktopologywiththegoalofexploiting product vo = Wvi where vi is a flattened version of the thelocalcorrelationintheinputdata. Thisconnectivityre- inputfeaturemapcontainingni=C×H×W elements,W striction,togetherwiththeadditionalpropertythatgroups isano=K byni matrixinwhichrowkisaflattenedver- ofneuronswithinoneconvolutionlayeralsosharelearnable sionofthekthfilter,andvoistheoutputfeaturemap. Itis possibletoprocessabatchofbdifferentinputfeaturemaps weights, allows the outputs of neurons in the layer to be frombdifferentimagesatoncebyreplacingvi withanni computed using 3-dimensional convolution. bybmatrixV inwhichcolumnkistheflattenedinputfea- i turemapcorrespondingtothekthimageinthebatch. The Although a CNN can be described from a neuronal per- aforementionedequationthenbecomesVo=WVi,where spective, it is more instructive, for the discussion that fol- Vo isanno bybmatrixinwhichcolumnkistheflattened outputfeaturemapcorrespondingtothekth imageinthe lows,toviewitasadirectedgraphofcomputational layers. batch. This method ofprocessing multiple imagesat once Each node represents a layer that accepts one or more n- in a fully-connected layer will feature prominently in the dimensional arrays as input, performs some computation, upcomingdiscussion. and produces one or more n-dimensional arrays as output. • Softmax - A softmax layer normalizes the values in the Theedgesofthegraphrepresenttheproducer-consumerre- input feature map by applying the softmax function to it. lationships between the layers of the network. The data Consequently,thesumoftheelementsintheoutputfeature arraysthatlayerswithinthenetworkconsumeandproduce mapisunity. are often referred to as feature maps. Atahighlevel,AlexNetconsistsoffiveconvolutionlayers, followedbythreefully-connectedlayers,andasoftmaxlayer. In AlexNet, a convolution layer accepts a 3-dimensional ar- Thereisanormalizationlayeraftertheeachofthefirsttwo ray with depth C, height H, and width W as input, and convolutionlayers. Finally,thereisamax-poolinglayerafter produces a 3-dimensional array with depth K, height P, the two aforementioned normalization layers, and between andwidthQ. Theoutputfeaturemapiscomputedby con- thelastconvolutionlayerandthefirstfully-connectedlayer. volving the input feature map with K filters, and applying The final softmax layer outputs a 1000-element vector con- an activation function element-wise to the result. Each fil- tainingprobabilitiesthattheinputimagebelongstoeachof ter is also a 3-dimensional array with depth C, height R, the1000possibleclassesintheImageNetLargeScaleVisual and width S which consists of learnable weights. The con- Recognition Competition (ILSVRC [15]). More details re- volution of the input feature map with one filter produces gardingthestructureandfunctionofAlexNetcanbefound one2-dimensionalarrayreferredtoasachannel orplane of in [7]. the output feature map. The entire output feature map is obtained by concatenating depth-wise the K channels pro- 2.3 RelatedWork duced by convolving each of the K filters with the input feature map. An illustration of the images is shown in Fig- FPGAshavebeenshowntobeapracticalmeanstosolve ure 3. CNNs [21, 16, 13, 20, 11, 2, 12, 14, 4]. In [16], the au- thors use a matrix-multiply approach to solve both convo- 2.2 AlexNet lutional and fully-connected layers which is similar to GPU AlexNet [7] consists of the following layers: and CPU approaches that convert 3D convolutions into 2D matrix-multiplications. Written in OpenCL, they are able • Convolution - The previous section describes the func- torunalllayersontheFPGAbutunfortunatelyendupbe- tionalityofconvolutionlayers. InAlexNet,allconvolution ing severely external memory bound such that the average layers use the ReLU or ramp function f(x) = max(0,x) as their activation function. In addition, each convolution GOPs they achieve is relatively low. To solve the memory layeralsohasK scalarbiastermsthatareaddedtocorre- bottleneck, in [13] the authors introduce a singular value spondingoutputfeaturemapchannelsbeforeapplyingthe decomposition approach to significantly reduce the data re- ReLUfunction. quired,andhencememorybandwidth,ofthefullyconnected • Cross-channel local response normalization-Anor- layers. They empirically show that this has approximately Figure 2: OpenCL FPGA Platform on an Intel’s Device. Figure 1: Intel FPGA SDK for OpenCL Host-Device Setup Because of the reconfigurable nature of FPGAs, timing- and Flow. sensitivecomponentssuchasDDRmemorycontrollersmust . be timing closed to ensure they work correctly. The Intel FPGA SDK for OpenCL avoids these problems by provid- ing a pre-generated platform for the OpenCL programmer. 1% impact on the overall accuracy of the neural network AnillustrationoftheplatformontheFPGAdeviceisshown when applied to image classification. inFigure2. Asillustrated,theplatformhaspre-placedcom- Conversely, the work in [21, 20] use a roofline model that ponents whose resources are reserved for the platform, and allows users to maximize compute resources on the FPGA cannot be used for the algorithmic portion of the OpenCL given the memory bandwidth constraints. In [20], the au- kernel code. thors describe Caffeine which is a runtime reconfigurable OurDLAiswrittenwithOpenCL,whereOpenCLkernels CNNFPGAaccelerator. InCaffeine,thethroughputisim- are used to define the DLA architecture, and the host run- proved significantly over previous approaches by creating a time is used to coordinate the data transfers of the images model to realistically reflect DDR transfers and also pro- with the kernel execution in an efficient manner. videaconvolutionalMMrepresentation wheretheyareable to maximize data reuse of weight filters by batching input feature maps of the fully-connected layers. They show that 3. DLAARCHITECTURE theyareabletoimprovetheperformanceofCNNonFPGAs OurDeepLearningAccelerator(DLA)implementsalllay- by3x,andtheyare1.5xmoreenergyefficientthantheK40 ers of AlexNet on the FPGA and is defined using the Intel GPU.Unfortunately,whencomparedtonVidia’slastgener- FPGA SDK for OpenCL. ation TitanX GPU [3, 9] the power efficiency of the FPGA is still 5.8x worse. Additionally, the authors of [20] show 3.1 DesignGoals thattheGOPsofeachlayerisrelativelylowwheretheyare OurDLAistargetedforhigh-performance. InmostCNN onlyabletoachieve14.7%oftheGOPsoftheKU060device topologies, the total amount of floating-point computation when running at 200MHz. is dominated by the convolution layers. For instance, in Ourapproachdiffersfromthepreviousworkaswesignif- AlexNet, convolutions are 92% of the total floating point icantly reduce memory bandwidth without loss of accuracy operations. Hence,theDLAhardwareisoptimizedtomax- by caching all feature-maps on-chip. Additionally, we show imizethethroughputoftheconvolutionlayersbyexploiting thatourarchitectureiscompute-bound,suchthatwecanef- parallelismincomputations. Eachconvolutionlayerconsists ficientlyusealltheDSPresources,andensurethattheyare of multiple nested loops that iterate over the dimensions occupied (i.e. doing useful work) the majority of the time of input features, filters, and output features. As shown and leverage Winograd transforms to reduce the number of in[21],itispossibletochoosedifferentcombinationsofthese required operations. Finally, we show how we use a design loops to vectorize in order to speedup the convolution op- space exploration methodology to find the optimal config- erations. On the FPGA, vectorizing a loop means spatially uration of our architecture for a specific FPGA device and distributing the computations of that loop across multiple CNN. All of these factors lead to a performance efficiency DSP blocks that exist on the device. For maximum perfor- that is competitive against nVidia’s TitanX GPU. mance,ourDLAvectorizestheloopsthatprovidesufficient 2.4 IntelFPGASDKforOpenCL parallelismsuchthatasmanyDSPsaspossibleareusedev- ery cycle for useful computations. Additionally, the DLA TheIntelFPGASDKforOpenCLallowsuserstoprogram architectureensuresthattheprocessingelements(PEs)are FPGAswithOpenCL.OpenCLisanopenparallelprogram- abletosolveboththeconvolutionalandfully-connectedlay- ming language that is vendor agnostic and is supported by ers without sacrificing performance. many vendors [6]. Our DLA is also aimed to be flexible and achieve good Currently, OpenCL uses a master-slave model where a performance with other CNN topologies, besides AlexNet. master host device is used to control all memory transfers Hence, convolution loops are chosen for vectorization such and execution of the kernels. A user is required to write that enough parallelism exists not just in AlexNet but in a a host program, which calls a predefined OpenCL API to wide range of CNN topologies. Consequently, adapting our control the accelerator device. On the device side, the user DLAforadifferentCNNtopologywillnotrequirevectoriz- writes OpenCL kernel functions that are compiled to the ingdifferentloops,butwilljustrequirechangingthevector- accelerator. This model is illustrated in Figure 1. izationfactorsaccordingtothedimensionsofthattopology. One of the key challenges for using FPGAs is that they Thisissimilartowhatisclaimedin[20]andisnotdiscussed havetraditionallyrequiredahardwaredesignmethodology. in detail in this work. tures and the filter weights in on-chip RAMs. The features arestoredinadoublebufferanddataisbroadcasttoPEsin adaisy-chainfashioneverycycle. Thedaisy-chainstructure is formed by the PEs where each PE receives a stick of in- put feature data for processing, and also passes the data to an adjacent PE (PE daisy-chain arrangement illustrated in Figure 7). This is much more efficient for placing the DLA ontheFPGAsincetheFPGAisa2Dgridoflogic. Theout- putsofthePEsarestoredbackintothedoublebuffer. The filters are stored in caches inside the PEs. The purpose of theon-chipstorageistoavoidunnecessaryexternalmemory accesses because the amount of features and filter weights loadedeverycycledependsonthevectorizationparameters and can easily exceed the available external memory band- width. On-chip storage allow the re-use of data by taking advantage of the temporal data locality. More specifically, doublebuffersallowthere-useoftheinputfeaturemapsbe- Figure 3: The overview of Convolution Execution. causesameinputfeaturesareconvolvedwithdifferentfilters to compute different output feature maps. In addition, fil- tercachesallowthere-useofthefilterweightsbecausesame 3.2 ConvolutionLayers filter weights are convolved with different input features to compute each output feature map. Toimprovethroughput,parallelismisextractedfromfour dimensions of a convolution layer: output feature columns 3.3 ArithmeticOptimizations (Q),outputfeaturemaps(K),inputfeaturemaps(C),and Inadditiontoprovidingparallelization,vectorizingonW input feature columns (W). The vectorization factors for andQallowsmultiplemultiply-accumulateoperationstobe eachofthesedimensionsarerespectivelyreferredtoasQvec, simplified through Winograd transformations as described Kvec, Cvec, and Wvec. Each cycle, Qvec horizontal output here [18]. Lavin et al. [8] showed that Winograd’s minimal featuresinKvec outputfeaturemapsarecomputedbycon- filter algorithms [18] can be used to derive algorithms for volving an input feature region Wvec wide and Cvec deep. CNNs. These algorithms can be applied when the convolu- This is illustrated in Figure 3, for Cvec > 1, Kvec = 3, tion stride is 1 and can reduce the arithmetic complexity, Wvec = 2, and Qvec = 1. The relationship between Wvec resulting in faster execution. It has also been shown that andQvec dependsonthenumberoffilterandfeaturepixels thereductioninarithmeticcomplexitycanexceedwhatcan thataremultipliedperoutputresult. Forexample,inequa- be achieved with other FFT-based methods for small fil- tion 1, for each output, a vector of three feature and filter ters. The AlexNet topology uses small 3×3 filters in most pixels are used. In this case Wvec =Svec+Qvec−1, where of its convolution layers, hence, can take advantage of the Svec is the size of the filter vector (e.g. Svec = 3 in equa- Winograd transformations. Furthermore, because the cur- tion 1). If a larger Wvec is desired, a larger Svec is required rent trend is towards deeper CNN topologies with small fil- for each output computation. ters (e.g. GoogLeNet [17]) other CNN topologies can also Vectorizing the W and Q dimensions is also useful for take advantage of the Winograd transformations. the arithmeticoptimizations which will be discussed in sec- tion 3.3. Because convolution layers usually process large o =(f ,f ,f )·(i ,i ,i ) number of input and output feature maps, enough paral- 0 0 1 2 0 1 2 lelism can be extracted in these dimensions to use all the o1 =(f0,f1,f2)·(i1,i2,i3) (1) DSPs by breaking up the convolution operations into indi- o =(f ,f ,f )·(i ,i ,i ) 2 0 1 2 2 3 4 vidual dot-products that are processed by PEs. o =(f ,f ,f )·(i ,i ,i ) The PEs act as dot-product solvers for the features and 3 0 1 2 3 4 5 filter weights. Each PE receives the same input features, il- InourDLA,eachPEgeneratesfourhorizontaloutputpixels lustratedasa1×W ×C stickinFigure3,andconvolves in parallel (i.e. Q = 4) where each output is formed by vec vec vec themwiththefilterweights,ofsize1×S ×C ,topro- doing a dot-product between three filters and three inputs vec vec duceavectorofQ outputfeaturesforoneoutputfeature as shown in equation 1. In standard convolutions, this re- vec map. Hence, at any given time K PEs will be comput- quires 12 multiplications and additions every cycle which is vec ing K different output feature-maps. In other words, K showninequation1whereo isanoutputpixel,f isafilter vec i i outputfeaturemapsarecomputedinK/K tiles. Convo- weight, and i is an input pixel. With the Winograd mini- vec i lution layers are mapped onto the architecture in Figure 3 mal filtering algorithms, we perform the four dot-products in a time-multiplexed fashion, i.e. layers are executed one in equation 1 with only six multiplications and additions atatimeinsuccession. Thisispossiblebecausethesizesof using techniques described in [18], and denoted as F(4,3). C ,K ,Q andW canbeindependentoftheimage All Winograd arithmetic transformations are done on-chip vec vec vec vec size,filtersize,andnumberofinputmapsandoutputmaps, and the flow is illustrated in Figure 4 which shows how we thussolvingdifferentconvolutionlayerssimplyrequiresdif- transformthreefiltercoefficientsandsixfeatureinputsinto ferent sequences of 1×W ×C sticks of input feature six Winograd filters and six Winograd input features (i.e. vec vec maps and filter data to be read and sent to the PEs. W = 6). The six values are multiplied together to form vec Our DLA takes advantage of the mega-bytes of on-chip six Winograd outputs, which then are transformed back to storage available on the FPGA device by storing the fea- four output features. (a)Asinglestreambuffer. Figure 4: Winograd Flow (b)Thearrayofstreambuffers. Figure 6: Stream buffer hardware. withtheresultofthedot-product,Lcycleslater. Hence,at anygivencycle,eachPEkeepsLdifferentpartialsumsthat belong to L different output features for each dot-product unit. Because each dot-product unit is fully-pipelined, L different output computations are interleaved. That is, for L consecutive cycles, input features and filter weights for differentoutput features will be fedintoa dot-productunit inasequence. Inourimplementation,weinterleavebothin Figure 5: The overview of a PE. the W (L ) and H (L ) direction. w h FilterweightsarestoredinPEcachesimplementedinon- chip RAMs. Every cycle, W ×C transformed filter vec vec 3.4 PEs weights are loaded from these caches and fed onto the dot- Figure 5 shows an overview of a single PE hardware. It productunits. Hence,Wvec×Cveccaches,ormemorybanks, consists of dot-product units, accumulators, and caches. areusedinordertogetthenecessaryon-chipmemoryread Eachdot-productunitmultipliesandaccumulatestheWino- bandwidth. A single filter weight can be loaded from each gradtransformedinputfeaturesandthe filter weights. The cache every cycle. vector size of the dot-product unit is determined by the Filter weights are stored in the caches before the cor- C parameter as shown in Figure 5. Each PE contains responding convolution layer starts. To avoid idle com- vec W of such dot-product units. Hence, a sub-region of size putation cycles, the DLA uses double-buffering and over- vec 1×W ×C is convolved every cycle. Once the total laps convolutions with the PE cache updates. While filter vec vec input feature region is convolved, Q output features are weightsareloadedfromthecachesforaparticularconvolu- vec completed. tion layer, filter weights for the next convolution layer are Each dot-product unit takes as input C ×W fea- also prefetched onto the caches. vec vec tures, Cvec ×Wvec transformed filter weights, and an init Every cycle, the Wvec outputs of each PE are sent to the bus as shown in Figure 5. To support Winograd, we take ReLUunitfortheWinogradoutputtransformasexplained C ×S filters and convert them to C ×W trans- in Section 3.3. vec vec vec vec formedfilterweights. Theinitbusissettozerowhenresetis 3.5 StreamBuffers set,whichrepresentsthestartofanewoutputfeaturecom- putation. If reset is not set, then init is set to the current Stream buffers shown in Figure 6 are implemented in on- accumulator value so that the accumulator is incremented chipRAMsinordertostorethefeaturedataandtostream bythedot-productresult. Ifthedonesignalisset,thedot- it to PEs. Each stream buffer is double-buffered similar productresultissentout. Thishappenswhentheverylast to filter caches. Before the first convolution layer starts, dot-product is completed for an output feature. Otherwise, the images are loaded from the DDR and stored in buffers. the result of the dot-product continues to be stored in the During the convolution layer execution, while feature data accumulator. for a convolution layer is being streamed into the PEs, the The accumulators are implemented as shift-registers. At outputs of convolutions are simultaneously stored into the anygivencycle,eachshift-registerlocationcontainsthepar- buffers. ThereareatotalofW ×C streambuffers. The vec vec tial sum that belongs to a specific output feature. The size width of feature maps is divided into W buffers, and the vec of this shift-register depends on the latency L of the dot- depthisdividedintoC buffers. Hence,atotalofW × vec vec product unit. That is, the same shift-register value that is C streambuffersprovidesufficienton-chipbandwidthfor vec used as the init value in the dot-product will be updated streaminganinputfeatureregionofsize1×W ×C to vec vec PEs every cycle. The output features, on the other hand, terweightstoproducedifferentoutputfeaturemaps. Hence, are generated in a different layout. More specifically, each duringthefully-connectedlayers,filterweightsarestreamed PE generates Q features in a cycle, hence, a total region into the PEs and PE caches store the features for different vec of1×Q ×K isgeneratedinacycle. Acrossbarnetwork images that are pre-loaded before computation starts. The vec vec isgeneratedinordertostorethisregionin1×W ×C caches are sized to accommodate not only the convolution vec vec buffers. filters but also the batches of images that need to be pro- cessed in parallel during fully-connected layers. 3.6 SharedExponentFP16 Thefully-connectedlayersareexecutedwiththefollowing configuration (summarized in Table 1). Usinghalf-precision(FP16)insteadofsingle-precision(FP32) floating point operations can significantly reduce the re- 1. NoWinogradtransformationsareappliedbecausefeatures andfiltersareconvolvedtogenerateonlyasingleoutput. source requirement of each PE. However, although FP32 is natively supported on Arria 10’s DSP blocks, FP16 is not, 2. Before starting the compute, features are pre-loaded into which leads to additional logic use. To reduce this over- the PE caches. For instance, if the image batch size is S ,eachPEwillstoreNdifferentimagefeatures,where head, we use a shared exponent technique which allows us batch to perform the multiplications in fixed-point, which signifi- N isequaltoSbatch/Kvec. cantly reduces the overhead required to perform the FP16 3. Duringfully-connectedlayercomputation,theWvec/Ndot- productunitsineachPEareusedtoprocessoneimage. dot-products. This technique works by leveraging the fact thatArria10DSPblockscanbefracturedintotwo18×18 4. Each cycle, F unique filter weights are loaded from the integer multipliers [1] such that before sending the feature DDR and streamed into the PEs, where F is equal to and filter data into each PE, we transform all the values (Wvec/N)×Cvec. EachPEreceivesthesamefilterweights andmultipliesthemwithdifferentimagefeatures. into 18-bit numbers, using the maximum exponent found 5. Similartotheconvolutionconfiguration,Ldifferentoutput in the group. Since the exponent matches for all the num- computationsareinterleaved. bers, they can be treated as fixed-point numbers and can be sent directly to the 18×18 integer multipliers. After 6. Wvec/N partial sums in each PE are summed to produce N outputsfromeachPE. thedot-productisperformed,thenumberisshiftedbackto 10-bits, and the exponent and sign bit is added back to the top 6 bits, reforming the 16-bit floating point value, which Configuration Convolution Fully-Connected is stored back into the stream buffer. Note that the shared Winograd Transformation Yes No exponenttransformisperformedonthedatapriortoenter- Batch Size 1 S batch ing the PEs and thus only need to be applied once and can Streamed Data Features Filters be shared across all PEs. Cached Data Filters Features Dot-Products per Image W W /N vec vec 3.7 FullyConnectedLayers TheDLAexecutesthefully-connectedlayersonthesame Table1: ConfigurationofPEsduringconvolutionandfully- PEsdescribedinsection3.4. Thisapproachmakesthemost connected layers. efficient use of the dot-product units since these units are kept busy during the convolution and the fully-connected 3.8 OverallArchitecture layers. Due to the different characteristics of computations in CNN algorithms often include other layers in addition fully-connected and convolutional layers, PEs need to be to convolution and fully-connected layers. For instance, configured differently. Specifically, the ratio of the total fil- AlexNet contains normalization, max-pooling, and ReLU terweightsusedincomputationstothetotalamountofthe layers. Hence, our DLA contains additional hardware to computationissignificantlyhigherinfully-connectedlayers support these different types of layers to enable the entire than in convolution layers. In other words, there is signifi- topology to be executed on the FPGA. cantly less re-use of the fully-connected layer filter weights Figure 7 shows the DLA hardware support for all the during the classification of a single image. Hence, storing AlexNet layers. The PEs, as discussed earlier, perform the these filters in PE caches does not give any benefits. More- dot-productsforconvolutionandfully-connectedlayers. The over,loadingthesefiltersfromDDRusessignificantlymore StreamBuffer unit manages the stream buffers, applies the bandwidth, which may become a performance bottleneck. Winogradtransformationstofeatures,andstreamsthetrans- In order to alleviate the above issues, the DLA processes formedfeaturestothefirstPE.TheStreamBufferunitalso fully-connected layers in image batches. After all convolu- fetches the filter data from DDR and sends it to the PEs. tion layers are completed layer by layer for a single image, ThefeaturesareforwardedthroughallthePEsviathedaisy- thelastlayerwilldumptheimagebackouttoexternalmem- chainedinputconnectionsbetweenthem. Theoutputsofthe ory to batch up the images. Once a large enough batch of PEsaresenttotheReLUunitagainviadaisy-chainedout- images are available, the batch of images is processed to- put connections. ReLU unit applies the Winograd output getherduringeachofthefully-connectedlayers. Thisallows transformations and non-linearity functions. The through- sharingthefully-connectedfilterweightsbetweentheclassi- put of ReLU unit and all the subsequent units are higher ficationofdifferentimages,andhence,reducingtheexternal thanthetotalthroughputofthePEsinordertoavoidstalls. memorybandwidthusage. Inotherwords,filtersareshared TheoutputsoftheReLUunitaresenttothenormalization and same filter weights are multiplied with different image unit,whichappliesthenormalizationformulaacrossthefea- features to produce the output features of different images. turemaps. BecausePEscomputefeaturemapsintiles,nor- This is in contrast to the convolution layers where features malizingatilerequiresbufferingofconvolutionoutputsfrom aresharedandsamefeaturesaremultipliedwithdifferentfil- theprevioustile. Theoutputsofthenormalizationunitare to the F(4,3) Winograd transforms we use. The stream buffers and filter caches M20K usage can be modeled using equation3and4respectively,whichassumesagivenM20K can store 1024 16-bit floating point values by forming a 2 word wide by 512 deep memory [1]. Equation 3 models the number of M20Ks required to store the largest input and output feature map for any given layer (represented by the MAX(Depth +Depth )). C isthenumberofinputfea- in out ture maps for a given layer, H and W are the feature map heightandwidth. Equation4modelsthenumberofM20Ks Figure 7: Overall DLA Architecture. requiredtostoreallfilterweightsforasingleoutputfeature map. This is scaled up by the number of PEs (i.e. K ) vec sinceeachPEprocessesoneoutputfeaturemapatanygiven time. Note that for the filter caches the depth is not con- senttothepoolingunitwhichcomputesthemaximumvalue sideredsincethefiltersdon’trequiretheentireM20Kdepth in a window. Because each feature map is pooled indepen- (i.e. there are less than 512 words needed). dently, no data buffering is necessary between the feature map tiles. If more convolution layers are to follow, the out- putofthepoolingunitisstoredbackontothestreambuffer for further processing. Following the last convolution layer, theoutputsofthepoolingunitarestoredtoexternalmem- ory. Atthestartofthefully-connectedlayers,thesefeatures Ndsps =(Wvec−Qvec+1)×Qvec×Kvec×Cvec×0.5 (2) arereadbackfromexternalmemoryandloadedontothePE caches as described earlier. Also, the ReLU output is sent directly to DDR, without applying Norm or Pool. The DLA includes several control signals, such as stream bufferread/writeaddresses,PEcacheread/writeaddresses, N =W ×C PE done/reset signals, normalization/pooling bypass sig- banks vec vec Depth=C×W ×H/N nals, etc. These signals are generated by the sequencer banks unit which is configured according to the topology of the MAX(Depth +Depth ) N =CEIL( in out )×N CNN algorithm being executed. The sizes of input/output M20K 512×2 banks images,intermediatefeatures,filters,normalization/pooling (3) windows,convolutionstrides,etc. areusedincalculatingthe exactcyclescertainactionsaretakenatortheaddressesthat N =W ×C banks vec vec are accessed. Hence, executing a different CNN algorithm (4) N =N ×K /2 on the same hardware requires just changing the sequencer M20K banks vec configuration. Expected throughput is modeled using the vector dimen- InAlexNet,thenormalizationandpoolingoperationsare sions (C , K ,W , and Q ), feature map sizes, out- vec vec vec vec not always performed after every convolution layer. Hence, put map sizes, filter sizes, and DDR bandwidth utilization. theseunitscanbeby-passeddependingonthetopologythat For a single convolutional layer, the number of cycles to isbeingexecuted. Thismakesextendingthisarchitectureto process an image is shown in equation 5. Here, C is the support software configurability relatively straightforward number of input feature maps, K is the number of output becausethesequenceoflayersbypassedcanbechangedafter feature maps, Q is the width of the output feature map, theFPGAisprogrammedandissimilartotheworkin[20]. P is the height of the output feature map. DSP rep- eff AlltheunitsdescribedabovearewrittenasOpenCLker- resents the efficiency of the DSPs and models any quanti- nels. Each kernel executes independently and concurrently. zation issues due to Q and interleaving width-wise (L ) vec w TheconnectionsbetweenthekernelsareimplementedasFI- and height-wise (L ) as described in section 3.4, where we h FOs using the Intel channel API. ignorequantizationeffectsonC andK forsimplicity(e.g. if C does not divide C evenly). For example, if the output vec 4. DESIGNSPACEEXPLORATIONANDAN- image is 20 wide, and Qvec = 3 with no interleaving (i.e. L =L =1), on the 7th cycle, only the first two values of ALYTICALMODELS h w Q willhaveusefuloutput,thelastvaluewillbedropped. vec Oneofthebenefitsofourarchitectureisthattheresource In this case, the DSP = 20/(CEIL(6.67)×3) = 95%. eff usage of the PE array, stream buffers, and filter caches can N arethenumberofcyclesrequiredtogenerateallout- cycles be analytically modeled using the C , K , W , and putfeature-mapsforthelayer,assumingtherearenomem- vec vec vec Q parameters. orybandwidthconstraints. BYTE arethetotalbytesof vec req For 16-bit floating point precision, equation 2 models the prefetched filter weights required to be loaded from DDR DSPusageforallthePEelements,whichassumesthateach during the convolution layer where R and S are the next next DSP block can perform two 16-bit floating point multiplies filterdimensionsofthenextconvolutionlayer, andC is next and two 16-bit floating point adds and no Winograd. If the number of feature map layers in the next convolutional Winograd is applied, we divide equation 2 by 2 and add on layer,andBYTE arethetotalnumberofbytesthatcan ddr aconstantfactorof200. Theconstantfactorisanoveresti- betransferredduringtheconvolutionallayerassumingthat mate, and accounts for the on-chip Winograd transforms there is one DDR memory interface that is 64 bytes wide. as shown in Figure 4 and the value chosen is applicable N istheestimatednumberofcyclesrequiredtakinginto real account any DDR bandwidth limitations of the device. Layer Eff. GFLOPS Act. GFLOPS Eff. Conv1 2,308 1,154 82.9% DSPeff =Q/(CEIL(Q/(Qvec×Lw))×Qvec×Lw)× Conv2 1,740 870 62.5% P/(CEIL(P/(L ))×L ) Conv3 1,960 980 72.4% h h Conv4 1,960 980 72.4% N =2×K×C×Q×P ×DSP flops eff Conv5 1,743 871 62.6% N =N /(N ×2) cycles flops dsps Fc6 1,389 1,389 99.8% BYTEreq =Knext×Rnext×Snext×Cnext×2 Fc7 1,386 1,386 99.6% BYTE =64×N Fc8 1,378 1,378 99.0% ddr cycles N =N ×BYTE /BYTE real cycles req ddr Table 2: The average GFLOPS achieved of convolutional (5) and fully-connected layers and DSP efficiency when using For fully-connected layers, the number of cycles is shown an 8 × 48 configuration. Shows both effective GFLOPS in equation 7 which calculates the cycles required for an (Eff. GFLOPS)duetoWinogradandactualGFLOPS(Act. entire batch of images. Here, K and C are the number of GFLOPS). inputandoutputfeaturemapsforthefully-connectedlayer andS isthebatchsizeused. Forfully-connectedlayers, batch BYTE are the total bytes of the filter weights that need the transfers, we pipeline the execution of the DLA with req to be loaded. Also, we ignore any quantization effects for the image data transfers from host to FPGA DDR mem- fully-connected layers, since empirically we show that DSP ory. Also note that the data precision vary from fixed and efficiency is close to 100% (shown later in Table 2). floatingpointinthestudiesin[16,20,3]andourwork. Pre- vious work [16, 5] have shown the limited impact of 16-bit Sbatch =Kvec×2 fixed point when compared to 16-bit floating point and is N =2×K×C×S not described here. flops batch N =N /(N ×2) cycles flops dsps (6) 6. RESULTS BYTE =C×K×2 req BYTE =64×N Toillustratetheefficiencyofourarchitecture,weshowthe ddr cycles GFLOPSoftheDLAforeachfully-connectedandconvolu- N =N ×BYTE /BYTE real cycles req ddr tional layer in Table 2 as well as the DSP efficiency. Here, Togetthefinalthroughputintermsofimagespersecond, wedefineDSPefficiencyasthepercentageoftimetheDSP wedividetheclockfrequencyofthedesignbythetotalcycles is occupied and doing useful computation. foralllayerstoprocess. Forfully-connectedlayers,wehave Itisclearthatformostlayers,DSPefficiencyandGFLOPS to normalize to one image so we divide by the batch size, arerelativelyhigh,whichisrequiredtobecompetitiveagainst S . We ignore the execution time of other layers, such theGPU.TheDSPefficiencydiffersbetweenlayersbecause batch as Norm and ReLU, since these are executed concurrently vectorization factors (Wvec, Qvec, Kvec, Cvec, Svec) lead to withtheconvolutionalorfully-connectedlayers,andhavea different quantization inefficiencies for different feature, fil- negligible execution overhead. ter and output dimensions as described in equation 5. For instance,Conv2hasthelowestefficiencybecauseituses5×5 Tall =fmax/(Σconv(Nreal)+Σfc(Nreal/Sbatch)) (7) filter weights which are sub-optimally vectorized with 1×3 tile sizes used. Moreover, FC layers have close to ideal effi- Using both the resource usage estimates and throughput ciency because the dimensions of their input features, filter models, we can find the optimal C and K value for a vec vec weights, and output features are large with respect to the givenFPGAdevice,assumingallothervaluesaresetbythe vectorization factors, i.e. how many features and filters are user (e.g. f , W , etc). A curve of this is shown in the max vec loaded and how many output features are computed every results section in Figure 8. cycle as discussed in Section 3.7. We should note that for Conv1, we achieve a high effi- 5. EXPERIMENTALEVALUATION ciency even though the number of input feature maps for WeevaluateourDLAbyimplementingtheAlexNettopol- the first layer is three, which is not wide enough to fill up ogy on Intel’s Arria 10 dev kit which contains a A10-1150 thevector8dot-productunitsineachPE(i.e. 3islessthan device (20nm). We use a batch size of 1 for convolution C = 8). In order to get around this limitation, we fold vec layers, and 96 for the fully connected layers as described in thethreeinputfeaturemapstocreate48sub-featuremaps, Section3.7. WeuseonlyonebankofDDR4x64at1200MHz such that we can saturate the dot-product width. with a total bandwidth of 17GB/s to reduce the power re- Figure8plotstheachievablethroughputforvariousC vec quired for the FPGA. We compare against the work in [16] and K values, using the Arria 10 1150 device. Here, we vec and [20]. Additionally, we compare against the best known assume f is 300MHz, Q = 4, and W = 6, and we max vec vec results for nVidia’s TitanX GPU (28nm) taken from [3]. onlyexplorepositionswhereK areevenmultiplesofC vec vec Note that nVidia used 28nm for its last generation GPU (areaswhicharenotevenmultiplesare0inFigure8),which and skipped the 20nm node, which is why TitanX is used leads to a more efficient memory structure for the stream in this comparison. When measuring throughput, we mea- buffersandfiltercache. Notethatthehighlightedredcircle surethetotalsystemthroughput,whichincludesallthedata is one of the peak throughput numbers with C = 8 and vec transfersoftheimagestotheFPGAusingtheILSVRCdata K =48. This is our final configuration which achieves a vec set[15],whichwouldbeincurredinarealapplication,which throughput of 1020 img/s. is not done in [20] nor [3]. In order to hide the latency of To validate our analytical models presented in Section 4, FP16 config ALMs Reg Half-type 10.7K 26K Shared Exponent 3.3K 10.6K Table 3: Resource usage of PE without shared exponent optimizations (Half-type) and with shared exponent opti- mizations. ALMs Reg M20K DSPs Freq. 246K(58%) 681K 2487(92%) 1476(97%) 303MHz Table4: ResourceusageandclockfrequencyontheArria10 1150device,foran8×48configurationrunningat303MHz. Figure 8: Plot of expected throughput for various Cvec and work on FPGAs and GPUs respectively. As Table 5 shows, Kvec values. weachieve8.4xmoreGFLOPSwhencomparedtothelatest Ultrascale (KU 20nm [19]) result, which uses a batch size 32 for the fully-connected layers, and 19x more GFLOPS than the latest Stratix V result, both running AlexNet. It is important to note that in [20] the authors are only able to use 50% of DSP resources and claim that this is due to a limitation in SDAccel when using partial reconfiguration. However, even if they were able to use 100% of DSPs, the 8.4x gap would still not be closed since they are only able to achieve a 14.7% efficiency of their DSPs, which assumes a 1.1 TOPS for the KU060 device at 200MHz used in [20]. We should note that in [20] and [16], they show better Figure9: Acomparisonofempiricaldataagainstanalytical GOPS numbers for VGG of 266 GOPS and 118 GOPS re- model for A10-1150 device. spectively. SinceourarchitectureisalsoapplicabletoVGG, whichisbasedonconvolutionalandfully-connectedlayersas well, our performance will not be impacted negatively with we plot predicted img/s given by our models and the mea- theVGGtopology. Infact,sinceVGGismoreregular,DSP sured performance, as shown in Figure 9. Note that in Fig- efficiency is improved on previous work [20] and we believe ure 9 we scale down the model img/s predictions provided that this should also benefit the DLA architecture. by equations 5 and 7 by 16% to account for any inefficien- Finally, in Table 6 we show a comparison of our work ciesinthepipelineddatatransfersandtheoverheadofdata against the best known nVidia results for the TitanX and movement between the host processor and FPGA, which is M4runningAlexNetwithimagesizesof224×224andbatch includedinthemeasuredthroughputvalues. 16%isusedbe- size128(notethattheM4whitepaperdoesn’tspecifybatch cause this was measured as the average difference between size) . The TitanX card has a peak 6.1 TFLOPS, com- thesystem-levelthroughputandtheFPGAdevicethrough- pared to the 1150 Arria 10 device which has 1.3 TFLOPS put. As shown in the graph, our model throughput predic- and as Table 6 shows, the TitanX is able to beat our work tions match very closely to the actual measurements. in terms of raw performance. However, when normalized against power consumption, we are competitive with both 6.1 ResourceUsage nVidia devices [3, 10]. Also note that the img/s/W num- Toshowtheimpactofthesharedexponentfloatingpoint bersshowninTable6are5.8xbetterthanthe img/s/Wfor optimizationdescribedinSection3.6,weshowtheresource AlexNet presented in [20]. usage of a single PE using true half-precision dot-products 6.2.1 Discussiononperformancecomparisons (Half-type)vssharedexponentdot-productsinTable3. The shared exponent significantly reduces resource usage since There are several simplifications the authors do in [3] wecanleveragetheDSPfully,whereaswhenusingthehalf- whichcansignificantlyboosttheperformanceoftheTitanX type, a lot of logic must be used to normalize and compute result shown in Table 6 including the removal of communi- the16-bitfloatingpointmultiplicationsandperformthedot- cation overhead and the use of random data instead of the product. Also,itshouldbenotedthatnoimpacttoaccuracy ILSVRC database set. As such, we suspect that the raw was seen to the top-1 and top-5 error rate (56% and 79% performance numbers are overly optimistic and, unlike our respectively) between our shared exponent implementation and 32-bit floating point. Table4showsthefinalresourceusageofan8×48(C × Stratix V (28nm)[16] KU060 (20nm) [20] DLA (20nm) vec K ) configuration running on the Arria 10 1150 device 72.4 GOPS 165 GOPS 1382 GFLOPS vec running at 303MHz. Table 5: A comparison of our DLA against [20, 16] for 6.2 FPGAComparisonstothestate-of-the-art AlexNet. For the DLA, effective flops shown due to Wino- grad. Table 5 and Table 6 shows our comparisons against prior img/s Watts(Wbrd) PeakOps img/s/W [6] Khronos.Theopenstandardforparallelprogrammingof DLA (20nm) 1020 45 1.3TFLOPS 23 heterogeneoussystems,2015. KU060 (20nm) 104 25 3.6TOPS 4 [7] A.Krizhevsky,I.Sutskever,andG.E.Hinton.Imagenet TitanX (28nm) 5120 227 6.1 23 classificationwithdeepconvolutionalneuralnetworks.In TFLOPS M4 (28nm) 1150 58 2.2 20 Advances in Neural Information Processing Systems,NIPS TFLOPS ’12,pages1097–1105,2012. [8] A.Lavin.Fastalgorithmsforconvolutionalneural Table 6: A comparison of the DLA at 303MHz against [20] networks.CoRR,abs/1509.09308,2015. and[3,10]. KU060peakoperationsareinteger,therestare [9] nVidia.GPU-BasedDeepLearningInference: A 32-bit floating pt. PerformanceandPowerAnalysis,November2015. [10] NVIDIA.Nvidia(r)tesla(r)m4gpuaccelerator,Apr.2016. [11] K.Ovtcharov,O.Ruwase,J.-Y.Kim,J.Fowers, throughputmeasurements,donotreflecttheactualthrough- K.Strauss,andE.Chung.Acceleratingdeepconvolutional put of a production system. Additionally, the KU060 104 neuralnetworksusingspecializedhardware,February2015. img/s is estimated using Figure 10d in [20], which assumes [12] M.Peemen,A.A.A.Setio,B.Mesman,andH.Corporaal. no execution overhead for data transfers and ignores the Memory-centricacceleratordesignforconvolutionalneural networks.In2013 IEEE 31st International Conference on execution time of the non-linear layers (i.e. Pool, Norm, Computer Design (ICCD),pages13–19,Oct2013. ReLU), which again is overly optimistic. Due to these sim- [13] J.Qiu,J.Wang,S.Yao,K.Guo,B.Li,E.Zhou,J.Yu, plifications,wesuspectthattherelativesystemperformance T.Tang,N.Xu,S.Song,Y.Wang,andH.Yang.Going benefit of our DLA is much larger than what is reported in deeperwithembeddedfpgaplatformforconvolutional Table 6. neuralnetwork.InProceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,FPGA’16,pages26–35,NewYork,NY,USA, 7. CONCLUSIONS 2016.ACM. WedescribeanovelarchitecturewritteninOpenCL,DLA, [14] M.Sankaradas,V.Jakkula,S.Cadambi,S.Chakradhar, targeted for computing CNNs on FPGAs. We demonstrate I.Durdanovic,E.Cosatto,andH.P.Graf.Amassively an approach that reduces the required memory bandwidth parallelcoprocessorforconvolutionalneuralnetworks.In Proceedings of the 2009 20th IEEE International by an order-of-magnitude through the use of an on-chip Conference on Application-specific Systems, Architectures stream buffer that efficiently stores input and output fea- and Processors,ASAP’09,pages53–60,Washington,DC, ture maps. Additionally, we demonstrate a vectorization USA,2009.IEEEComputerSociety. approach that achieves over 60% DSP efficiency and uses [15] StanfordVisionLab.Imagenetlargescalevisual theWinogradtransformtosignificantlyreducetheDSPsre- recognitionchallenge(ilsvrc),2015. quired to perform the convolution layers. Because of these [16] N.Suda,V.Chandra,G.Dasika,A.Mohanty,Y.Ma, improvements,weareabletoachieveanoverallsystem-level S.Vrudhula,J.-s.Seo,andY.Cao.Throughput-optimized opencl-basedfpgaacceleratorforlarge-scaleconvolutional performance that is 10x faster than the state-of-the-art on neuralnetworks.InProceedings of the 2016 ACM/SIGDA FPGAswhenrunningAlexNetandiscompetitiveinenergy International Symposium on Field-Programmable Gate efficiency with the best known results on nVidia’s TitanX Arrays,FPGA’16,pages16–25,NewYork,NY,USA, GPU at 23 img/s/W. 2016.ACM. FutureworkincludesmappingotherCNNssuchasGoogLeNet [17] C.Szegedy,W.Liu,Y.Jia,P.Sermanet,S.E.Reed, and VGG to our architecture, and exploring how run-time D.Anguelov,D.Erhan,V.Vanhoucke,andA.Rabinovich. reconfigurability may impact performance of our architec- Goingdeeperwithconvolutions.CoRR,abs/1409.4842, 2014. ture. [18] S.Winograd.Arithmetic Complexity of Computations, volume33.Siam,1980. 8. ACKNOWLEDGEMENTS [19] Xilinx.UltraScale Architecture and Product Overview. PreliminaryProductSpecification.XilinxCorporation, WewouldliketothankStephenWestonforhisinsightful June2016. comments and Kevin Jin for the experimental data. [20] C.Zhang,Z.Fang,P.Zhou,andJ.Cong.Caffeine: Towardsuniformedrepresentationandaccelerationfordeep 9. REFERENCES convolutionalneuralnetworks.InProceedings of the 2016 [1] Altera.Arria 10 Device Overview.WhitePaper International Conference On Computer Aided Design, A10-OVERVIEW.AlteraCorporation,Jan.2015. ICCAD’16,NewYork,NY,USA,2016.ACM. [2] S.Cadambi,A.Majumdar,M.Becchi,S.Chakradhar,and [21] C.Zhang,P.Li,G.Sun,Y.Guan,B.Xiao,andJ.Cong. H.P.Graf.Aprogrammableparallelacceleratorfor Optimizingfpga-basedacceleratordesignfordeep learningandclassification.InProceedings of the 19th convolutionalneuralnetworks.InProceedings of the 2015 International Conference on Parallel Architectures and ACM/SIGDA International Symposium on Compilation Techniques,PACT’10,pages273–284,New Field-Programmable Gate Arrays,FPGA’15,pages York,NY,USA,2010.ACM. 161–170,NewYork,NY,USA,2015.ACM. [3] S.Chintala.convnet-benchmarks,2016. [4] C.Farabet,C.Poulet,J.Y.Han,andY.LeCun.Cnp: An fpga-basedprocessorforconvolutionalnetworks.In2009 InternationalConferenceonFieldProgrammableLogicand Applications,pages32–37,Aug2009. [5] S.Gupta,A.Agrawal,K.Gopalakrishnan,and P.Narayanan.Deeplearningwithlimitednumerical precision.InD.BleiandF.Bach,editors,Proceedings of the 32nd International Conference on Machine Learning (ICML-15),pages1737–1746.JMLRWorkshopand ConferenceProceedings,2015.