International Journal of Reconfigurable Computing High-Performance Reconfigurable Computing Guest Editors: Khaled Benkrid, Esam El-Araby, Miaoqing Huang, Kentaro Sano, and Thomas Steinke High-Performance Reconfigurable Computing International Journal of Reconfigurable Computing High-Performance Reconfigurable Computing Guest Editors: Khaled Benkrid, Esam El-Araby, Miaoqing Huang, Kentaro Sano, and Thomas Steinke Copyright©2012HindawiPublishingCorporation.Allrightsreserved. Thisisaspecialissuepublishedin“InternationalJournalofReconfigurableComputing.”Allarticlesareopenaccessarticlesdistributed undertheCreativeCommonsAttributionLicense,whichpermitsunrestricteduse,distribution,andreproductioninanymedium,pro- videdtheoriginalworkisproperlycited. Editorial Board CristinelAbabei,USA ParisKitsos,Greece MarioPorrmann,Germany NeilBergmann,Australia ChidamberKulkarni,USA ViktorK.Prasanna,USA K.L.M.Bertels,TheNetherlands MiriamLeeser,USA LeonardoReyneri,Italy ChristopheBobda,Germany GuyLemieux,Canada TeresaRiesgo,Spain MiodragBolic,Canada HeitorSilverioLopes,Brazil MarcoD.Santambrogio,USA Joa˜oCardoso,Portugal MartinMargala,USA RonSass,USA PaulChow,Canada LiamMarnane,Ireland PatrickR.Schaumont,USA Rene´ Cumplido,Mexico EduardoMarques,Brazil AndrzejSluzek,Singapore AravindDasu,USA Ma´ireMcLoone,UK WalterStechele,Germany ClaudiaFeregrino,Mexico SedaOgrenciMemik,USA TodorStefanov,TheNetherlands AndresD.Garcia,Mexico GokhanMemik,USA GregorySteffan,Canada SoheilGhiasi,USA DanielMozos,Spain GustavoSutter,Spain DianaGo¨hringer,Germany NadiaNedjah,Brazil LionelTorres,France ReinerHartenstein,Germany NikRumziNikIdris,Malaysia JimTorresen,Norway ScottHauck,USA Jose´ Nun˜ez-Yan˜ez,UK W.Vanderbauwhede,UK MichaelHu¨bner,Germany FernandoPardo,Spain Mu¨s¸takE.Yalc¸n,,Turkey JohnKalomiros,Greece MarcoPlatzner,Germany VolodymyrKindratenko,USA SalvatorePontarelli,Italy Contents High-PerformanceReconfigurableComputing,KhaledBenkrid,EsamEl-Araby,MiaoqingHuang, KentaroSano,andThomasSteinke Volume2012,ArticleID104963,2pages AConvolve-And-MErgeApproachforExactComputationsonHigh-PerformanceReconfigurable Computers,EsamEl-Araby,IvanGonzalez,SergioLopez-Buedo,andTarekEl-Ghazawi Volume2012,ArticleID925864,14pages HighPerformanceBiologicalPairwiseSequenceAlignment:FPGAversusGPUversusCellBEversus GPP,KhaledBenkrid,AliAkoglu,ChengLing,YangSong,YingLiu,andXiangTian Volume2012,ArticleID752910,15pages AnEvaluationofanIntegratedOn-Chip/Off-ChipNetworkforHigh-PerformanceReconfigurable Computing,AndrewG.Schmidt,WilliamV.Kritikos,ShanyuanGao,andRonSass Volume2012,ArticleID564704,15pages ACoarse-GrainedReconfigurableArchitecturewithCompilationforHighPerformance,LuWan, ChenDong,andDemingChen Volume2012,ArticleID163542,17pages AProteinSequenceAnalysisHardwareAcceleratorBasedonDivergences,JuanFernandoEusse, NahriMoreano,AlbaCristinaMagalhaesAlvesdeMelo,andRicardoPezzuolJacobi Volume2012,ArticleID201378,19pages OptimizingInvestmentStrategieswiththeReconfigurableHardwarePlatformRIVYERA, ChristophStarke,VascoGrossmann,LarsWienbrandt,SvenKoschnicke,JohnCarstens, andManfredSchimmler Volume2012,ArticleID646984,10pages The“Chimera”:AnOff-The-ShelfCPU/GPGPU/FPGAHybridComputingPlatform,RaInta, DavidJ.Bowman,andSusanM.Scott Volume2012,ArticleID241439,10pages ThroughputAnalysisforaHigh-PerformanceFPGA-AcceleratedReal-TimeSearchApplication, WimVanderbauwhede,S.R.Chalamalasetti,andM.Margala Volume2012,ArticleID507173,16pages HindawiPublishingCorporation InternationalJournalofReconfigurableComputing Volume2012,ArticleID104963,2pages doi:10.1155/2012/104963 Editorial High-Performance Reconfigurable Computing KhaledBenkrid,1EsamEl-Araby,2MiaoqingHuang,3KentaroSano,4andThomasSteinke5 1SchoolofEngineering,TheUniversityofEdinburgh,EdinburghEH93JL,UK 2ElectricalEngineeringandComputerScience,TheCatholicUniversityofAmerica,Washington,DC20064,USA 3DepartmentofComputerScienceandComputerEngineering,UniversityofArkansas,Fayetteville,AR72701,USA 4GraduateSchoolofInformationSciences,TohokuUniversity,6-6-01AramakiAzaAoba,Sendai980-8579,Japan 5Zuse-InstitutBerlin(ZIB),Takustraße7,14195Berlin-Dahlem,Germany CorrespondenceshouldbeaddressedtoKhaledBenkrid,[email protected] Received28February2012;Accepted28February2012 Copyright©2012Khaled Benkrid et al. This is an open access article distributed under the Creative Commons Attribution License,whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperly cited. This special issue presents some of the latest developments proteinsequenceanalysissolution.Theauthorsintegratethe in the burgeoning area of high-performance reconfigurable concept of divergence to the Viterbi algorithm used in the computing (HPRC) which aims to harness the high-per- HMMER program suite for biological sequence alignment, formance and low power of reconfigurable hardware in the inordertoreducetheareaofthescorematrixinwhichthe forms of field programmable gate arrays (FPGAs) in high- trace-backalignmentismade.Thistechniqueleadstolarge performancecomputing(HPC)applications. speedups(182×)comparedtononacceleratedpuresoftware The issue starts with three widely popular HPC appli- processing. cations, namely, financial computing, bioinformatics and The issue then presents a number of architectural con- computational biology, and high-throughput data search. cernsinthedesignofHPRCsystems,namely:reconfigurable First, Starke et al. from the University of Kiel in Germany hardwarearchitecture,communicationnetworkdesign,and presenttheuseofamassivelyparallelFPGAplatform,called arithmetic design. First, Wan et al. from the University of RIVYERA, in the high-performance and low-power opti- IllinoisatUrbanaChampaignandMagmaDesignAutoma- mizationofinvestmentstrategies.Theauthorsdemonstrate tionInc.intheUSApresentacoarse-grainedreconfigurable an FPGA-based implementation of an investment strategy architecture (CGRA) with a Fast Data Relay (FDR) mecha- algorithmthatconsiderablyoutperformsasingleCPUnode nism to enhance its performance. This is achieved through in terms of raw processing power and energy efficiency. multicycledatatransmissionconcurrentwithcomputation, Furthermore, it is shown that the implemented optimized andeffectivereductionofcommunicationtrafficcongestion. investment strategy outperforms a buy-and-hold strategy. The authors also propose compiler techniques to efficiently Then, Vanderbauwhede et al. from Glasgow University and utilizetheFDRfeature.Theexperimentalresultsforvarious the University of Massachusetts Lowell propose a design multimediaapplicationsshowthatFDRcombinedwiththe for the scoring part of a high-throughput real-time search new compiler delivers up to 29% and 21% higher perfor- applicationonFPGAs.Theauthorsusealow-latencyBloom mance than other CGRAs: ADRES and RCP, respectively. filter to realize high-performance information filtering. An The following paper by Schmidt et al. from the University analytical model of the application throughput is built ofSouthernCaliforniaandtheUniversityofNorthCarolina around the Bloom filter. The experimental results on the present an integrated on-chip/off-chip network with MPI- Novo-G reconfigurable supercomputer demonstrate a 20× style point-to-point message, implemented on an all-FPGA speedup compared with a software implementation on a computingcluster.Themostsalientdifferencesbetweenthe 3.4GHz Intel Core i7 processor. After that, Eusse et al. network architecture presented in this paper and state-of- fromtheUniversityofBrasiliaandtheFederalUniversityof the-art Network-on-Chip (NoC) architectures is the use of Mato Grosso do Sul in Brazil present an FPGA-accelerated a single full-crossbar switch. The results are different from 2 InternationalJournalofReconfigurableComputing othereffortsduetoseveralreasons.First,theimplementation MiaoqingHuang targetistheprogrammablelogicofanFPGA.Second,most KentaroSano NoCs assume that a full crossbar is too expensive in terms ThomasSteinke of resources while within the context of an HPRC system theprogrammablelogicresourcesarefungible.Theauthors justify their focus on the network performance by the fact that overall performance is limited by the bandwidth off the chip rather than by the mere number of compute cores on the chip. After that, El-Araby et al. from the Catholic University of America, Universidad Autonoma de Madrid, and the George Washington University present a technique for the acceleration of arbitrary-precision arithmetic on HPRCsystems.Efficientsupportofarbitrary-precisionarith- metic in very large science and engineering simulations is particularlyimportantasnumericalnonrobustnessbecomes increasinglyanissueinsuchapplications.Today’ssolutions for arbitrary-precision arithmetic are usually implemented in software and performance is significantly reduced as a result. In order to reduce this performance gap, the paper investigatestheaccelerationofarbitrary-precisionarithmetic onHPRCsystems. The special issue ends with two papers which present reconfigurable hardware in HPC in the context of other computer technologies. First, Benkrid et al. from the Uni- versity of Edinburgh, Scotland, and the University of Ari- zona,USA,presentacomparativestudyofFPGAs,Graphics Processing Units (GPUs), IBM’s Cell BE, and General Pur- pose Processors (GPPs) in the design and implementation ofabiologicalsequencealignmentapplication.Usingspeed, energy consumption, in addition to purchase and develop- ment costs, as comparison criteria, the authors argue that FPGAs are high-performance economic solutions for se- quence alignment applications. In general, however, they argue that FPGAs need to achieve at least two orders of magnitude speedup compared to GPPs and one order of magnitude speedup compared to GPUs to justify their rel- ativelylongerdevelopmenttimesandhigherpurchasecosts. The following paper by Inta et al. from the Australian Na- tional University presents an off-the-shelf CPU, GPU, and FPGA heterogeneous computing platform, called Chimera, asapotential high-performanceeconomicsolution forcer- tain HPC applications. Motivated by computational de- mands in the area of astronomy, the authors propose the Chimeraplatformasaviablealternativeformanycommon computationally bound problems. Advantages and challen- ges of migrating applications to such heterogeneous plat- formsarediscussedbyusingdemonstratorapplicationssuch as Monte Carlo integration and normalized cross-corre- lation.Theauthorsshowthatthemostsignificantbottleneck in multidevice computational pipelines is the communica- tionsinterconnect. Wehopethatthisspecialissuewillserveasanintroduc- tion to those who have newly joined, or are interested in joining, the HPRC research community as well as provide specialists with a sample of the latest developments in this excitingresearcharea. KhaledBenkrid EsamEl-Araby HindawiPublishingCorporation InternationalJournalofReconfigurableComputing Volume2012,ArticleID925864,14pages doi:10.1155/2012/925864 Research Article A Convolve-And-MErge Approach for Exact Computations on High-Performance Reconfigurable Computers EsamEl-Araby,1IvanGonzalez,2SergioLopez-Buedo,2andTarekEl-Ghazawi3 1DepartmentofElectricalEngineeringandComputerScience,TheCatholicUniversityofAmerica,Washington,DC20064,USA 2DepartmentsofComputerEngineeringatEscuelaPolitecnicaSuperiorofUniversidadAutonomadeMadrid,28049Madrid,Spain 3DepartmentofElectricalandComputerEngineering,TheGeorgeWashingtonUniversity,Washington,DC20052,USA CorrespondenceshouldbeaddressedtoEsamEl-Araby,[email protected] Received31October2011;Revised1February2012;Accepted13February2012 AcademicEditor:ThomasSteinke Copyright©2012EsamEl-Arabyetal.ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense, whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. Thisworkpresentsanapproachforacceleratingarbitrary-precisionarithmeticonhigh-performancereconfigurablecomputers (HPRCs).Althoughfasterandsmaller,fixed-precisionarithmetichasinherentroundingandoverflowproblemsthatcancause errorsinscientificorengineeringapplications.Thisrecurringphenomenonisusuallyreferredtoasnumericalnonrobustness. Therefore,thereisanincreasinginterestintheparadigmofexactcomputation,basedonarbitrary-precisionarithmetic.Therearea numberoflibrariesand/orlanguagessupportingthisparadigm,forexample,theGNUmultiprecision(GMP)library.However,the performanceofcomputationsissignificantlyreducedincomparisontothatoffixed-precisionarithmetic.Inordertoreducethis performancegap,thispaperinvestigatestheaccelerationofarbitrary-precisionarithmeticonHPRCs.AConvolve-And-MErge approachisproposed,thatimplementsvirtualconvolutionschedulesderivedfromtheformalrepresentationofthearbitrary- precisionmultiplicationproblem.Additionally,dynamic(nonlinear)pipelinetechniquesarealsoexploitedinordertoachieve speedupsrangingfrom5x(addition)to9x(multiplication),whilekeepingresourceusageofthereconfigurabledevicelow,ranging from11%to19%. 1.Introduction computation paradigm. In arbitrary-precision arithmetic, alsoknownasbignumarithmetic,thesizeofoperandsisonly Present-day computers built around fixed-precision com- limitedbytheavailablememoryofthehostsystem[3,4]. ponents perform integer and/or floating point arithmetic Among other fields, arbitrary-precision arithmetic is operations using fixed-width operands, typically 32 and/or used, for example, in computational metrology and coor- 64 bits wide. However, some applications require larger dinate measuring machines (CMMs), computation of fun- precision arithmetic. For example, operands in public-key damental mathematical constants such as π to millions of cryptographyalgorithmsaretypicallythousandsofbitslong. digits, rendering fractal images, computational geometry, Arbitrary-precisionarithmeticisalsoimportantforscientific geometric editing and modeling, and constraint logic pro- and engineering computations where the roundoff errors gramming(CLP)languages[1–5]. arising from fixed-precision arithmetic cause convergence In the earlier days of computers, there were some and stability problems. Although many applications can machines that supported arbitrary-precision arithmetic in toleratefixed-precisionproblems,thereisasignificantnum- hardware. Two examples of these machines were the IBM ber of other applications, such as finance and banking, in 1401[6]andtheHoneywell200Liberator[7]series.Nowa- whichnumericaloverflowisintolerable.Thisrecurringphe- days, arbitrary-precision arithmetic is mostly implemented nomenonisusuallyreferredtoasnumericalnonrobustness in software, perhaps embedded into a computer compiler. [1]. In response to this problem, exact computation, based Overthelastdecade,anumberofbignumsoftwarepackages onexact/arbitrary-precisionarithmetic,wasfirstintroduced havebeendeveloped.TheseincludetheGNUmultiprecision in 1995 by Yap and Dube [2] as an emerging numerical (GMP)library, CLN, LEDA, Java.math,BigFloat, BigDigits, 2 InternationalJournalofReconfigurableComputing and Crypto++ [4, 5]. In addition, there exist stand-alone magnitude less than the one of high-end microprocessors, applicationsoftware/languagessuchasPARI/GP,Mathemat- significantspeedupsareobtainedduetotheincreasedparal- ica,Maple,Macsyma,dcprogramminglanguage,andREXX lelismofhardware.Thisperformanceisespeciallyimportant programminglanguage[4]. for those algorithms not matching the architecture of con- Arbitrary-precision numbers are often stored as large- ventional microprocessors, because of either the operand variable-length arrays of digits in some base related to the lengths (e.g., bioinformatics) or the operations performed systemword-length.Becauseofthis,arithmeticperformance (e.g., cryptography). Moreover, the power consumption is is slower compared to fixed-precision arithmetic which is reduced in comparison to conventional platforms, and the closely related to the size of the processor internal registers useofreconfigurabledevicesbringsflexibilityclosertothat [2].Therehavebeensomeattemptsforhardwareimplemen- of SW, as opposed to other HW-accelerated solutions such tations. However, those attempts usually amounted to spe- as ASICs. In other words, the goal of HPRC machines is cializedhardwareforsmall-sizediscretemultiprecisionand/ to achieve the synergy between the low-level parallelism or to large-size fixed-precision [8–12] integer arithmetic of hardware with the system-level parallelism of high-per- ratherthantorealarbitrary-precisionarithmetic. formancecomputing(HPC)machines. High-performance reconfigurable computers (HPRCs) Ingeneral,HPRCscanbeclassifiedaseithernonuniform have shown remarkable results in comparison to conven- node uniform systems (NNUSs) or uniform node nonuni- tionalprocessorsinthoseproblemsrequiringcustomdesigns form systems (UNNSs) [13]. NNUSs consist of only one becauseofthemismatchwithoperandwidthsand/oropera- type of nodes. Nodes are heterogeneous containing both tionsofconventionalALUs.Forexample,speedupsofupto FPGAs and microprocessors. FPGAs are connected directly 28,514havebeenreportedforcryptographyapplications[13, tothemicroprocessorsinsidethenode.Ontheotherhand, 14],upto8,723forbioinformaticssequencematching[13], UNNS nodes are homogeneous containing either FPGAs and up to 32 for remote sensing and image processing [15, ormicroprocessorswhicharelinkedviaaninterconnection 16]. Therefore, arbitrary-precision arithmetic seemed to be network. The platform used in this paper, SRC-6 [17], be- agoodcandidateforaccelerationonreconfigurablecomput- longstothesecondcategory. ers. SRC-6platformconsistsofoneormoregeneral-purpose This work explores the useof HPRCs for arbitrary-pre- microprocessor subsystems, one or more MAP reconfig- cisionarithmetic.Weproposeahardwarearchitecturethatis urable processor subsystems, and global common memory abletoimplementaddition,subtractionandmultiplication, (GCM) nodes of shared memory space [17]. These subsys- as well as convolution, on arbitrary-length operands up to temsareinterconnectedthroughaHi-BarSwitchcommuni- 128 ExibiByte. The architecture is based on virtual convo- cationlayer;seeFigure1.MultipletiersoftheHi-BarSwitch lution scheduling. It has been validated on a classic HPRC canbeusedtocreatelarge-nodecountscalablesystems.Each machine, the SRC-6 [17] from SRC Computers, showing microprocessorboardisbasedon2.8GHzIntelXeonmicro- speedupsrangingfrom2to9incomparisontotheportable processors. Microprocessors boards are connected to the versionoftheGMPlibrary.Thisspeedupisinpartattained MAPboardsthroughtheSNAPinterconnect.TheSNAPcard due to the dynamic (nonlinear) pipelining techniques that plugs into the memory DIMM slot on the microprocessor areusedtoeliminatetheeffectsofdeeplypipelinedreduction motherboard to provide higher data transfer rates between operators. theboardsthanthelessefficientbutcommonPCIsolution. The paper is organized as follows. Section2 presents a Thepeaktransferratebetweenamicroprocessorboardand short overview of HPRC machines. The problem is formu- theMAPboardis1600MB/sec.Hardwarearchitectureofthe latedinSection3.Section4describestheproposedapproach SRC-6MAPprocessorisshowninFigure1.TheMAPSeries and architecture augmented with a numerical example for C board is composed of one control FPGA and two user illustrating the details of the proposed approach. Section5 FPGAs, all Xilinx Virtex II-6000-4. Additionally, each MAP shows the experimental work. Implementation details are unit contains six interleaved banks of on-board memory also given in Section5, as well as performance comparison (OBM) with a total capacity of 24MB. The maximum to the SW version of GMP. Finally, Section6 presents the aggregatedatatransferrateamongallFPGAsandon-board conclusionsandfuturedirections. memory is 4800MB/s. The user FPGAs are configured in such a way that one is in the master mode and the other is in the slave mode. The two FPGAs of a MAP are di- 2.HighPerformance rectlyconnectedusingabridgeport.Furthermore,MAPpro- ReconfigurableComputing cessorscanbechainedtogetherusingachainporttocreate anarrayofFPGAs. Intherecentyears,theconceptofhigh-performancerecon- figurablecomputinghasemergedasapromisingalternative 3.ProblemFormulation to conventional processing in order to enhance the perfor- manceofcomputers.Theideaistoaccelerateaparallelcom- Exact arithmetic uses the four basic arithmetic operations puter with reconfigurable devices such as FPGAs where a (+, −, ×, ÷) over the rational field Q to support exact custom hardware implementation of the critical sections of computations[1–5].Therefore,theproblemofexactcompu- thecodeisperformedinthereconfigurabledevice.Although tationisreducedtoimplementingthesefourbasicarithmetic the clock frequency of the FPGA is typically one order of operationswitharbitrary-sizedoperands.
Description: