ebook img

Computers for Lattice Field Theories PDF

15 Pages·0.26 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Computers for Lattice Field Theories

1 Computers for Lattice Field Theories Y. Iwasakia aInstitute of Physics, University of Tsukuba, Ibaraki 305, Japan Parallel computers dedicated to lattice field theories are reviewed with emphasis on the three recent projects, 4 theTeraflopsprojectintheUS,theCP-PACSprojectinJapanandthe0.5-TeraflopsprojectintheUS.Somenew 9 commercial parallel computers are also discussed. Recent development of semiconductor technologies is briefly 9 surveyedin relation to possible approaches toward Teraflops computers. 1 n a J 1. Introduction Table 1 List of dedicated QCD computers 6 Numerical studies of lattice field theories have Project Peak speed year 2 developedsignificantlyin parallelwith the devel- Gflops 1 opment of computers during the past decade. Of Columbia 16 0.25 1985 v particularimportanceinthis regardhasbeenthe 64 1.0 87 0 construction of dedicated QCD computers (see 3 256 16 89 Table 1 and for earlier reviews see Ref.[1]) and 0 APE 4 0.25 86 the move of commercial vendors toward parallel 1 16 1.0 88 0 computers in recent years. Due to these devel- QCDPAX 14 90 4 opments we now have access to parallel comput- GF11 11 91 9 ers whichare capable of 5–10Gflops of sustained / ACPMAPS 5 91 t speed. a APE100 6(→100) 92(→94) However,a fully convincing numerical solution l - of many of lattice field theory problems, in par- ACPMAPS 50 93 p e ticular those of lattice QCD, requires much more upgraded h speed. In fact typical number of floating point Teraflops 1,600 96 v: operations required in these problems, such as CP-PACS ≥300 96 i full QCD hadronmass spectrum calculations, of- 0.5Teraflops 800 95 X ten exceeds 1018, which translates to 115 days of APE1000 ∼1000 r computing time with the sustained speed of 100 a Gflops. Under this circumstance we really need computers with a sustained speed exceeding 100 Gflops. States, are at a varying stage of development. Inthis talkI reviewthe presentstatusofeffort I shall describe them in detail. Finally the toward construction of dedicated parallel com- APE1000[8] is the future plan of the APE Col- puters with the peak speed of 100–1000 Gflops. laboration, of which details are not yet available. Of the six projects in this category (see Table Akeyingredientinthefastprogressofparallel 1),APE100[2]isnearcompletionandACPMAPS computers in recent years is the development in upgraded[3] is running now. Because they have semiconductor technologies. Understanding this alreadybeenreviewedpreviously[1],weshallonly aspect is important when one considers possible describe their most recent status. The three re- approaches toward a Teraflops of speed. I shall cent projects, the Teraflops project[4,5] in the therefore start this review with a brief reminder United States, the CP-PACS project[6] in Japan of the development of vector and parallel com- and the 0.5-Teraflops[7] project in the United putersandthetechnologicalreasonswhyrecently 2 100 1000 CMOS Parallel Computers 100 ( ) s] n S k [ 10 FLOP 10 Suvpeecr tCoro-mtyppueter Cloc G ECL 1 1 1985 1990 1995 100 Year Caltech Figure2. MachineclockofECLandCMOSsemi- 10 PAX-64J conductors. S P PAX-128 O L 1 F M PAX-32 2. Recent development of computers and semiconductor technology 0.1 PAX-9 IntheupperpartofFig.1weshowtheprogress of peak speed of vector and parallel computers 0.01 over the years. Small symbols correspond to the 1975 1980 1985 1990 1995 2000 firstshippingdateofcomputersmadebycommer- Year cial vendors, with open ones for vector and filled CRAY TMC Columbia ones for parallel type. Parallel computers dedi- CDC/ETA nCUBE APE Hitachi intel QCDPAX cated to lattice QCD are plotted by large sym- Fujitsu Fujitsu AP1000 GF11(IBM) bols. We clearlyobservethattherateofprogress N E C FNAL for parallel computers is roughly double that of NWT UKQCD vector computers and that a crossover in peak speedhastakenplacefromvectortoparallelcom- puters around 1991. Figure 1. Progress of theoretical peak speed. The “linear fit” drawn in Fig. 1 for parallel computerscanbeextrapolatedtotheperiodprior to 1985. QCDPAX is the fifth generation com- parallelcomputershaveexceededvectorcomput- puter inthe PAXseries[9]andtherearefourear- ers in the computing capability (Sec. 2). The lier computersstartingin1978. Inthe lowerpart status of APE100 and ACPMAPS upgraded are of Fig. 1 the peak speed of these computers are summarized in Sec. 3. The US Teraflops, CP- plotted in units of Mflops together with that of PACSand0.5-Teraflopsprojectsaredescribedin the Caltech computer described, for example, by Sec. 4. Powerfulparallelcomputersarealsoavail- NormanChristatFermilabin1988[1]. Itisamus- able from commercial vendors. In Sec. 5 I shall ing to observe that the rapid increase of speed of discuss two new computers, the Fujitsu VPP500 parallelcomputershasbeencontinuingforovera and CRAY T3D. After these reviews I discuss decade since the early days. several architectural issues for computers toward ItisimportanttonotethatthefirstthreePAX Teraflopsin Sec. 6. Abrief conclusionis givenin computersarelimitedto8bitarithmeticandthe Sec. 7. fourth one to 16 bit. We also recall that the first 3 100 105 600 104 D pPinlapsitticch Q =F 1P.0mm RA 500 = 0.8 m] 10 1000 M = 0.65 mspacing [ 1 11000 capacity [k pin 340000 C e r a m i c ====P G0002....A54354 b # 1 it] 200 0.1 0.1 1960 1970 1980 1990 2000 100 Year 0 80 82 84 86 88 90 92 94 year Figure 3. Development of minimum spacing of LSI and capacity of DRAM. Figure4. Evolutionofthenumberofpins. (From “Nikkei Electronics” August 2, 1993.) Columbiacomputerused22bitarithmetic. Thus not only the peak speed but also the precision of Further it has multiple sets of this kind of mul- floatingpointnumbershasincreasedsignificantly tiple FPUs; in the case of 4 sets the peak speed forparallelcomputers. Nowthe64bitarithmetic becomes 32 Gflops. This is the way how a vec- is becoming standard. tor computer gets the peak speed of order of 10 To see more closely why the crossover hap- Gflops. That is, recent vector computers are al- pened, let us look at the development of tech- ready parallel computers. However, it is rather nologyofsemiconductors. InFig.2 we showhow difficult to proceed further in this approach be- machine clocks become faster in the case of ECL causeofthepowerconsumptionandtheheatout- which is utilized in vector-type supercomputers put. as well as in the the case of CMOS which is used On the other hand, the development of CMOS in personal computers and workstations. As we semiconductor technology, with its small-size, can see, the speed of CMOS is about 10-fold less highspeedandlowpowerconsumption,hasmade than ECL. However,the power consumption and it possible to construct a massively parallel com- the heat output are much lower than those of puter which is composed of order of 1,000 nodes ECL. Furthermore the speed of CMOS itself has withthepeakspeedwhichexceedsthatofvector- become comparable to that of ECL of the late type supercomputers. This is the reasonwhy the 1980’s. crossover occurred. Themachinecycleofonenano-secondisakind ThespeedupofCMOShasbecomepossibledue of limit to reach. This is understandable because to the development of LSI technology. Figure 3 one nano-second is the time in which light trav- shows the development in terms of the minimum els 30cm. In this time interval one has to load feature size or minimum spacing. Now the spac- data from memory to a floating point operation ing has been reduced to 0.5 micron. This devel- unit, make a calculation and store results to the opment has also lead to a substantial increase of memory. Evenintheidealcaseofpipelinedoper- DRAM bit capacity which has recently reached ations, one nano-second corresponds only to one the level of 16Mbit. The speed of transistors Gflops. Usuallyavectorcomputerhasamultiple has also increased with the decrease of minimum operation units which consists of, for example, 8 spacing because electrons can move through the floatingpointoperationunits(FPUs). Becauseof minimum spacing in a shorter time. This is the this,thetheoreticalpeakspeedbecomes8Gflops. reasonwhy the machine clock has become faster. 4 Table 2 Characteristics of dedicated QCD computers I Project Columbia APE QCDPAX GF11 ACPMAPS peak speed 16 Gflops 1 Gflops 14 Gflops 11 Gflops 5 Gflops processors 256 16 480 566 256 network 2d torus linear 2d torus Memphis crossbar and array switch hypercube+ arichi- tecture MIMD SIMD MIMD SIMD MIMD CPU 80286 — 68020 — Weitek FPU 80287 Weitek1032×4 LSI Logic Weitek1032×2 XL8032 Weitek3364×2 Weitek1033×4 L64133 Weitek1033×2 chip set SRAM 2MB — 2MB 64KB 2MB DRAM 8MB 16MB 4MB 2MB 10MB speed/ processor 64Mflops 64Mflops 32Mflops 20Mflops 20Mflops host VAX11/780 µVAX Sun 3/260 3090 µVAX The packaging technique has also developed: characteristics in Table 3. Figure4showsthedevelopmentofthe numberof pins of LSI. 3.1. APE100 Due to these development, it is now not a ThearchitectureofAPE100[2]isacombination dream to construct a 1Tflops computer with 64 ofSIMDandMIMD.Thefullmachineconsistsof bitarithmeticwithreasonablesizeandreasonable 2048nodes with apeak speedof100Gflops. The power consumption. networkisa3-dimensionaltorus. Eachnodehasa custom-designed floating point chip called MAD. The chipcontainsa32-bitadderandamultiplier 3. Past and present of dedicated comput- with a 128-word register file. The memory size ers is 4Mbytes/node with 80 ns access time 1M ×4 The computers of the first group in Table 1, DRAM. The bandwidth between MAD and the the three computers of Columbia[10], two ver- memory is 50 Mbytes/sec, which corresponds to sions of APE[11], QCDPAX[12], GF11[13] and one word/4floating point operations. One board ACPMAPS[14],wereconstructedsome yearsago consists of 2×2×2 = 8 nodes with a commuter and have been producing physics results. The for data transfer. The communication rates on- characteristics of these computers are given in node and inter-node are 50 Mbytes/sec and 12.5 Table 2. These computers are already familiar Mbytes/sec, respectively. Each board has a con- to lattice community. Therefore I refer to earlier troller which takes care of program flow control, reviews [1] for details and just emphasize that a address generation and memory control. number of interesting physics results have been The 6 Gflops version of APE 100, which is produced. This fact shows that there is really called TUBE, is running and producing physics benefit in constructing dedicated computers. results. A TUBE is composed of 128 nodes mak- The computers of the secondgroupin Table 1, ing a 32 × 2 × 2 torus with periodic boundary the 6 Gflops version of APE100 and ACPMAPS conditions. The naming originates from its topo- upgraded, have been recently completed. Both logical shape. The memory size is 512 Mbytes. are nowproducing physicsresults, someofwhich Four TUBEs have been completed. have beenreportedatthis conference. I listtheir The sustained speed of a TUBE for the link 5 Table 3 50 Gflops and has 20 Gbytes of memory. Characteristics of dedicated QCD computers II The network has a cluster structure: one crate Project APE100 ACPMAPS consists of 16 boards with a 16-way crossbar. A board can be either a processor node or a Bus processors 2048 612 Switch Interface board. The 16-way crossbars arichi- SIMD are connected in a complicated way which makes tecture MIMD MIMD a hyper-cube and other extra connections. The CPU MAD i860 throughput between nodes is 20 Mbytes/sec. (custom) ACPMAPS has a strong distributed I/O sys- memory 4MB 32MB tem: there are 32 Exabyte tape drives and 20 speed/ 50 80 Gbytes of disk space. This mass I/O subsystem processor Mflops Mflops is one of characteristics of ACPMAPS. network 3d torus crossbar ThesoftwarepackageCANOPYwhichwaswell hypercube+ described several times[14,3] is very powerful to host SUN WS SGI distribute physical variables to nodes without peak knowing the details of the hardware. speed 100 Gflops 50Gflops The ACPMAPS is running and doing calcula- arithmetic 32 bit 32 (64) bit tionsofthequenchedhadronspectrumandheavy quark physics, the results of which have been re- ported at this conference. The sustained speed measured on a 323 ×48 update is about 1.5 microsecond/link with the lattice are as follows. One link update time by a Metropolis algorithm with 5 hits. The time for heat-bath method is 0.64 micro-second per link. multiplication of the Wilson operator is 0.8 mi- One cycle of conjugate gradient inversion of the crosecond per site. These rates roughly corre- Wilsonoperatorbyred-blackmethodtakesabout spondto 2.5Gflops to 3Gflops, whichrepresents 0.64 micro-second per site. The L inversion to- 40-50% of the peak speed. These figures show gether with the U back-inversion in the ILUMR good efficiency. method takes 2.23 micro-second per site. These The physics subjects being studied on TUBE figuresforthesustainedspeedsareabout10-20% are hadron spectrum and heavy quark physics, of the peak speed. Therefore efficiency is not so the results of which have been reported at this goodcomparedtoTUBE.However,therearesev- conference. eral good characteristics. First, it supports both ATowerwhichconsistsof4TUBEswithapeak 64and32bitarithmeticoperations. Thenetwork speed of 25 Gflops is being assembled now and is very flexible and the distributed I/O system is should be working in the late fall of 1993. The convenient for users. full machine which is composed of 4 Towers with apeakspeedof100Gflopsisexpectedtobecom- 4. Project under way and proposed pleted by the first quarter of 1994. ThethreeprojectsofthethirdgroupinTable1, 3.2. ACPMAPS Upgraded the Teraflops project, the CP-PACS project and This is an upgrade of the ACPMAPS replac- the0.5-Teraflopsprojectarewellunderway. The ing the processor boards without changing the basic design targets are listed in Table 4. communication backbone[3]. The ACPMAPS is a MIMD machine with distributed memory. On 4.1. Teraflops project eachnodetherearetwoInteli860microprocessors The Teraflops project[4] has changed signifi- withapeakspeedof80Mflops. Thememorysize cantly since last year. The new plan (Multi- is 32 Mbytes of DRAM for each node. The full disciplinary Teraflops Project)[5] utilizes Think- machineconsistsof612i860withapeakspeedof ing Machine’s next generation platform instead 6 Table 4 Characteristics of dedicated QCD computers III Project Teraflops CP-PACS 0.5Tflops processors 8K 1–1.5K 16K arichi- tecture MIMD MIMD CPU enhanced DSP PA-RISC TI memory 32MB 64MB 2MB speed/ 200–300 200–300 50 processor Mflops Mflops Mflops network hypercrossbar 4d torus host main frame SUN WS peak speed ≥1.6Tflops ≥300Gflops 0.8Tflops arithmetic 64bit 64 bit 32 bit of CM5 as originally planned. A floating point of them physicists and the other half computer processing unit(FPU) called an arithmetic accel- scientists. erator is to be constructed with a peak speed in ThearchitectureisMIMDwitha3-dimensional the rangeof200–300Mflops. One node consists hypercrossbarwhichwillbeexplainedlater. The of 16 suchFPUs plus one generalprocessor,with target of the peak speed is currently at least 300 a peak speed of more than 3.2 Gflops and 512 Gflops with 64 bit arithmetic. We are making Mbytes of memory. a proposal for additional funds to increase this The full machine consists of 512 nodes with peak speed. The memory size is planned to be a peak speed of at least 1.6 Tflops with 64 bit more than 48 Gbytes. arithmetic. The sustained speed is expected to The processor is based on a Hewlett-Packard be more than 1 Tflops. A preliminary estimate PA-RISC processor. This is a super-scalar pro- forthecostofthefullmachineis$20–25M.This cessor which can perform two operations concur- projectisthe collaborationoftheQCDTeraflops rently. We enhance the processor to support effi- Collaboration[15],MIT Laboratory for computer cient vector calculations. The peak speed of one science, Lincoln Laboratory and TMC. Funding processor is 200 – 300 Mflops. The enhancement fortheprojectbeganinthefallof1992withstart- will be described in detail later. For memory up funds provided by MIT. The proposal for the we use synchronous DRAM, pipelined by multi- whole project will be submitted to NSF, DOE interleaving banks and a storage controller. The and ARPA this fall. The tentative schedule is to memory bandwidth is one word per one machine build a prototype node in 1994, a prototype sys- cycle. temin1995andhavethefullsysteminoperation Now let me explain the vector enhancement in 1996. of the processor. As is well-known, high perfor- mance of usual RISC processors like those of In- 4.2. CP-PACS project tel, IBM, HP and DEC heavily depends on the We started the CP-PACS (Computational existence of cache. However, when the data size Physics by Parallel Array Computer Systems) exceeds the cache size, effectiveness of cache de- project last year[6]. The CP-PACS collabora- creases. Figure 5 shows a typical example of the tion currently consists of 22 members[16], a half performanceofaRISCprocessor. Whenthedata 7 1 Slide-Windowed Registers level 1 cache overflow ce 0.8 0 7 8 physical register m-1 n a m 0.6 or logical (slide-windowed) register erf level 2 cache 0 7 8 31 e p 0.4 overflow ativ window i 0 7 8 31 acctiavlec uwlaintidoonw el 0.2 window offset of r 0 7 active w8indow 31 data preload window j 0 100 101 102 103 104 105 106 107 108 109 vector length 0 7 8 31 window k Figure 5. Performance of a RISC processor in a 0 7 31 8 large scale scientific calculation. Figure6. Schematicgraphofslide-windowedreg- isters. size exceeds the size of the on-chip level-1 cache, it drops down by about 50%. Furthermore when it exceeds the size of the level-2 cache, the per- with and without slide windows for Livermore formance is of order 15% of the theoretical peak Fortran Kernels : <Original> means perfor- speed. This feature is very common to cache- mance without slide windows, and <Perfect- based RISC processors. Cache> represents a hypothetical case for com- Toovercomethisdifficulty,ourstrategyistoin- parison where the cache size is infinite and creasethenumberoffloating-pointregisterswith- the data are all in cache. In the case of out serious changes in the instruction set archi- <Slide-Window>, the number of slide-windowed tecture. This means upward compatibility. How- floating-point registers is assumed to be 64. Ex- ever,thisisnotstraightforwardbecausetheregis- cept for #14 of Livermore Fortran Kernels, the ter fields for instructions are limited; the number performance with slide windows is almost equal of registers is usually limited to 32. To resolve to that of the perfect cache case and it is about this problem we introduce slide windows as well 6 times higher than the original one. aspreloadandpoststoreinstructions[17]. Wealso Figure 8 shows the efficiency of performance pipeline the memory. Because of these features for the case of multiplication of the Wilson ma- we are able to hide long memory access latency trix. The dashed line corresponds to efficiency and perform vector calculations efficiently. in the case of the code optimized by hand with- Figure6isaschematicillustrationofhowslide outconsideringmemorybank-conflicts. Thesolid windowedregisterswork. Arithmeticinstructions line is the result of a simulation for the realistic use the registers in the active window which has casewheretheeffectofmemorybankconflictand 32 registers. The preload instruction can load the buffer size effectaretakeninto account. This data into registers of the next (or next-to-next) showsthatifthenumberofregistersislargerthan window andthe poststore instructionstores data 100 the efficiency is more than 75%. We will de- fromregisters of the previous window. The pitch velop a compiler for the enhanced RISC proces- for the window slide can be chosen by software. sor, which will produce optimized codes for the Due to the preloadand poststore instructions we slide-window architecture. can use all of m (m>32) physical registers. On each processing unit(PU), we place Figure 7 is a comparison of the performance one enhanced PA-RISC processor, local stor- 8 FLOPC (FLoating Operations Per Cycle) 100 3 90 <Original> <Perfect-Cache> 80 <Slide-Window> 70 %) 2 y ( 60 c n 50 e Effici 40 1 30 with bank-conflict effects 20 without bank-conflict effects 10 0 060 70 80 90 100 110 120 130 #1 #2 #3 #4 #5 #6 #7 #8 #9 #10#11#12#13#14 # of registers Livermore Fortran Kernels Figure 8. Performance for multiplication of Wil- son matrix. Figure 7. Comparison of performance with and without slide windows for Livermore loops. The host is a main frame computer with mod- ifications for massive data transfer between the CP-PACS and the external disk storage. age(DRAM) and a storage controller(see Fig. 9). A prototype with the PA-RISC without en- NIA stands for Network Interface Adapter and hancement, which will be used mainly for tests EX for exchanger. On an IO unit(IOU), in ad- of network hardware, will be completed in early dition to the components on PU, we place an 1994 and the full scale machine with the newly IO bus to which disks are connected through IO developedprocessorisscheduledtobe completed adapters. by spring 1996. The network is a 3-dimensional hyper crossbar Theprojectisbeingcarriedoutbyacollabora- as shown in Fig. 10. It consists of x-direction tionwithHitachiLtd. Anewcentercalled“Cen- crossbars as well as y and z direction crossbars. ter for Computational Physics” was established Thishypercrossbarnetworkisveryflexible: from at University of Tsukuba for the development of anynodetoanothernodedatacanbetransferred CP-PACS. A new building for the center, where through at most three switches. The data trans- the new machinewillbeinstalled, wascompleted fer is made by message passing with wormhole in the summer of 1993. The fund for the devel- routing. The latency is expected to be of order opment of CP-PACS is about $14M. ofafewmicro-second. Ablock-stridedtransferis supported. Wehavealsoaglobalsynchronization 4.3. 0.5-Teraflops project in addition to the hyper crossbar network. This project started quite recently[7]. The ThesystemconfigurationoftheCP-PACSwith projectis acollaborationoftheoreticalphysicists distributed disks is depicted in Fig. 10. The disk and experimental physicists[18]. The machine space is more than 500 Gbytes in total. We use consists of 16K nodes making a 4-dimensional RAID5 which has extra parity bits. In general, torus 16×16×16×4 with a peak speed of 0.8 when the number of disks is large as in this case, Tflops with 32bit arithmetic. It is expected that the MTBF(mean time between failure) becomes the sustained speed for QCD is about 0.4 Tflops. of order of one month. With RAID there is no The node architecture is depicted in Fig. 11. such problem, however. The number of nodes, The processor is DSP(Digital Signal Processor) not fixed yet, is from 1000 to 1500. byTexasInstruments. A32bitadditionandmul- 9 IPnrsotrcuecstsioonr PU • • • • • • Local Storage Storage Controller NIA EX X-XB Z-XB • Y-XB • • Instruction IO Bus IOU Processor Local Bus Storage Storage Adapter IO Controller Adapter Disk NIA Bus Adapter IO PU (Processing Unit) EX Adapter IOU (IO Unit) X-XB Z-XB Y-XB Crossbar Switch PU: Processing Unit EX (Exchanger) IOU: IO Unit NIA: Network Interface Adapter EX: Exchanger Disk XB: Cross Bar Figure 9. Schematic configuration of Processing Figure 10. System configuration of CP-PACS. Unit(PU) boardand IO Unit(IOU) boardof CP- PACS. a buffer. The buffer size is chosen in such a way thatmultiplicationsof3×3matriceson3-vectors tiplicationcanbeperformedconcurrentlywith40 can be efficiently done. nsmachinecycle. Thisleadsto50Mflopsforeach The 4-dimensional network is connected by node. It executes one word read for one machine eight bi-directional lines of NGA. Because the cycle and one wordwrite for two machine cycles. datatransferismadebyhandshaking,thelatency The DSP has 2K words of memory on chip. The is not low. To hide this latency, there is a mode size is small (3.0 cm2), the power consumption called “store and pass through”. In the calcula- very low (less than 1 Watt) and the price is less tionoftheinner-productoftwovectorswhichap- than 50$. pearsinthe conjugategradientmethod, the data Each node has 2 Mbytes of DRAM. The max- transfer which takes 70 % of the total time with- imum bandwidth between the processor and the outthismodereducesto28%withthismode. It memory is 25 Mwords/sec. The memory size is supports a block-strided transfer. 32 Gbytes in total. The mechanical design of a mother board is The node gate array(NGA) which is shown in shownin Fig.13. Onthe mother boardthere are Fig. 12 is to be newly developed. The design has 2×2×4×4 = 64 daughter boards with last 4 beenpartlyfinished. Itplaystherolesofmemory making a loop. Each node has a SCSI port to manager,networkswitchandspecializedcacheas which peripheral tape and disk drives are con- 10 Figure12. SchematicdiagramofNGA(node gate array) for the 0.5-Teraflops machine. Figure 11. Schematic diagramof one node of the 0.5-Teraflops machine. nected. One of 256 boards of the full machine is connected to the host. The disk space is 48 Gbytes in total. The data transfer from disk to tape or visa-visa can be done concurrently with physics calculations. The power consumption is expected to be about 50 KW, which is very low compared with other projects. The test board will be completed by summer 1994 and the full machine by sum- Figure 13. Mechanical design of a mother board mer 1995. The funds for 128 node machine with of the 0.5-Tflops project. a peak speed of 6.4 Gflops is supported by DOE. The proposalforthe full machine willbe submit- ted in spring 1994. 5.1. VPP500 4.4. APE1000 This is the latest machine from Fujitsu. Each This is a successor of APE 100 with a peak node is a vector processor with the same ar- speed of 1Tflops with 64 bit arithmetic[8]. The chitecture as VP400 with a peak speed of 1.6 project will start by the end of 1994. Gflops. Because of this, it is called a vector- parallelmachinebyFujitsu. Onenodeisamulti- 5. Commercial computers chip-module whichconsists of121LSIs, a partof I list the characteristics of the most powerful which is composed of GaAs. Each node has 128 commercialcomputers inTable 5 anddescribe in Kbytes of vector registers and 2 Kbytes of mask some details the two new ones below. For other registers. The memory size is 256Mbytes/node. computers I refer to the earlier reviews[1]. The network is a complete crossbar connecting all nodes, which is very powerful for any appli-

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.