Table Of ContentDelft University of Technology
Electrical Engineering, Mathematics, and Computer Science
Computer Engineering
Feeding High-Bandwidth Streaming-Based FPGA Accelerators
Thesis by:
Yvo Thomas Bernard Mulder
Advisor:
Prof. Dr. H.P. Hofstee
Committee:
Chair:
Prof. Dr. H.P. Hofstee
Members:
Dr. Ir. Z. Al-Ars
Dr. Ir. R. van Leuken
Feeding High-Bandwidth Streaming-Based FPGA Accelerators
Thesis
submitted in partial fulfillment of the requirements for the degree of
Master of Science
in
Computer Engineering
by
Yvo Thomas Bernard Mulder
born in Utrecht, The Netherlands
to be defended publicly on January 29, 2018 at 15:00.
CE-MS-2018-05
ISBN: 978-94-6186-886-2
Computer Engineering
Faculty of Electrical Engineering, Mathematics, and Computer Science
Delft University of Technology
Mekelweg 4,
2628 CD, Delft
The Netherlands
Abstract
A new class of accelerator interfaces has significant implications on system architecture. An
order of magnitude more bandwidth forces us to reconsider FPGA design. OpenCAPI is a
new interconnect standard that enables attaching FPGAs coherently to a high-bandwidth, low-
latency interface. Keeping up with this bandwidth poses new challenges for the design of
accelerators, and the logic feeding them.
Thisthesisisconductedaspartofagroupproject,wherethreeothermasterstudentsinvestigate
database operator accelerators. This thesis focuses on the logic to feed the accelerators, by
designing a reconfigurable multi-stream buffer architecture. By generalizing across multiple
common streaming-like accelerator access patterns, an interface consisting of multiple read
ports with a smaller than cache line granularity is desired. At the same time, multiple read
ports are allowed to request any stream, including reading across a cache line boundary.
The proposed architecture exploits different memory primitives available on the latest genera-
tionofXilinxFPGAs. Bycombiningatraditionalmulti-readportapproachfordataduplication
with a second level of buffering, a hierarchy typically found in caches, an architecture is pro-
posed which can supply data from 64 streams to eight read ports without any access pattern
restrictions.
A correct-by-construction design methodology was used to simplify the validation of the design
and to speedup the implementation phase. At the same time, the design methodology is doc-
umented and examples are provided for ease of adoption. With the design methodology, the
proposed architecture has been implemented and is accompanied by a validation framework.
Various configurations of the multi-stream buffer have been tested. Configurations up to 64
streams with four read ports meet timing with an AFU request-to-response latency of five
cycles. The largest configuration with 64 streams and eight read ports fails timing. Limiting
factorsaretheinherentarchitectureofFPGAs,wherememoriesarephysicallylocatedinspecific
columns. This makes extracting data complex, especially at the target frequencies of 200MHz
and 400MHz. Wires are scattered across the FPGA and wire delay becomes dominant.
FPGA design at increasing bandwidths requires new design approaches. Synthesis results are
no guarantee for the implemented design, and depending on the design size, could indicate a
very optimistic operating frequency. Therefore, designing accelerators to keep up with an order
ofmagnitudemorebandwidthcomparedtothecurrentstate-of-the-artiscomplex, andrequires
carefully thought out accelerator cores, combined with an interface capable of feeding it.
3
Preface
This thesis report marks the end of this project, on which I have worked for a year. The year
started at the IBM Austin Research Lab, after Peter Hofstee invited me to work with him on
emerging coherently attached FPGA accelerators. During my six months in Austin, Peter was
always ready to help and chat. Also in the period after Austin was Peter always available.
Peter, it has been a very pleasant journey and I sincerely hope our paths cross again. I can
safely say that you not only have been a tremendous supervisor, but also a good friend.
Because the project is in the field of FPGAs, I had many interesting discussions with Andrew
Martin. Andrew is a research staff member and developed a ready-valid design methodology
for FPGA design. Andy, I would like to thank you for your support during the design and
implementation phase. Your methodology is the missing link for FPGA design.
Dorus, Eric, Jeremy, and Jinho, the three o’clock stretch session must live on, but in different
countries and time zones. I enjoyed my time in Austin very much and that is thanks to you.
I would like to thank the Universiteitsfonds for providing funding for the six months I spent in
Austin. Without this support, it would have been difficult to have had this experience.
I would like to thank my parents for their life-long support and faith in me.
Finally, I would like to thank Fjóla for always being there for me when I needed her the most
and making me a better person every day.
5
Contents
List of Figures 11
List of Tables 13
Listings 13
Revision Log 15
1 Introduction 17
1.1 Thesis Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Technology Trends 21
2.1 Acceleration in the Data Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.1 Dennard Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.2 Homogeneous Multi-Core Systems . . . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Heterogeneous Multi-Core Systems . . . . . . . . . . . . . . . . . . . . . . 23
2.1.4 Application Specific Acceleration . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.5 FPGA Adoption in the Data Center . . . . . . . . . . . . . . . . . . . . . 24
2.2 Interconnect Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 Attached Devices Push Bandwidth Requirements . . . . . . . . . . . . . . 25
2.2.2 Bandwidth Trends at Device-Level . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Bandwidth Trends at System-Level . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Current Interconnect Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Traditional IO Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Communication and Synchronization Overhead . . . . . . . . . . . . . . . 29
2.3.3 Host Memory Access Congestion . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Interconnect Coherency and Shared Memory: A Necessity . . . . . . . . . . . . . 30
2.4.1 Coherent IO Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 System-Wide Shared Memory Address Space . . . . . . . . . . . . . . . . 31
2.4.3 System-Wide Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.4 Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Preliminary Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 State-of-the-Art Interconnects 35
3.1 PCI Express . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.2 PCI Express Gen 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.3 PCI Express Gen 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7
3.1.4 PCI Express Gen 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 CAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 CAPI 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 CAPI 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 OpenCAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 OpenCAPI 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 OpenCAPI 4.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 CCIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 AMBA AXI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.2 Handshake Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.3 AXI Protocol Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.4 AXI Coherence Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Interconnect Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.1 Bandwidth and Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.2 Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.3 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.4 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.7 Preliminary Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 OpenCAPI Characterization 47
4.1 POWER9 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 OpenCAPI Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Protocol Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Data Link Layer Frame Format . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.3 Transaction Layer Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 Coherent Accelerator Processor Proxy . . . . . . . . . . . . . . . . . . . . 52
4.2.5 OpenCAPI Attached Device . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.6 Address Spaces and Translation . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Coherent Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Coherent Shared Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.2 Accelerator Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 FPGA Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.2 Typical Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.3 Configurable Logic Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.4 Memory Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.5 DLX and TLX Reference Design . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Preliminary Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Requirements and Naive Designs 63
5.1 Accelerator Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Merge-Sort Accelerator Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.1 Naive Buffer Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.2 Crossing the Cache Line Boundary . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Design Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Naive Design Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8
Description:and making me a better person every day. 5 0. 50. 100. 150. 200. 250. 300. Peak aggregate uni-directional interconnect bandwidth [GB/s]. AMD. Applied Micro. Cavium. IBM. Intel. Qualcomm. (b) Peak .. In October 2017, the final specification for PCI Express Gen 4 was released, limited to members.