Table Of ContentFPGA-Based Soft Vector Processors
by
Peter Yiannacouras
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
Copyright (cid:2)c 2009 by Peter Yiannacouras
Abstract
FPGA-Based Soft Vector Processors
Peter Yiannacouras
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
2009
FPGAs are increasingly used to implement embedded digital systems because of their low
time-to-market and low costs compared to integrated circuit design, as well as their superior
performance and area over a general purpose microprocessor. However, the hardware design
necessary to achieve this superior performance and area is very difficult to perform causing
long design times and preventing wide-spread adoption of FPGA technology. The amount of
hardware design can be reduced by employing a microprocessor for less-critical computation in
the system. Often this microprocessor is implemented using the FPGA reprogrammable fabric
as a soft processor which can preserve the benefits of a single-chip FPGA solution without
specializing the device with dedicated hard processors. Current soft processors have simple
architectures that provide performance adequate for only the least-critical computations.
Our goal is to improve soft processors by scaling their performance and expanding their
suitability to more critical computation. To this end we focus on the data parallelism found
in many embedded applications and propose that soft processors be augmented with vector
extensions to exploit this parallelism. We support this proposal through experimentation with
a parameterized soft vector processor called VESPA (Vector Extended Soft Processor Archi-
tecture) which is designed, implemented, and evaluated on real FPGA hardware.
ThescalabilityofVESPAcombinedwithseveralotherarchitecturalparameterscanbeused
to finely span a large design space and derive a custom architecture for exactly matching the
ii
needs of an application. Such customization is a key advantage for soft processors since their
architectures can be easily reconfigured by the end-user. Specifically, customizations can be
made to the pipeline, functional units, and memory system within VESPA. In addition, general
purpose overheads can be automatically eliminated from VESPA.
Comparing VESPA to manual hardware design, we observe a 13x speed advantage for hard-
ware over our fastest VESPA, though this is significantly less than the 500x speed advantage
over scalar soft processors. The performance-per-area of VESPA is also observed to be sig-
nificantly higher than a scalar soft processor suggesting that the addition of vector extensions
makes more efficient use of silicon area for data parallel workloads.
iii
Acknowledgements
I would like to thank my advisors Professor Greg Steffan and Professor
Jonathan Rose for all their guidance throughout the years. Our weekly
meetings were key in directing and executing this research. Also count-
less writing edits and many dry runs helped improve my written and
oral communication. Thank you for all of that, your advice and insights
along the way, and also thank you for the opportunity to teach some
classes.
My committee including Professor Moshovos and Professor Abdelrah-
man, as well as Professor Enright Jerger provided useful suggestions and
commentary for filling out this work.
Many thanks to Professor Christos Kozyrakis for corresponding with
me and sending me the hand-vectorized benchmarks used throughout
this work. I would also like to acknowledge the useful input received
from Vaughn Betz, David Lewis, and James Ball which helped guide our
research direction.
ThroughoutmysixyearsinLP392andinthePaCRaTgroup, itwascer-
tainly a pleasure interacting with all my colleagues, thanks for technical
breadth, good fun, and stimulating discussions.
Thank you to my parents and brothers for raising me, taking care of me,
and looking out for me throughout my life.
To my wife Melinda, thank your for all your love and support, I’m glad
to have had you by my side for every step of the way.
iv
for Meli
Contents
List of Tables xi
List of Figures xiii
1 Introduction 1
1.1 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Microprocessor Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Vector Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Vector Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Vector Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Vector Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Vector Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.5 The T0 Vector Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.6 The VIRAM Vector Processor . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.7 SIMD Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Field-Programmable Gate Arrays (FPGAs) . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Block RAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Multiply-Accumulate blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Microprocessor Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 FPGA Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vi
2.4.1 Behavioural Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Extensible Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Soft Processors and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Soft Single-Issue In-Order Pipelines . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Soft Multi-Issue Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 Soft Multi-Threaded Pipelines . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.4 Soft Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.5 Soft Vector Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Experimental Framework 27
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Software Compilation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 FPGA CAD Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 Measuring Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Measuring Clock Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Hardware Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.1 Transmogrifier-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.2 Terasic DE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.3 Measuring Wall Clock Time . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.1 Instruction Set Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7.2 Register Transfer Level (RTL) Simulation . . . . . . . . . . . . . . . . . . 36
3.7.3 In-Hardware Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.8 Advantages of Hardware Execution . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Performance Bottlenecks of Scalar Soft Processors 39
4.1 Integrating Scalar Soft Processors with Off-Chip Memory . . . . . . . . . . . . . 39
vii
4.1.1 Scalar Soft Processor Area Breakdown . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Scalar Soft Processor Memory Latency . . . . . . . . . . . . . . . . . . . . 42
4.2 Scaling Soft Processor Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Soft vs Hard Processor Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 The VESPA Soft Vector Processor 50
5.1 Motivating Soft Vector Processors . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 VESPA Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 VESPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.1 MIPS-Based Scalar Processor . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3.2 VIRAM-Based Vector Instruction Set . . . . . . . . . . . . . . . . . . . . 55
5.3.3 Vector Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.4 VESPA Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 Meeting the Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.1 VESPA Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.2 VESPA Portability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5 FPGA Influences on VESPA Architecture . . . . . . . . . . . . . . . . . . . . . . 63
5.6 Selecting a Maximum Vector Length (MVL) . . . . . . . . . . . . . . . . . . . . . 64
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 Scalability of the VESPA Soft Vector Processor 69
6.1 Initial Scalability (L) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1.1 Analyzing the Initial Design . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.2 Improving the Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.1 Cache Design Trade-Offs (DD and DW) . . . . . . . . . . . . . . . . . . . 72
6.2.2 Impact of Data Prefetching (DPK and DPV) . . . . . . . . . . . . . . . . 77
6.2.3 Reduced Memory Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.4 Impact of Instruction Cache (IW and ID) . . . . . . . . . . . . . . . . . . 84
6.3 Decoupling Vector and Control Pipelines . . . . . . . . . . . . . . . . . . . . . . . 85
viii
6.4 Improved VESPA Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.1 Cycle Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.4.2 Clock Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.3 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7 Expanding and Exploring the VESPA Design Space 92
7.1 Heterogeneous Lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.1.1 Supporting Heterogeneous Lanes . . . . . . . . . . . . . . . . . . . . . . . 93
7.1.2 Impact of Multiplier Lanes (X) . . . . . . . . . . . . . . . . . . . . . . . . 94
7.1.3 Impact of Memory Crossbar (M) . . . . . . . . . . . . . . . . . . . . . . . 95
7.2 Vector Chaining in VESPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7.2.1 Supporting Vector Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.2.2 Impact of Vector Chaining . . . . . . . . . . . . . . . . . . . . . . . . . . 101
7.2.3 Vector Lanes and Powers of Two . . . . . . . . . . . . . . . . . . . . . . . 105
7.3 Exploring the VESPA Design Space . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.3.1 Selecting and Pruning the Design Space . . . . . . . . . . . . . . . . . . . 105
7.3.2 Exploring the Pruned Design Space . . . . . . . . . . . . . . . . . . . . . 108
7.3.3 Per-Application Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4 Eliminating Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.1 Hardware Elimination Opportunities . . . . . . . . . . . . . . . . . . . . . 116
7.4.2 Impact of Vector Datapath Width Reduction (W) . . . . . . . . . . . . . 118
7.4.3 Impact of Instruction Set Subsetting . . . . . . . . . . . . . . . . . . . . . 120
7.4.4 Impact of Combining Width Reduction and Instruction Set Subsetting . . 121
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8 Soft Vector Processors vs Manual FPGA Hardware Design 125
8.1 Designing Custom Hardware Circuits . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.1.1 System-Level Design Constraints . . . . . . . . . . . . . . . . . . . . . . . 126
8.1.2 Simplifying Hardware Design Optimistically . . . . . . . . . . . . . . . . . 127
ix
8.2 Evaluating Hardware Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.2.1 Area Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2.2 Clock Frequency Measurement . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2.3 Cycle Count Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.2.4 Area-Delay Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.3 Implementing Hardware Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.4 Comparing to Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.4.1 Software vs Hardware: Area . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.4.2 Software vs Hardware: Wall Clock Speed . . . . . . . . . . . . . . . . . . 137
8.4.3 Software vs Hardware: Area-Delay . . . . . . . . . . . . . . . . . . . . . . 142
8.5 Effect of Subsetting and Width Reduction . . . . . . . . . . . . . . . . . . . . . . 143
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
9 Conclusions 146
9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
A Measured Model Parameters 152
B Raw VESPA Data on DE3 Platform 155
C Instruction Disabling Using Verilog 168
Bibliography 171
x
Description:a parameterized soft vector processor called VESPA (Vector Extended Soft Processor Archi- tecture) which is designed, implemented, and evaluated