Table Of ContentComputer Architecture and Design Methodologies
Ayesha Khalid
Goutam Paul
Anupam Chattopadhyay
Domain Specific
High-Level
Synthesis for
Cryptographic
Workloads
Computer Architecture and Design
Methodologies
Series Editors
AnupamChattopadhyay,NanyangTechnologicalUniversity,Singapore,Singapore
Soumitra Kumar Nandy, Indian Institute of Science, Bangalore, India
Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU),
Erlangen, Bayern, Germany
Debdeep Mukhopadhyay, Indian Institute of Technology Kharagpur, Kharagpur,
West Bengal, India
Twilight zone of Moore’s law is affecting computer architecture design like never
before. The strongest impact on computer architecture is perhaps the move from
unicore to multicore architectures, represented by commodity architectures like
general purpose graphics processing units (gpgpus). Besides that, deep impact of
application-specificconstraintsfromemergingembeddedapplicationsispresenting
designers with new, energy-efficient architectures like heterogeneous multi-core,
accelerator-rich System-on-Chip (SoC). These effects together with the security,
reliability, thermal and manufacturability challenges of nanoscale technologies are
forcing computing platforms to move towards innovative solutions. Finally, the
emergenceoftechnologiesbeyondconventionalcharge-basedcomputinghasledto
a series of radical new architectures and design methodologies.
The aim of this book series is to capture these diverse, emerging architectural
innovations as well as the corresponding design methodologies. The scope covers
the following.
(cid:129) Heterogeneous multi-core SoC and their design methodology
(cid:129) Domain-specific architectures and their design methodology
(cid:129) Novel technology constraints, such as security, fault-tolerance and their impact
on architecture design
(cid:129) Novel technologies, such as resistive memory, and their impact on architecture
design
(cid:129) Extremely parallel architectures
More information about this series at http://www.springer.com/series/15213
Ayesha Khalid Goutam Paul
(cid:129) (cid:129)
Anupam Chattopadhyay
fi
Domain Speci c High-Level
Synthesis for Cryptographic
Workloads
123
Ayesha Khalid GoutamPaul
TheInstitute of Electronics, CryptologyandSecurity Research Unit
Communications andInformation R. C.Bose Centrefor Cryptology
Technology andSecurity
Queen’s University Belfast Indian Statistical Institute
Belfast, Ireland Kolkata, India
Anupam Chattopadhyay
Schoolof ComputerEngineering
NanyangTechnological University
Singapore, Singapore
ISSN 2367-3478 ISSN 2367-3486 (electronic)
ComputerArchitecture andDesign Methodologies
ISBN978-981-10-1069-9 ISBN978-981-10-1070-5 (eBook)
https://doi.org/10.1007/978-981-10-1070-5
LibraryofCongressControlNumber:2019933716
©SpringerNatureSingaporePteLtd.2019
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpart
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
orinformationstorageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilar
methodologynowknownorhereafterdeveloped.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publicationdoesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfrom
therelevantprotectivelawsandregulationsandthereforefreeforgeneraluse.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with respect to the material contained
hereinorforanyerrorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregard
tojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations.
ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSingaporePteLtd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Acknowledgements
Research is hardly a day’s thought. A small germ of an idea takes shape over
numerous sessions of discussion, debate, programming, deadlines, and finally sees
thelightofday,ofteninaforumoflike-mindedresearchcommunity.Thatleadsto
furthersynthesisandrefinementofideas.Thisworkisnoexception.Whilelooking
intothesuddenburstofactivityaroundcryptographicaccelerator,wefoundthatto
beaninterestingtopicanddecidedtotakeashot.Afewmonthslater,itwasalltoo
apparentthatwewereinaloopofreapplyingthesamedesignprinciplestoachieve
faster and smaller circuits. We were lucky to have a solid design automation
knowledgesharedamongustoquicklyfigureoutthatthiscouldbeautomated.The
rest was pure fun. We kept on defining the roadmap, charted the course, and
covered it—to our amazement, sometimes matching designs that were carefully
handcrafted.Inthatsense,thisisthestoryofthiswork,whereweshowthatdesign
automation is all too essential but, an oft-forgotten piece of technology. However,
automationisnottheonlythingthatcryptographicacceleratordesignsdoneed.To
chart the course of what to automate, we found a whole range of trade-offs,
including programmability-efficiency, need to be accounted for in a design. We
report a few interesting design studies in that direction. Finally, cryptographic
accelerator designs do sometimes play an important role while trying to break
security. Design efficiency becomes important there as well. A preliminary design
study for that is reported in this book. Overall, we show that high-level synthesis
can bridge the gap between efficiency and convenience if narrowed down to a
specificapplicationdomain.Asimilarstudycouldbereplicatedinotherapplication
areas, and actually have been well covered in the scope of signal processing sys-
tems. Given the importance of security-related applications, the proposed studies
cangrowinvariousdirections,asoutlinedthroughoutthisbook.Inthisjourney,we
gothelpfromnumerousstudents,collaborators,andcolleagues.Itwouldbehardto
name them all, and very hard to prioritize their contributions. Nevertheless, in the
standard art of acknowledgement, we would like to thank Subhamoy Maitra,
Sourav Sen Gupta, and Rishiraj Bhattacharyya, who ignited several initial ideas
capturedinthisbook.WearethankfultoMuhammadHassanandKhawarShahzad
v
vi Acknowledgements
for making particularly strong contributions in the programming aspects. We are
thankfultoZoltanRakossyandZhengWangtobeatthereceivingendofourmost
complicated technical arguments, and patiently providing us with insightful com-
ments.Wearethankfultoalltheresearchersintheearlyeraofhigh-levelsynthesis,
who kept the flame burning. We thank our families to bear with us while working
long schedules that research often demands. It has been an eventful time, which
makesthelifeofaresearchermeaningful.Wehopethatreadersenjoythejourney,
in this book, and beyond.
November 2016 Ayesha Khalid
Goutam Paul
Anupam Chattopadhyay
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions and Structure of This Thesis. . . . . . . . . . . . . . . . . . 2
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 High Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Generation Through Automation. . . . . . . . . . . . . . . . . . . . 8
2.1.3 Steps of High Level Synthesis . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 HLS: A Brief Retrospection . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 The Current Generation of HLS . . . . . . . . . . . . . . . . . . . . 12
2.2 High Level Synthesis for Cryptographic Workloads . . . . . . . . . . . 14
2.3 ASIC Design Flow Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 The Standard Cell Digital Design Flow. . . . . . . . . . . . . . . 16
2.3.2 ADL Based Design Flow. . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Experimental Setup for CPU-GPGPUs Environment. . . . . . . . . . . 19
2.5 Concluding Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Dwarfs of Cryptography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Berkeley Dwarfs for Parallel Computing . . . . . . . . . . . . . . . . . . . 23
3.2 Cryptology Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Block Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Stream Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Block Ciphers: Major Ingredient of Symmetric Key
Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vii
viii Contents
3.3.1 Transformations Under Modes Of Operation. . . . . . . . . . . 28
3.3.2 Basic Building Blocks for Symmetric Key
Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Cipher Algorithmic Configuration Space . . . . . . . . . . . . . . . . . . . 31
3.4.1 Block Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Stream Ciphers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Hash Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Concluding Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 High Level Synthesis for Symmetric Key Cryptography. . . . . . . . . . 51
4.1 CRYKET (CRYptographic Kernels Toolkit). . . . . . . . . . . . . . . . . 52
4.2 RunFein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.1 Design Specification Compilation . . . . . . . . . . . . . . . . . . . 53
4.2.2 Specification Validation and Formal Model Creation . . . . . 54
4.2.3 Software Generation Engine. . . . . . . . . . . . . . . . . . . . . . . 57
4.2.4 Hardware Generation Engine . . . . . . . . . . . . . . . . . . . . . . 58
4.2.5 Results and Analysis: Software Efficiency. . . . . . . . . . . . . 63
4.2.6 Results and Analysis: Hardware Efficiency . . . . . . . . . . . . 64
4.3 RunStream. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Design Specification Compilation . . . . . . . . . . . . . . . . . . . 72
4.3.2 Specification Validation and Formal Model Creation . . . . . 73
4.3.3 Software Generation Engine. . . . . . . . . . . . . . . . . . . . . . . 76
4.3.4 Hardware Generation Engine . . . . . . . . . . . . . . . . . . . . . . 78
4.3.5 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.6 Comparison with Manual Implementations . . . . . . . . . . . . 86
4.4 Concluding Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 Manual Optimizations for Efficient Designs . . . . . . . . . . . . . . . . . . . 91
5.1 Optimization Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.1.1 Memory Bank Structure Optimizations . . . . . . . . . . . . . . . 91
5.1.2 Unification of Multiple Cryptographic Proposals . . . . . . . . 91
5.2 Memory Bank Structure Optimizations . . . . . . . . . . . . . . . . . . . . 92
5.2.1 Reviewing Known Techniques . . . . . . . . . . . . . . . . . . . . . 93
5.2.2 Optimized Memory Utilization for HC-128. . . . . . . . . . . . 93
5.2.3 Design Space Exploration of HC-128 Accelerator . . . . . . . 94
5.2.4 State Split Optimizations for HC-128 . . . . . . . . . . . . . . . . 98
5.2.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Integrated Implementation of Multiple Cryptographic
Primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.3 Contribution: HiPAcc-LTE-Integrated Accelerator
for SNOW 3G and ZUC . . . . . . . . . . . . . . . . . . . . . . . . . 107
Contents ix
5.3.4 Structural Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3.5 Integrating the Main LFSR. . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.6 Integrating the FSM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.7 ASIC Implementation of HiPAcc-LTE . . . . . . . . . . . . . . . 115
5.4 Concluding Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6 Study of Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Contribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.3 CoARX: A Coprocessor for ARX-Based Cryptographic
Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . 132
6.3.3 Mapping of the ARX Algorithms . . . . . . . . . . . . . . . . . . . 136
6.3.4 Implementation and Benchmarking. . . . . . . . . . . . . . . . . . 140
6.4 RC4-AccSuite: A Hardware Acceleration Suite for RC4-like
Stream Ciphers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.4.1 RC4 Stream Cipher Algorithm . . . . . . . . . . . . . . . . . . . . . 144
6.4.2 Variants of RC4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.4.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.4.4 High-Level Architecture of RC4-AccSuite. . . . . . . . . . . . . 146
6.4.5 Performance Enhancement by Memory Replication
Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4.6 Resource Economization in RC4-AccSuite . . . . . . . . . . . . 152
6.4.7 Implementation and Benchmarking. . . . . . . . . . . . . . . . . . 158
6.5 Concluding Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7 Study of Scalability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.2 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.3 The Compute Unified Device Architecture (CUDA)
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.3.1 Kernel Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.3.2 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.4 Block Ciphers Performance Acceleration on GPUs. . . . . . . . . . . . 172
7.5 Mapping Salsa20 Stream Cipher on GPUs. . . . . . . . . . . . . . . . . . 173
7.5.1 Analyzing Parallelism Opportunities of Salsa20. . . . . . . . . 173
7.5.2 Batch Processing Framework . . . . . . . . . . . . . . . . . . . . . . 174
7.5.3 CUDA Coding Guidelines . . . . . . . . . . . . . . . . . . . . . . . . 176
7.5.4 Optimization for Salsa20 . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.5.5 Autotuning for Throughput Optimization . . . . . . . . . . . . . 177
7.5.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180