Table Of ContentAssessment of Cache Coherence Protocols in Shared-memory
Multiprocessors
by
Alexander Grbic
A thesis submitted in conformity with the requirements
for the degree of Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
Copyright c 2003 by Alexander Grbic
(cid:176)
Abstract
Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors
Alexander Grbic
Doctor of Philosophy
Graduate Department of Electrical and Computer Engineering
University of Toronto
2003
The cache coherence protocol plays an important role in the performance of a distributed
shared-memory (DSM) multiprocessor. A variety of cache coherence protocols exist and difier
mainly in the scope of the sites that are updated by a write operation. These protocols can
be complex and their impact on the performance of a multiprocessor system is often di–cult
to assess. To obtain good performance, both architects and users must understand processor
communication, data locality, the properties of the interconnection network, and the nature of
the coherence protocols. Analyzing the processor data sharing behavior and determining its
efiect on cache coherence communication tra–c is the flrst step to a better understanding of
overall performance. Toward this goal, this dissertation provides a framework for evaluating
the coherence communication tra–c of difierent protocols and considers using more than one
protocol in a DSM multiprocessor.
The framework consists of a data access characterization and the application of assessment
rules. Its usefulness is demonstrated through an investigation into the performance of difierent
cache coherence protocols for a variety of systems and parameters. It is shown to be efiective
for determining the relative performance of protocols and the efiect of changes in system and
application parameters. The investigation also shows that no single protocol is best suited for
all communication patterns. Consequently, the dissertation also considers using more than one
cache coherence protocol in a DSM multiprocessor. The results show that the hybrid protocol
can signiflcantly reduce tra–c in all levels of the interconnection network with little efiect on
execution time.
ii
Acknowledgements
Iwouldliketothankmysupervisors,ProfessorsZvonkoVranesicandSinisaSrbljic,fortheir
suggestions, guidance and support throughout my thesis. Without their knowledge, experience
and time this work would not have been possible. I am grateful for their continued faith in me
in spite of my decisions to take on new challenges and responsibilities. In addition, I wish to
acknowledge useful discussions with Professor Michael Stumm and thank him for his help.
I cannot say enough to thank my wife Gordana and daughter Lidia for their love, patience
and understanding. Gordana, you gave me the support I needed to keep going, even when it
looked like there was no end in sight to my graduate work. Lidia, the moment you arrived you
brightened up my life, provided me with inspiration and taught me about the important things.
To both of you, my love.
I would like to thank my parents, brother and sister for their support, sacriflces and their
love. Tony and Vanda, thanks for being there for Gordana, Lidia and me whenever we needed
you. Tony, your dedication to research has motivated me in more ways than just making me
realize that you could flnish before me.
I must also thank my friends for the continued friendships. Even though I’ve gone largely
into seclusion in the last while, you’ve kept in touch and always made me feel welcome. I
express my thanks to the old Computer Group crowd and to the people at work for the friendly
and frequent reminders of my unflnished business.
IgratefullyacknowledgetheflnancialassistanceprovidedtomethroughOGSSTandNSERC
Scholarships as well as a UofT Open Fellowship.
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Type of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Implementing Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Directory Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Understanding Protocol Performance . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Hybrid Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5.1 On-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.2 Ofi-line Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 The NUMAchine Multiprocessor - Evolution . . . . . . . . . . . . . . . . . . . . . 22
2.7 Memory Consistency models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 The NUMAchine Cache Coherence Protocol 27
3.1 The NUMAchine Multiprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Interconnection Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
iv
3.1.3 Communication Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.4 Organization of the Network Cache . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Protocol Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Processor Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Protocol Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 Invalidations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 Request and Response Forwarding . . . . . . . . . . . . . . . . . . . . . . 35
3.2.5 Negative Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.6 Efiect of Network Cache Organization . . . . . . . . . . . . . . . . . . . . 36
3.3 Protocol Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.1 Directory Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Protocol States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Local Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2 Local Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.3 Remote Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.4 Remote Write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.5 Remote Write-Backs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Preserving the Memory Consistency Model . . . . . . . . . . . . . . . . . . . . . 45
3.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Experimental Environment 48
4.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.1 Mintsim Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.2 Architectural Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Description of Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 Rationale for Choices of Benchmarks . . . . . . . . . . . . . . . . . . . . . 52
4.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
v
5 Sharing Patterns and Tra–c 54
5.1 Data Access Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1.1 Data Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1.2 Obtaining the Data Access Characterization. . . . . . . . . . . . . . . . . 57
5.2 Understanding Cache Coherence Protocols . . . . . . . . . . . . . . . . . . . . . . 58
5.2.1 Description of Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3 Choice of Characterization Interval . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Conflrmation of Rule 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Choosing Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Extending the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6 Evaluation of Protocol Performance 74
6.1 The Update Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1.1 The Update Protocol in a Distributed System . . . . . . . . . . . . . . . . 77
6.2 The Write-through Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Uncached Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.4 Protocol Communication Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.5 Study Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.5.2 Page Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.5.3 Interval Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.6 Data Access Characterization of Benchmarks . . . . . . . . . . . . . . . . . . . . 83
6.7 Relative Performance of Difierent Protocols . . . . . . . . . . . . . . . . . . . . . 85
6.7.1 Applying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 85
6.7.2 Verifying the Assessment Rules . . . . . . . . . . . . . . . . . . . . . . . . 89
vi
6.8 Explanation of Application Behavior . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.9 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
7 Hybrid Cache Coherence Protocol 99
7.1 General Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2 Processor Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.2.1 Base Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2.2 Dirty Shared State Support . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.3 Directory Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.3.1 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3.2 Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.4 Transitions Between Protocols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.4.1 Dealing with Additional States in the Update Protocol . . . . . . . . . . . 109
7.4.2 Network Cache Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4.3 Cache Blocks in Transition . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4.4 Transitions Between Protocols in the Processor Cache . . . . . . . . . . . 113
7.5 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.5.1 Simulation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.5.3 Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.6 Hybrid Protocol Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.7 Wrong Protocols for Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.8 Decision Functions and Hybrid Protocol Execution Time . . . . . . . . . . . . . . 129
7.8.1 Only the Tra–c-Based Decision Function Changes to Update (t2u) . . . . 132
7.8.2 Only the Tra–c-based Decision Function Changes to Invalidate (t2i) . . . 135
7.8.3 Only the Latency-based Decision Function Changes to Update (l2u) . . . 136
7.8.4 Only the Latency-based Decision Function Changes to Invalidate (l2i) . . 137
7.8.5 General Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.9 Latency-based Decision Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
vii
7.10 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8 Conclusion 147
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
A NUMAchine Cache Coherence Protocol - Invalidate 151
A.1 Local System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
A.2 Remote System Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
A.3 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.3.1 Negative Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . 161
A.3.2 Exclusive Reads and Upgrades . . . . . . . . . . . . . . . . . . . . . . . . 163
A.3.3 Non-inclusion of Network Cache, NOTIN Cases . . . . . . . . . . . . . . . 165
B System Events 168
Bibliography 174
viii
List of Tables
2.1 Experimental and commercial multiprocessor architectures. . . . . . . . . . . . . 13
2.2 Cache coherence in experimental and commercial multiprocessors. . . . . . . . . 15
3.1 States in memory and network cache directories. . . . . . . . . . . . . . . . . . . 40
4.1 Simulation parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Access latencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Values of parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1 Communicationcostsinnumbersofpacketsforinvalidate,update,write-through
and uncached operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 System data access characterization and percentage of writes. . . . . . . . . . . . 86
6.3 Data access characterization for the central ring. . . . . . . . . . . . . . . . . . . 87
6.4 Average number of packets per access for difierent cache coherence protocols on
a 4-processor system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.5 Average number of packets per access for difierent cache coherence protocols on
a 64-processor system central ring. . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1 Parallel e–ciency for SPLASH2 applications used in the hybrid protocol study. . 116
7.2 Examples of NUMAchine system event costs in terms of number of packets for
the invalidate and update protocols. . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Frequency of using incorrect protocols given in numbers of intervals. . . . . . . . 128
ix
7.4 Disagreements between the tra–c-based and latency-based decision functions
given in numbers of intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.5 MRSW example for the case where only the tra–c decision function changes to
update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.6 SRMW example for the case where only the tra–c decision function changes to
update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.7 MRMW example for the case where only the tra–c decision function changes to
update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.8 MW example for the case where only the tra–c decision function changes to
update (t2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.9 MRSW example for the case where only the tra–c decision function changes to
invalidate (t2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.10 MRSW example for the case where only the latency decision function changes
to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.11 MRMW example for the case where only the latency decision function changes
to update (l2u). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.12 MRSW example for the case where only the latency decision function changes
to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.13 MRMW example for the case where only the latency decision function changes
to invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.14 MW example for the case where only the latency decision function changes to
invalidate (l2i). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
A.1 System events for local requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
A.2 System events for remote requests. . . . . . . . . . . . . . . . . . . . . . . . . . . 155
B.1 System event descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.2 System event details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
B.3 Tra–c and latency costs for system events. . . . . . . . . . . . . . . . . . . . . . 172
B.4 System parameters that afiect tra–c. . . . . . . . . . . . . . . . . . . . . . . . . . 172
x
Description:then it can create cache pollution and increase traffic and contention of network resources. A number of dynamic hybrid cache coherence protocols have been proposed and imple- mented. They differ mainly in the implementation of the decision function and the amount of hardware support provided for