MATRIX COMPUTATIONS ON SYSTOLIC-TYPE ARRAYS MATRIX COMPUTATIONS ON SYSTOLIC-TYPE ARRAYS J aime H. Moreno Assistant Professor, Departamento de Ingenieria Electrica Universidad de Concepci6n, Chile Tomas Lang Professor, Departament d'Arquitectum de Computadors Universitat Politecnica de Catalunya, Barcelona, Spain ~. " SPRINGER SCIENCE+BUSINESS MEDIA, LLC Library of Congress Cataloging-in-Publication Data Moreno, Jaime H., 1954- Matrix computations on systolic-typc arrays ! Jaime H. Moreno, Tomas Lang. p. cm. -- (The Kluwer international series in engineering and computer science; SECS 0174) Includes bibliographical references and index. ISBN 978-1-4613-6604-1 ISBN 978-1-4615-3610-9 (eBook) DOI 10.1007/978-1-4615-3610-9 1. Computer algorithms. 2. Systolic array circuits. 1. Lang, Tom as. II. Title. III. Series. QA76.9.A43M67 1992 512.9'434'0285435--dc20 92-9868 CIP Copyright © 1992 by Springer Science+Business Media New York Originally published by Kluwer Academic Publishers in 1992 Softcover reprint ofthe hardcover lst edition 1992 AII rights reserved. No part of this pubJication may be reproduced, stored in a retrieval systemor transmitted in any form orby any means, mechanical, photo-copying, record ing, or otherwise, without the prior written permission of the publisher, Springer Science+Business Media, LLC. Printed on acid-free pap er. To Marisa, who has shared my endeavors with love, understanding and support. To my parents, who instilled in me the desire to reach further. Jaime Moreno Contents 1 Introduction 1 1.1 Matrixcomputations, algorithms, parallel architectures 1 1.2 Summary of the book. . . . . . . . . . . . . . . . .. 8 2 Systolic-type arrays for matrix algorithms 15 2.1 Realization and mapping of matrix algorithms 16 2.2 Desigp. space, performance and cost measures 19 2.3 Architectural models of systolic-type arrays 24 2.4 Models of computation in systolic-type arrays 27 2.5 Size relation among problem and array 30 2.5.1 Partitioning by cut-and-pile 31 2.5.2 Partitioning by coalescing 32 2.5.3 Other partitioning strategies 33 2.5.4 Indirect and direct partitioning 33 2.6 Tradeoffs in an implementation 36 Vll viii Contents 2.6.1 Nonpartitioned case ... 36 2.6.2 Example of the tradeoffs 39 2.6.3 Partitioned case . . . . . 41 2.6.4 Array topology for partitioned implementations 42 2.7 Further readings 42 3 Regularization of matrix algorithms 45 3.1 Stages in a design method 45 3.2 Regularized representations 47 3.3 The multimesh graph representation 51 3.4 Class of admissible algorithms in the MMG method 54 3.5 Regularization stage in the MMG method 58 3.5.1 Obtaining the fully parallel graph 59 3.5.2 Obtaining the multimesh graph 64 3.6 Formal description of the regularizing transformations 68 3.6.1 Eliminating data broadcasting . . . . . 69 3.6.2 Eliminating bidirectional dependencies 70 3.6.3 Removing nonnearest-neighbor dependencies 72 3.7 Deriving the multimesh graph of the triangulariza- tion algorithm ..................... 73 3.8 Deriving the multimesh graph of the transitive clo- sure algorithm ...................... 76 Contents ix 3.9 Derivingthe multimeshgraphofthe LU-decomposition algorithm . . . . . . . . . . . . . . . . . . . . . . . . 81 3.10 Deriving the multimesh graph of the algorithm to compute BA-1 85 3.11 Summary .. , 87 4 Realization of algorithm-specific fixed-size arrays 91 4.1 Realization procedure ............. 92 4.2 Derivation of G-graphs: Grouping by prisms 94 4.3 Schedule of nodes in a complete prism 96 4.4 Prisms in a complete graph 98 4.5 Direction of prisms ..... · 101 4.6 Complete multimesh graph and the pseudosystolic model of computation 105 4.6.1 Summary of performance and cost measures . 107 4.7 Cell architecture and control. · 108 4.7.1 Functional unit .... .108 4.7.2 Internal storage access and organization · 109 4.7.3 Control of the cell . · 112 4.7.4 Systolic cells. . . . .114 4.7.5 Nonpipelined cells. .114 4.8 Incomplete graphs and the pseudosystolic model. · 115 4.8.1 Incompleteness of the graph ........ .115 x C;ontents 4.8.2 Transmitted data and direction of prisms . . . 117 4.8.3 Performance of realizations from incomplete graphs. . . . . . . . . . . . . . . . . .. . 119 4.9 Multimesh graphs with two flows of input data. · 122 4.10 Example: Pseudosystolic arrays for matrix triangu- larization ....................... · 124 4.11 Example: Systolic-type arrays for computing BA-1 · 130 4.12 Summary .................. . . . . . · 134 5 Partitioned realizations using cut-and-pile 135 5.1 Model of partitioned execution using cut-and-pile · 136 5.2 Partitioning a multimesh graph using cut-and-pile · 137 5.3 Selection of G-sets · 140 5.4 Schedule of G-sets . · 141 5.5 G-sets from a complete multimesh graph · 142 5.5.1 Throughput and computation time · 142 5.5.2 Array utilization · 145 5.5.3 External memory · 145 5.5.4 Array input/output bandwidth · 146 5.6 Incomplete MMGs and G-sets · 150 5.6.1 The selection of G-sets · 150 5.6.2 Performance measures · 151 5.7 Summary of performance measures · 154 C70ntents xi 5.8 Multimesh graphs with two flows of input data. . 154 5.9 Cut-and-pile in LU-decomposition . . 156 5.9.1 Linear array. . . . . . 158 5.9.2 Two-dimensional array 159 5.9.3 Performance measures for systolic arrays 160 5.9.4 Performance measures for pseudosystolic arrays . . . . . . . . . . . . 164 5.10 Tradeoffs among array topologies 167 5.11 A canonical linear array for partitioned problems 169 6 Partitioned realizations using coalescing 171 6.1 The model of computation . . . . . · 172 6.2 The model of partitioned execution · 173 6.3 Partitioning the multimesh graph · 174 6.4 Coalescing the multimesh graph · 176 6.5 Schedule of nodes in a partition 178 6.6 Cell architecture and control. 180 6.7 Coalescing incomplete MMGs 183 6.8 Example: Local-access arrays for LU-decomposition 187 6.8.1 Computing LU-decomposition in a single cell. 190 6.8.2 Realization as a linear array . . . . . . 191 6.8.3 Realization as a two-dimensional array . 195 xii Contents 7 Linear pseudosystolic array for matrix algorithms 199 7.1 Architecture of the array .200 7.2 Architecture of the cells .202 7.2.1 The access unit .202 7.2.2 The processing unit . .206 7.3 Code efficiency ....... .211 7.4 Executing LV-decomposition. .214 7.4.1 Computing the algorithm in a single cell · 216 7.4.2 Computing the algorithm with K cells · 219 7.5 Summary .................... · 221 8 Mapping matrix algorithms 225 8.1 The regularization stage . .226 8.2 The mapping stage and the specific target ar- chitecture . . . . . . . " . 227 8.3 Example: Mapping onto a memory-linked array .229 8.3.1 The target architecture .. .229 8.3.2 Scheduling and allocation .232 8.3.3 Computation in a single cell · 235 8.3.4 Computation with K cells .236 8.4 Example: Mapping onto a digital signal processor .237 8.4.1 Simplified model of a DSP . . . . . . . . . .238
Description: