F. T H O M S ON L E I G H T ON I N T R O D U C T I ON TO PARALLEL ALGORITHMS AND ARCHITECTURES: ARRAYS · TREES · HYPERCUBES MORGAN KAUFMANN PUBLISHERS SAN MATEO, CALIFORNIA Sponsoring Editor: Bruce M. Spatz Production Editor: Yonie Overton Cover Designer: Victoria Ann Philp Copyeditor: Bob Klingensmith Morgan Kaufmann Publishers, Inc. Editorial Office: 2929 Campus Drive, Suite 260 San Mateo, CA 94403 © 1992 by Morgan Kaufmann Publishers, Inc. All rights reserved Printed in the United States of America No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopying, recording, or otherwise—without the prior written permission of the publisher. 94 93 92 91 543 2 1 Library of Congress Cataloging-in-Publication Data is available for this book. Preface This book is designed to serve as an introduction to the exciting and rapidly expanding field of parallel algorithms and architectures. The text is specif- ically directed towards parallel computation involving the most popular network architectures: arrays, trees, hypercubes, and some closely related networks. The text covers the structure and relationships between the dominant network architectures, as well as the fastest and most efficient parallel algo- rithms for a wide variety of problems. Throughout, emphasis is placed on fundamental results and techniques and on rigorous analysis of algorithmic performance. Most of the material covered in the text is directly applica- ble to many of the parallel machines that are now commercially available. Those portions of the text that are of primarily theoretical interest are identified as such and can be passed without interrupting the flow of the text. The book is targeted for a reader with a general technical background, although some previous familiarity with algorithms or programming will prove to be helpful when reading the text. No previous familiarity with parallel algorithms or networks is expected or assumed. Most of the text is written at a level that is suitable for undergrad- uates. Sections that involve more complicated material are denoted by a • following the section heading. A few highly advanced subsections in the text are denoted with a • • following the subsection heading. These sub- sections cover material that is meant for advanced researchers, although the introductions to these subsections are written so as to be accessible to all. Readers who wish to understand the more advanced sections of the text, but who find that they lack the necessary mathematical or computer χ Preface science background, are referred to the text by Cormen, Leiserson, and Rivest [51] for an introduction to algorithms, the text by Graham, Knuth, and Patashnik [84] for an introduction to concrete mathematics (including combinatorics, probability, counting arguments, and asymptotic analysis), and the text by Maurer and Ralston [167] for an elementary introduction to both subjects. Organization of the Material The book is organized into three chapters according to network architec- ture. We begin with the simplest architectures (arrays and trees) in Chap- ter 1 and advance to more complicated architectures in Chapter 2 (meshes of trees) and Chapter 3 (hypercubes and related networks). Each chap- ter can be read independently; however, Section 1.1 and Subsection 1.2.2 provide important background material for all three chapters. Within each chapter, the material is organized according to application domain. Throughout, we start with simple algorithms for simple problems and advance to more complicated algorithms for more complicated prob- lems within each chapter and each section. Commonality between algorithms for the same problem on different networks and different problems on the same network is pointed out and emphasized where appropriate. Particular emphasis is placed on the most basic paradigms and primitives for parallel algorithm design. These para- digms and primitives (which include prefix computation, divide and con- quer, pointer jumping, Fourier transform, matrix multiplication, packet routing, and sorting) arise in all three chapters and provide threads that link the chapters together. Of course, there are many other ways that one could organize the same material. We have chosen this particular organization for several reasons. First, algorithms designed for different problems on the same network tend to have more in common with each other than do algorithms designed for the same problem on different networks. For example, Chapter 1 contains optimal algorithms for Gaussian elimination and finding minimum-weight spanning trees on an array. These algorithms have surprisingly similar structures. However, the minimum-weight spanning tree algorithm de- scribed in Chapter 1 is quite different from the minimum-weight spanning tree algorithm described in Chapter 2. This is because the optimal al- gorithm for finding a minimum-weight spanning tree on an array is quite different from the optimal algorithms for this problem on other networks. Teaching from the Text xi As a consequence, an organization of the material by network architecture allows for more cohesion than an organization by application domain. Second, an organization by network architecture facilitates use by read- ers who are interested in only one particular architecture. For example, if you are programming one of the many array-based parallel machines, then you will want to focus your reading on Chapter 1. Finally, it is easiest to learn the basic techniques of parallel algorithm design by studying them as they naturally arise in various problem do- mains. Although the idea of organizing the material around basic tech- niques may seem appealing at first, such an organization suffers from a serious lack of cohesion caused by the fact that the basic paradigms and primitives arise in widely varying contexts. For example, a chapter on pre- fix computations would naturally include topics such as carry-lookahead addition, solution of tridiagonal systems of equations, indexing, data dis- tribution, and certain circuit-switching algorithms, but it would likely not include other algorithms for these same problems. As a consequence, many significant educational opportunities would be lost by such an organization. For the most part, the sections in each chapter are independent of each other, and the table of contents and index have been designed to accommodate readers who want to follow a different path through the book. If you are interested in specific problems (such as graph algorithms or linear algebra), then you can use the text by reading only those sections within each chapter. If you are interested only in the implementations and applications of certain basic techniques (such as prefix computation or matrix multiplication), then you can read the text selectively with the help of the table of contents and the index. Teaching from the Text This book is also designed to be used as a text for an introductory (late undergraduate or early graduate) course on parallel algorithms and archi- tectures. Drafts of this material have been successfully used in numer- ous course settings during the past several years. Typically, a course on this subject will cover a large portion of the introductory material (i.e., the non-starred sections) from all three chapters. For example, a one- semester course could consist of the material from Sections 1.1-1.5 (possi- bly excluding Subsections 1.3.3-1.3.5), a sampling of the non-starred ma- terial from Sections 1.6-1.8, Subsection 1.9.5 (and possibly 1.9.1 as well), Section 2.1, a sampling from Sections 2.2, 2.4, and 2.5 (possibly exclud- xii Preface ing Subsection 2.5.5), Sections 3.1-3.3 (excluding Subsections 3.1.4, 3.2.3, and 3.3.4), Section 3.4 (possibly excluding Subsections 3.4.6-3.4.8), and Subsections 3.5.1, 3.6.1, and 3.6.2. Material from Section 3.7 might also be included as time permits. The book can also be used in courses devoted to specific architectures such as arrays or hypercube-related networks. An array-based course could include Chapter 1 in its entirety. For a course on hypercube-related ar- chitectures, it would be helpful to cover the material in Section 1.1 and Subsection 1.2.2 before proceeding to Chapter 3. Since all of the algo- rithms described in Chapters 1 and 2 can be implemented directly on a hypercube, it might also make sense to include most of the material from Sections 2.1, 2.2, 2.4, and 2.5 (excluding 2.5.5) in such a course. In ad- dition, the material in Subsection 1.9.5 provides a worthwhile perspective for results in Chapter 3; Theorem 1.21 in Subsection 1.9.1 is used for prov- ing lower bounds on the bisection width of the networks in Chapter 3; Theorem 1.16 in Subsection 1.7.5 is used in the proof of Theorem 3.12 in Subsection 3.2.2; and Corollary 1.19 from Subsection 1.7.5 is used to show that the hypercubic networks are universal in Subsections 3.2.2 and 3.3.3. Finally, the text can be used as a supplement for courses on related sub- jects such as VLSI, graph theory, computer architecture, and algorithms. Lecture notes and problem sets for the courses on this material that are taught at MIT can be purchased from the MIT Laboratory for Computer Science by sending a request for MIT/LCS/RSS10 (which is the most recent version of the notes available at the time of this printing) to Publications Office Laboratory for Computer Science 545 Technology Square Cambridge, MA 02139. Examples of the curricula based on this text that are used at other universities will be made available by Morgan Kaufmann Publishers. Exercises and Bibliographic Notes Particular emphasis has been placed on the selection and formulation of the more than 750 exercises that appear in the problem sections located near the end of each chapter. Many of these exercises have been tested in a wide variety of settings and have been solved by students with widely varying backgrounds and abilities. Errors xiii The problems are divided into several categories. Problems without an asterisk are the easiest and should be solvable by the average reader within 5-50 minutes after reading the appropriate section of the text. Problems with a single asterisk (*) are harder and will take more advanced readers 10-100 minutes to solve, on average. Problems with two asterisks (**) are very challenging and can require several days of effort from the best students. Many of the harder problems introduce new material that is the subject of current research. Problems marked with an R are research problems. Some of these prob- lems are probably easy and some could be very hard. (Some might even have been solved already without my being aware of the fact.) Problems marked with an R* are more likely to be very challenging since they have been studied by several researchers. (Some of the problems marked with an R have not been studied by anyone, as far as I know.) Unfortunately, 750+ problems can be overwhelming for the instructor who wants to select a few for homework or for the reader seeking content reinforcement. Hence, I have emphasized the 250 or so most worthwhile problems by printing the problem numbers in boldface. As a consequence, there will be about one boldface problem for every three pages of reading. All citations of results described in the text and all pointers to outside references are contained in the bibliographic notes at the end of each chap- ter. These notes are meant to be helpful but not exhaustive. The citations are included at the end of each chapter so that the reader can concentrate on understanding the technical material without getting bogged down in the sometimes messy business of assigning credit, and so that the reader can quickly locate pointers to references without having to wade through the technical material. Errors Despite the best efforts of many people, it is likely that the text contains numerous errors. If you find any, then please let me know. I can be reached by electronic mail at [email protected] .edu or by sending hardcopy mail to MIT. A list of known errors will be compiled and made available by Morgan Kaufmann Publishers. These errors will be corrected in subsequent printings of the book. xiv Preface Preview of Volume II Readers who find this book useful may be interested to know that a related text is currently being developed. The second text will be titled Introduc- tion to Parallel Algorithms and Architectures: Expanders · PR A Ms · VLSI (referred to as Volume II herein) and will be coauthored by Bruce Maggs. We are currently projecting that Volume II will consist of five chapters numbered four through eight. The contents of these chapters are briefly described in what follows. Chapter 4 will describe the expander family of networks, including the multibutterfly, multi-Benes network, and the AKS sorting circuit. Al- though expander-based networks are not currently used in the design of parallel machines, recent work suggests that some of these networks may become important components in future high-performance architectures. Chapter 5 is devoted to abstract models of parallelism such as the par- allel random access machine (PRAM). The PRAM model unburdens the parallel algorithm designer from having to worry about wiring and memory organization issues, thereby allowing him or her to focus on abstract paral- lelism. We will describe a wide variety of PRAM algorithms in Chapter 5, and the chapter will be organized so that theoretically inclined readers can start there instead of in Chapter 1. We will then continue in Chapter 6 with a discussion of lower bound techniques and P-Completeness. In Chapter 7, we will return to more practical matters and discuss is- sues relating to the fabrication of large-scale parallel machines. Particular attention will be devoted to very large scale integration (VLSI) compu- tation and design. Among other things, we will see in Chapter 7 why hypercubes are more costly to build than arrays and why area-universal networks such as the mesh of trees are particularly cost-effective. We will conclude in Chapter 8 with a collection of important topics. Included will be a survey of state-of-the-art parallel computers, an intro- duction to parallel programming (with examples from the Connection Ma- chine), a discussion of issues relating to fault tolerance, and a discussion of bus-based architectures. We have already begun writing Volume II and we hope to have it completed and available from Morgan Kaufmann within the next few years. Much of the material in Volume II is covered in the lecture notes for the courses taught at MIT (e.g., MIT/LCS/RSS10) that were mentioned earlier. In addition, some of this material can also be found in the follow- ing sources: the paper by Arora, Leighton, and Maggs [15] (for informa- Acknowledgments xv tion on expanders, multibutterflies, and nonblocking networks), the papers by Ajtai, Komlos, and Szemerédi [5] and Paterson [194] (for information on the AKS sorting circuit), the survey paper by Karp and Ramachan- dran [113] and the text by Gibbons and Rytter [81] (for information on PRAM algorithms), the texts by Mead and Conway [168], Lengauer [155], Ullman [247], and Glasser and Dobberpuhl [82] (for more information on VLSI computation and design), and the text by Almasi and Gottlieb [7] for more information on parallel programming and state-of-the-art parallel machines. Acknowledgments Many people have contributed substantially to the creation of this text. On the technical side, I am most indebted to Bruce Maggs and Charles Leiserson. Bruce spent countless hours reading drafts of the text and is directly responsible for improving the quality of the manuscript. In addition to catching some nasty bugs and suggesting simpler explanations for several results, Bruce also helped provide motivation for completing the text by commencing work on Volume II. Charles also contributed substantially to the text, although in different ways. Charles and I have been co-teaching courses on parallel algorithms and architectures for nearly 10 years, and I have learned a great deal from him during this time. Many of the explanations and exercises presented in the text are due to Charles or were improved as a result of his influence. Of course, many other people provided technical assistance with this work. I am particularly thankful to Al Borodin, Robert Fowler, Richard Karp, Arnold Rosenberg, Clark Thomborson, Les Valiant and Vijay Vazi- rani for reviewing early drafts of the text and to Richard Anderson, Mikhail Atallah, and Franco Preparata for their thorough reviews of later drafts. In addition, I would like to thank Bill Aiello Bobby Blumofe Lenore Cowen Mike Ernst Jose Fernandez Nabil Kahale Mike Klugerman Manfred Kunde Yuan Ma Greg Plaxton Eric Schwabe Nick Trefethen Jacob White David Williamson for reading sections of the text and for providing numerous helpful com- ments. Special recognition also goes to xvi Preface Jon Buss Ron Greenberg Mark Hansen Nabil Kahale Joe Kilian Mike Klugerman Dina Kravets Bruce Maggs Marios Papaefthymiou Serge Plotkin Eric Schwabe Peter Shor Joel Wein for their help as teaching assistants for this material during the past decade. I would also like to thank the following people for numerous helpful discussions, suggestions, and pointers: Anant Agarwal Alok Aggarwal Sanjeev Arora Arvind Paul Beame Bonnie Berger Sandeep Bhatt Gianfranco Bilardi Fan Chung Richard Cole Bob Cypher Bill Dally Persi Diaconis Shimon Even Greg Frederickson Ron Graham David Greenberg Torben Hagerup Susanne Hambrusch Johan Hâstad Dan Kleitman Tom Knight Richard Koch Rao Kosaraju Danny Krizanc Clyde Kruskal H. T. Kung Thomas Lengauer Fillia Makedon Gary Miller Mark Newman Victor Pan Michael Rabin Abhiram Ranade Satish Rao John Reif Sartaj Sahni Jorge Sanz Chuck Seitz Adi Shamir Alan Siegel Burton Smith Marc Snir Larry Snyder Quentin Stout Hal Sudborough Bob Tarjan Thanasis Tsantilas Eli Upfal Uzi Vishkin David Wilson On the production side, I am most indebted to Martha Adams, Jose Fernandez, David Jones, and Tim Wright. Martha converted the text from handwritten scribbles to Ί^Χ and then from Ί^Κ to I&TjjX. This was a difficult and (at times) frustrating task that spanned many years. Jose converted my crude sketches into the clear and artistic figures that appear in the text. Jose's unusual ability to express complicated technical material in easy-to-understand figures has substantially enhanced the quality of the text. David entered tens of thousands of revisions into the text during countless late nights at LCS, and he performed the formatting for the final text. Text setting and figure placement in a text such as this is a tricky, time-consuming, and frustrating business, and I am very grateful to David for doing such a splendid job. Tim helped with many aspects of the preparation of the final manuscript, including revisions, hunting down