Table Of ContentPREFACE
This book aims at giving an impression of the way current research in algorithms, ar-
chitectures and compilation for parallel systems is evolving. It is focused especially on
domains where embedded systems are required, either oriented to application-specific or
programmable realisations. These are crucial in domains such as audio, telecom, instrumen-
tation, speech, robotics, medical and automotive processing, image and video processing,
TV, multimedia, radar and sonar. Also the domain of scientific, numerical computing is
covered.
The material in the book is based on the author contributions presented at the 3rd Inter-
national Workshop on Algorithms and Parallel VLSI Architectures, held in Leuven, Au-
gust 29-31, 1994. This workshop was partly sponsored by EURASIP and the Belgian
NFWO (National Fund for Scientific Research), and organized in co-operation with the
IEEE Benelux Signal Processing Chapter, the IEEE Benelux Circuits and Systems Chap-
ter, and INRIA, France. It was a continuation of two previous workshops of the same name
which were held in Pont-&-Mousson, France, June 1990 1, and Bonas, France, June 1991
2. All of these workshops have been organized in the frame of the EC Basic Research
Actions NANA and NANA2, Novel parallel Algorithms for New real.time Architectures,
sponsored by the ESPRIT program of Directorate XIII of the European Commission. The
NANA Contractors are IMEC, Leuven, Belgium (F. Catthoor), K.U. Leuven, Leuven, Bel-
gium (J. Vandewalle), ENSL, Lyon, France (Y. Robert), TU Delft, Delft, The Netherlands
(P. Dewilde and E. Deprettere), IRISA, Rennes, France (P. Quinton). The goal within
these projects has been to contribute algorithms suited for parallel architecture realisation
on the one hand, and on the other hand design methodologies and synthesis techniques
which address the design trajectory from real behaviour down to the parallel architecture
realisatlon of the system. As such, this is clearly overlapping with the scope of the workshop
and the book.
An overview of the main results presented in the different chapters combined with an
attempt to structure all this information is available in the introductory chapter.
We expect this book to be of interest in academia, both for detailed descriptions of research
results as well as for the overview of the field given here, with many important but less
widely known issues which must be addressed to arrive at practically relevant results. In
addition, many authors have considered applications and the book is intended to reflect
this fact. The real-life applications that have driven the research are described in several
vi ecaferP
contributions, and the impact of their characteristics on the methodologies is assessed. We
therefore believe that the book will be of interest also to senior design engineers and CAD
managers in industry, who wish either to anticipate the evolution of commercially available
design tools over the next few years, or to make use of the concepts in their own research
and development.
It has been a pleasure for us to organize the workshop and to work together with the
authors to assemble this book. We feel amply rewarded with the result of this co-operation,
and we want to thank all the authors here for their effort. We have spent significant
effort in trying to deliver as much as possible consistent material, by careful editing. The
international aspect has allowed us to group the results of many research groups with a
different background and "research culture," which si felt to be particularly enriching.
We would be remiss not to thank Prof. L. Thiele of Universit~t des Saarlandes, Saarbriicken,
Germany, who was an additional member to the workshops organizing committee, and
F. Vanpoucke who was a perfect workshop managing director and also did a great job in
collecting and processing the contributions to this book.
We hope that the reader will find the book useful and enjoyable, and that the results
presented will contribute to the continued progress of the field of parallel algorithms, ar-
chitectures and compilation.
Leuven, October ,t~991 eht editors
References
1 E.Deprettere, A.Van der Veen (eds.), "Algorithms and Parallel VLSI Architectures",
Elsevier, Amsterdam, 1991.
2 P.Quinton and Y.Robert (eds.), "Algorithms and Parallel VLSI Architectures II", El-
sevier, Amsterdam, 1992.
smhtiroglA dna Parallel VLSI Architectures III
M. Moonen dna F. Catthoor )srotidE(
(cid:14)9 1995 Elsevier Science B.V. All rights reserved.
ALGORITHMS AND PARALLEL VLSI ARCHITECTURES
F. CATTHOOtt
IMEC
Kapeldreef 57
800I Leuven, Belgium
catthoor@imec.be
M. MOONEN
ESAT
Katholieke Universiteit Leuven
800I Leuven, Belgium
Marc. Moonen @esat.kuleu ven. .ca eb
ABSTRACT. In this introductory chapter, ew will summarize the main contributions of
the chapters collected in this book. Moreover, the topics addressed in these chapters will
be linked to the major research trends in the domain of parallel algorithms, architectures
and compilation.
1 STRUCTURE OF BOOK
The contributions to the workshop and the book can be classified in three categories:
.1 Parallel Algorithms: The emphasis lies on the search for more efficient and inherently
parallelisable algorithms for particular computational kernels, mainly from linear al-
gebra. The demand for fast matrix computations has arisen in a variety of fields,
such as speech and image processing, telecommunication, radar and sonar, biomedi-
cal signal processing, and os on. The work si motivated by the belief that preliminary
algorithmic manipulations largely determine the success of, e.g, a dedicated hardware
design, because radical algorithmic manipulations and engineering techniques are not
easily captured, e.g, in automatic synthesis tools. Most of the contributions here deal
with real-time signal processing applications, and in many cases, the research on these
algorithms si already tightly linked to the potential parallel realisation options to be
exploited in the architecture phase.
.F Catthoor and M. Moonen
.2 Parallel Architectures: Starting from an already paraUelized algorithm or a group of
algorithms (a target application domain), the key issue here is to derive a particular
architecture which efficiently realizes the intended behaviour for a specific technology.
In this book, the target technology will be CMOS electronic circuitry. In order to
achieve this architecture realisation, the detailed implementation characteristics of
the building blocks - like registers/memories, arithmetic components, logic gates and
connection networks- have to be incorporated. The end result is an optimized net-
list/layout of either primitive custom components or of programmable building blocks.
The trend of the last years is to mix both styles. So more custom features are
embedded in the massively parallel programmable machines, especially in the storage
hierarchy and the network topologies. In addition, (much) more flexibility is built
into the custom architectures, sometimes leading to highly-flexible weakly-parallel
processors. The path followed to arrive at such architectures is the starting point for
the formalisation into reusable compilation methodologies.
.3 Parallel Compilation: Most designs in industry suffer from increasing time pressure.
As a result, the methods to derive efficient architectures and implementations have to
become more efficient and less error-prone. For this purpose, an increasing amount of
research is spent on formalized methodologies to map specific classes of algorithms (ap-
plication domain) to selected architectural templates (target style). In addition, some
steps in these methodologies are becoming supported by interactive or automated
design techniques (architectural synthesis or compilation). In this book, the empha-
sis will be on modular algorithms with much inherent parallelism to be mapped on
(regular) parallel array styles. Both custom (application-specific) and programmable
(general-purpose) target styles Uiw be considered.
These categories correspond to the different parts of the book. An outline of the main
contributions in each part is given next, along with an attempt to capture the key-features
of the presented research.
2 PARALLEL ALGORITHMS
In recent years, it has become clear that for many advanced real-time signal processing and
adaptive systems and control applications the required level of computing power is well
beyond that available on present-day programmable signal processors. Linear algebra and
matrix computations play an increasingly prominent role here, and the demand for fast
matrix computations has arisen in a variety of fields, such as speech and image processing,
telecommunication, radar and sonar, biomedical signal processing, and so on. Dedicated
architectures then provide a means of achieving orders of magnitude improvement in per-
formance, consistent with the requirements.
However, past experience has shown that preliminary algorithmic manipulations largely
determine the success of such a design. This has led to a new research activity, aimed
at tailoring algorithmic design to architectural design and vice versa, or in other words
deriving numerically stable algorithms which are suitable for parallel computation. At
this stage, there is also interaction already with the parallel architecture designers who
Algorithms and Parallel VLSI Architectures
have to evaluate the mapping possibilities onto parallel processing architectures, capable of
performing the computation efficiently at the required throughput rate.
In the first keynote contribution, KETPAHC 1 (~egalia), a tutorial overview si given of
so-called subspace methods, which have received increasing attention in signal processing
and control in recent years. Common features are extracted for two particular applications,
namely multivariable system identification and source localization. Although these appli-
cation areas have different physical origins, the mathematical structure of the problems
they aim to solve are laced with parallels, so that, e.g. parallel and adaptive algorithms
in one area find an immediate range of applications in neighbouring areas. In particular,
both problems are based on finding spanning vectors for the null space of a spectral density
matrix characterizing the available data, which si usually expressed numerically in terms of
extremal singular vectors of a data matrix. Algorithmic aspects of such computations are
treated in subsequent chapters, namely RHTPAHC 6 (G&tze et al. and RETPAHC 7 anexa~(
et al.), see below.
Linear least squares minimisation si no doubt one of the most widely used techniques
in digital signal processing. It finds applications in channel equalisation as well as system
identification and adaptive antenna array beamforming. At the same time, it si one of the
most intensively studied linear algebra techniques when it comes to parallel implementa-
tion. SRHTPAHC 2 through 5 all deal with various aspects of this. Of the many alternative
algorithms that have been proposed over the years, one of the most attractive is the algo-
rithm based on QR decomposition. To circumvent pipelining problems with this algorithm,
several alternative algorithms have been developed, of which the covariance-type algorithm
with inverse updating si receiving a lot of attention now.
In RETPAHC 2 (Mc Whirter et al.), a formal derivation is given of two earlierly developed
systolized versions of this algorithm. The derivation of these arrays si highly non-trivial
due to the presence of data contra-flow in the underlying signal flow graph, which would
normally prohibit pipelined processing. Algorithmic engineering techniques are applied to
overcome these problems. Similar algorithmic techniques are used in RHTPAHC 3 (Brown et
al.), which si focused on covariance-type algorithms for the more general Kalman Filtering
problem. Here also, algorithmic engineering techniques are used to generate two systolic
architectures, put forward in earlier publications, from an initial three-dimensional hierar-
chical signal flow graph (or dependence graph). In KETPAHC 4 (Schier), it si shown how
the inverse updates algorithm and systolic array treated in RETPAHC 2 may be equipped
with a block-regularized exponential forgetting scheme. This allows to overcome numerical
problems if the input data si not sufficiently informative. Finally, in RETPAHC 5 (Kadlec)
the information-type RLS algorithm based on QR decomposition si reconsidered. A nor-
malized version of this algorithm si presented which has potential for efficient fixed point
implementation. The main contribution here si a global probability analysis which gives an
understanding of the algorithms numerical properties and allows to formulate probability
statements about the number of bits actually used in the fixed point representation.
A second popular linear algebra tool si the singular value decomposition (and the related
symmetric eigenvalue decomposition), which, e.g., finds applications in subspace techniques
as outlined in RETPAHC .1 The next two chapters deal with the parallel implementation of
F. Catthoor and M. Moonen
such orthogonal decompositions. In RETPAHC 6 (G6tze et al.), it is explained how Jacobi-
type methods may be speeded up through the use of so-called orthonormal p-rotations.
Such CORDIC-like rotations require a minimal number of shift-add operations, and can be
executed on a floating-point CORDIC architecture. Various methods for the construction
of such orthonormal p-rotations of increasing complexity are presented and analysed. An
alternative approach to developing parallel algorithms for the computation of eigenvalues
and eigenvectors is presented in RETPAHC 7 (Sazena et al.). It is based on isospectral
flows, that is matrix flows in which the eigenvalues of the matrix are preserved. Very few
researchers in the past have used the isospectral flow approach to implement the eigenvalue
problem in VLSI, even though, as explained in this chapter, it has several advantages from
the VLSI point of view, such as simplicity and scalability.
RETPAHC 8 (Arioli et al.) deals with block iterative methods for solving linear systems
of equations in heterogeneous computingenvironments. Three different strategies are pro-
posed for parallel distributed implementation of the Block Conjugate Gradient method,
differing in the amount of computation performed in parallel, the communication scheme,
and the distribution of tasks among processors. The best performing scheme is then used
to accelerate the convergence of the Block Cimmino method.
Finally, CHAPTER 9 (Cardarilli et al.) deals with RNS-to-binary conversion. RNS
(Residue Number System) arithmetic is based on the decomposition of a number- rep-
resented by a large number of bits - into reduced wordlength residual numbers. It is a very
useful technique to reduce carry propagation delays and hence speed up signal processing
implementations. Here, a conversion method is presented which is based on a novel class of
coprime moduli and which is easily extended to a large number of moduli. In this way the
proposed method allows the implementation of very fast and low complexity architectures.
This paper, already bridges the gap with the detailed architecture realisation, treated in
the second category of contributions.
3 PARALLEL ARCHITECTURES FOR HIGH-SPEED NUMERICAL AND
SIGNAL PROCESSING
Within this research topic, we have contributions on both customized and programmable
architectures. For the application-specific array architectures, the main trend is towards
more flexibility. This is visible for instance in the high degree of scalability and the different
modes/options offered by the different architectures.
We can make a further subdivision between the more "conventional" regular arrays with
only local communication and the arrays which are combined with other communication
support like tree networks to increase the speed of non-local dependencies. In the first
class, two representative designs are reported in this book. In CHAPTER 10 (Riern et al.), a
custom array architecture for long integer arithmetic computations is presented. It makes
use of redundant arithmetic for high-speed and is very scalable for word-length. Moreover,
several modes are available to perform various types of multiplication and division. The
emphasis in this paper lies on the interaction with the algorithmic transformations which
are needed to derive an optimized architecture and also on the methodology which is used
Algorithms and Parallel VLSI Architectures
throughout the design trajectory.
Similarly, in CHAPTER 11 (Rosseel et al.), a regular array architecture for an image dif-
fusion algorithm is derived. The resulting design is easily cascadable and scalable and the
data-path supports many different interpolation functions. The extended formal method-
ology used to arrive at the end result - oriented to fixed throughput applications - forms a
red thread throughout the paper.
Within the class of arrays extended with non-local communication also two representative
designs are reported, again including a high degree of scalability. The topic of RETPAHC
12 (Duboux et al.) is a parallel array augmented with a tree network for fast and efficient
dictionary manipulations. The memory and network organisation for handling the key-
record data are heavily tuned to obtain the final efficiency.
Also in CHAPTER 13 (Archambaud et al.), a basic systolic array is extended with an
arbitration tree to speed up the realisation of the application. In this case, it is oriented
to genetic sequence comparison including the presence of "holes". In order to achieve even
higher speed, a set-associative memory is included too.
For the class of programmable architectures, both massively and weakly parallel machines
are available. Apparently, their use depends on the application domain which is targeted.
For high-throughput real-time signal processing, in e.g. image and video processing, the
main trend nowadays is towards lower degrees of parallelism (4 to 61 processor elements) and
more customisation to support particular, frequently occurring operations and constructs.
The latter is especially apparent in the storage and communication organisation. The
reduced parallelism is motivated because the amount of available algorithmic parallelism
si not necessarily that big and because the speed of the basic processors has become high
enough to reduce the required parallelisation factor for the throughput to be obtained.
Within the programmable class, the main emphasis in the book lies on the evolution of
these novel, weakly parallel processor architectures for video and image processing type
applications.
Ill CHAPTER 14 (Vissers et al.), an overview is provided of the VSP2 architecture which
si mainly intended for video processing as in HDTV, video compression and the like. It
Supports a highly flexible connection network (cross-bar) and a very distributed memory
organisation with dedicated register-banks and FIFO's. In CHAPTER 15 (Roenner et al.),
the emphasis lies on a programmable processor mainly targeted to image processing algo-
rithms. Here, the communication network si more restricted but the storage organisation is
more diversified, efficiently supporting in hardware both regular and data-dependent, and
both local and neighbourhood operations. The two processor architectures are however also
partly overlapping in target domain and the future has to be show which of the options is
best suited for a particular application.
Using such video or image signal processors, it si possible to construct flexible higher-level
templates which are tuned to a particular class of applications. This has for instance been
achieved in CrIAPTER 61 (De Greef et al.)where motion-estimation like algorithms
are
considered. A highly efficient communication and storage organisation is proposed which
allows to reduce these overheads considerably for the targeted applications. Real-time
F. Catthoor and M. Moonen
execution with limited board-space is obtained in this way for emulation and prototyping
purposes.
In addition, higher efficiency in the parallel execution within the data-path can poten-
tiaUy be obtained by givingup the fully synchronous operation. This is demonstrated in
CHAPTER 17 (Arvind et al.), where the interesting option of asynchronously communicat-
ing micro-agents is explored. It is shown that several alternative mechanisms to handle
dependencies and to distribute the control of the instruction ordering are feasible. Some of
these lead to a significant speed-up.
FinaUy, there is also a trend to simplify the processor data-path and to keep the instruc-
tion set as small as possible (RISC processor style). Within the class of weakly parallel
processors for image and video processing, this was already reflected in the previously
mentioned architectures. In ,RETPAHC 18 (Hall et al.) however, this is put even more to
the extreme by considering bit-serial processing elements which are communicating in an
SIMD array. The use of special instructions and a custom memory organisation make global
data-dependent operations possible though. This parallel programmable image processor
is mainly oriented to wood inspection applications.
Within the class of massively parallel machines, the main evolution is also towards more
customisation. The majority of the applications targeted to such machines appears to come
mainly from the scientific and numerical computing fields. In CHAPTER 19 (Vankats), a
new shared memory multi-processor based on hypercube connections is proposed. The
dedicated memory organisation with a directory based cache coherence scheme is the key
for improved speed.
An application of a fast DCT scheme mapped to such parallel machines is studied in
CHAPTER 20 (Christopottlo8 et al.). Here, the emphasis lies on the influence of the al-
gorithmic parameters and the load balancing on the efficiency of the parallel mapping.
Efficient massive parallelism is only achievable for large system parameters.
The power of a "general-purpose" array of processors realized on customizable field-
programmable gate arrays (FPGAs) is demonstrated in ,RETPAHC 21 (Champeau et al.).
This combination allows to extend the customisation further without overly limiting the
functionality. An efficient realisation of parallel text matching is used as a test-case to show
the advantages of the approach.
Compiler support is a key issue for all of these parallel programmable machines so all the
novel architectures have been developed with this in mind. Hence, each of the contributions
in CHAPTER 14 (Vissers et al.), CHAPTER 51 (Roenner et al.), CHAPTER 18 (Hall et
al.), CHAPTZ~t 21 (Champean et al.)and CHAPTZR 19 (Vankats)devotes a section to the
compilation issues. Most of these compilers can however benefit from the novel insights and
techniques which are emerging in the compilation field, as addressed in section 4.
4 PARALLEL COMPILATION FOR APPLICATION-SPECIFIC AND
GENERAL-PURPOSE ARCHITECTURES
As already mentioned, the key drive for more automated and more effective methodologies
Algorithms and Parallel VLSI Architectures
comes from the reduced design time available to system designers. In order to obtain these
characteristics, methodologies have to be generally targeted towards application domains
and target architecture styles. This is also true for the domain of parallel architectures.
Still, a number of basic steps do reoccur in the methodologies and an overview of the major
compilation steps in such a targeted methodology is provided in CHAPTER 22 (Featrier). In
that contribution, the emphasis lies on array data-flow analysis, scheduling of the parallel
operations on the time axis, allocation to processors and processor code generation including
communication synthesis. Even though this survey is mainly oriented to the compilation on
programmable machines, most of the concepts recur for the field of custom array sisehtnys
(see also CHAPTER 10 (Riera et al.) and CHAPTER 11 (Rosseel et al.)). Still, the detailed
realisation of the algorithmic techniques used for the design automation typically differs
depending on the specific characteristics of the domain (see also below).
The other papers in the compilation category are addressing specific tasks in the global
methodology. Representative work in each of the different stages is collected in this book.
The order in which these tasks will be addressed here is not fully fixed, but still most
researchers converge on a methodology which is close to what is presented here.
The first step is of course the representation of the algorithm to be mapped in a formal
model, suitable for manipulation by the design automation techniques. The limitations of
this model to affine, manifest index functions have been partly removed in the past few
years. Important in this process is that the resulting models should still be amenable to the
vast amount of compilation/synthesis techniques which are operating on the afnne model.
This also means that array data-flow analysis should remain feasible.
Interesting extensions to this "conventional" model which meet these requirements, are
proposed in CHAPTER 23 (Held et al.) and RETPAHC 24 (Rapanotti et al.). The restriction
to linear or affine index functions can be extended to piece-wise regular affine cases yb
a normal form decomposition process. This allows to convert integer division, modulo,
ceiling and floor functions to the existing models, as illustrated in CHAPTER 23 (Held et
al.). Moreover, also so-called linearly bounded lattices can then be handled.
The restrictions can be even further removed by considering the class of "integral" index
functions, as studied in CHAPTER 24 (Rapanotti et al.). This allows to handle also more
complicated cases as occurring e.g. in the knapsack algorithm. By especially extending
the so-called uniformisation step in the design trajectory, it is still possible to arrive at
synthesizable descriptions. There is also hope to deal with part of the data-dependent
cases in this way.
Finally, it is also possible to consider the problem of modelling from another point of
view, namely as a matching between primitive operations for which efficient parallel im-
plementations are known, and the algorithm to be mapped. This approach is taken in
RETPAHC 25 (Rangaswarni), where a functional programming style with recursion is advo-
cated. By providing a library of mappable functions, it is then possible to derive different
options for compiling higher-level functions and to characterize each of the alternatives in
terms of cost.
Once the initial algorithm has been brought in this manipulatable form, it is usually nee-
.F Catthoor and M. Moonen
essary to apply a number of high-level algorithmic transformations to improve the efficiency
of the eventual architecture realisations (see also CHAPTER 10 (Riem et al.) and CHAP-
RET 11 (Rossee! et al.)). Support for these is considered in RETPAHC 26 (Durrieu et al.),
where provably correct small transformations allow the designer to interactively modify the
original algorithm into the desired form. Also the uniformisation transformation addressed
in CHAPTER 24 (Rapanotti et al.)falls in principle under this stage, but for that purpose
also more automated techniques have become available lately.
Now that the algorithm has a suitable form for the final mapping stages, it is usually
assumed that all index functions are uniform and manifest, and that the algorithm has
been broken up into several pure loop nests. For each of these, the scheduling, allocation
and code generation/communication synthesis steps then have to be performed. Within the
target domain of massively parallel machines (either custom or programmable), the notion
of affine mapping functions has been heavily exploited up to now (see also CHAPTEIt 22
For instance, the work in RETPAHC 27 (Bouchittg et al.) considers the mapping of eval-
uation trees onto a parallel machine where communication and computation can coincide.
This assumption complicates the process a lot and heuristics are needed and proposed to
handle several practical cases within fine- and coarse-grain architectures.
It is however clear from several practical designs that purely affine mapping are not
always leading to optimal designs. This si clearly illustrated in C~APTER 28 (Werth et
al.) for both scheduling and communication synthesis, and this for the test-case of the so-
called Lamport loop. Therefore, several researchers have started looking at extensions to
the conventional methods.
A non-unimodular mapping technique including extended scheduling/allocation and es-
pecially communication synthesis is proposed in RETPA~IC 29 (Reffay et al.). For the
Cholesky factorisation kernel, it is shown that significantly increased efficiency can be ob-
tained, while still providing automatable methods.
Up till now, we have however still restricted ourselves to mapping onto homogeneous,
locally connected parallel machines. As already demonstrated in section 3, the use of weakly
parallel and not necessarily homogeneous architectures is finding a large market in high-
throughput signal processing, as in video and image applications. As a result, much research
has been spent lately on improved compilation techniques for these architectures too. Most
of this work is originating from the vast amount of know-how which has been collected in
the high-level synthesis community on mapping irregular algorithms onto heterogeneous
single processor architectures. Several representative contributions in this area are taken
up in this book.
In CHAPTER 30 (Schwiegershausen et al.), the scheduling problem of coarse grain tasks
onto a heterogeneous multi-processor is considered. The assumption is that several pro-
cessor styles are available and that the mapping of the tasks on these styles has been
characterized already. Given that information, it is possible to formulate an integer pro-
gramming problem which allows to solve several practical applications in the image and
video processing domain.
Description:Content: Preface, Pages v-viAlgorithms and Parallel VLSI Architectures, Pages 1-9, F. Catthoor, M. MoonenSubspace Methods in System Identification and Source Localization, Pages 13-23, P.A. RegaliaPipelining the Inverse Updates RLS Array by Algorithmic Engineering, Pages 25-36, J.G. McWhirter, I.K.