ebook img

Data Access and Storage Management for Embedded Programmable Processors PDF

315 Pages·2002·10.758 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Data Access and Storage Management for Embedded Programmable Processors

DATA ACCESS AND STORAGE MANAGEMENT FOR EMBEDDED PROGRAMMABLE PROCESSORS Data Access and Storage Management for Embedded Programmable Processors by Francky Catthoor [MEC, Leuven, Belgium Koen Danckaert IMEC, Leuven, Belgium Chidamber Kulkarni [MEC, Leuven, Belgium Erik Brockmeyer [MEC, Leuven, Belgium Per Gunnar Kjeldsberg Norwegian Univ. ofSc. and Tech. (NTNU), Trondheim, Norway Tanja Van Achteren Katholieke Universiteit Leuven, Leuven, Belgium and Thierry Omnes [MEC, Leuven, Belgium SPRINGER SCIENCE+BUSINESS MEDIA, LLC A C.I.P. Catalogue record for this book is available from the Library of Congress. ISBN 978-1-4419-4952-3 ISBN 978-1-4757-4903-8 (eBook) DOI 10.1007/978-1-4757-4903-8 Printed an acid-free paper AII Rights Reserved © 2002 Springer Science+Business Media New York Originally published by Kluwer Academic Publishers. in 2002 Softcover reprint of the hardcover 1s t edition 2002 No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic Of mechanical, including photocopying, recording or by any information storage and retrieval system, without written permission from the copyright owner. Preface The main intention of this book is to give an impression of the state-of-the-art in data transfer/access and storage exploration' related issues for complex data-dominated embedded real-time processing applications. The material is based on research at IMEC in this area in the period 1996-2001. It can be viewed as a follow-up of the earlier "Custom Memory Management Methodology" book that was published in 1998 based on the custom memory style related work at IMEC in the period 1988-1997. In order to deal with the stringent timing requirements, the cost-sensitivity and the data dominated characteristics of our target domain, we have again adopted a target architecture style and a systematic methodology to make the exploration and optimization of such systems feasible. But this time our target style is oriented mostly to (partly) predefined memory organisations as occurring e.g. in instruction-set processors, cores or "platforms". Our approach is also very heavily application-driven which is illustrated by several realistic demonstrators, partly used as red-thread examples in the book. Moreover, the book addresses only the steps above the traditional compiler tasks, even prior to the parallelizing ones. The latter are mainly focussed on scalar or scalar stream operations and other data where the internal structure of the complex data types is not exploited, in contrast to the approaches discussed in this book. The pro posed methodologies are nearly fully independent of the level of programmability in the data-path and controller so they are valuable for the realisation of both instruction-set and custom processors, and even reconfigurable styles. Some of the steps do depend on the memory architecture instance and its restrictions though. Our target domain consists of real-time processing systems which deal with large amounts of data. This happens both in real-time multi-dimensional signal processing (RMSP) applications like video and image processing, which handle indexed array signals (usually in the context of loops), in wired and wireless terminals which handle less regular alTays of data, and in sophisticated com munication network protocols, which handle large sets of records organized in tables and pointers. All these classes of applications contain many important applications like video coding, medical image archival, advanced audio and speech coding, multi-media terminals, artificial vision, Or thogonal Frequency Domain Multiplex (OFDM), turbo coding, Spatial Division Multiplex Access (SDMA), Asynchronous Transfer Mode (ATM) networks, Internet Protocol (lP) networks, and other LocallWide Area Network (LANIWAN) protocol modules. For these applications, we believe (and we will demonstrate by real-life experiments) that the organisation of the global communication and data storage, together with the related algorithmic transformations, form the dominating factors (both for system power and memory footprint) in the system-level design decisions, with special emphasis on source code transformations. Therefore, we have concentrated ourselves mainly on the effect of system-level decisions on the access to large (background) memories and caches, which require separate access cycles, and on the transfer of data over long "distances" (i.e. which have to pass between source and destination over long-term main memory storage). This is complementary to and not in competition with the existing compiler technology. Indeed, our approach should be 1a more limited term used in literature is memory management but the scope of this material is much broader because it also includes data management and source code transformations ii DATA ACCESS AND STORAGE MANAGEMENT FOR PROCESSORS considered as a precompilation stage, which precedes the more conventional compiler steps that are essential to handle aspects that are not decided in "our" stage yet. So we are dealing with "orthogo nal" exploration choices where the concerns do not overlap but that are still partly dependent though due to the weak phase coupling. From the precompilation stage, constraints are propagated to the compiler stage. The effectiveness of this will be demonstrated in the book. The cost functions which we have incorporated for the storage and communication resources are both memory footprint and power oriented. Due to the real-time nature of the targeted applications, the throughput is normally a constraint. So performance is in itself not an optimisation criterion for us, it is mostly used to restrict the feasible exploration space. The potential slack in performance is used to optimize the real costs like power consumption, and (on-or off-chip) memory foot-print. The material in this book is partly based on work in the context of several European and national research projects. The ESPRIT program of Directorate XIII of the European Commission, though the ESDILPD Project 25518 DAB-LP has sponsored the SCBD step and the DAB application work. The Flemish IWT has sponsored some of the demonstrator work under SUPERVISIE project, and part of the loop transformation and caching related work under the MEDEA SMT and SMT2 projects (System level methods and Tools), including partial support of the industrial partners A1catel Telecom, Philips, CoWare and Frontier Design. The Flemish FWO foundation and the Belgian inter university attraction pole have sponsored the data reuse related work under FWO-G.0036.99 resp. IUAP4/24. The FWO has also sponsored part of loop transformation work in a Ph.D. fellowship. The Norwegian Research Council and the European Marie Curie Fellowship funding have sponsored part of the size estimation work through research project 131359 CoDeV er resp. MCFH-1999-00493. A major goal of the system synthesis and compilation work within these projects has been to con tribute systematic design methodologies and appropriate tool support techniques which address the design/compilation trajectory from real system behaviour down to the detailed platform architecture level of the system. In order to provide complete support for this design trajectory, many problems must be tackled. In order to be effective, we believe that the design/compilation methodology and the supporting techniques have to be (partly) domain-specific, i.e. targeted. This book illustrates this claim for a particular target application domain which is of great importance to the current industrial activities in the telecommunications and multi-media sectors: data-dominated real-time signal and data processing systems. For this domain, the book describes an appropriate systematic methodology partly supported by efficient and realistic compilation techniques. The latter have been embedded in prototype tools to prove their feasibility. We do not claim to cover the complete system compilation path, but we do believe we have significantly contributed to the solution of one of the most crucial problems in this domain, namely the ones related to data transfer and storage exploration (DTSE). We therefore expect this book to be of interest in academia, both for the overall description of the methodology and for the detailed descriptions of the compilation techniques and algorithms. The priority has been placed on issues that in our experience are crucial to arrive at industrially relevant results. All projects which have driven this research, have also been application-driven from the start, and the book is intended to reflect this fact. The real-life applications are described, and the impact of their characteristics on the methodologies and techniques is assessed. We therefore believe that the book will be of interest as well to senior design engineers and compiler/system CAD managers in industry, who wish either to anticipate the evolution of commercially available design tools over the next few years, or to make use of the concepts in their own research and development. It has been a pleasure for us to work in this research domain and to co-operate with our project partners and our colleagues in the system-level synthesis and embedded compiler community. Much of this work has been performed in tight co-operation with many university groups, mainly across EUrope. This is reflected partly in the author list of this book, but especially in the common publica tions. We want to especially acknowledge the valuable contributions and the excellent co-operation with: the ETRO group at the V.U.Brussels, KTH-Electrum (Kista), the ACCA and DTAI groups at K.U.Leuven (Leuven), INSA (Lyon), Patras Univ., Norwegian Univ. of Sc. and Tech. (Trondheim), Democritus University of Thrace (Xanthi). In addition to learning many new things about system synthesis/compilation and related issues, we have also developed close connections with excellent people. Moreover, the pan-European as- iii pect of the projects has allowed us to come in closer contact with research groups with a different background and "research culture", which has led to very enriching cross-fertilization. We would like to use this opportunity to thank the many people who have helped us in realizing these results and who have provided contributions in the direct focus of this book, both in IMEC and at other locations. In particular, we wish to mention: Einar Aas, Javed Absar, Yiannis Andreopoulos, Ivo Bolsens, Jan Bormans, Henk Corporaal, Chantal Couvreur, Geert Deconinck, Eddy De Greef, Kristof Denolf, Hugo De Man, Jean-Philippe Diguet, Peeter Ellervee, Michel Eyckmans, Antoine Fraboulet, Frank Franssen, Thierry Franzetti, Cedric Ghez, Costas Goutis, Ahmed Hemani, Ste faan Himpe, Jos Huisken, Martin Janssen, Stefan Janssens, Rudy Lauwereins, Paul Lippens, Anne Mignotte, Miguel Miranda, Kostas Masselos, Lode Nachtergaele, Martin Palkovic, Rainer Schaf fer, Peter Siock, Dimitrios Soudris, Amout Vandecappelie, Tom Van der Aa, Tycho van Meeuwen, Michael van Swaaij, Diederik Verkest, Sven Wuytack. We finally hope that the reader will find the book useful and enjoyable, and that the results pre sented will contribute to the continued progress of the field of system-level compilation and synthesis for both instruction-set and custom processors. Francky Catthoor, Koen Danckaert, Chidamber Kulkarni, Erik Brockmeyer, Per Gunnar Kjeldsberg, Tanja Van Achteren, Thierry Omnes October 200 I. Contents 1. DTSE IN PROGRAMMABLE ARCHITECTURES 1 1.1 Problem context and motivation 1.2 Target application domain 4 1.3 Target architecture style 4 1.4 Storage cost models 8 1.5 Objectives and global approach 10 1.6 Brief state of the art 12 1.7 The data transfer and storage exploration approach at IMEC 14 1.7.1 Reducing the required data bit-widths for storage 16 1.7.2 Pruning and related preprocessing steps 16 1.7.3 Global data flow trafo 16 1.7.4 Global loop and control flow trafo 17 1.7.5 Array-oriented memory size estimation 18 1.7.6 Data reuse decision in a hierarchical memory context 18 1.7.7 Storage cycle budget distribution 19 1.7.71 Memory hierarchy layer assignment 19 1.7.7.2 Loop transformations for SCBD 19 1.7.7.3 BG structuring 20 1.7.7.4 Storage bandwidth optimization 20 1.7.8 Memory (bank allocation and signal assignment 20 1.7.9 Memory data layout optimization 20 1.7.10 Other related methodologies and stages 22 1.8 Overview of Book 23 2. RELATED COMPILER WORK ON DATA TRANSFER AND STORAGE MANAGEMENT 25 2.1 Parallelism and memory optimizing loop and data flow trafo 25 2.1.1 Interactive loop transformations 25 2.1.2 Automated loop transformation steering 26 2.1.3 Data-flow transformations 27 2.2 MIMD processor mapping and parallel compilation approaches 27 2.2.1 Task-level parallelism 27 2.2.2 Data-level parallelism 28 2.2.3 Instruction-level parallelism 28 2.3 Memory management approaches in programmable processor context 29 2.3.1 Memory organisation issues 29 2.3.2 Data locality and cache organisation related issues 30 2.3.3 Data storage organisation and memory reuse 31 2.4 Summary 31 3. GLOBAL LOOP TRANSFORMATIONS 33 3.1 Related work 33 3.1.1 Loop transformation research 33 3.1.1.1 The hyperplane method 34 v vi DATA ACCESS AND STORAGE MANAGEMENT FOR PROCESSORS 3.1.1.2 Dependency abstractions and dependency analysis 35 3.1.1.3 Locality optimization 36 3.1.1.4 Tiling the iteration/data space 36 3.1.1.5 Communication-minimal loop partitionings 36 3.1.1.6 Fine-grain scheduling 37 3.1.2 Comparison with the state of the art 37 3.1.2.1 DTSE context and global loop nest scope 37 3.1.2.2 Exact modeling 38 3.1.2.3 Separate phases 38 3.1.2.4 Comparison with earlier work at IMEC 38 3.2 Problem definition 39 3.2.1 Input 39 3.2.2 Output 40 3.3 Overall loop transformation steering strategy 40 3.3.1 Loop transformations in the PDG model 40 3.3.1.1 Polytope placement and ordering 41 3.3.1.2 Preliminary processor mapping 42 3.3.1.3 Global and local ordering 43 3.3.1.4 Estimations for data reuse, memory (hierarchy) assignment and in-place mapping 44 3.3.1.5 Other estimations 45 3.3.2 Example of optimizing an algorithm in the PDG model 45 3.3.2.1 Placement phase 45 3.3.2.2 Ordering phase 46 3.3.3 Reasons for a multi-phase approach 48 3.4 Polytope placement: cost functions and heuristics 50 3.4.1 Cost functions for regularity 50 3.4.1.1 The dependency cone 50 3.4.1.2 The allowed ordering vector cone 51 3.4.1.3 Construction of the cones 52 3.4.2 Cost functions for locality 54 3.4.2.1 Simple locality cost function 54 3.4.2.2 Refined locality cost function 54 3.4.3 Cost functions for data reuse 54 3.4.4 Interaction with the ordering phase 56 3.4.4.1 Cost functions and data reuse 56 3.4.4.2 Loop tiling 57 3.4.5 Estimations for in-place mapping 57 3.4.6 Ways to reduce the complexity. . . 58 3.5 Automated constraint-based polytope placement 58 3.5.1 Introduction 58 3.5.2 Preliminaries 58 3.5.2.1 Homogeneous coordinates 58 3.5.2.2 A note on inverse mappings 59 3.5.3 Splitting up the placement phase 59 3.5.4 Constraint-based regularity optimization 60 3.5.4.1 Constraints to make dependencies uniform 60 3.5.4.2 Constraints to make dependencies regular 61 3.5.4.3 Examples for individual loop transformations 63 3.5.4.4 Example of placing an algorithm 66 3.5.5 Automated exploration strategy 69 3.5.5.1 Basics of the strategy 69 3.5.5.2 Selection of the starting set of possible mappings 70 3.5.5.3 Implementation 71 3.5.6 Experiments on automated polytope placement 72 3.5.6.1 Simple example 72 3.5.6.2 Algebraic path problem 75 3.5.6.3 Updating singular value decomposition 76 3.5.6.4 MPEG-4 video motion estimation kernel 76 Contents VIJ 3.5.6.5 Discussion 76 35.7 Optimization of the translation component 76 3.6 Summary 77 4. SYSTEM-LEVEL STORAGE REQUIREMENT ESTIMATION 79 4.1 Context and motivation 79 4.2 Previous Work 80 4.2.1 Scalar-based estimation 80 4.2.2 Estimation with fixed execution ordering 81 4.2.3 Estimation without execution ordering 82 4.3 Estimation with partially fixed execution ordering 85 4.3.1 Motivation and context 85 4.3.2 Concept definitions 86 4.3.2.1 Dependency part 86 4.3.2.2 Dependency vector polytope and its dimensions 88 4.3.2.3 Orthogonalization 89 4.3.3 Overall estimation strategy 90 4.3.4 Generation of common iteration space 90 4.3.5 Estimation of individual dependencies 92 4.3.6 Estimation of global storage requirement 94 4.3.6.1 Simultaneous aliveness for two dependencies 95 4.3.6.2 Simultaneous aliveness for multiple dependencies 95 4.3.7 Guiding principles 97 4.4 Size estimation of individual dependencies 99 4.4.1 General principles 99 4.4.2 Automated estimation with a partially fixed ordering 99 4.4.3 Estimation with partial ordering among dimensions 102 4.4.4 Automated estimation on three dimensional code examples 104 4.5 Estimation on real-life application Demonstrators 106 4.5.1 MPEG-4 motion estimation kernel 106 4.5.1.1 Code description and external constraints 106 4.5.1.2 Individual dependencies 108 4.5.1.3 Simultaneously alive dependencies 109 4.5.2 SVD updating algorithm III 4.5.3 Cavity detection algorithm 113 4.5.3.1 Code description ll3 4.5.3.2 Storage requirement estimation and optimization ll5 4.6 Summary 117 5. AUTOMATED DATA REUSE EXPLORATION TECHNIQUES ll9 5.1 Context and motivation ll9 5.2 Related work 120 5.3 Basic concepts 121 5.3.1 Data reuse factor 121 5.3.2 Assumptions 122 5.3.3 Cost functions 122 5.4 A hole in the picture 123 5.4.1 A first simple example 123 5.4.2 Definition of reuse factors and memory size 124 5.4.3 Application to simple example 126 5.4.4 Multiple levels of reuse 126 5.5 Steering the global exploration of the search space 126 5.5.1 Extension of Belady's MIN Algorithm to larger time-frames 126 5.5.2 DRD, Tllmetrwne, offset: influence on relevant cost variables 127 5.5.3 Heuristics to speed up the search 127 5.6 Experimental results 128 5.6.1 Binary Tree Predictive Coder 128 5.6.2 MPEG4 Motion estimation 129 viii DATA ACCESS AND STORAGE MANAGEMENT FOR PROCESSORS 5.6.3 Cavity detection 130 5.6.4 SUSAN principle 131 5.7 Summary 132 6. STORAGE CYCLE BUDGET DISTRIBUTION 133 6.1 Summary of research on customized memory organisations 133 6.2 Memory bandwidth and memory organisation principles 135 6.2.1 Memory access ordering and conflict graphs 135 6.2.2 Memory/Bank Allocation and Assignment 136 6.3 How to meet real-time bandwidth constraints 138 6.3.1 The cost of data transfer bandwidth 139 6.3.2 Cost model for high-speed memory architectures 139 6.3.3 Costs in highly parallel memory architectures 140 6.3.4 Balance memory bandwidth 140 6.4 System-wide energy cost versus cycle budget tradeoff 141 6.4.1 Basic Trade-off Principles 141 6.4.2 Binary Tree Predictive Coder Demonstrator 142 6.4.3 Exploiting the use of the Pareto curves 144 6.4.3.1 Trading off cycles between memory organisation and data-path 145 6.4.3.2 Energy tradeoffs between concurrent tasks 146 6.4.4 Demonstrator: Synchro core of DAB 147 6.5 Storage cycle budget distribution techniques 149 6.5.1 Summary of the flat flow graph technique 150 6.5.2 Dealing with hierarchical graphs 150 6.5.3 Cycle distribution across blocks 151 6.5.4 Global optimum for ECG 151 6.5.5 Incremental SCBD 153 6.5.6 Experimental SCBD results 154 6.6 Conflict Directed Ordering Techniques 157 6.6.1 Basic principles 157 6.6.2 Illustration and motivation 157 6.6.3 Exploration of access orderings 159 6.6.3.1 Problem formulation 159 6.6.3.2 A 10 digital filter example 162 6.6.3.3 Conflict-directed ordering (CDO) 163 6.6.3.4 Localized Conflict Directed Ordering (LCDO) 164 6.6.3.5 LCDO with global constraint propagation (GCP) 166 6.6.3.6 LCDO with deterministic GCP 166 6.6.4 Faster but still effective exploration 167 6.6.4.1 Multi-level Local CDO (LCDO(k)) 167 6.6.4.2 (Multi-level) Generalized CDO (G-CDO(k)) 169 6.6.4.3 Five variants 171 6.6.5 Experiments on real-life examples 171 6.6.5.1 Telecom networks - Segment Protocol Processor (SPP) 171 6.6.5.2 Speech processing -Voice Coder (VoCoder) 174 6.6.5.3 Image and video processing - Binary Tree Predictive Coding (BTPC) 175 6.7 Summary 177 7. CACHE OPTIMIZATION 179 7.1 Related work in compiler optimizations for caches 179 7.1.1 Loop fusion 179 7.1.2 Loop tiling/blocking 180 7.1.3 Multi-level blocking 181 7.1.4 Array padding 181 7.1.5 Combination of loop and data transformations 183 7.1.6 Software prefetching 183 7.1.7 Scratch pad versus cache memory 183

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.