ebook img

Neural Networks for Conditional Probability Estimation: Forecasting Beyond Point Predictions PDF

279 Pages·1999·23.805 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Neural Networks for Conditional Probability Estimation: Forecasting Beyond Point Predictions

Perspectives in Neural Computing Springer London Berlin Heidelberg New¥ork Barcelona Hong Kong Milan Paris Santa Clara Singapore Tokyo Also in this series Maria Marinaro and Roberto Tagliaferri (Eds) Neural Nets -WIRNVIETRI-96 3-540-76099-7 Adrian Shepherd Second -Order Methods for Neural Networks 3-540-76100~4 Dimitris C. Dracopoulos Evolutionary Learning Algorithms for Neural Adaptive Control 3-540-76161-6 John A. Bullinaria, David W. Glasspool and George Houghton (Eds) 4th Neural Computation and Psychology Workshop, London, 9-11 April 1997: Connectionist Representations 3-540-76208-6 Maria Marinaro and Roberto Tagliaferri (Eds) Neural Nets -WIRN VIETRI-97 3-540-76157-8 Gustavo Deco and Dragan Obradovic An Information-Theoretic Approach to Neural Computing 0-387 -94666-7 Thomas Lindblad and Jason M. Kinser Image Processing using Pulse-Coupled Neural Networks 3-540-76264-7 L. Niklasson, M. Boden and T. Ziemke (Eds) ICANN98 3-540-76263-9 Maria Marinaro and Roberto Tagliaferri (Eds) Neural Nets -WIRN VIETRI-98 1-85233-051-1 Amanda J.C. Sharkey (Ed.) Combining Artificial Neural Nets 1-85233-004-X Dietmar Heinke, Glyn W. Humphreys and Andrew Olson (Eds) Connectionist Models in Cognitive Neuroscience The 5th Neural Computation and Psychology Workshop, Birmingham, 8-10 September 1998 1-85233-052-X Dirk Husmeier Neural Networks for Conditional Probability Estimation Forecasting Beyond Point Predictions " Springer Dirk Husmeier, PhD Neural Systems Group, Department of Electrical & Electronic Engineering, Imperial College, Exhibition Road, London. SW7. Series Editor J.G. Taylor, BA, BSc, MA, PhD, FlnstP Centre for Neural Networks, Department of Mathematics, King's College, Strand, London WC2R 2LS, UK ISB N-13:978-1-85233-095-8 Springer-Verlag London Berlin Heidelberg British Library Cataloguing in Publication Data Husmeier, Dirk Neural networks for conditional estimation: forecasting beyond point predictions. -(Perspectives in neural computing) I.Neurai networks (Computer science) 2.Prediction theory I.Title 006.3'2 ISBN-13:978-1-85233-095-8 Library of Congress Cataloging-in-Publication Data Husmeier, Dirk, 1964- Neural networks for conditional probability estimation: forecasting beyond point predictions I Dirk Husmeier. p. em. --(Perspectives in neural computing) ISBN-13:978-1-85233-095-8 e-ISBN-13:978-1-4471-0847-4 DOl: 10.1007/978-1-4471-0847-4 1. Neural networks (Computer science) 2. Distribution (probabilitytheory)--Dat;l processing. I. Title. II. Series QA76.87.H87 1999 98-48438 006.3'2--dc21 CIP Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted. in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. © Springer-Verlag London Limited 1999 The use of registered names, trademarks etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied. with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Typesetting: Camera-ready by the editor 3413830-543210 Printed on acid-free paper To Ulli vii Acknowledgements Major parts of this book are based on my PhD thesis, which I completed in the Department of Mathematics at King's College London and which was funded by a Postgraduate Knight Studentship from the University of Lon don. I would like to thank my supervisor, John Taylor, for his support, ad vice and insightful criticism. I am fortunate to have been part of the research group he has led and am grateful to Valeriu Beiu and fellow students Paulo Adeodato, Bart Krekelberg, Oury Monchi, and Rasmus Petersen for their methodological support and inspiring discussions. My thanks also go to Mike Reiss, David Lavis, Mike Freeman, Richard Everson, lead Rezek, Stephen Hunt, Jon Rogers, and Sreejith Das for various assistance in solving soft ware and hardware problems, as well as to Robert Dale, Rasmus Petersen, William Penny and Stephen Roberts for meticulous proofreading of parts of the manuscript. Although I have tried to develop a personal point of view on the top ics covered by this book, I recognise that lowe much to other authors. In particular my work profited greatly from the study of David MacKay's publi cations on the Bayesian evidence method, which were a pleasure to read and substantially inspired the work leading to Chapters 9-12. During the revision of the manuscript and the work on those chapters that were not part of my thesis, I was employed as a research associate in the Department of Electrical Engineering at Imperial College London. I would like to express .my gratitude to Stephen Roberts for allowing me to dedicate large parts of my working hours to the completion of this project. I would also like to acknowledge the support of the editors at Springer Ver lag, Rosie Kemp and Vicki Swallow. The interest they took in my work helped me considerably to keep on course, and I am grateful for all their assistance during the laborious process from the first manuscript to the camera-ready copy. Last but not least, my thanks go to my wife, Ulrike Husmeier. It is needless to point out that the process of writing this book demanded huge sacrifices of our common time, and yet she managed to help me find the necessary distraction and avoid the perils of overly grim determination. ix Preface Conventional applications of neural networks usually predict a single value as a function of given inputs. In forecasting, for example, a standard objective is to predict the future value of some entity of interest on the basis of a time series of past measurements or observations. Typical training schemes aim to minimise the sum of squared deviations between predicted and actual values (the 'targets'), by which, ideally, the network learns the conditional mean of the target given the input. If the underlying conditional distribution is Gaus sian or at least unimodal, this may be a satisfactory approach. However, for a multimodal distribution, the conditional mean does not capture the relevant features of the system, and the prediction performance will, in general, be very poor. This calls for a more powerful and sophisticated model, which can learn the whole conditional probability distribution. Chapter 1 demonstrates that even for a deterministic system and 'be nign' Gaussian observational noise, the conditional distribution of a future observation, conditional on a set of past observations, can become strongly skewed and multimodal. In Chapter 2, a general neural network structure for modelling conditional probability densities is derived, and it is shown that a universal approximator for this extended task requires at least two hidden layers. A training scheme is developed from a maximum likelihood approach in Chapter 3, and the performance ofthis method is demonstrated on three stochastic time series in chapters 4 and 5. Several extensions of this basic paradigm are studied in the following chapters, aiming at both an increased training speed· and a better generalisation performance. Chapter 7 shows that a straightforward application of the Expectation Maximisation (EM) al gorithm does not lead to any improvement in the training scheme, but that in combination with the random vector functional link (RVFL) net approach, reviewed in Chapter 6, the training process can be accelerated by about two orders of magnitude. An empirical corroboration for this 'speed-up' can be found in Chapter 8. Chapter 9 discusses a simple Bayesian approach to network training, where a conjugate prior distribution on the network pa rameters naturally results in a penalty term for regularisation. However, the hyperparameters still need to be set by intuition or cross-validation, so a con sequent extension is presented in chapters 10 and 11, where the Bayesian evidence scheme, introduced to the neural network community by MacKay for regularisation and model selection in the simple case of Gaussian homoscedas tic noise, is generalised to arbitrary conditional probability densities .. The Hessian matrix of the error function is calculated with an extended version of the EM algorithm. The resulting update equations for the hyperparameters and the expression for the model evidence are found to reduce to MacKay's results in the above limit of Gaussian noise and thus provide a consequent generalisation of these earlier results. An empirical test of the evidence-based x regularisation scheme, presented in Chapter 12, confirms that the problem of overf itting can be considerably reduced, and that the training process is stabilised with respect to changes in the length of training time. A further improvement of the generalisation performance can be achieved by employing network committees, for which two weighting schemes - based on either the evidence or the cross-validation performance - are derived in Chapter 13. Chapters 14 and 16 report the results of extensive simulations on a syn thetic and a real-world problem, where the intriguing observation is made that in network committees, overfitting of the individual models can be use ful and may lead to better prediction results than obtained with an ensemble of properly regularised networks. An explanation for this curiosity can be given in terms of a modified bias-variance dilemma, as expounded in Chap ter 13. The subject of Chapter 15 is the problem of feature selection and the identification of irrelevant inputs. To this end, the automatic relevance determination (ARD) scheme of MacKay and Neal is adapted to learning in comittees of probability-predicting RVFL networks. This method is applied in Chapter 16 to a real-world benchmark problem, where the objective is the prediction of housing prices in the Boston metropolitan area on the ba sis of various socio-economic explanatory variables. The book concludes in Chapter 17 with a brief summary. xi Nomenclature In this book, the following notation has been adopted. Lower-case boldface letters, for example x, are used to denote vectors, while upper-case boldface letters, such as H, denote matrices. The transpose of a vector or a matrix is denoted by a superscript t, for example, xt. In most cases, x denotes a column vector, whereas xt denotes the corresponding row vector. The upper case letter P is used to represent both a discrete probability and a probability density. When used with different arguments, P represents different functions. For example, P(x) denotes the distribution of a random variable x and P(y) the distribution of a random variable y, so that these distributions are de noted by the same symbol P even though they represent different functions. Moreover, the notation does not distinguish between a random variable, X, and its realisation, x. I hope these conventions simplify the notation and re duce unnecessary opacity rather than cause confusion. The symbols used for some frequently occuring quantities in this book are listed below: S-layer first hidden layer of sigmoidal nodes 9-layer second hidden layer of nodes with a bell-shaped transfer function m number of nodes in the input layer, also: dimension of the hyperparameter space H number of nodes in the S-layer f{ number of nodes in the 9-layer output weights or prior probabilities kernel width parameters determining the kernel width set of all network parameters vector of weights in the kth weight group, that is, weights feeding into the kth 9-node inverse variance of a Gaussian prior distribution on the weights in the kth weight group 'Yk number of well-determined parameters in the kth weight group [defined in (10.72)] S(.) sigmoidal transfer function, i.e. a function that is continuous, monotonically increasing, and bounded from above and below ,(.) sigmoid function, ,(x) = (1 + e-x)-l U EM error function, defined in (7.3) E maximum likelihood error function, defined in (3.8). E* total error function, defined in (10.30) xii Etrain same as E, the training 'error' cross-validation 'error' Eeross Egen empirical generalisation 'error' [estimated on the test set according to (8.4)] [. theoretical generalisation 'error' [defined in (3.7)] N, Ntrain number of exemplars in the training set number of exemplars in the cross-validation set Neross Ngen number of exemplars in the test set number of networks in a committee of networks NCom EgCeonm empirical generalisation 'error' of a network committee

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.