On some extensions of Radial Basis Functions and their applications in Arti(cid:2)cial Intelligence Federico Girosi Arti(cid:2)cial Intelligence Laboratory Massachusetts Institute of Technology Cambridge(cid:3) Massachusetts (cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:3) USA (cid:0) Introduction In recent years approximation theory has found interesting applications in the (cid:2)elds of Arti(cid:2)cial Intelligence and Computer Science(cid:9) For instance(cid:3) a problem that (cid:2)ts very naturally in the framework of approximation theory is the problem of learning to per(cid:10) forma particulartask froma set of examples(cid:9) The examplesaresparse data points ina multidimensionalspace(cid:3) and learning means to reconstruct a surface that (cid:2)ts the data(cid:9) From this perspective(cid:3) the so popular approach of Neural Networks to this problem (cid:11)Rumelhart et al(cid:9)(cid:3) (cid:6)(cid:8)(cid:12)(cid:13)(cid:14) Hertz et al(cid:9)(cid:3) (cid:6)(cid:8)(cid:8)(cid:6)(cid:15) is nothing else than the implementation of a particular kind of nonlinear approximation scheme(cid:9) However(cid:3) despite the great popularity(cid:3)verylittleis knownabout thepropertiesof neuralnetworks(cid:9) Forthis reason we started considering the same class of problems(cid:3) but in a more classical framework(cid:9) In particular(cid:3) since the problem of approximating a surface from sparse data points is ill(cid:10)posed(cid:3) regularization theory (cid:11)Tikhonov(cid:3) (cid:6)(cid:8)(cid:13)(cid:7)(cid:14) Tikhonov and Arsenin(cid:3) (cid:6)(cid:8)(cid:16)(cid:16)(cid:14) Mo(cid:10) rozov(cid:3) (cid:6)(cid:8)(cid:12)(cid:17)(cid:14) Bertero(cid:3) (cid:6)(cid:8)(cid:12)(cid:13)(cid:15) seemed to be an ideal framework(cid:9) Regularization theory leads naturally to the formulation of variational principle(cid:3) from which it is possible to derive some well known approximation schemes (cid:11)Wahba(cid:3) (cid:6)(cid:8)(cid:8)(cid:4)(cid:15)(cid:3) such as the multi(cid:10) variate splines previously introduced by Duchon (cid:11)(cid:6)(cid:8)(cid:16)(cid:16)(cid:15) and Meinguet (cid:11)(cid:6)(cid:8)(cid:16)(cid:8)(cid:15) and the more general Radial Basis Functions technique (cid:11)Powell(cid:3) (cid:6)(cid:8)(cid:12)(cid:16)(cid:14) Franke(cid:3) (cid:6)(cid:8)(cid:12)(cid:5)(cid:14) Micchelli(cid:3) (cid:6)(cid:8)(cid:12)(cid:13)(cid:14) Kansa(cid:3) (cid:6)(cid:8)(cid:8)(cid:4)a(cid:3)b(cid:14) Madych and Nelson(cid:3) (cid:6)(cid:8)(cid:8)(cid:6)(cid:14) Dyn(cid:3) (cid:6)(cid:8)(cid:8)(cid:6)(cid:15)(cid:9) Due to the characteris(cid:10) tics of the problems of machine learning(cid:3) Radial Basis Functions seemed to be a very appropriate technique(cid:9) Moreover(cid:3) this method has a simple interpretation in terms of a (cid:18)network(cid:19) whose architecture is similar to the one of the multilayer perceptrons(cid:3) and therefore retains all the advantages of this architecture(cid:3) such as the high degree of parallelizability(cid:9) Unfortunately(cid:3) in many practical cases the Radial Basis Functions method cannot be applied in a straightforward manner(cid:3) because does not take in ac(cid:10) count some features that are typical of problems in Arti(cid:2)cial Intelligence(cid:9) Goal of this paper is to review some of these aspects(cid:3) show possible solutions in the framework of Radial Basis Functions(cid:3) and point out some open problems(cid:9) In the next section we brie(cid:20)y de(cid:2)ne what we mean by learning(cid:3) and show how natural is its embedding in the framework of approximation theory(cid:9) In section (cid:7) we review the classical variational approach to surface reconstruction(cid:3) and in section (cid:17) we show some extensions that are needed(cid:3) in order to cope with practical problems(cid:9) In section (cid:21) we present two examples and in section (cid:13) we draw some conclusions and show future directions of investigation(cid:9) Other aspects and extensions of this approach to learning are discussed in (cid:11)Poggio and Girosi(cid:3) (cid:6)(cid:8)(cid:8)(cid:4)(cid:15)(cid:9) (cid:2) Learning and Approximation Theory The problemof learning is at the very core of the problemof intelligence(cid:9) But learning is a very broad term and there are many di(cid:22)erent forms of learning that must be distinguished(cid:9) In this paper we will consider a speci(cid:2)c form of the problem of learning (cid:6) from examples(cid:9) This is clearly only one small part of the larger problem of learning(cid:3) but (cid:23) we believe (cid:23) an interesting place from where to start a rigorous theory and from where to develop useful tools(cid:9) (cid:3)(cid:4)(cid:5) Learning as an Approximation Problem If we look at many of the problems that have been considered in the (cid:2)eld of machine learning(cid:3) we notice that in all the cases an instance of the problem is given by a set D N of input(cid:10)output pairs(cid:3) D (cid:11)xi(cid:2)yi(cid:15) X Y i(cid:0)(cid:2)(cid:3) belonging to some input and output (cid:0) f (cid:2) (cid:3) g space(cid:3) X and Y (cid:11)see (cid:2)g(cid:9) (cid:6)(cid:15)(cid:9) Figure (cid:6) about here We assume that there is some relationship between the input and the output(cid:3) and consider the pairs (cid:11)xi(cid:2)yi(cid:15) as examples of it(cid:9) Therefore we assume that it exists a map f (cid:24) X Y (cid:4) with the property(cid:24) f(cid:11)xi(cid:15) (cid:25) yi i (cid:25) (cid:6)(cid:2)(cid:3)(cid:3)(cid:3)(cid:2)N (cid:3) Learning(cid:3) or generalizing(cid:3) means being able to estimate the function at points of its domain X where no data are available(cid:9) From a mathematical point of view this means estimating the function f from the knowledge of a subset D ot its graph(cid:3) that is from a set of sparse data(cid:9) Therefore from this point of view the problem of learning is equivalent to the problem of surface reconstruction(cid:9) Of course learning and generalization are possible only under the assumption that the world in which we live is (cid:23) at the appropriate level of description (cid:23) redundant(cid:9) In terms of surface approximation it means that the surface to be reconstructed have to be smooth(cid:24) small changes in some input determine a correspondingly small change in the output (cid:11)it may be necessary in some cases to accept piecewise smoothness(cid:15)(cid:9) Generalization is not possible if the mapping is completely random(cid:9) For instance(cid:3) any number of examples for the mapping represented by a telephone directory (cid:11)people(cid:26)s names into telephone numbers(cid:15) do not help in estimating the telephone number corresponding to a new name(cid:9) Some examples are in order at this point(cid:9) (cid:6)(cid:9) Learning to pronounce English words from their spelling A well known neural network implementation has been claimed to solve the problem of converting strings of letters in strings of phonemes (cid:11)Sejnowski and Rosenberg(cid:3) (cid:6)(cid:8)(cid:12)(cid:16)(cid:15)(cid:9) The mapping to be learned was therefore X English (cid:0) f words Y phonemes (cid:9) The input string consisted of (cid:16) letters(cid:3) and the g (cid:4) (cid:0) f g output was the phoneme corresponding to the letter in the middle of string(cid:9) (cid:5) Feeding English text in a window of (cid:16) letters(cid:3) a string of phonemes was produced andthenprocessedbyadigitalspeechsynthesizer(cid:3)thereforeenablingthemachine to read the text aloud(cid:9) Because of the particular way of encoding letters and phonemes(cid:3) the input consisted of a binary vector of (cid:5)(cid:4)(cid:7) elements(cid:3)and the output of a vector of (cid:5)(cid:13) elements(cid:9) Clearly in this case the smoothness assumption is often violated(cid:3) since similar words do not always have similar pronunciations(cid:3) and this is re(cid:20)ected in the fact that the percentage error on a test set of data was (cid:16)(cid:12)(cid:27)(cid:3) for continuous informal speech(cid:9) (cid:5)(cid:9) Learning to recognize a (cid:2)D object from its (cid:3)D image Theinputdata setconsistsof(cid:5)D viewsofdi(cid:22)erentobjectsindi(cid:22)erentposes(cid:9) The output corresponding to a given input is (cid:6) if the input is a (cid:5)D view of the object that has to be recognized(cid:3) and (cid:4) otherwise(cid:9) A (cid:5)D view is a set of features of the (cid:5)D image of the object(cid:9) In the simplest case a feature is the location(cid:3) on the imageplane(cid:3)of somereferencepoint(cid:3) butmaybeany parameterassociated to the image(cid:9) Forexample(cid:3)ifweareinterestedinrecognizingfaces(cid:3)commonfeaturesare nose and mouse width(cid:3) eyebrows thickness(cid:3) pupil to eyebrows sepatation(cid:3) pupil to nose verical distance(cid:3) mouth height(cid:9) In general the mapping to be learned is(cid:24) X (cid:5)D view Y (cid:4)(cid:2) (cid:6) (cid:3) (cid:0) f g (cid:4) (cid:0) f g In (cid:11)Poggio and Edelman(cid:3) (cid:6)(cid:8)(cid:8)(cid:4)(cid:15) the problem of recognizing simple wire(cid:10)frame objects was considered(cid:9) The input was the (cid:6)(cid:5)(cid:10)dimensional space of the locations on the imageplane of (cid:13) feature points(cid:9) Good results wereobtained applying least squares Radial Basis Functions (cid:11)see section (cid:17)(cid:9)(cid:6)(cid:15) with (cid:12)(cid:4)(cid:23)(cid:6)(cid:4)(cid:4) examples(cid:9) This indicates that the underlying mapping is smooth almost everywhere (cid:11)in fact(cid:3) under particular conditions(cid:3) it may the characteristic function of a half(cid:10)space(cid:15)(cid:9) Good results have been also obtained(cid:3) with real world images(cid:3) on the similar problems of face and gender recognition(cid:3) although in that case the map to be approximated may be much more complex (cid:11)Brunelli and Poggio(cid:3) (cid:6)(cid:8)(cid:8)(cid:6)(cid:3) (cid:6)(cid:8)(cid:8)(cid:6)a(cid:15)(cid:9) (cid:7)(cid:9) Learning Navigation tasks An indoor robot is manually driven through a corridor(cid:3) while its frontal camera recordsa sequenceoffrontalimages(cid:9) Can thissequencebeused totraintherobot to drive by itself without crashing into the walls by using the visual input and to generalize to slightly di(cid:22)erent corridors and di(cid:22)erent positions within them(cid:28) The map to be approximated is X images Y steering command (cid:3) (cid:0) f g (cid:4) (cid:0) f g D(cid:9) Pomerlau(cid:11)(cid:6)(cid:8)(cid:12)(cid:8)(cid:15) has described a similaroutdoor problemof driving the CMU Navlab and considered an image as described by all its pixel grey values(cid:9) The (cid:7) correspondingapproximationproblem(cid:3)solvedwithaneuralnetworkarchitecture(cid:3) had therefore (cid:8)(cid:4)(cid:4) variables(cid:3) since images of (cid:7)(cid:4) (cid:7)(cid:4) pixels have been used(cid:9) A (cid:3) simpler possibility consists in coding an image by an appropriate set of features (cid:11)as location or orientation of relevant edges of the image(cid:15) (cid:11)Aste and Caprile(cid:3) (cid:6)(cid:8)(cid:8)(cid:6)(cid:15)(cid:9) (cid:17)(cid:9) Learning Motor Control Consider a multiple(cid:10)jointrobot armthat has to be controlled(cid:9) One needs to solve the inverse dynamics problem(cid:24) compute from positions(cid:3) velocities and accelera(cid:10) tions of the jointsthe corresponding torques at the joints(cid:9) The mapto be learned is therefore the (cid:18)inverse dynamics(cid:19) map X state space Y torques (cid:2) (cid:0) f g (cid:4) (cid:0) f g where the state space is the set of all admissible position(cid:3) velocities and acceler(cid:10) ations of the robot(cid:9) For a two(cid:10)joints arm the state space is six dimensional and there are two torques to be learned(cid:3) one for each joint(cid:9) Very good performances have been obtained in simulations run by Botros and Atkeson (cid:11)(cid:6)(cid:8)(cid:8)(cid:4)(cid:15) using the extensions of the Radial Basis Functions technique described in section (cid:11)(cid:17)(cid:9)(cid:5)(cid:15)(cid:9) (cid:21)(cid:9) Learning to predict time series Suppose we observe and collect data about the temporal evolution of a multi(cid:10) dimensional dynamical system from some time in the past up to now(cid:9) Given the state of the system at some time t in the future we would like to be able to predict its state at a successive time t (cid:29)T(cid:9) In practice we usually observe the temporal evolution of one of the variables of the system(cid:3) say x(cid:11)t(cid:15)(cid:3) and represent the state of the system as a vector of (cid:18)delay variables(cid:19) (cid:11)Farmer and Sidorowich(cid:3) (cid:2) (cid:6)(cid:8)(cid:12)(cid:16)(cid:15) (cid:24) x(cid:11)t(cid:15) x(cid:11)t(cid:15)(cid:2)x(cid:11)t (cid:4)(cid:15)(cid:2)(cid:3)(cid:3)(cid:3)(cid:2)x(cid:11)t (cid:11)d(cid:29)(cid:6)(cid:15)(cid:4)(cid:15) (cid:0) f (cid:5) (cid:5) g where (cid:4) is an appropriate delay time(cid:3) and d is larger than the dimension of the attractor of the dynamical system(cid:9) The map that has to be learned is therefore d fT (cid:24) R R (cid:2) (cid:4) fT(cid:11)x(cid:11)t(cid:15)(cid:15) x(cid:11)t(cid:29)T(cid:15) (cid:3) (cid:4) Some authors (cid:11)Broomhead et al(cid:9)(cid:3) (cid:6)(cid:8)(cid:8)(cid:4)(cid:14) Casdagli(cid:3) (cid:6)(cid:8)(cid:12)(cid:8)(cid:14) Moody et al(cid:9)(cid:3) (cid:6)(cid:8)(cid:12)(cid:8)(cid:15) have already successfully applied Radial Basis Functions techniques to time series prediction(cid:3) with interesting results(cid:9) (cid:0) It can shown (cid:2)Takens(cid:3) (cid:4)(cid:5)(cid:6)(cid:4)(cid:7) that this representation preserves the topological properties of the attractor(cid:8) Howeverthisisnottheonlytechniquethatcanbeusedtorepresentthestateofadynamical system (cid:2)Broomhead and King(cid:3) (cid:4)(cid:5)(cid:6)(cid:9)(cid:7)(cid:8) (cid:17) (cid:3)(cid:4)(cid:3) Learning from Examples and Radial Basis Functions As we haveseen(cid:3) in manypractical cases learning to performsome task is equivalentto recoveringa function froma setof scattered data points(cid:9) Manydi(cid:22)erenttechniquesare currently available for surface approximation(cid:9) However(cid:3) the approximation problems arising from learning problems has some unusual features that make many of these techniques not very appropriate(cid:9) Let us brie(cid:20)y see some of these features(cid:24) large number of dimensions(cid:4) This is the most striking characteristic of the prob(cid:10) (cid:6) lem of learning from examples(cid:9) Usually a large set of number is needed in order to specify the input of the system(cid:9) Consider for exampleall the problems related to vision(cid:3) as object or character recognition(cid:24) if all the pixels of the image are used the number of variables is enormous (cid:11)of the order of (cid:6)(cid:4)(cid:4)(cid:4) for low resolution images(cid:3) of (cid:7)(cid:4) (cid:7)(cid:4) pixels(cid:15)(cid:9) Clearly(cid:3) not all the information carried in the pixels (cid:3) may be necessary(cid:3) and several techniques have developed to extract(cid:3) out of the thousands of pixels values(cid:3) (cid:18)small(cid:19) sets of features(cid:3) of the order of one hundred or less (cid:11)(cid:6)(cid:13) features have been successfully used by Brunelli and Poggio (cid:11)(cid:6)(cid:8)(cid:8)(cid:6)(cid:15) for face and gender recognition(cid:15)(cid:9) In speech recognition problems the number of dimensions is also easily of the order of hundreds (cid:11)Lippmann(cid:3) (cid:6)(cid:8)(cid:12)(cid:8)(cid:15)(cid:9) Time series prediction is a (cid:18)simple(cid:19) problem from this point of view(cid:9) In fact the number of variables is of the order of the dimension of the attractor of the underlying dynamical system(cid:3) that is typically smaller than (cid:6)(cid:4)(cid:14) relatively small number of data points(cid:4) In practical applications the number of (cid:6) available data points may vary(cid:3) from few hundreds to several thousands(cid:3) but (cid:3) usually never exceed (cid:6)(cid:4) (cid:9) For such high dimensional problems(cid:3) however(cid:3) these (cid:3) number are small(cid:9) The (cid:7)(cid:4) dimensional hypercube(cid:3) with (cid:6)(cid:4) data points in it(cid:3) is (cid:4)(cid:5) (cid:6) empty(cid:3) cosidered that the number of its vertices is (cid:5) (cid:6)(cid:4) (cid:9) The emptyness (cid:7) of these high dimensional space is a manifestation of the so called (cid:18)curse of dimensionality(cid:19)(cid:3) and sets the limits of any approximation technique(cid:14) noisy data(cid:4) In everypractical application data are noisy(cid:9) For this reason approx(cid:10) (cid:6) imation(cid:3) instead of pure interpolation(cid:3) is of interest(cid:9) It is natural to ask if(cid:3) besides computational problems(cid:3) it is meaningful to approx(cid:10) (cid:3) imate a function of (cid:7)(cid:4) variables given (cid:6)(cid:4) data points or less(cid:9) The answer clearly depends on properties of the functions we want to approximate(cid:3) as(cid:3) for example(cid:3) their degree of smoothness(cid:9) However there is another factor to be taken in account(cid:3) and it is the fact that often only a low degree of accuracy is required(cid:9) In many cases(cid:3) in fact(cid:3) (cid:2) (cid:7) a relative error of (cid:6)(cid:4)(cid:0) (cid:6)(cid:4)(cid:0) may be more than satisfactory(cid:3) as long as we can be (cid:5) con(cid:2)dent that is never excedeed(cid:9) The extensive experimentationthat has been carried over in the (cid:2)eld of neural networks indicates that in practical cases these conditions are met(cid:3) since good experimental results have been found in many cases(cid:9) Once established that approximation theory may be useful in some practical learn(cid:10) ing problems(cid:3) a speci(cid:2)c technique has to be chosen(cid:9) Given the characteristic of the (cid:21) learning problems(cid:3) Radial Basis Functions seems to be quite a natural choice(cid:9) In fact(cid:3) most of the other techniques(cid:3) that may be very successful in (cid:6) or (cid:5) dimensions(cid:3) run in problems when the number of dimensions increases(cid:9) For example(cid:3) in such large and empty spaces(cid:3) techniques based on tensor products or triangulations do not seem ap(cid:10) propriate(cid:9) Moreover(cid:3) since we can now take advantage of parallel machines(cid:3) it is very convenient to deal with functions of the form N f(cid:11)x(cid:15) (cid:25) ci(cid:5)i(cid:11)x(cid:15) (cid:2) i(cid:0)(cid:2) X and Radial Basis Functions satisfy this requirement(cid:9) As we will see in the next section Radial Basis Functions can be derived in a variational framework(cid:3) in which noisy data are naturally taken in account(cid:9) However it will turn out that a straightforward application of Radial Basis Func(cid:10) tions is not always possible because of the high computational complexity or because departure from radiality is needed(cid:9) Nevertheless(cid:3) approximations of the original Ra(cid:10) dial Basis Functions expansion can be devised (cid:11)(cid:18)least square Radial Basis Functions(cid:19)(cid:15)(cid:3) that still mantain the good features of Radial Basis Functions(cid:3) and non radiality can be also taken in account(cid:3) as it will be shown in section (cid:11)(cid:17)(cid:9)(cid:5)(cid:15)(cid:9) Of course(cid:3) other techniques than Radial Basis Functions could be used(cid:9) For ex(cid:10) ample(cid:3) Multilayer Perceptrons (cid:11)MLP(cid:15) are very common in the Arti(cid:2)cial Intelligence community(cid:9) In the simplest version a Multilayer Perceptron(cid:3) that is a particular case of (cid:18)neural network(cid:19)(cid:3) is a nonlinear approximation schemebased on the following para(cid:10) metric representation(cid:24) n f(cid:11)x(cid:15) (cid:25) ci(cid:6)(cid:11)x wi (cid:29)(cid:7)i(cid:15) (cid:2) (cid:11)(cid:6)(cid:15) i(cid:0)(cid:2) (cid:8) X wheretheparametersci(cid:3)wi and(cid:7)i havetobefoundbyminimizingtheleastsquareerror on the data points(cid:9) Not much is known on the properties of such an approximation scheme(cid:3) except the fact that set of functions of the type (cid:11)(cid:6)(cid:15) is dense in the space of continuous function on compact sets provided with the uniform norm (cid:11)Cybenko(cid:3) (cid:6)(cid:8)(cid:12)(cid:8)(cid:14) Funahashi(cid:3) (cid:6)(cid:8)(cid:12)(cid:8)(cid:15)(cid:9) The huge literature in the (cid:2)eld of neural networks shows that this approximation scheme performs well in many cases(cid:3) but a solid analysis of its performances is still missing(cid:9) For thesereasons wefocused our attentionon theRadialBasisFunctions technique(cid:3) being more well understood and susceptible of a theoretical analysis(cid:9) The next section isthereforedevotedto areviewoftheRadialBasisFunctionsmethod(cid:3)andinparticular to its variational formulation(cid:9) (cid:13) (cid:3) A Variational Approach to Surface Reconstruc(cid:4) tion Fromthe point of view of learning as approximation(cid:3) the problemof learning a smooth mapping from examples is ill(cid:10)posed (cid:11)Hadamard(cid:3) (cid:6)(cid:8)(cid:13)(cid:17)(cid:14) Tikhonov and Arsenin(cid:3) (cid:6)(cid:8)(cid:16)(cid:16)(cid:15) in the sense that the information in the data is not su(cid:30)cient to reconstruct uniquely the mapping in regions where data are not available(cid:9) In addition(cid:3) the data are usually noisy(cid:9) Somea prioriinformationabout themappingisneededinorderto(cid:2)nda unique(cid:3) physicallymeaningful(cid:3)solution(cid:9) Severaltechniqueshavebeendevisedtoembedapriori knowledge in the solution of ill(cid:10)posed problems(cid:3) and most of them have been uni(cid:2)ed in a very general theory(cid:3) known as regularization theory(cid:3) has been developed (cid:11)Tikhonov(cid:3) (cid:6)(cid:8)(cid:13)(cid:7)(cid:14) Tikhonov and Arsenin(cid:3) (cid:6)(cid:8)(cid:16)(cid:16)(cid:14) Morozov(cid:3) (cid:6)(cid:8)(cid:12)(cid:17)(cid:14) Bertero(cid:3) (cid:6)(cid:8)(cid:12)(cid:13)(cid:15)(cid:9) Thea prioriinformationconsideredbyregularizationtheorycanbeofseveralkinds(cid:9) The most commonconcern smoothness properties of the solution(cid:3) but also localization properties and upper(cid:31)lower bounds on the solution and(cid:31)or its derivatives can be taken in account(cid:9) We are mainly interested in smoothness properties(cid:9) In fact(cid:3) whenever we try to learn to perform some task from a set of examples(cid:3) we are making the implicit assumption that we are dealing with a smooth map(cid:9) If this is not true(cid:3) that is if small changes in the input do no determine small changes in the output(cid:3) there is no hope to generalize(cid:3) and therefore to learn(cid:9) One of the idea underlying regularization theory is that the solution of an ill(cid:10)posed problem can be obtained from a variational principle(cid:3) that contains both data and a priori information(cid:9) We now sketch the general form of this variational principle and of its solution(cid:3) and show how it leads to the Radial Basis Functions method and to some possible generalizations of it(cid:9) (cid:6)(cid:4)(cid:5) A Variational Principle for Multivariate Approximation d N Suppose that the set D (cid:25) (cid:11)xi(cid:2)yi(cid:15) R R i(cid:0)(cid:2) of data has been obtained by random f (cid:2) d (cid:3) g sampling a function f(cid:3) de(cid:2)ned on R (cid:3) in presence of noise(cid:9) We are interested in recovering the function f(cid:3) or an estimate of it(cid:3) from the set of data D(cid:9) If the class of functions to which f belongs is large enough(cid:3) this problem has clearly an in(cid:2)nite number of solutions(cid:3) and some a priori knowledge is needed in order to (cid:2)nd a unique solution(cid:9) When data are not noisy we are looking for a strictinterpolant(cid:3) and therefore need to pick up one among all the possible interpolants(cid:9) If we know a priori that the original function has to satisfy some constraint(cid:3) and if we can de(cid:2)ne a functional (cid:5)(cid:11)f(cid:15) (cid:11)usually called stabilizer(cid:15) that measures the deviation from this constraint(cid:3) we can choose as a solution the interpolant that minimizes (cid:5)(cid:11)f(cid:15)(cid:9) If data are noisy we do not want to interpolate the data(cid:3) and a better choice consists in minimizing the following functional(cid:24) (cid:16) N (cid:7) H f! (cid:25) (cid:11)yi f(cid:11)xi(cid:15)(cid:15) (cid:29)(cid:8)(cid:5)(cid:11)f(cid:15) (cid:3) (cid:11)(cid:5)(cid:15) i(cid:0)(cid:2) (cid:5) X where (cid:8) is a positive parameter(cid:9) The (cid:2)rst term measures the distance between the data and the desired solution f(cid:3) the second term measures the cost associated with the deviation from the constraint and (cid:8)(cid:3) the regularization parameter(cid:3) determines the trade(cid:10)o(cid:22) between these two terms(cid:9) This is(cid:3) in essence(cid:3) one of the ideas underlying regularization theory(cid:3) applied to the approximation problem(cid:9) As previously said(cid:3) we are interested in the case in which the stabilizer (cid:5) enforces some smoothness constraint(cid:3) and therefore we face the problem to choose a suitable class of stabilizers(cid:9) Functionalsthatcanbeusedtomeasuresmoothnessofafunctionhavebeenstudied in the past(cid:3) and many of them have the property to be semi(cid:5)norms in some Banach space(cid:9) A common choice (cid:11)Duchon(cid:3) (cid:6)(cid:8)(cid:16)(cid:16)(cid:14) Meinguet(cid:3) (cid:6)(cid:8)(cid:16)(cid:8)(cid:3) (cid:6)(cid:8)(cid:16)(cid:8)a(cid:14) Wahba(cid:3) (cid:6)(cid:8)(cid:8)(cid:4) and references therein(cid:15)(cid:3) is the following (cid:2) (cid:7) (cid:5)(cid:11)f(cid:15) (cid:25) D f (cid:11)(cid:7)(cid:15) (cid:2)(cid:0)mk k jXj (cid:2) where (cid:9) is a multi(cid:10)index(cid:3) (cid:9) (cid:25) (cid:9)(cid:2)(cid:29)(cid:9)(cid:7)(cid:29)(cid:3)(cid:3)(cid:3)(cid:29)(cid:9)d(cid:3) D is the derivative of order (cid:9)(cid:3) and (cid:7) j j is the standard L norm(cid:9) k(cid:8)k The functional of eq(cid:9) (cid:11)(cid:7)(cid:15) is not the only sensible choice(cid:9) Madych and Nelson (cid:11)(cid:6)(cid:8)(cid:8)(cid:4)(cid:15) proved that any conditionally positive de(cid:2)nite function G can be used to de(cid:2)ne d a semi(cid:10)normin a space XG C R ! that generalizes the semi(cid:10)norm(cid:11)(cid:7)(cid:15) (cid:11)see also (cid:11)Dyn(cid:3) (cid:9) (cid:6)(cid:8)(cid:12)(cid:16)(cid:3) (cid:6)(cid:8)(cid:8)(cid:6)(cid:15)(cid:15) and therefore perfectly (cid:2)ts in the framework of regularization theory(cid:9) In essence(cid:3) to any conditionally positive de(cid:2)nite function G of order m it is possible to associate the functional f"(cid:11)s(cid:15) (cid:7) (cid:5)(cid:11)f(cid:15) (cid:25) ds j j (cid:11)(cid:17)(cid:15) ZRd G"(cid:11)s(cid:15) where f" and G" are the generalized Fourier transforms of f and G(cid:9) It turns out that the functional (cid:11)(cid:17)(cid:15) is a semi(cid:10)norm(cid:3) whose null space is the set of polynomials of degree at most m(cid:9) If this functional is used as a stabilizer in eq(cid:9) (cid:11)(cid:5)(cid:15)(cid:3) the solution of the minimization problem is of the form N k f(cid:11)x(cid:15) (cid:25) ciG(cid:11)x xi(cid:15)(cid:29) d(cid:2)(cid:10)(cid:2)(cid:11)x(cid:15) (cid:11)(cid:21)(cid:15) i(cid:0)(cid:2) (cid:5) (cid:2)(cid:0)(cid:2) X X k where (cid:10)(cid:2) (cid:2)(cid:0)(cid:2) is a basis in the space of polynomials of degree at most m(cid:3) and ci and f g d(cid:2) are coe(cid:30)cients that have to be determined(cid:9) If expression (cid:11)(cid:21)(cid:15) is substituted in the functional (cid:11)(cid:5)(cid:15)(cid:3) H f! becomes a function H(cid:11)c(cid:2)d(cid:15) of the variables ci and d(cid:2)(cid:9) Therefore the coe(cid:30)cients ci and d(cid:2) of equation (cid:11)(cid:21)(cid:15) (cid:12) can be obtained by minimizing H(cid:11)c(cid:2)d(cid:15) with respect to them(cid:3) obtaining the following linear system(cid:24) T (cid:11)G(cid:29)(cid:8)I(cid:15)c(cid:29)# d (cid:25) y (cid:11)(cid:13)(cid:15) #c (cid:25) (cid:4) where I is the identity matrix(cid:3) and we have de(cid:2)ned (cid:11)y(cid:15)i (cid:25) yi (cid:2) (cid:11)c(cid:15)i (cid:25) ci (cid:2) (cid:11)d(cid:15)i (cid:25) di (cid:2) (cid:11)G(cid:15)ij (cid:25) G(cid:11)xi xj(cid:15) (cid:2) (cid:11)#(cid:15)(cid:2)i (cid:25) (cid:10)(cid:2)(cid:11)xi(cid:15) (cid:5) Clearly(cid:3) in the limit of (cid:8) (cid:25) (cid:4) these equations become the standard equation of the Radial Basis Functions technique(cid:3) and the interpolation conditions f(cid:11)xi(cid:15) (cid:25) yi are satis(cid:2)ed(cid:9) If data are noisy(cid:3) the value of (cid:8) should be proportional to the amount of noise in the data(cid:9) The optimal value of (cid:8) can be found by means of techniques like Generalized Cross Validation (cid:11)Wahba(cid:3) (cid:6)(cid:8)(cid:8)(cid:4)(cid:15)(cid:3) but we do not discuss it here(cid:9) Of course the Radial Basis Functions approach to the problem of learning from examples has its drawbacks(cid:3) and very often cannot be applied in a straightforward way to speci(cid:2)c problems(cid:9) Some approximations and extensions of the original method has to be developed(cid:3) and the next section is devoted to this topic(cid:9) (cid:5) Extending Radial Basis Functions The Radial Basis Functions technique seems to be an ideal tool to approximate func(cid:10) tions in multidimensional spaces(cid:9) However(cid:3) at least for the kind of applications we are interested in(cid:3) it su(cid:22)ers of two main drawbacks(cid:24) (cid:6)(cid:9) the coe(cid:30)cients of the Radial Basis Functions expansion are computed by solving a linear systemof N equations(cid:3) where N is the number of data points(cid:9) In typical (cid:4) applications n is of the order of (cid:6)(cid:4) (cid:3) and(cid:3) moreover(cid:3) the matrices associated to these linear systems are usually ill(cid:10)conditioned(cid:9) Preconditioning and iterative techniques can be used in order to deal with numerical instabilities (cid:11)Dyn et al(cid:9)(cid:3) (cid:6)(cid:8)(cid:12)(cid:13)(cid:15)(cid:3) butwhenthenumberofdata pointsisseveralthousands thewholemethod is not very practical(cid:14) (cid:5)(cid:9) the Radial Basis Functions technique is based on the standard de(cid:2)nition of Eu(cid:10) clidean distance(cid:9) However(cid:3) in many situations the function to be approximated is de(cid:2)ned on a space in which a natural notion of distance does not exist(cid:3) for example beacuse di(cid:22)erent coordinates have di(cid:22)erent units of measurements(cid:9) In this case it is crucial to de(cid:2)ne an alternative distance(cid:3) whose particular form de(cid:10) pends on the data(cid:3) and has to be computedas wellas the Radial Basis Functions coe(cid:30)cients(cid:9) We now proceed to show how to deal with these two drawbacks of the Radial Basis Functions technique(cid:9) (cid:8)
Description: