Machine Learning A Bayesian and Optimization Perspective This page intentionally left blank Machine Learning A Bayesian and Optimization Perspective Sergios Theodoridis AMSTERDAM (cid:129) BOSTON (cid:129) HEIDELBERG (cid:129) LONDON NEW YORK (cid:129) OXFORD (cid:129) PARIS (cid:129) SAN DIEGO SAN FRANCISCO (cid:129) SINGAPORE (cid:129) SYDNEY (cid:129) TOKYO Academic Press is an imprint of Elsevier AcademicPressisanimprintofElsevier 125LondonWall,London,EC2Y5AS,UK 525BStreet,Suite1800,SanDiego,CA92101-4495,USA 225WymanStreet,Waltham,MA02451,USA TheBoulevard,LangfordLane,Kidlington,OxfordOX51GB,UK Copyright©2015ElsevierLtd.Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicor mechanical,includingphotocopying,recording,oranyinformationstorageandretrievalsystem,without permissioninwritingfromthepublisher.Detailsonhowtoseekpermission,furtherinformationaboutthe Publisher’spermissionspoliciesandourarrangementswithorganizationssuchastheCopyrightClearanceCenter andtheCopyrightLicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions. ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher (otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperiencebroadenour understanding,changesinresearchmethods,professionalpractices,ormedicaltreatmentmaybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingandusing anyinformation,methods,compounds,orexperimentsdescribedherein.Inusingsuchinformationormethods theyshouldbemindfuloftheirownsafetyandthesafetyofothers,includingpartiesforwhomtheyhavea professionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeanyliability foranyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise, orfromanyuseoroperationofanymethods,products,instructions,orideascontainedinthematerialherein. BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress ISBN:978-0-12-801522-3 ForinformationonallAcademicPresspublications visitourwebsiteathttp://store.elsevier.com/ Publisher:JonathanSimpson AcquisitionEditor:TimPitts EditorialProjectManager:CharlieKent ProductionProjectManager:SusanLi Designer:GregHarris TypesetbySPiGlobal,India PrintedandboundinTheUnitedStates 15 16 17 18 19 10 9 8 7 6 5 4 3 2 1 Contents Preface ...................................................................................................xvii Acknowledgments....................................................................................... xix Notation.................................................................................................. xxi CHAPTER 1 Introduction ...................................................................... 1 1.1 WhatMachineLearningisAbout.................................................. 1 1.1.1 Classification............................................................... 2 1.1.2 Regression.................................................................. 3 1.2 StructureandaRoadMapoftheBook............................................ 5 References................................................................................ 8 CHAPTER 2 Probability and Stochastic Processes.................................... 9 2.1 Introduction......................................................................... 10 2.2 ProbabilityandRandomVariables................................................. 10 2.2.1 Probability.................................................................. 11 2.2.2 DiscreteRandomVariables................................................ 12 2.2.3 ContinuousRandomVariables............................................ 14 2.2.4 MeanandVariance ........................................................ 15 2.2.5 TransformationofRandomVariables..................................... 17 2.3 ExamplesofDistributions.......................................................... 18 2.3.1 DiscreteVariables.......................................................... 18 2.3.2 ContinuousVariables...................................................... 20 2.4 StochasticProcesses................................................................ 29 2.4.1 FirstandSecondOrderStatistics ......................................... 30 2.4.2 StationarityandErgodicity................................................ 30 2.4.3 PowerSpectralDensity.................................................... 33 2.4.4 AutoregressiveModels .................................................... 38 2.5 InformationTheory................................................................. 41 2.5.1 DiscreteRandomVariables................................................ 42 2.5.2 ContinuousRandomVariables............................................ 45 2.6 StochasticConvergence ............................................................ 48 Problems.................................................................................. 49 References................................................................................ 51 CHAPTER 3 Learning in Parametric Modeling: Basic Concepts and Directions 53 3.1 Introduction......................................................................... 53 3.2 ParameterEstimation:TheDeterministicPointofView ......................... 54 v vi Contents 3.3 LinearRegression................................................................... 57 3.4 Classification........................................................................ 60 3.5 BiasedVersusUnbiasedEstimation ............................................... 64 3.5.1 BiasedorUnbiasedEstimation? .......................................... 65 3.6 TheCramér-RaoLowerBound .................................................... 67 3.7 SufficientStatistic................................................................... 70 3.8 Regularization....................................................................... 72 3.9 TheBias-VarianceDilemma ....................................................... 77 3.9.1 Mean-SquareErrorEstimation............................................ 77 3.9.2 Bias-VarianceTradeoff.................................................... 78 3.10 MaximumLikelihoodMethod..................................................... 82 3.10.1 LinearRegression:TheNonwhiteGaussianNoiseCase ................ 84 3.11 BayesianInference.................................................................. 84 3.11.1 TheMaximumaPosterioriProbabilityEstimationMethod............. 88 3.12 CurseofDimensionality............................................................ 89 3.13 Validation ........................................................................... 91 3.14 ExpectedandEmpiricalLossFunctions........................................... 93 3.15 NonparametricModelingandEstimation ......................................... 95 Problems ................................................................................... 97 References.................................................................................. 102 CHAPTER 4 Mean-Square Error Linear Estimation .................................... 105 4.1 Introduction......................................................................... 105 4.2 Mean-SquareErrorLinearEstimation:TheNormalEquations.................. 106 4.2.1 TheCostFunctionSurface................................................ 107 4.3 AGeometricViewpoint:OrthogonalityCondition................................ 109 4.4 ExtensiontoComplex-ValuedVariables .......................................... 111 4.4.1 WidelyLinearComplex-ValuedEstimation.............................. 113 4.4.2 OptimizingwithRespecttoComplex-ValuedVariables: WirtingerCalculus......................................................... 116 4.5 LinearFiltering ..................................................................... 118 4.6 MSELinearFiltering:AFrequencyDomainPointofView...................... 120 4.7 SomeTypicalApplications......................................................... 124 4.7.1 InterferenceCancellation.................................................. 124 4.7.2 SystemIdentification ...................................................... 125 4.7.3 Deconvolution:ChannelEqualization.................................... 126 4.8 AlgorithmicAspects:TheLevinsonandtheLattice-LadderAlgorithms........ 132 4.8.1 TheLattice-LadderScheme............................................... 137 4.9 Mean-SquareErrorEstimationofLinearModels................................. 140 4.9.1 TheGauss-MarkovTheorem.............................................. 143 4.9.2 ConstrainedLinearEstimation:TheBeamformingCase................ 145 Contents vii 4.10 Time-VaryingStatistics:KalmanFiltering ........................................ 148 Problems ................................................................................... 154 References.................................................................................. 158 CHAPTER 5 Stochastic Gradient Descent: The LMS Algorithm and its Family.................................................................... 161 5.1 Introduction......................................................................... 162 5.2 TheSteepestDescentMethod...................................................... 163 5.3 ApplicationtotheMean-SquareErrorCostFunction ............................ 167 5.3.1 TheComplex-ValuedCase................................................ 175 5.4 StochasticApproximation.......................................................... 177 5.5 TheLeast-Mean-SquaresAdaptiveAlgorithm.................................... 179 5.5.1 ConvergenceandSteady-StatePerformanceoftheLMS inStationaryEnvironments ............................................... 181 5.5.2 CumulativeLossBounds.................................................. 186 5.6 TheAffineProjectionAlgorithm................................................... 188 5.6.1 TheNormalizedLMS ..................................................... 193 5.7 TheComplex-ValuedCase......................................................... 194 5.8 RelativesoftheLMS............................................................... 196 5.9 SimulationExamples............................................................... 199 5.10 AdaptiveDecisionFeedbackEqualization ........................................ 202 5.11 TheLinearlyConstrainedLMS.................................................... 204 5.12 TrackingPerformanceoftheLMSinNonstationary Environments ....................................................................... 206 5.13 DistributedLearning:TheDistributedLMS ...................................... 208 5.13.1 CooperationStrategies..................................................... 209 5.13.2 TheDiffusionLMS........................................................ 211 5.13.3 ConvergenceandSteady-StatePerformance: SomeHighlights........................................................... 218 5.13.4 Consensus-BasedDistributedSchemes................................... 220 5.14 ACaseStudy:TargetLocalization................................................ 222 5.15 SomeConcludingRemarks:ConsensusMatrix................................... 223 Problems ................................................................................... 224 References.................................................................................. 227 CHAPTER 6 The Least-Squares Family.................................................... 233 6.1 Introduction......................................................................... 234 6.2 Least-SquaresLinearRegression:AGeometricPerspective..................... 234 6.3 StatisticalPropertiesoftheLSEstimator.......................................... 236 6.4 OrthogonalizingtheColumnSpaceofX:TheSVDMethod..................... 239 6.5 RidgeRegression ................................................................... 243 6.6 TheRecursiveLeast-SquaresAlgorithm .......................................... 245 viii Contents 6.7 Newton’sIterativeMinimizationMethod ......................................... 248 6.7.1 RLSandNewton’sMethod................................................ 251 6.8 Steady-StatePerformanceoftheRLS............................................. 252 6.9 Complex-ValuedData:TheWidelyLinearRLS.................................. 254 6.10 ComputationalAspectsoftheLSSolution........................................ 255 6.11 TheCoordinateandCyclicCoordinateDescentMethods........................ 258 6.12 SimulationExamples............................................................... 259 6.13 Total-Least-Squares................................................................. 261 Problems ................................................................................... 268 References.................................................................................. 272 CHAPTER 7 Classification: A Tour of the Classics..................................... 275 7.1 Introduction......................................................................... 275 7.2 BayesianClassification............................................................. 276 7.2.1 AverageRisk............................................................... 278 7.3 Decision(Hyper)Surfaces.......................................................... 280 7.3.1 TheGaussianDistributionCase........................................... 282 7.4 TheNaiveBayesClassifier......................................................... 287 7.5 TheNearestNeighborRule ........................................................ 288 7.6 LogisticRegression................................................................. 290 7.7 Fisher’sLinearDiscriminant....................................................... 294 7.8 ClassificationTrees................................................................. 300 7.9 CombiningClassifiers.............................................................. 304 7.10 TheBoostingApproach ............................................................ 307 7.11 BoostingTrees...................................................................... 313 7.12 ACaseStudy:ProteinFoldingPrediction......................................... 314 Problems ................................................................................... 318 References.................................................................................. 323 CHAPTER 8 Parameter Learning: A Convex Analytic Path........................... 327 8.1 Introduction......................................................................... 328 8.2 ConvexSetsandFunctions......................................................... 329 8.2.1 ConvexSets................................................................ 329 8.2.2 ConvexFunctions.......................................................... 330 8.3 ProjectionsontoConvexSets ...................................................... 333 8.3.1 PropertiesofProjections .................................................. 337 8.4 FundamentalTheoremofProjectionsontoConvexSets ......................... 341 8.5 AParallelVersionofPOCS........................................................ 344 8.6 FromConvexSetstoParameterEstimationandMachineLearning ............. 345 8.6.1 Regression.................................................................. 345 8.6.2 Classification............................................................... 347 Contents ix 8.7 InfiniteManyClosedConvexSets:TheOnlineLearningCase.................. 349 8.7.1 ConvergenceofAPSM .................................................... 351 8.8 ConstrainedLearning............................................................... 356 8.9 TheDistributedAPSM ............................................................. 357 8.10 OptimizingNonsmoothConvexCostFunctions.................................. 358 8.10.1 SubgradientsandSubdifferentials ........................................ 359 8.10.2 MinimizingNonsmoothContinuousConvexLossFunctions: TheBatchLearningCase.................................................. 362 8.10.3 OnlineLearningforConvexOptimization ............................... 367 8.11 RegretAnalysis..................................................................... 370 8.12 OnlineLearningandBigDataApplications:ADiscussion...................... 374 8.13 ProximalOperators................................................................. 379 8.13.1 PropertiesoftheProximalOperator...................................... 382 8.13.2 ProximalMinimization.................................................... 383 8.14 ProximalSplittingMethodsforOptimization..................................... 385 Problems ................................................................................... 389 8.15 AppendixtoChapter8.............................................................. 393 References.................................................................................. 398 CHAPTER 9 Sparsity-Aware Learning: Concepts and Theoretical Foundations................................................ 403 9.1 Introduction......................................................................... 403 9.2 SearchingforaNorm............................................................... 404 9.3 TheLeastAbsoluteShrinkageandSelectionOperator(LASSO)................ 407 9.4 SparseSignalRepresentation ...................................................... 411 9.5 InSearchoftheSparsestSolution ................................................. 415 9.6 Uniquenessofthe(cid:2) Minimizer ................................................... 422 0 9.6.1 MutualCoherence ......................................................... 424 9.7 Equivalenceof(cid:2) and(cid:2) Minimizers:SufficiencyConditions................... 426 0 1 9.7.1 ConditionImpliedbytheMutualCoherenceNumber................... 426 9.7.2 TheRestrictedIsometryProperty(RIP).................................. 427 9.8 RobustSparseSignalRecoveryfromNoisyMeasurements...................... 429 9.9 CompressedSensing:TheGloryofRandomness................................. 430 9.9.1 DimensionalityReductionandStableEmbeddings...................... 433 9.9.2 Sub-NyquistSampling:Analog-to-InformationConversion ............ 434 9.10 ACaseStudy:ImageDe-Noising.................................................. 438 Problems ................................................................................... 440 References.................................................................................. 444 CHAPTER 10 Sparsity-Aware Learning: Algorithms and Applications ............. 449 10.1 Introduction......................................................................... 450 10.2 Sparsity-PromotingAlgorithms.................................................... 450
Description: