Contents 1 Math and Machine Learning Basics 7 1.1 LinearAlgebra(QuickReview)(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1 Example: PrincipalComponentAnalysis . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Probability&InformationTheory(QuickReview)(Ch. 3) . . . . . . . . . . . . . . . . . . . 12 1.3 NumericalComputation(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 MachineLearningBasics(Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.1 Estimators,BiasandVariance(5.4) . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.2 MaximumLikelihoodEstimation(5.5) . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4.3 BayesianStatistics(5.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4.4 SupervisedLearningAlgorithms(5.7) . . . . . . . . . . . . . . . . . . . . . . . . 22 2 Deep Networks: Modern Practices 23 2.1 DeepFeedforwardNetworks(Ch. 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.1 Back-Propagation(6.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 RegularizationforDeepLearning(Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 OptimizationforTrainingDeepModels(Ch. 8) . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 ConvolutionalNeuralNetworks(Ch. 9) . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 SequenceModeling(RNNs)(Ch. 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.1 Review: TheBasicsofRNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2 RNNsasDirectedGraphicalModels . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.3 ChallengeofLong-TermDeps. (10.7) . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.4 LSTMsandOtherGatedRNNs(10.10) . . . . . . . . . . . . . . . . . . . . . . . 43 2.6 Applications(Ch. 12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6.1 NaturalLanguageProcessing(12.4) . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6.2 NeuralLanguageModels(12.4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Deep Learning Research 46 3.1 LinearFactorModels(Ch. 13) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Autoencoders(Ch. 14) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 RepresentationLearning(Ch. 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 StructuredProbabilisticModelsforDL(Ch. 16) . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1 SamplingfromGraphicalModels . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.2 InferenceandApproximateInference . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5 MonteCarloMethods(Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6 ConfrontingthePartitionFunction(Ch. 18) . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7 ApproximateInference(Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.8 DeepGenerativeModels(Ch. 20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4 Papers and Tutorials 66 4.1 WaveNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 1 4.2 NeuralStyle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 NeuralConversationModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 NMTByJointlyLearningtoAlign&Translate . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.1 DetailedModelArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 EffectiveApproachestoAttention-BasedNMT . . . . . . . . . . . . . . . . . . . . . . . . 79 4.6 UsingLargeVocabulariesforNMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.7 CandidateSampling–TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.8 AttentionTerminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.9 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.9.1 KeywordExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.9.2 SentenceExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.10 SimpleBaselineforSentenceEmbeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.11 SurveyofTextClusteringAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.11.1 Distance-basedClusteringAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . 97 4.11.2 ProbabilisticDocumentClusteringandTopicModels . . . . . . . . . . . . . . . . . 98 4.11.3 OnlineClusteringwithTextStreams . . . . . . . . . . . . . . . . . . . . . . . . 100 4.12 DeepSentenceEmbeddingUsingLSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.13 ClusteringMassiveTextStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.14 SupervisedUniversalSentenceRepresentations(InferSent) . . . . . . . . . . . . . . . . . . . 107 4.15 Dist. Rep. ofSentencesfromUnlabeledData(FastSent) . . . . . . . . . . . . . . . . . . . . 108 4.16 LatentDirichletAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.17 ConditionalRandomFields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.18 AttentionIsAllYouNeed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.19 HierarchicalAttentionNetworks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.20 JointEventExtractionviaRNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.21 EventExtractionviaBidi-LSTMTensorNNs . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.22 ReasoningwithNeuralTensorNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.23 LanguagetoLogicalFormwithNeuralAttention . . . . . . . . . . . . . . . . . . . . . . . 128 4.24 Seq2SQL:GeneratingStructuredQueriesfromNLusingRL . . . . . . . . . . . . . . . . . . 130 4.25 SLING:AFrameworkforFrameSemanticParsing . . . . . . . . . . . . . . . . . . . . . . . 133 4.26 PoincaréEmbeddingsforLearningHierarchicalRepresentations . . . . . . . . . . . . . . . . . 135 4.27 EnrichingWordVectorswithSubwordInformation(FastText) . . . . . . . . . . . . . . . . . 137 4.28 DeepWalk: OnlineLearningofSocialRepresentations . . . . . . . . . . . . . . . . . . . . . 139 4.29 ReviewofRelationalMachineLearningforKnowledgeGraphs . . . . . . . . . . . . . . . . . 141 4.30 FastTop-KSearchinKnowledgeGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.31 DynamicRecurrentAcyclicGraphicalNeuralNetworks(DRAGNN) . . . . . . . . . . . . . . . 146 4.31.1 MoreDetail: Arc-StandardTransitionSystem . . . . . . . . . . . . . . . . . . . . 149 4.32 NeuralArchitectureSearchwithReinforcementLearning . . . . . . . . . . . . . . . . . . . . 150 4.33 JointExtractionofEventsandEntitieswithinaDocumentContext . . . . . . . . . . . . . . . 152 4.34 GloballyNormalizedTransition-BasedNeuralNetworks . . . . . . . . . . . . . . . . . . . . 155 4.35 AnIntroductiontoConditionalRandomFields . . . . . . . . . . . . . . . . . . . . . . . . 158 4.35.1 Inference(Sec. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 4.35.2 ParameterEstimation(Sec. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.35.3 RelatedWorkandFutureDirections(Sec. 6) . . . . . . . . . . . . . . . . . . . . . 168 2 4.36 Co-sampling: TrainingRobustNetworksforExtremelyNoisySupervision . . . . . . . . . . . . 169 4.37 Hidden-UnitConditionalRandomFields . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 4.37.1 DetailedDerivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.38 Pre-trainingofHidden-UnitCRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.39 StructuredAttentionNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4.40 NeuralConditionalRandomFields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 4.41 BidirectionalLSTM-CRFModelsforSequenceTagging . . . . . . . . . . . . . . . . . . . . 183 4.42 RelationExtraction: ASurvey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 4.43 NeuralRelationExtractionwithSelectiveAttentionoverInstances . . . . . . . . . . . . . . . 187 4.44 OnHerdingandthePerceptronCyclingTheorem . . . . . . . . . . . . . . . . . . . . . . . 189 4.45 Non-ConvexOptimizationforMachineLearning. . . . . . . . . . . . . . . . . . . . . . . . 191 4.45.1 Non-ConvexProjectedGradientDescent(3) . . . . . . . . . . . . . . . . . . . . . 194 4.46 ImprovingLanguageUnderstandingbyGenerativePre-Training . . . . . . . . . . . . . . . . . 195 4.47 DeepContextualizedWordRepresentations . . . . . . . . . . . . . . . . . . . . . . . . . . 196 4.48 ExploringtheLimitsofLanguageModeling. . . . . . . . . . . . . . . . . . . . . . . . . . 198 4.49 ConnectionistTemporalClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 4.50 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 4.51 Wassersteinisallyouneed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 4.52 NoiseContrastiveEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 4.52.1 Self-NormalizedNCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 4.53 NeuralOrdinaryDifferentialEquations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 4.54 OntheDimensionalityofWordEmbedding . . . . . . . . . . . . . . . . . . . . . . . . . . 212 4.55 GenerativeAdversarialNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 4.56 AFrameworkforIntelligenceandCorticalFunction . . . . . . . . . . . . . . . . . . . . . . 216 4.57 Large-ScaleStudyofCuriosityDrivenLearning . . . . . . . . . . . . . . . . . . . . . . . . 217 4.58 UniversalLanguageModelFine-TuningforTextClassification . . . . . . . . . . . . . . . . . 218 4.59 TheMarginalValueofAdaptiveGradientMethodsinMachineLearning . . . . . . . . . . . . . 220 4.60 ATheoreticallyGroundedApplicationofDropoutinRecurrentNeuralNetworks . . . . . . . . . 221 4.61 ImprovingNeuralLanguageModelswithaContinuousCache . . . . . . . . . . . . . . . . . . 222 4.62 ProtectionAgainstReconstructionandItsApplicationsinPrivateFederatedLearning . . . . . . . 223 4.63 ContextDependentRNNLanguageModel . . . . . . . . . . . . . . . . . . . . . . . . . . 225 4.64 StrategiesforTrainingLargeVocabularyNeuralLanguageModels . . . . . . . . . . . . . . . . 226 4.65 Productquantizationfornearestneighborsearch . . . . . . . . . . . . . . . . . . . . . . . 228 4.66 LargeMemoryLayerswithProductKeys . . . . . . . . . . . . . . . . . . . . . . . . . . 229 4.67 Show,Ask,Attend,andAnswer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 4.68 DidtheModelUnderstandtheQuestion? . . . . . . . . . . . . . . . . . . . . . . . . . . 233 4.69 XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 4.70 Transformer-XL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 4.71 EfficientSoftmaxApproximationforGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 237 4.72 AdaptiveInputRepresentationsforNeuralLanguageModeling . . . . . . . . . . . . . . . . . 238 4.73 NeuralModuleNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 4.74 LearningtoComposeNeuralNetworksforQA . . . . . . . . . . . . . . . . . . . . . . . . 241 4.75 End-to-EndModuleNetworksforVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 4.76 FastMulti-languageLSTM-basedOnlineHandwritingRecognition . . . . . . . . . . . . . . . 245 3 4.77 Multi-LanguageOnlineHandwritingRecognition . . . . . . . . . . . . . . . . . . . . . . . 246 4.78 ModularGenerativeAdversarialNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . 248 4.79 TransferLearningfromSpeakerVerificationtoTTS . . . . . . . . . . . . . . . . . . . . . . 250 5 NLP with Deep Learning 251 5.1 WordVectorRepresentations(Lec2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 5.2 GloVe(Lec3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 6 Speech and Language Processing 258 6.1 Introduction(Ch. 12ndEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 6.2 Morphology(Ch. 32ndEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 6.3 N-Grams(Ch. 62ndEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 6.4 NaiveBayesandSentiment(Ch. 63rdEd.) . . . . . . . . . . . . . . . . . . . . . . . . . 263 6.5 HiddenMarkovModels(Ch. 93rdEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 6.6 POSTagging(Ch. 103rdEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.7 FormalGrammars(Ch. 113rdEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 6.8 VectorSemantics(Ch. 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 6.9 SemanticswithDenseVectors(Ch. 16) . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 6.10 InformationExtraction(Ch. 213rdEd) . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 7 Probabilistic Graphical Models 281 7.1 Foundations(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 7.1.1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 7.1.2 L-BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 7.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 7.2 TheBayesianNetworkRepresentation(Ch. 3) . . . . . . . . . . . . . . . . . . . . . . . . 292 7.3 UndirectedGraphicalModels(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 7.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 7.4 LocalProbabilisticModels(Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 7.5 Template-BasedRepresentations(Ch. 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 7.6 GaussianNetworkModels(Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 7.7 VariableElimination(Ch. 9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 7.8 CliqueTrees(Ch. 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 7.9 InferenceasOptimization(Ch. 11) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 7.10 ParameterEstimation(Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 7.11 PartiallyObservedData(Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 8 Information Theory, Inference, and Learning Algorithms 324 8.1 IntroductiontoInformationTheory(Ch. 1) . . . . . . . . . . . . . . . . . . . . . . . . . 325 8.2 Probability,Entropy,andInference(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . 327 8.2.1 MoreAboutInference(Ch. 3Summary) . . . . . . . . . . . . . . . . . . . . . . . 329 8.3 TheSourceCodingTheorem(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 4 8.3.1 DataCompressionandTypicality . . . . . . . . . . . . . . . . . . . . . . . . . . 333 8.3.2 FurtherAnalysisandQ&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 8.4 MonteCarloMethods(Ch. 29) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 8.5 VariationalMethods(Ch. 33) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 9 Machine Learning: A Probabilistic Perspective 341 9.1 Probability(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 9.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 9.2 GenerativeModelsforDiscreteData(Ch. 3) . . . . . . . . . . . . . . . . . . . . . . . . . 344 9.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 9.3 GaussianModels(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 9.4 BayesianStatistics(Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 9.5 LinearRegression(Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 9.6 GeneralizedLinearModelsandtheExponentialFamily(Ch. 9) . . . . . . . . . . . . . . . . . 358 9.7 MixtureModelsandtheEMAlgorithm(Ch. 11) . . . . . . . . . . . . . . . . . . . . . . . 361 9.8 LatentLinearModels(Ch. 12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 9.9 MarkovandHiddenMarkovModels(Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . 366 9.10 UndirectedGraphicalModels(Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 10 Convex Optimization 369 10.1 ConvexSets(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 11 Bayesian Data Analysis 372 11.1 ProbabilityandInference(Ch. 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 11.2 Single-ParameterModels(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 11.3 AsymptoticsandConnectionstoNon-BayesianApproaches(Ch. 4) . . . . . . . . . . . . . . . 378 11.4 GaussianProcessModels(Ch. 21) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 12 Gaussian Processes for Machine Learning 383 12.1 Regression(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 13 Blogs 387 13.1 ConvNets: AModularPerspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 13.2 UnderstandingConvolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 13.3 DeepReinforcementLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 13.4 DeepLearningforChatbots(WildML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 13.5 AttentionalInterfaces–NeuralPerspective . . . . . . . . . . . . . . . . . . . . . . . . . . 395 14 Appendix 396 14.1 CommonDistributionsandModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 14.2 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 14.3 MatrixCookbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 5 14.4 MainTasksinNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 14.5 Misc. Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 14.5.1 BLEUScore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 14.5.2 ConnectionistTemporalClassification(CTC) . . . . . . . . . . . . . . . . . . . . . 413 14.5.3 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 14.5.4 BytePairEncoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 14.5.5 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 14.5.6 BloomFilter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 14.5.7 DistributedTraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 14.5.8 TraditionalLanguageModeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 6 Math and Machine Learning Basics Contents 1.1 LinearAlgebra(QuickReview)(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1 Example: PrincipalComponentAnalysis . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Probability&InformationTheory(QuickReview)(Ch. 3) . . . . . . . . . . . . . . . . . . . 12 1.3 NumericalComputation(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 MachineLearningBasics(Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.1 Estimators,BiasandVariance(5.4) . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.2 MaximumLikelihoodEstimation(5.5) . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4.3 BayesianStatistics(5.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4.4 SupervisedLearningAlgorithms(5.7) . . . . . . . . . . . . . . . . . . . . . . . . 22 7 Math and Machine Learning Basics January 23, 2017 Linear Algebra (Quick Review) (Ch. 2) Table of Contents Local Written by Brandon McKinzie • ForA−1 toexist,Ax = bmusthaveexactlyonesolutionforeveryvalueofb. Determining Unlessstatedotherwise, whether a solution exists ∀b ∈ Rm means requiring that the column space (range) of assumeA∈Rm×n A be all of Rm. It is helpful to see Ax expanded out explicitly in this way: A A 1,1 1,n Ax = XxiA:,i = x1 ... +···+xm ... (2.27) i A A m,1 m,n → Necessary: A must have at least m columns (n ≥ m). (“wide”). → Necessary and sufficient: matrix must contain at least one set of m linearly indepen- dent columns. → Invertibility: In addition to above, need matrix to be square (re: at most m columns ∧ at least m columns). • A square matrix with linearly dependent columns is known as singular. A (necessarily square) matrix is singular if and only if one or more eigenvalues are zero. • A norm is any function f that satisfies the following properties: ||x||∞=maxi|xi| f(x) = 0 ⇒ x = 0 (1) f(x+y) ≤ f(x)+f(y) (2) ∀α ∈ R, f(αx) = |α|f(x) (3) • An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal: Notethatorthonormcols impliesorthonormrows ATA = AAT = I (2.37) (ifsquare). Toprove, considertherelationship A−1 = AT (2.38) betweenATA andAAT • SupposesquarematrixA ∈ Rn×nhasnlinearlyindependenteigenvectors{v(1),...,v(n)}. The eigendecomposition of A is then given by1 A = V diag(λ) V−1 (2.40) In the special case where A is real-symmetric, A = QΛQT. Interpretation: Ax can be decomposed into the following three steps: Allreal-symmetricA havean 1This appear to imply that unless the columns of V are also normalized, can’t guarantee that its inverse eigendecomposition,but it might not be equals its transpose? (since that is the only difference between it and an orthogonal matrix) unique! 8 1) Change of basis: The vector (QTx) can be thought of as how x would appear in the basis of eigenvectors of A. 2) Scale: Next, we scale each component (QTx) by an amount λ , yielding the new i i vector (Λ(QTx)). Acommonconventionto sorttheentriesofΛin 3) Change of basis: Finally, we rotate this new vector back from the eigen-basis into descendingorder. its original basis, yielding the transformed result of QΛQTx. • Positive definite: all λ are positive; positive semidefinite: all λ are positive or zero. → PSD: ∀x, xTAx ≥ 0 → PD: xTAx = 0 ⇒ x = 0.2 • Any real matrix A ∈ Rm×n has a singular value decomposition of the form, A = UDVT (10) U∈Rm×m (7) D∈Rm×n (8) where both U and V are orthogonal matrices, and D is diagonal. V∈Rn×n (9) – The singular values are the diagonal entries D . ii – The left(right)-singular vectors are the columns of U(V). – EigenvectorsofAAT aretheL-Svectors. EigenvectorsofATA aretheR-Svectors. The eigenvalues of both AAT and ATA are given by the singular values squared. • TheMoore-Penrosepseudoinverse, denotedA+, enablesustofindan“inverse”ofsorts for a (possibly) non-square matrix A. Most algorithms compute A+ via A+ isuseful,e.g.,when wewanttosolve Ax=y by A+ = VD+UT (11) left-multiplyingeachside toobtainx=By. Itisfarmorelikelyfor • The determinant of a matrix is det(A) = Qiλi. Conceptually, |det(A)| tells how solution(s)toexistwhen much [multiplication by] A expands/contracts space. If det(A) = 1, the transformation A iswiderthanitistall. preserves volume. 2I proved this and it made me happy inside. Check it out. Let A be positive definite. Then xTAx =xTQΛQTx (4) X = (QTx) λ (QTx) (5) i i i i X = λ (QTx)2 (6) i i i Since all terms in the summation are non-negative and all λ > 0, we have that xTAx = 0 if and only if i (QTx) =0=q(i)·x for all i. Since the set of eigenvectors {q(i)} form an orthonormal basis, we have that x i must be the zero vector. 9 1.1.1 Example: Principal Component Analysis Task. Say we want to apply lossy compression (less memory, but may lose precision) to a collection of m points {x(1),...,x(m)}. We will do this by converting each x(i) ∈ Rn to some c(i) ∈ Rl (l < n), i.e. finding functions f and g such that: f(x) = c and x ≈ g(f(x)) (12) Decoding function (g). As is, we still have a rather general task to solve. PCA is defined by choosing g(c) = Dc, with D ∈ Rn×l, where all columns of D are both (1) orthogonal and (2) unit norm. Encoding function (f). Now we need a way of mapping x to c such that g(c) will give us back a vector optimally close to x. We’ve already defined g, so this amount to finding the optimal c∗ such that: c∗ = argmin||x−g(c)||2 (13) 2 c (x−g(c))T(x−g(c)) = xTx−2xTg(c)+g(c)Tg(c) (14) h i c∗ = argmin −2xTDc+cTc (15) c = DTx = f(x) (16) which means the PCA reconstruction operation is defined as r(x) = DDTx. Optimal D. It is important to notice that we’ve been able to determine e.g. the optimal c∗ for some x because each x has a (allowably) different c∗. However, we use the same matrix D for all our samples x(i), and thus must optimize it over all points in our collection. With that out of the way, we just do what we always do: minimize over the L2 distance between points and their reconstruction. Formally, we minimize the Frobenius norm of the matrix of errors: v D∗ = argminuuX(cid:16)x(i)−r(x(i)) (cid:17)2 s.t. DTD = I (17) t j j D i,j 10
Description: