Table Of Content

Contents 1 Math and Machine Learning Basics 7 1.1 LinearAlgebra(QuickReview)(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1 Example: PrincipalComponentAnalysis . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Probability&InformationTheory(QuickReview)(Ch. 3) . . . . . . . . . . . . . . . . . . . 12 1.3 NumericalComputation(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 MachineLearningBasics(Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.1 Estimators,BiasandVariance(5.4) . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.2 MaximumLikelihoodEstimation(5.5) . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4.3 BayesianStatistics(5.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4.4 SupervisedLearningAlgorithms(5.7) . . . . . . . . . . . . . . . . . . . . . . . . 22 2 Deep Networks: Modern Practices 23 2.1 DeepFeedforwardNetworks(Ch. 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.1.1 Back-Propagation(6.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 RegularizationforDeepLearning(Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 OptimizationforTrainingDeepModels(Ch. 8) . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4 ConvolutionalNeuralNetworks(Ch. 9) . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 SequenceModeling(RNNs)(Ch. 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.1 Review: TheBasicsofRNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5.2 RNNsasDirectedGraphicalModels . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.5.3 ChallengeofLong-TermDeps. (10.7) . . . . . . . . . . . . . . . . . . . . . . . . 42 2.5.4 LSTMsandOtherGatedRNNs(10.10) . . . . . . . . . . . . . . . . . . . . . . . 43 2.6 Applications(Ch. 12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6.1 NaturalLanguageProcessing(12.4) . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6.2 NeuralLanguageModels(12.4.2) . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3 Deep Learning Research 46 3.1 LinearFactorModels(Ch. 13) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Autoencoders(Ch. 14) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 RepresentationLearning(Ch. 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 StructuredProbabilisticModelsforDL(Ch. 16) . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1 SamplingfromGraphicalModels . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.2 InferenceandApproximateInference . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5 MonteCarloMethods(Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.6 ConfrontingthePartitionFunction(Ch. 18) . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.7 ApproximateInference(Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.8 DeepGenerativeModels(Ch. 20) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4 Papers and Tutorials 66 4.1 WaveNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 1 4.2 NeuralStyle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.3 NeuralConversationModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 NMTByJointlyLearningtoAlign&Translate . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.1 DetailedModelArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5 EffectiveApproachestoAttention-BasedNMT . . . . . . . . . . . . . . . . . . . . . . . . 79 4.6 UsingLargeVocabulariesforNMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.7 CandidateSampling–TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.8 AttentionTerminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.9 TextRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.9.1 KeywordExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.9.2 SentenceExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.10 SimpleBaselineforSentenceEmbeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.11 SurveyofTextClusteringAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.11.1 Distance-basedClusteringAlgorithms . . . . . . . . . . . . . . . . . . . . . . . . 97 4.11.2 ProbabilisticDocumentClusteringandTopicModels . . . . . . . . . . . . . . . . . 98 4.11.3 OnlineClusteringwithTextStreams . . . . . . . . . . . . . . . . . . . . . . . . 100 4.12 DeepSentenceEmbeddingUsingLSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.13 ClusteringMassiveTextStreams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.14 SupervisedUniversalSentenceRepresentations(InferSent) . . . . . . . . . . . . . . . . . . . 107 4.15 Dist. Rep. ofSentencesfromUnlabeledData(FastSent) . . . . . . . . . . . . . . . . . . . . 108 4.16 LatentDirichletAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.17 ConditionalRandomFields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.18 AttentionIsAllYouNeed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.19 HierarchicalAttentionNetworks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.20 JointEventExtractionviaRNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.21 EventExtractionviaBidi-LSTMTensorNNs . . . . . . . . . . . . . . . . . . . . . . . . . 125 4.22 ReasoningwithNeuralTensorNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 4.23 LanguagetoLogicalFormwithNeuralAttention . . . . . . . . . . . . . . . . . . . . . . . 128 4.24 Seq2SQL:GeneratingStructuredQueriesfromNLusingRL . . . . . . . . . . . . . . . . . . 130 4.25 SLING:AFrameworkforFrameSemanticParsing . . . . . . . . . . . . . . . . . . . . . . . 133 4.26 PoincaréEmbeddingsforLearningHierarchicalRepresentations . . . . . . . . . . . . . . . . . 135 4.27 EnrichingWordVectorswithSubwordInformation(FastText) . . . . . . . . . . . . . . . . . 137 4.28 DeepWalk: OnlineLearningofSocialRepresentations . . . . . . . . . . . . . . . . . . . . . 139 4.29 ReviewofRelationalMachineLearningforKnowledgeGraphs . . . . . . . . . . . . . . . . . 141 4.30 FastTop-KSearchinKnowledgeGraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 4.31 DynamicRecurrentAcyclicGraphicalNeuralNetworks(DRAGNN) . . . . . . . . . . . . . . . 146 4.31.1 MoreDetail: Arc-StandardTransitionSystem . . . . . . . . . . . . . . . . . . . . 149 4.32 NeuralArchitectureSearchwithReinforcementLearning . . . . . . . . . . . . . . . . . . . . 150 4.33 JointExtractionofEventsandEntitieswithinaDocumentContext . . . . . . . . . . . . . . . 152 4.34 GloballyNormalizedTransition-BasedNeuralNetworks . . . . . . . . . . . . . . . . . . . . 155 4.35 AnIntroductiontoConditionalRandomFields . . . . . . . . . . . . . . . . . . . . . . . . 158 4.35.1 Inference(Sec. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 4.35.2 ParameterEstimation(Sec. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.35.3 RelatedWorkandFutureDirections(Sec. 6) . . . . . . . . . . . . . . . . . . . . . 168 2 4.36 Co-sampling: TrainingRobustNetworksforExtremelyNoisySupervision . . . . . . . . . . . . 169 4.37 Hidden-UnitConditionalRandomFields . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 4.37.1 DetailedDerivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.38 Pre-trainingofHidden-UnitCRFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.39 StructuredAttentionNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 4.40 NeuralConditionalRandomFields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 4.41 BidirectionalLSTM-CRFModelsforSequenceTagging . . . . . . . . . . . . . . . . . . . . 183 4.42 RelationExtraction: ASurvey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 4.43 NeuralRelationExtractionwithSelectiveAttentionoverInstances . . . . . . . . . . . . . . . 187 4.44 OnHerdingandthePerceptronCyclingTheorem . . . . . . . . . . . . . . . . . . . . . . . 189 4.45 Non-ConvexOptimizationforMachineLearning. . . . . . . . . . . . . . . . . . . . . . . . 191 4.45.1 Non-ConvexProjectedGradientDescent(3) . . . . . . . . . . . . . . . . . . . . . 194 4.46 ImprovingLanguageUnderstandingbyGenerativePre-Training . . . . . . . . . . . . . . . . . 195 4.47 DeepContextualizedWordRepresentations . . . . . . . . . . . . . . . . . . . . . . . . . . 196 4.48 ExploringtheLimitsofLanguageModeling. . . . . . . . . . . . . . . . . . . . . . . . . . 198 4.49 ConnectionistTemporalClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 4.50 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 4.51 Wassersteinisallyouneed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 4.52 NoiseContrastiveEstimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 4.52.1 Self-NormalizedNCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 4.53 NeuralOrdinaryDifferentialEquations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 4.54 OntheDimensionalityofWordEmbedding . . . . . . . . . . . . . . . . . . . . . . . . . . 212 4.55 GenerativeAdversarialNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 4.56 AFrameworkforIntelligenceandCorticalFunction . . . . . . . . . . . . . . . . . . . . . . 216 4.57 Large-ScaleStudyofCuriosityDrivenLearning . . . . . . . . . . . . . . . . . . . . . . . . 217 4.58 UniversalLanguageModelFine-TuningforTextClassification . . . . . . . . . . . . . . . . . 218 4.59 TheMarginalValueofAdaptiveGradientMethodsinMachineLearning . . . . . . . . . . . . . 220 4.60 ATheoreticallyGroundedApplicationofDropoutinRecurrentNeuralNetworks . . . . . . . . . 221 4.61 ImprovingNeuralLanguageModelswithaContinuousCache . . . . . . . . . . . . . . . . . . 222 4.62 ProtectionAgainstReconstructionandItsApplicationsinPrivateFederatedLearning . . . . . . . 223 4.63 ContextDependentRNNLanguageModel . . . . . . . . . . . . . . . . . . . . . . . . . . 225 4.64 StrategiesforTrainingLargeVocabularyNeuralLanguageModels . . . . . . . . . . . . . . . . 226 4.65 Productquantizationfornearestneighborsearch . . . . . . . . . . . . . . . . . . . . . . . 228 4.66 LargeMemoryLayerswithProductKeys . . . . . . . . . . . . . . . . . . . . . . . . . . 229 4.67 Show,Ask,Attend,andAnswer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 4.68 DidtheModelUnderstandtheQuestion? . . . . . . . . . . . . . . . . . . . . . . . . . . 233 4.69 XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 4.70 Transformer-XL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 4.71 EfficientSoftmaxApproximationforGPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 237 4.72 AdaptiveInputRepresentationsforNeuralLanguageModeling . . . . . . . . . . . . . . . . . 238 4.73 NeuralModuleNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 4.74 LearningtoComposeNeuralNetworksforQA . . . . . . . . . . . . . . . . . . . . . . . . 241 4.75 End-to-EndModuleNetworksforVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 4.76 FastMulti-languageLSTM-basedOnlineHandwritingRecognition . . . . . . . . . . . . . . . 245 3 4.77 Multi-LanguageOnlineHandwritingRecognition . . . . . . . . . . . . . . . . . . . . . . . 246 4.78 ModularGenerativeAdversarialNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . 248 4.79 TransferLearningfromSpeakerVerificationtoTTS . . . . . . . . . . . . . . . . . . . . . . 250 5 NLP with Deep Learning 251 5.1 WordVectorRepresentations(Lec2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 5.2 GloVe(Lec3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 6 Speech and Language Processing 258 6.1 Introduction(Ch. 12ndEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 6.2 Morphology(Ch. 32ndEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 6.3 N-Grams(Ch. 62ndEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 6.4 NaiveBayesandSentiment(Ch. 63rdEd.) . . . . . . . . . . . . . . . . . . . . . . . . . 263 6.5 HiddenMarkovModels(Ch. 93rdEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 6.6 POSTagging(Ch. 103rdEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 6.7 FormalGrammars(Ch. 113rdEd.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 6.8 VectorSemantics(Ch. 15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 6.9 SemanticswithDenseVectors(Ch. 16) . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 6.10 InformationExtraction(Ch. 213rdEd) . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 7 Probabilistic Graphical Models 281 7.1 Foundations(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 7.1.1 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 7.1.2 L-BFGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 7.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 7.2 TheBayesianNetworkRepresentation(Ch. 3) . . . . . . . . . . . . . . . . . . . . . . . . 292 7.3 UndirectedGraphicalModels(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 7.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 7.4 LocalProbabilisticModels(Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 7.5 Template-BasedRepresentations(Ch. 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 7.6 GaussianNetworkModels(Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 7.7 VariableElimination(Ch. 9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 7.8 CliqueTrees(Ch. 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 7.9 InferenceasOptimization(Ch. 11) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 7.10 ParameterEstimation(Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 7.11 PartiallyObservedData(Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 8 Information Theory, Inference, and Learning Algorithms 324 8.1 IntroductiontoInformationTheory(Ch. 1) . . . . . . . . . . . . . . . . . . . . . . . . . 325 8.2 Probability,Entropy,andInference(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . 327 8.2.1 MoreAboutInference(Ch. 3Summary) . . . . . . . . . . . . . . . . . . . . . . . 329 8.3 TheSourceCodingTheorem(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 4 8.3.1 DataCompressionandTypicality . . . . . . . . . . . . . . . . . . . . . . . . . . 333 8.3.2 FurtherAnalysisandQ&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 8.4 MonteCarloMethods(Ch. 29) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 8.5 VariationalMethods(Ch. 33) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 9 Machine Learning: A Probabilistic Perspective 341 9.1 Probability(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 9.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 9.2 GenerativeModelsforDiscreteData(Ch. 3) . . . . . . . . . . . . . . . . . . . . . . . . . 344 9.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 9.3 GaussianModels(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 9.4 BayesianStatistics(Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 9.5 LinearRegression(Ch. 7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 9.6 GeneralizedLinearModelsandtheExponentialFamily(Ch. 9) . . . . . . . . . . . . . . . . . 358 9.7 MixtureModelsandtheEMAlgorithm(Ch. 11) . . . . . . . . . . . . . . . . . . . . . . . 361 9.8 LatentLinearModels(Ch. 12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 9.9 MarkovandHiddenMarkovModels(Ch. 17) . . . . . . . . . . . . . . . . . . . . . . . . . 366 9.10 UndirectedGraphicalModels(Ch. 19) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 10 Convex Optimization 369 10.1 ConvexSets(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 11 Bayesian Data Analysis 372 11.1 ProbabilityandInference(Ch. 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 11.2 Single-ParameterModels(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 11.3 AsymptoticsandConnectionstoNon-BayesianApproaches(Ch. 4) . . . . . . . . . . . . . . . 378 11.4 GaussianProcessModels(Ch. 21) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 12 Gaussian Processes for Machine Learning 383 12.1 Regression(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 13 Blogs 387 13.1 ConvNets: AModularPerspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 13.2 UnderstandingConvolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 13.3 DeepReinforcementLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 13.4 DeepLearningforChatbots(WildML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 13.5 AttentionalInterfaces–NeuralPerspective . . . . . . . . . . . . . . . . . . . . . . . . . . 395 14 Appendix 396 14.1 CommonDistributionsandModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 14.2 Math . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 14.3 MatrixCookbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 5 14.4 MainTasksinNLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 14.5 Misc. Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 14.5.1 BLEUScore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 14.5.2 ConnectionistTemporalClassification(CTC) . . . . . . . . . . . . . . . . . . . . . 413 14.5.3 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 14.5.4 BytePairEncoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 14.5.5 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 14.5.6 BloomFilter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 14.5.7 DistributedTraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 14.5.8 TraditionalLanguageModeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 6 Math and Machine Learning Basics Contents 1.1 LinearAlgebra(QuickReview)(Ch. 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.1 Example: PrincipalComponentAnalysis . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 Probability&InformationTheory(QuickReview)(Ch. 3) . . . . . . . . . . . . . . . . . . . 12 1.3 NumericalComputation(Ch. 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 MachineLearningBasics(Ch. 5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.1 Estimators,BiasandVariance(5.4) . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4.2 MaximumLikelihoodEstimation(5.5) . . . . . . . . . . . . . . . . . . . . . . . . 19 1.4.3 BayesianStatistics(5.6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4.4 SupervisedLearningAlgorithms(5.7) . . . . . . . . . . . . . . . . . . . . . . . . 22 7 Math and Machine Learning Basics January 23, 2017 Linear Algebra (Quick Review) (Ch. 2) Table of Contents Local Written by Brandon McKinzie • ForA−1 toexist,Ax = bmusthaveexactlyonesolutionforeveryvalueofb. Determining Unlessstatedotherwise, whether a solution exists ∀b ∈ Rm means requiring that the column space (range) of assumeA∈Rm×n A be all of Rm. It is helpful to see Ax expanded out explicitly in this way:     A A 1,1 1,n Ax = XxiA:,i = x1 ... +···+xm ...  (2.27)     i A A m,1 m,n → Necessary: A must have at least m columns (n ≥ m). (“wide”). → Necessary and sufficient: matrix must contain at least one set of m linearly indepen- dent columns. → Invertibility: In addition to above, need matrix to be square (re: at most m columns ∧ at least m columns). • A square matrix with linearly dependent columns is known as singular. A (necessarily square) matrix is singular if and only if one or more eigenvalues are zero. • A norm is any function f that satisfies the following properties: ||x||∞=maxi|xi| f(x) = 0 ⇒ x = 0 (1) f(x+y) ≤ f(x)+f(y) (2) ∀α ∈ R, f(αx) = |α|f(x) (3) • An orthogonal matrix is a square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal: Notethatorthonormcols impliesorthonormrows ATA = AAT = I (2.37) (ifsquare). Toprove, considertherelationship A−1 = AT (2.38) betweenATA andAAT • SupposesquarematrixA ∈ Rn×nhasnlinearlyindependenteigenvectors{v(1),...,v(n)}. The eigendecomposition of A is then given by1 A = V diag(λ) V−1 (2.40) In the special case where A is real-symmetric, A = QΛQT. Interpretation: Ax can be decomposed into the following three steps: Allreal-symmetricA havean 1This appear to imply that unless the columns of V are also normalized, can’t guarantee that its inverse eigendecomposition,but it might not be equals its transpose? (since that is the only difference between it and an orthogonal matrix) unique! 8 1) Change of basis: The vector (QTx) can be thought of as how x would appear in the basis of eigenvectors of A. 2) Scale: Next, we scale each component (QTx) by an amount λ , yielding the new i i vector (Λ(QTx)). Acommonconventionto sorttheentriesofΛin 3) Change of basis: Finally, we rotate this new vector back from the eigen-basis into descendingorder. its original basis, yielding the transformed result of QΛQTx. • Positive definite: all λ are positive; positive semidefinite: all λ are positive or zero. → PSD: ∀x, xTAx ≥ 0 → PD: xTAx = 0 ⇒ x = 0.2 • Any real matrix A ∈ Rm×n has a singular value decomposition of the form, A = UDVT (10) U∈Rm×m (7) D∈Rm×n (8) where both U and V are orthogonal matrices, and D is diagonal. V∈Rn×n (9) – The singular values are the diagonal entries D . ii – The left(right)-singular vectors are the columns of U(V). – EigenvectorsofAAT aretheL-Svectors. EigenvectorsofATA aretheR-Svectors. The eigenvalues of both AAT and ATA are given by the singular values squared. • TheMoore-Penrosepseudoinverse, denotedA+, enablesustofindan“inverse”ofsorts for a (possibly) non-square matrix A. Most algorithms compute A+ via A+ isuseful,e.g.,when wewanttosolve Ax=y by A+ = VD+UT (11) left-multiplyingeachside toobtainx=By. Itisfarmorelikelyfor • The determinant of a matrix is det(A) = Qiλi. Conceptually, |det(A)| tells how solution(s)toexistwhen much [multiplication by] A expands/contracts space. If det(A) = 1, the transformation A iswiderthanitistall. preserves volume. 2I proved this and it made me happy inside. Check it out. Let A be positive definite. Then xTAx =xTQΛQTx (4) X = (QTx) λ (QTx) (5) i i i i X = λ (QTx)2 (6) i i i Since all terms in the summation are non-negative and all λ > 0, we have that xTAx = 0 if and only if i (QTx) =0=q(i)·x for all i. Since the set of eigenvectors {q(i)} form an orthonormal basis, we have that x i must be the zero vector. 9 1.1.1 Example: Principal Component Analysis Task. Say we want to apply lossy compression (less memory, but may lose precision) to a collection of m points {x(1),...,x(m)}. We will do this by converting each x(i) ∈ Rn to some c(i) ∈ Rl (l < n), i.e. finding functions f and g such that: f(x) = c and x ≈ g(f(x)) (12) Decoding function (g). As is, we still have a rather general task to solve. PCA is defined by choosing g(c) = Dc, with D ∈ Rn×l, where all columns of D are both (1) orthogonal and (2) unit norm. Encoding function (f). Now we need a way of mapping x to c such that g(c) will give us back a vector optimally close to x. We’ve already defined g, so this amount to finding the optimal c∗ such that: c∗ = argmin||x−g(c)||2 (13) 2 c (x−g(c))T(x−g(c)) = xTx−2xTg(c)+g(c)Tg(c) (14) h i c∗ = argmin −2xTDc+cTc (15) c = DTx = f(x) (16) which means the PCA reconstruction operation is defined as r(x) = DDTx. Optimal D. It is important to notice that we’ve been able to determine e.g. the optimal c∗ for some x because each x has a (allowably) different c∗. However, we use the same matrix D for all our samples x(i), and thus must optimize it over all points in our collection. With that out of the way, we just do what we always do: minimize over the L2 distance between points and their reconstruction. Formally, we minimize the Frobenius norm of the matrix of errors: v D∗ = argminuuX(cid:16)x(i)−r(x(i)) (cid:17)2 s.t. DTD = I (17) t j j D i,j 10

Description:

2 Deep Networks: Modern Practices. 18 . Math and Machine Learning. Basics. Contents. 1.1 Linear Algebra (Quick Review) . 5May want to double-check this, but I'm fairly certain this is what the book meant when it said. “data

Contents 1 Math and Machine Learning Basics 4 1.1 Linear Algebra PDF

138 Pages·2017·2.53 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Contents 1 Math and Machine Learning Basics 4 1.1 Linear Algebra

Description:

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.