MASTER’S THESIS | LUND UNIVERSITY 2016 Analysis of Financial Transactions using Machine Learning Adam Wamai Egesa Department of Computer Science Faculty of Engineering LTH ISSN 1650-2884 LU-CS-EX 2016-05 Analysis of Financial Transactions using Machine Learning (An Application to Compute the Socio-ecological Impact of Consumer Spending) Adam Wamai Egesa [email protected] February 8, 2016 Master’s thesis work carried out at the Department of Computer Science, Lund University. Supervisors: PierreNugues,[email protected] MarcusKlang,[email protected] Examiner: JacekMalec,[email protected] Abstract Many people want to know the socio-ecological impact of the goods they purchase. Inthisthesis,wedescribeasystemthatcomputesthesocio-ecological impactofthosegoodsbyanalyzinguncategorizedfinancialtransactions. The computationismadepossiblebyextendingasystemthatcancomputatesocio- ecological impact from categorized transactions. The extension further in- cludesvisualizationsonthesystem’swebGUIusingAngularJSandextension ofthesystem’sNode.jsAPI. Tocomputethesocio-ecologicalimpactthereportdescribesacategoriza- tion service. To connect the service to the core system a RabbitMQ message queuewasused. Theservicetrainedsupervisedmachinelearningmodelsus- ingApacheSpark’smachinelearninglibrary(MLlib)onadatasetcontaining about2.4millioncategorizedtransactions. Thisachievedacategorizationac- curacyof82.9%. The main focus for future work is to increase accuracy by using named- entityrecognitionandsplittingupthecategorizationintotwostepsusingmul- tiplecategorizers. Keywords: machinelearning,apachespark,mllib,mcc,financialtransactions 2 Acknowledgements Special thanks to my supervisor Pierre Nugues, assistant supervisor Marcus Klang and Dennis Medved at the Computer Science Faculty in Lund for many valuable insights and help throughout the thesis’ duration. I would also like to thank my colleagues Christo- pher Olsson, Robin Undall-Behrend, Mikael Karlsson and Kristian Rönn at Meta Mind for their feedback and collaboration. In particular thanks to Kristian Rönn for providing thedataneededforthethesisandcollaborationonthefirstversionoftheScikit-learnim- plementation. Further thanks to the corporation Meta Mind for financial support of the thesis. 3 4 Contents 1 Introduction 7 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 DataSourcesandDataSets 11 2.1 CategoryTaxonomies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 UNSPSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 ProClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 MerchantCategoryCode(MCC) . . . . . . . . . . . . . . . . . 12 2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 DataontheMarginalImpactofExpenses . . . . . . . . . . . . . . . . . 13 3 AlgorithmsandTools 15 3.1 MachineLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 MessageQueue–RabbitMQ . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 DataAnalyticsTools . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.2 ApacheSparkandMLLib . . . . . . . . . . . . . . . . . . . . . 18 3.4 Server–Node.js . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.5 WebClient–AngularJSFramework . . . . . . . . . . . . . . . . . . . . 19 3.6 Database–MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Implementation 23 4.1 SystemArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 WebDashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.2 Userinput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.3 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3 AppServer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5 CONTENTS 4.3.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.2 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.3 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 ClassificationServer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.1 Trainingworkflow . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.4.2 Testingworkflow . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4.3 Predictionworkflow . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4.4 MLlibWrapperClasses . . . . . . . . . . . . . . . . . . . . . . 40 4.5 Scikit-learnimplementation . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Results 43 5.1 Scikit-learnresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1 Vectorizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.2 Stopwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.3 Lowercase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.4 n-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.5 Classifieroptions . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 MLlibresults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.1 Experimentalsetup . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.3 FeatureSelection . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.4 Resultsfromentiredataset . . . . . . . . . . . . . . . . . . . . . 47 5.2.5 Resultsfrom10%sample . . . . . . . . . . . . . . . . . . . . . 47 5.3 ComputationResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6 Conclusions 53 6.1 MLlibResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Timeandmemorycomplexity . . . . . . . . . . . . . . . . . . . . . . . 54 6.3 ImpactComputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.4 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Bibliography 57 AppendixA Metrics 63 6