APIC: A Method for Automated Pattern Identification and Classification by Ryan Gavin Goss Dissertation submitted in fulfilment n w of the requirements o for the degree T e Doctor of Philosophy p a in C f o Computer Science y t i at the s r e University of Cape Town v i n U Supervisor: Dr. Geoff S. Nitschke March 2017 n w The copyright of this thesis vests in the author. No o T quotation from it or information derived from it is to be published without full acknowledgeement of the source. p The thesis is to be used for private study or non- a C commercial research purposes only. f o Published by the Universit y of Cape Town (UCT) in terms y t of the non-exclusive license granted to UCT by the author. i s r e v i n U APIC: A Method for Automated Pattern Identification and Classification by Ryan Gavin Goss Acknowledgements My grateful thanks go to the following people and organisations, without whom this dissertation would not have been possible: First and foremost, I would like to thank the Lord Jesus Christ, who gave me the strength to persevere, even through the hard times. I would like to thank my wife, Kelly, and our two children, Rylee and Kayley, for showering me with endless love, support and encouragement. To my parents, Melvin and Julia, for all the sacrifices they made for me and for demonstratingthatwecanachieveanythingwesetourmindsto. Iwouldlike to thank my brother, Robert, for continued encouragement and for leading the way forward, by example, in academia. To my sister, Kym, for her support and assistance in rendering many of the figures present in this text. My sincerest thanks go out to my supervisor, Geoff Nitschke, who has helped to expand my writing skills over the past few years. Neither this dissertation, nor my previous publications would have been possible without his constant feedback and tireless efforts. A big thank you also to my copy-editor, Tanya Wyatt, who worked through this dissertation with me, recommending changes to improve the overall read. To my friends and colleagues at both my current and previous employers, for providing insights into the real-life case studies described in this dissertation. Thank you to the developers of the scikit-learn project (Pedregosa et al., 2011) and Google TensorFlow (Abadi et al., 2015), whose libraries were used extensively in this study. Finally, I would like to thank the National Research Foundation (NRF) for funding many aspects of this dissertation. i Abstract Machine Learning (ML) is a transformative technology at the forefront of many modern research endeavours. The technology is generating a tremendous amount of attention from researchers and practitioners, providing new approaches to solving complex classification and regression tasks. While concepts such as Deep Learning have existed for many years, the computational power for realising the utility of these algorithms in real-world applications has only recently become available. This dissertation investigated the efficacy of a novel, general method for deploying ML in a variety of complex tasks, where best feature selection, data-set labelling, model definition and training processes were determined automatically. Models were developed in an iterative fashion, evaluated using both training and validation data sets. The proposed method was evaluated using three distinct case studies, describing complex classification tasks often requiring significant input from human experts. The results achieved demonstrate that the proposed method compares with, and often outperforms, less general, comparable methods designed specifically for each task. Feature selection, data-set annotation, model design and training processes were optimised by the method, where less complex, comparatively accurate classifiers with lower dependency on computational power and human expert intervention were produced. In chapter 4, the proposed method demonstrated improved efficacy over comparable systems, automatically identifying and classifying complex application protocols traversing IP networks. In chapter 5, the proposed method was able to discriminate between normal and anomalous traffic, maintaining accuracy in excess of 99%, while reducing false alarms to a mere 0.08%. Finally, in chapter 6, the proposed method discovered more optimal classifiers than those implemented by comparable methods, with ii ABSTRACT iii classification scores rivalling those achieved by state-of-the-art systems. The findings of this research concluded that developing a fully automated, general method, exhibiting efficacy in a wide variety of complex classification tasks with minimal expert intervention, was possible. The method and various artefacts produced in each case study of this dissertation are thus significant contributions to the field of ML. Contents Acknowledgements i Abstract ii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Overview of Dissertation . . . . . . . . . . . . . . . . . . . . . 8 1.7 Assumptions and Delineations . . . . . . . . . . . . . . . . . . 9 2 Foundations 10 2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 IP Traffic Classification . . . . . . . . . . . . . . . . . . 14 2.1.2 Anomaly Detection on IP networks . . . . . . . . . . . 15 2.1.3 Handwritten Digit Recognition . . . . . . . . . . . . . 16 2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Biological Inspiration . . . . . . . . . . . . . . . . . . . 20 2.3.2 Overview of an Evolutionary Algorithm . . . . . . . . . 22 2.3.3 Feature Set and Hyper-Parameter Optimisation . . . . 28 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 Automated Pattern Identification and Classification (APIC) 33 3.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Pattern Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 36 iv CONTENTS v 3.3 Classifier Production . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 IP Traffic Classification 47 4.1 IP Traffic Classification . . . . . . . . . . . . . . . . . . . . . . 48 4.1.1 Classic Port Matching . . . . . . . . . . . . . . . . . . 49 4.1.2 Deep Packet Inspection . . . . . . . . . . . . . . . . . . 51 4.1.3 Statistical Analysis . . . . . . . . . . . . . . . . . . . . 54 4.1.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . 57 4.2 Task Description . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 Distinguishing Application Protocols . . . . . . . . . . . . . . 65 4.5 Verification by Visualising Application Protocols . . . . . . . . 74 4.6 TWEANN Classifier Development . . . . . . . . . . . . . . . . 90 4.7 Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . 102 4.8 Portability Comparison . . . . . . . . . . . . . . . . . . . . . . 107 4.9 Automation Comparison . . . . . . . . . . . . . . . . . . . . . 110 4.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.10.1 Completeness . . . . . . . . . . . . . . . . . . . . . . . 112 4.10.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5 Network Anomaly Detection 120 5.1 Information Security and Anomaly Detection . . . . . . . . . . 121 5.2 Distributed Denial of Service Attack (DDoS) . . . . . . . . . . 123 5.2.1 Classes of DDoS Attack . . . . . . . . . . . . . . . . . 125 5.2.2 Identifying DDoS Attacks . . . . . . . . . . . . . . . . 128 5.2.3 Generic Architecture of DDoS Defence Systems . . . . 128 5.3 Statistical and Machine Learning-based DDoS Detection . . . 130 5.4 ISCX 2012 IDS Experiment . . . . . . . . . . . . . . . . . . . 136 5.4.1 Experimental Data Sets . . . . . . . . . . . . . . . . . 136 5.4.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . 137 5.4.3 Behavioural Profiling of IP Flow Summary Data . . . . 141 5.4.4 IP Traffic Profile Classification . . . . . . . . . . . . . . 143 5.4.5 Classifier Evaluation . . . . . . . . . . . . . . . . . . . 146 5.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 147
Description: