Computer Science / Computer Engineering / Computing T M h u a Data Mining Tools for Malware Detection r s a u i d s Mehedy Masud, Latifur Khan, and Bhavani Thuraisingham in • g K h Although the use of data mining for security and malware detection is quickly a h a m on the rise, most books on the subject provide high-level theoretical discussions n to the near exclusion of the practical aspects. Breaking the mold, Data Mining D Tools for Malware Detection provides a step-by-step breakdown of how to a develop data mining tools for malware detection. Integrating theory with practical t techniques and experimental results, it focuses on malware detection applications a for email worms, malicious code, remote exploits, and botnets. M i The authors describe the systems they have designed and developed: email worm n i detection using data mining, a scalable multi-level feature extraction technique n to detect malicious executables, detecting remote exploits using data mining, and g flow-based identification of botnet traffic by mining multiple log files. For each of T these tools, they detail the system architecture, algorithms, performance results, o and limitations. o l s • Discusses data mining for emerging applications, including adaptable f malware detection, insider threat detection, firewall policy analysis, o and real-time data mining r • Includes four appendices that provide a firm foundation in data M management, secure systems, and the semantic web a l • Describes the authors’ tools for stream data mining w a From algorithms to experimental results, this is one of the few books that will r e be equally valuable to those in industry, government, and academia. It will help technologists decide which tools to select for specific applications, managers will D learn how to determine whether or not to proceed with a data mining project, and e t developers will find innovative alternative designs for a range of applications. e c K12530 t i o ISBN: 978-1-4398-5454-9 90000 n www.crcpress.com 9 781439 854549 www.auerbach-publications.com K12530 cvr mech.indd 1 10/27/11 10:09 AM Data Mining Tools for Malware Detection IT MANAGEMENT TITLES FROM AuERBACH PuBLICATIONS AND CRC PRESS .Net 4 for Enterprise Architects and Information Security Governance Simplified: Developers From the Boardroom to the Keyboard Sudhanshu Hate and Suchi Paharia Todd Fitzgerald ISBN 978-1-4398-6293-3 ISBN 978-1-4398-1163-4 A Tale of Two Transformations: Bringing Lean IP Telephony Interconnection Reference: and Agile Software Development to Life Challenges, Models, and Engineering Michael K. Levine Mohamed Boucadair, Isabel Borges, ISBN 978-1-4398-7975-7 Pedro Miguel Neves, and Olafur Pall Einarsson Antipatterns: Managing Software ISBN 978-1-4398-5178-4 Organizations and People, Second Edition IT’s All about the People: Technology Colin J. Neill, Philip A. Laplante, and Management That Overcomes Disaffected Joanna F. DeFranco People, Stupid Processes, and Deranged ISBN 978-1-4398-6186-8 Corporate Cultures Asset Protection through Security Stephen J. Andriole Awareness ISBN 978-1-4398-7658-9 Tyler Justin Speed ISBN 978-1-4398-0982-2 IT Best Practices: Management, Teams, Quality, Performance, and Projects Beyond Knowledge Management: What Tom C. Witt Every Leader Should Know ISBN 978-1-4398-6854-6 Edited by Jay Liebowitz ISBN 978-1-4398-6250-6 Maximizing Benefits from IT Project Management: From Requirements to CISO’s Guide to Penetration Testing: Value Delivery A Framework to Plan, Manage, and José López Soriano Maximize Benefits James S. Tiller ISBN 978-1-4398-4156-3 ISBN 978-1-4398-8027-2 Secure and Resilient Software: Cybersecurity: Public Sector Threats Requirements, Test Cases, and and Responses Testing Methods Edited by Kim J. Andreasson Mark S. Merkow and Lakshmikanth Raghavan ISBN 978-1-4398-4663-6 ISBN 978-1-4398-6621-4 Cybersecurity for Industrial Control Systems: Security De-engineering: Solving the SCADA, DCS, PLC, HMI, and SIS Problems in Information Risk Management Tyson Macaulay and Bryan Singer Ian Tibble ISBN 978-1-4398-0196-3 ISBN 978-1-4398-6834-8 Data Warehouse Designs: Achieving Software Maintenance Success Recipes ROI with Market Basket Analysis and Donald J. Reifer Time Variance ISBN 978-1-4398-5166-1 Fon Silvers ISBN 978-1-4398-7076-1 Software Project Management: A Process-Driven Approach Emerging Wireless Networks: Concepts, Ashfaque Ahmed Techniques and Applications ISBN 978-1-4398-4655-1 Edited by Christian Makaya and Samuel Pierre ISBN 978-1-4398-2135-0 Web-Based and Traditional Outsourcing Information and Communication Vivek Sharma, Varun Sharma, and Technologies in Healthcare K.S. Rajasekaran, Infosys Technologies Ltd., Edited by Stephan Jones and Frank M. Groom Bangalore, India ISBN 978-1-4398-5413-6 ISBN 978-1-4398-1055-2 Data Mining Tools for Malware Detection Mehedy Masud, Latifur Khan, and Bhavani Thuraisingham CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2012 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20111020 International Standard Book Number-13: 978-1-4398-5455-6 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Dedication We dedicate this book to our respective families for their support that enabled us to write this book. Contents Preface xix Introductory Remarks xix Background on Data Mining xx Data Mining for Cyber Security xxii Organization of This Book xxii Concluding Remarks xxiii acknowledgments xxv the authors xxvii coPyright Permissions xxix chaPter 1 introduction 1 1.1 Trends 1 1.2 Data Mining and Security Technologies 2 1.3 Data Mining for Email Worm Detection 3 1.4 Data Mining for Malicious Code Detection 4 1.5 Data Mining for Detecting Remote Exploits 5 1.6 Data Mining for Botnet Detection 5 1.7 Stream Data Mining 5 1.8 Emerging Data Mining Tools for Cyber Security Applications 6 1.9 Organization of This Book 6 1.10 Next Steps 7 Part i data mining and security Introduction to Part I: Data Mining and Security 11 vii viii Contents chaPter 2 data mining techniques 13 2.1 Introduction 13 2.2 Overview of Data Mining Tasks and Techniques 14 2.3 Artificial Neural Network 14 2.4 Support Vector Machines 19 2.5 Markov Model 22 2.6 Association Rule Mining (ARM) 25 2.7 Multi-Class Problem 29 2.7.1 One-vs-One 29 2.7.2 One-vs-All 30 2.8 Image Mining 31 2.8.1 Feature Selection 32 2.8.2 Automatic Image Annotation 33 2.8.3 Image Classification 33 2.9 Summary 34 References 34 chaPter 3 malware 37 3.1 Introduction 37 3.2 Viruses 38 3.3 Worms 39 3.4 Trojan Horses 40 3.5 Time and Logic Bombs 40 3.6 Botnet 41 3.7 Spyware 42 3.8 Summary 42 References 43 chaPter 4 data mining for security aPPlications 45 4.1 Introduction 45 4.2 Data Mining for Cyber Security 46 4.2.1 Overview 46 4.2.2 Cyber-Terrorism, Insider Threats, and External Attacks 47 4.2.3 Malicious Intrusions 48 4.2.4 Credit Card Fraud and Identity Theft 49 4.2.5 Attacks on Critical Infrastructures 50 4.2.6 Data Mining for Cyber Security 50 4.3 Current Research and Development 52 4.4 Summary 54 References 54 chaPter 5 design and imPlementation of data mining tools 57 5.1 Introduction 57 5.2 Intrusion Detection 59 5.3 Web Page Surfing Prediction 62 Contents ix 5.4 Image Classification 65 5.5 Summary 66 References 66 conclusion to Part i 69 Part ii data mining for email worm detection Introduction to Part II 71 chaPter 6 email worm detection 73 6.1 Introduction 73 6.2 Architecture 74 6.3 Related Work 75 6.4 Overview of Our Approach 76 6.5 Summary 77 References 78 chaPter 7 design of the data mining tool 81 7.1 Introduction 81 7.2 Architecture 82 7.3 Feature Description 83 7.3.1 Per-Email Features 83 7.3.2 Per-Window Features 84 7.4 Feature Reduction Techniques 84 7.4.1 Dimension Reduction 84 7.4.2 Two-Phase Feature Selection (TPS) 85 7.4.2.1 Phase I 85 7.4.2.2 Phase II 87 7.5 Classification Techniques 89 7.6 Summary 91 References 92 chaPter 8 evaluation and results 95 8.1 Introduction 95 8.2 Dataset 96 8.3 Experimental Setup 98 8.4 Results 99 8.4.1 Results from Unreduced Data 99 8.4.2 Results from PCA-Reduced Data 99 8.4.3 Results from Two-Phase Selection 102 8.5 Summary 106 References 106 conclusion to Part ii 107 Part iii data mining for detecting malicious executables Introduction to Part III 109