SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN AJAX CONTENT IN WEB APPLICATIONS by PAUL SUGANTHAN G C 20084053 MUTHUKUMAR V 20084041 NANDHAKUMAR B 20084043 A project report submitted to the FACULTY OF INFORMATION AND COMMUNICATION ENGINEERING in partial fulfillment of the requirements for the award of the degree of BACHELOR OF ENGINEERING in COMPUTER SCIENCE AND ENGINEERING DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ANNA UNIVERSITY CHENNAI CHENNAI - 600025 MAY 2012 CERTIFICATE Certified that this project report titled “ SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN AJAX CONTENT IN WEB APPLICATIONS” is the bonafide work of PAUL SUGANTHAN G C (20084053), MUTHUKUMAR V (20084041), NANDHAKUMAR B (20084043) who carried out the project work under my supervision, for the fulfillment of the requirements for the award of the degree of Bachelor of Engineering in Computer Science and Engineering. Certified further that to the best of my knowledge, the work reported herein does not form part of any other thesis or dissertation on the basis of which a degree or an award was conferred on an earlier occasion on these are any other candidates. Place: Chennai Dr. V Vetriselvi Date: Project Guide, Designation, Department of Computer Science and Engineering, Anna University Chennai, Chennai - 600025 COUNTERSIGNED Head of the Department, Department of Computer Science and Engineering, Anna University Chennai, Chennai – 600025 ACKNOWLEDGEMENTS We express our deep gratitude to our guide, Dr. V VETRISELVI for guiding us through every phase of the project. We appreciate her thoroughness, tolerance and ability to share her knowledge with us. We thank her for being easily approachable and quite thoughtful. Apart from adding her own input, she has encouraged us to think on our own and give form to our thoughts. We owe her for harnessing our potential and bringing out the best in us. Without her immense support through every step of the way, we could never have it to this extent. We are extremely grateful to Dr. K.S. EASWARAKUMAR, Head of the Department of Computer Science and Engineering, Anna University, Chennai 600025, for extending the facilities of the Department towards our project and for his unstinting support. We express our thanks to the panel of reviewers Dr. ARUL SIROMONEY, Dr. A.P. SHANTHI and Dr. MADHAN KARKY (list of panel members) for their valuable suggestions and critical reviews throughout the course of our project. Wethankourparents, family, andfriendsforbearingwithusthroughoutthecourse of our project and for the opportunity they provided us in undergoing this course in such a prestigious institution. Paul Suganthan G C Muthukumar V Nandhakumar B ABSTRACT Current search engines such as Google and Yahoo! are prevalent for searching the Web. Search on dynamic client-side Web pages is, however, either inexistent or far from perfect, and not addressed by existing work, for example on Deep Web. This is a real impediment since AJAX and Rich Internet Applications are already very common in the Web. AJAX applications are composed of states which can be seen by the user, but not by the search engine, and changed by the user using client-side events. Current search engines either ignore AJAX applications or produce false negatives. The reason is that crawling client-side code is a difficult problem that cannot be solved naively by invoking user events. The project is aimed to propose a solution for crawling and extracting the hidden ajax content. Thus enabling the search engines to enhance its search result quality by indexing dynamic ajax content. Though AJAX can be crawled by testing manually in browser by invoking client side events, enhancing the search engine to crawl AJAX content automatically similar to traditional web applications hasn’t been achieved. The project describes the design and implementation of an AJAX Crawler. Then enabling search engine to index the crawled states of an AJAX page. The performance of AJAX Crawler is evaluated and compared with traditional crawler. The possible issues regarding crawling AJAX content and future optimizations are also analysed. திட்டப்பணிச் சுருக்கம் தற்ேபாது உள்ள ேதடு ெபாறிகள் அைனத்தும், இைணயதளத்தில் உள்ள அடிக்கடி மாறுகின்ற உைரைய ெகாண்டுள்ள வைலப்பக்கங்கைள ேதடுவதில்ைல. இதனால் இைணயதளத்தில் உள்ள பல உைரகள் மக்களுக்கு ெதrயாமல் ேபாகிறது. இத்திட்டத்தின் ேநாக்கம் மைறந்துள்ள பல உைரகைள ேதடு ெபாறிகளுக்கு ெதrய ெசய்வது. பிரதான கூகுள், யாஹூ ேபான்ற ேதடு ெபாறிகள் கூட பல உைரகைள கண்டுெகாள்ளாமல் இருக்கிறது. எனேவ இத்திட்டம் மூலம் இைணயதளத்தில் உள்ள மைறந்துள்ள பல உைரகள் ேதடு ெபாறிகளால் கண்டுபிடிக்கப்படும். ஆகேவ இைணயதளத்தில் உள்ள மைறந்துள்ள உைரகளின் எண்ணிக்ைக குைறயும். இத்திட்டத்தின் மூலம் ேதடு ெபாறிகளின் திறைம அதிகrக்கப்படும். Contents CERTIFICATE i ACKNOWLEDGEMENTS ii ABSTRACT(ENGLISH) iii ABSTRACT(TAMIL) iv LIST OF FIGURES viii LIST OF TABLES ix LIST OF ABBREVIATIONS x 1 INTRODUCTION 1 1.1 AJAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Scope of the Project . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Organisation of this Report . . . . . . . . . . . . . . . . . . . . . 4 2 RELATED WORK 5 2.1 Crawling AJAX . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Google’s AJAX Crawling Scheme . . . . . . . . . . . . . . . . . 8 3 REQUIREMENTS ANALYSIS 11 3.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Non-Functional Requirements . . . . . . . . . . . . . . . . . . . 12 3.2.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Hardware Considerations . . . . . . . . . . . . . . . . . . 12 3.2.3 Performance Characteristics . . . . . . . . . . . . . . . . 12 3.2.4 Security Issues . . . . . . . . . . . . . . . . . . . . . . . 13 v 3.2.5 Safety Issues . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4 SYSTEM DESIGN 15 4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.1 Architecture Diagram . . . . . . . . . . . . . . . . . . . . 15 4.2 Module Descriptions . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2.1 Identification of Clickables . . . . . . . . . . . . . . . . . 17 4.2.2 Event Invocation . . . . . . . . . . . . . . . . . . . . . . 19 4.2.3 State Machine representation of AJAX website . . . . . . 19 4.2.3.1 Visualizing the State Machine . . . . . . . . . . 21 4.2.4 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.5 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.6 Reconstruction of state . . . . . . . . . . . . . . . . . . . 22 4.3 User Interface Design . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4 UseCase Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4.1 UseCase Diagram . . . . . . . . . . . . . . . . . . . . . . 23 4.5 System Sequence Diagram . . . . . . . . . . . . . . . . . . . . . 24 4.5.1 Event Invocation . . . . . . . . . . . . . . . . . . . . . . 24 4.5.2 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.6 Data Flow Model . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.6.1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 25 5 SYSTEM DEVELOPMENT 28 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.1.1 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.1.2 Implementation Description . . . . . . . . . . . . . . . . 28 5.1.2.1 Ajax Crawling Algorithm . . . . . . . . . . . . 29 5.1.2.2 State Machine . . . . . . . . . . . . . . . . . . 32 5.1.2.3 Indexing . . . . . . . . . . . . . . . . . . . . . 34 5.1.2.4 Searching . . . . . . . . . . . . . . . . . . . . 35 5.1.2.5 Reconstruction of a particular state after crawling 36 6 RESULTS AND DISCUSSION 37 6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 39 vi 6.2.1 Crawling Time . . . . . . . . . . . . . . . . . . . . . . . 39 6.2.1.1 Number of States Vs Crawling Time . . . . . . 40 6.2.2 Clickable Selection Policy . . . . . . . . . . . . . . . . . 41 6.2.2.1 Number of AJAX Requests Vs Probable Clickables . . . . . . . . . . . . . . . . . . . . 42 6.2.2.2 Probable Clickables Vs Detected Clickables . . 43 6.2.3 Clickable Selection Ratio Vs Crawling Time . . . . . . . 44 6.3 Search Result Quality . . . . . . . . . . . . . . . . . . . . . . . . 45 6.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7 CONCLUSIONS 50 7.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 A Snapshots 52 A.1 Search Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 A.2 Google Bot and AJAX Crawler . . . . . . . . . . . . . . . . . . . 54 B DOM 58 B.1 DOM - Document Object Model . . . . . . . . . . . . . . . . . . 58 B.2 DOM Tree Representation . . . . . . . . . . . . . . . . . . . . . 58 References 60 vii List of Figures 1.1 Crawler Architecture . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 AJAX Crawling Scheme . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.1 Architecture Diagram . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Visualizing State Machine . . . . . . . . . . . . . . . . . . . . . 21 4.3 UseCase Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4 Sequence Diagram - Event Invocation . . . . . . . . . . . . . . . 24 4.5 Sequence Diagram - Searching . . . . . . . . . . . . . . . . . . . 25 4.6 Level 0 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 25 4.7 Level 1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 26 4.8 Level 1 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . 27 6.1 Number of States Vs Crawling Time(in minutes) . . . . . . . . . 40 6.2 Number of AJAX Requests Vs Probable Clickables . . . . . . . . 42 6.3 Probable Clickables Vs Detected Clickables . . . . . . . . . . . . 43 6.4 Clickable Selection Ratio Vs Crawling time per state(in minutes) . 44 A.1 Interface I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 A.2 Interface II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 A.3 Fetched By Google Bot . . . . . . . . . . . . . . . . . . . . . . . 54 A.4 Fetched By Google Bot . . . . . . . . . . . . . . . . . . . . . . . 55 A.5 Fetched By AJAX Crawler . . . . . . . . . . . . . . . . . . . . . 56 A.6 Fetched By AJAX Crawler . . . . . . . . . . . . . . . . . . . . . 57 B.1 DOM Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 viii List of Tables 5.1 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.1 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 38 6.3 Crawling Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.4 Clickable Selection Policy . . . . . . . . . . . . . . . . . . . . . 41 ix
Description: