USE OF WEB PAGE CREDIBILITY INFORMATION IN INCREASING THE ACCURACY OF WEB-BASED QUESTION ANSWERING SYSTEMS ASAD ALI SHAH THESIS SUBMITTED IN FULFILMENT OF THE REQUIREMENT FOR THE DEGREE OF DOCTOR OF PHILOSOPHY FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY UNIVERSITY OF MALAYA KUALA LUMPUR 2017 UNIVERSITY OF MALAYA ORIGINAL LITERARY WORK DECLARATION Name of Candidate: Asad Ali Shah Registration/Matric No: WHA120030 Name of Degree: Doctor of Philosophy in Computer Science Title of Project Paper/Research Report/Dissertation/Thesis (“this Work”): Use of Web Page Credibility Information in Increasing the Accuracy of Web-Based Question Answering Systems Field of Study: Information Systems I do solemnly and sincerely declare that: (1) I am the sole author/writer of this Work; (2) This Work is original; (3) Any use of any work in which copyright exists was done by way of fair dealing and for permitted purposes and any excerpt or extract from, or reference to or reproduction of any copyright work has been disclosed expressly and sufficiently and the title of the Work and its authorship have been acknowledged in this Work; (4) I do not have any actual knowledge nor do I ought reasonably to know that the making of this work constitutes an infringement of any copyright work; (5) I hereby assign all and every rights in the copyright to this Work to the University of Malaya (“UM”), who henceforth shall be owner of the copyright in this Work and that any reproduction or use in any form or by any means whatsoever is prohibited without the written consent of UM having been first had and obtained; (6) I am fully aware that if in the course of making this Work I have infringed any copyright whether intentionally or otherwise, I may be subject to legal action or any other action as may be determined by UM. Candidate’s Signature Date: 3rd Aug 2017 Subscribed and solemnly declared before, Witness’s Signature Date: Name: Designation: ii ABSTRACT Question Answering (QA) systems offer an efficient way of providing precise answers to questions asked in natural language. In the case of Web-based QA system, the answers are extracted from information sources such as Web pages. These Web-based QA systems are effective in finding relevant Web pages but either they do not evaluate credibility of Web pages or they evaluate only two to three out of seven credibility categories. Unfortunately, a lot of information available over the Web is biased, false and fabricated. Extracting answers from such Web pages leads to incorrect answers, thus decreasing the accuracy of Web-based QA systems and other system relying on Web pages. Most of the previous and recent studies on Web-based QA systems focus primarily on improving Natural Language Processing and Information Retrieval techniques for scoring answers, without conducting credibility assessment of Web pages. This research proposes a credibility assessment algorithm for evaluating Web pages and using their credibility score for ranking answers in Web-based QA systems. The proposed credibility assessment algorithm uses seven categories for scoring credibility, including correctness, authority, currency, professionalism, popularity, impartiality and quality, where each category consists of one or more credibility factors. This research attempts to improve accuracy in Web-based QA systems by developing a prototype Web- based QA system, named Optimal Methods QA (OMQA) system, which uses methods producing highest accuracy of answers, and improving the same by adding a credibility assessment module, called Credibility-based OMQA (CredOMQA) system. Both OMQA and CredOMQA systems have been evaluated with respect to accuracy of answers, using two quantitative evaluation metrics: 1) Percentage of queries correctly answered and 2) Mean Reciprocal Rank evaluation metrics. Extensive quantitative experiments and analyses have been conducted on 211 factoid questions taken from TREC QA track from iii 1999, 2000 and 2011 and a random sample of 21 questions from CLEF QA track for comparison and conclusions. Results from methods and techniques evaluation show that some techniques improved accuracy of answers retrieved more than others performing the same function. In some cases, combination of different techniques produced higher accuracy of answers retrieved than using them individually. The inclusion of Web pages credibility score significantly improved accuracy of the system. Among the seven credibility categories, four categories including correctness, professionalism, impartiality and quality had a major impact on accuracy of answer, whereas authority, currency and popularity played a minor role. The results conclusively establish that proposed CredOMQA performs better than other Web-based QA systems. Not only that, it also outperforms other credibility-based QA systems, which employ credibility assessment partially. It is expected that these results will help researchers/experts in selecting Web-based QA methods and techniques producing higher accuracy of answers retrieved, and evaluate credibility of sources using credibility assessment module to improve accuracy of existing and future information systems. The proposed algorithm can also help in designing credibility-based information systems in the areas of education, health, stocks, networking and media, requiring accurate and credible information, and would help enforce new Web-publishing standards, thus enhancing overall Web experience. iv ABSTRAK Sistem soal jawab (QA) menawarkan cara yang cekap untuk memberikan jawapan yang tepat kepada soalan-soalan yang ditanya dalam bahasa asli. Dalam kes sistem QA berasaskan Web, jawapan diambil daripada sumber-sumber maklumat seperti laman Web. Sistem QA berasaskan Web ini berkesan dalam mencari laman Web yang berkaitan tetapi tidak menilai kredibiliti laman Web tersebut atau hanya menilai dua hingga tiga daripada tujuh kategori kredibiliti. Malangnya, kebanyakan maklumat yang disediakan melalui laman Web adalah berat sebelah, palsu dan fabrikasi. Pengekstrakan jawapan dari sistem QA berasaskan Web tersebut menunjukan jawapan yang kurang tepat, sejurusnya mengurangkan ketepatan sistem QA berasaskan Web dan sistem lain yang bergantung kepada laman Web. Kebanyakan kajian sistem QA berasaskan Web yang lepas dan yang terbaru pada asasnya tertumpu dalam memperbaiki teknik pemprosesan bahasa asli dan teknik capaian maklumat untuk pemarkahan jawapan, tanpa membuat penilaian kredibiliti laman Web. Kajian ini mencadangkan satu algorithm penilaian kredibiliti untuk menilai laman Web dan menggunakan skor kredibiliti untuk kedudukan jawapan dalam sistem QA berasaskan Web. Model penilaian kredibiliti yang dicadangkan menggunakan tujuh kategori untuk menjaringkan kredibiliti, termasuk ketepatan, kuasa, mata wang, profesionalisme, populariti, kesaksamaan dan kualiti, di mana setiap kategori terdiri daripada satu atau lebih faktor kredibiliti. Kajian ini cuba meningkatkan ketepatan dalam sistem QA berasaskan Web dengan membangunkan prototaip sistem QA berasaskan Web yang dinamakan Optimal Methods QA (OMQA), yang menggunakan kaedah menghasilkan ketepatan tertinggi jawapan, dan meningkatkannya dengan penambahan penilaian modul kredibiliti, yang dipanggil sistem Credibility-based OMQA (CredOMQA). Kedua-dua sistem OMQA dan CredOMQA telah dinilai dari segi ketepatan jawapan, menggunakan dua metrik penilaian kuantitatif: 1) Peratusan v pertanyaan yang dijawab dengan betul dan 2) metrik penilaian Mean Reciprocal Rank. Eksperimen kuantitatif dan analisis yang meluas telah dijalankan ke atas 211 soalan factoid dari trek TREC QA tahun 1999, 2000 dan 2011 dan sampel rawak 21 soalan daripada trek CLEF QA untuk perbandingan dan kesimpulan. Hasil daripada kaedah dan teknik penilaian menunjukkan bahawa beberapa teknik meningkatkan ketepatan jawapan lebih daripada teknik lain yang melaksanakan fungsi yang sama. Dalam beberapa kes, gabungan teknik yang berbeza menghasilkan ketepatan jawapan yang lebih tinggi daripada menggunakan mereka secara individu. Kemasukan kredibiliti skor laman Web meningkatkan ketepatan sistem dengan ketara. Antara tujuh kategori kredibiliti, lima kategori termasuk ketepatan, profesionalisme, kesaksamaan dan kualiti mempunyai kesan yang besar kepada ketepatan jawapan, manakala kuasa, populariti dan mata wang memainkan peranan yang kecil. Keputusan muktamad membuktikan bahawa cadangan CredOMQA lebih berkesan daripada sistem QA berasaskan Web yang lain. Bukan sekadar itu, ia juga mengatasi sistem QA berdasarkan kredibiliti yang menggunakan sebahagian penilaian kredibiliti. Ia dijangka bahawa keputusan ini akan membantu penyelidik/pakar-pakar dalam memilih kaedah QA berasaskan Web dan teknik menghasilkan ketepatan yang lebih tinggi dalam pengekstrakan jawapan, dan menilai kredibiliti sumber menggunakan algorithm penilaian kredibiliti untuk meningkatkan ketepatan yang sedia ada dan sistem maklumat kelak. Model yang dicadangkan juga boleh membantu dalam merekabentuk sistem maklumat berasaskan kredibiliti termasuk bidang pendidikan, kesihatan, saham, rangkaian dan media, yang memerlukan maklumat yang tepat serta boleh dipercayai, dan membantu vi menguatkuasakan piawaian Web-penerbitan baharu, sekali gus meningkatkan keseluruhan pengalaman Web. vii ACKNOWLEDGEMENTS First and foremost, thanks to Allah for bestowing me the knowledge and guiding me in pursuing Ph.D. Accomplishing anything requires both moral and technical guidance. For technical guidance I will like to thank my supervisor Dr. Sri Devi Ravana for always being cooperative and providing the necessary assistance whenever it was required. I would also like to thank my co-supervisors, Dr. Suraya Hamid and Dr. Maizatul Akmar Binti Ismail, for also giving advice on improving my work. A man can only achieve a little without moral support, for that all credit goes to my better half, my wife Arooj, who always has been encouraging me to give my best and has always been supporting me whenever I needed it the most. My daughter has also been a blessing for me during my PhD, every time I looked at her I knew what needed to be done and that kept me pushing forward. Lastly, my parents, in-laws and family members back home who have been supporting and guiding me throughout my research. viii TABLE OF CONTENTS Abstract ............................................................................................................................ iii Abstrak .............................................................................................................................. v Acknowledgements ........................................................................................................ viii Table of Contents ............................................................................................................. ix List of Figures ................................................................................................................ xiv List of Tables................................................................................................................. xvii List of Symbols and Abbreviations ................................................................................ xxi CHAPTER 1: INTRODUCTION .................................................................................. 1 Motivation................................................................................................................ 3 1.1.1 Web-based QA systems methods and techniques ...................................... 8 1.1.2 Credibility assessment ................................................................................ 9 Research questions................................................................................................. 11 Research objectives ............................................................................................... 11 Contributions ......................................................................................................... 12 Overview of research ............................................................................................. 13 Structure of the thesis ............................................................................................ 15 CHAPTER 2: LITERATURE REVIEW .................................................................... 17 Web-based QA systems ......................................................................................... 17 2.1.1 QA systems types and characterization .................................................... 17 2.1.2 Web-based QA systems vs state-of-the-art QA systems .......................... 21 2.1.3 Web-based QA system model .................................................................. 22 2.1.4 Methods and techniques in Web-based QA systems ................................ 23 2.1.4.1 Question analysis ....................................................................... 29 ix 2.1.4.2 Answer extraction ..................................................................... 31 2.1.4.3 Answer scoring .......................................................................... 38 2.1.4.4 Answer aggregation ................................................................... 41 2.1.5 Web-based QA systems summary ............................................................ 42 Web credibility ...................................................................................................... 43 2.2.1 Defining credibility .................................................................................. 43 2.2.2 Perceiving Web credibility and difficulties faced .................................... 44 2.2.3 Credibility categories ............................................................................... 49 2.2.3.1 Correctness ................................................................................ 50 2.2.3.2 Authority ................................................................................... 51 2.2.3.3 Currency .................................................................................... 52 2.2.3.4 Professionalism ......................................................................... 54 2.2.3.5 Popularity .................................................................................. 56 2.2.3.6 Impartiality ................................................................................ 57 2.2.3.7 Quality ....................................................................................... 58 2.2.3.8 Credibility categories-summary ................................................ 60 2.2.4 Web credibility evaluation ....................................................................... 61 2.2.4.1 Evaluation techniques by humans ............................................. 62 2.2.4.2 Evaluation techniques using computers .................................... 76 2.2.4.3 Issues in the existing Web credibility evaluation approaches ... 98 Credibility assessment in Web-based QA systems ................................................ 99 Research gap ........................................................................................................ 105 CHAPTER 3: RESEARCH METHOLODY ............................................................ 109 Research flow ...................................................................................................... 109 3.1.1 Web credibility assessment .................................................................... 109 3.1.2 Develop a Web-based QA system .......................................................... 110 x
Description: