Table Of ContentUSE OF WEB PAGE CREDIBILITY INFORMATION IN
INCREASING THE ACCURACY OF WEB-BASED
QUESTION ANSWERING SYSTEMS
ASAD ALI SHAH
THESIS SUBMITTED IN FULFILMENT OF THE
REQUIREMENT FOR THE DEGREE OF DOCTOR OF
PHILOSOPHY
FACULTY OF COMPUTER SCIENCE AND
INFORMATION TECHNOLOGY
UNIVERSITY OF MALAYA
KUALA LUMPUR
2017
UNIVERSITY OF MALAYA
ORIGINAL LITERARY WORK DECLARATION
Name of Candidate: Asad Ali Shah
Registration/Matric No: WHA120030
Name of Degree: Doctor of Philosophy in Computer Science
Title of Project Paper/Research Report/Dissertation/Thesis (“this Work”):
Use of Web Page Credibility Information in Increasing the Accuracy of Web-Based
Question Answering Systems
Field of Study: Information Systems
I do solemnly and sincerely declare that:
(1) I am the sole author/writer of this Work;
(2) This Work is original;
(3) Any use of any work in which copyright exists was done by way of fair dealing
and for permitted purposes and any excerpt or extract from, or reference to or
reproduction of any copyright work has been disclosed expressly and
sufficiently and the title of the Work and its authorship have been
acknowledged in this Work;
(4) I do not have any actual knowledge nor do I ought reasonably to know that the
making of this work constitutes an infringement of any copyright work;
(5) I hereby assign all and every rights in the copyright to this Work to the
University of Malaya (“UM”), who henceforth shall be owner of the copyright
in this Work and that any reproduction or use in any form or by any means
whatsoever is prohibited without the written consent of UM having been first
had and obtained;
(6) I am fully aware that if in the course of making this Work I have infringed any
copyright whether intentionally or otherwise, I may be subject to legal action
or any other action as may be determined by UM.
Candidate’s Signature Date: 3rd Aug 2017
Subscribed and solemnly declared before,
Witness’s Signature Date:
Name:
Designation:
ii
ABSTRACT
Question Answering (QA) systems offer an efficient way of providing precise answers to
questions asked in natural language. In the case of Web-based QA system, the answers
are extracted from information sources such as Web pages. These Web-based QA systems
are effective in finding relevant Web pages but either they do not evaluate credibility of
Web pages or they evaluate only two to three out of seven credibility categories.
Unfortunately, a lot of information available over the Web is biased, false and fabricated.
Extracting answers from such Web pages leads to incorrect answers, thus decreasing the
accuracy of Web-based QA systems and other system relying on Web pages. Most of the
previous and recent studies on Web-based QA systems focus primarily on improving
Natural Language Processing and Information Retrieval techniques for scoring answers,
without conducting credibility assessment of Web pages.
This research proposes a credibility assessment algorithm for evaluating Web pages
and using their credibility score for ranking answers in Web-based QA systems. The
proposed credibility assessment algorithm uses seven categories for scoring credibility,
including correctness, authority, currency, professionalism, popularity, impartiality and
quality, where each category consists of one or more credibility factors. This research
attempts to improve accuracy in Web-based QA systems by developing a prototype Web-
based QA system, named Optimal Methods QA (OMQA) system, which uses methods
producing highest accuracy of answers, and improving the same by adding a credibility
assessment module, called Credibility-based OMQA (CredOMQA) system. Both OMQA
and CredOMQA systems have been evaluated with respect to accuracy of answers, using
two quantitative evaluation metrics: 1) Percentage of queries correctly answered and 2)
Mean Reciprocal Rank evaluation metrics. Extensive quantitative experiments and
analyses have been conducted on 211 factoid questions taken from TREC QA track from
iii
1999, 2000 and 2011 and a random sample of 21 questions from CLEF QA track for
comparison and conclusions.
Results from methods and techniques evaluation show that some techniques improved
accuracy of answers retrieved more than others performing the same function. In some
cases, combination of different techniques produced higher accuracy of answers retrieved
than using them individually.
The inclusion of Web pages credibility score significantly improved accuracy of the
system. Among the seven credibility categories, four categories including correctness,
professionalism, impartiality and quality had a major impact on accuracy of answer,
whereas authority, currency and popularity played a minor role. The results conclusively
establish that proposed CredOMQA performs better than other Web-based QA systems.
Not only that, it also outperforms other credibility-based QA systems, which employ
credibility assessment partially.
It is expected that these results will help researchers/experts in selecting Web-based
QA methods and techniques producing higher accuracy of answers retrieved, and evaluate
credibility of sources using credibility assessment module to improve accuracy of existing
and future information systems. The proposed algorithm can also help in designing
credibility-based information systems in the areas of education, health, stocks,
networking and media, requiring accurate and credible information, and would help
enforce new Web-publishing standards, thus enhancing overall Web experience.
iv
ABSTRAK
Sistem soal jawab (QA) menawarkan cara yang cekap untuk memberikan jawapan yang
tepat kepada soalan-soalan yang ditanya dalam bahasa asli. Dalam kes sistem QA
berasaskan Web, jawapan diambil daripada sumber-sumber maklumat seperti laman
Web. Sistem QA berasaskan Web ini berkesan dalam mencari laman Web yang berkaitan
tetapi tidak menilai kredibiliti laman Web tersebut atau hanya menilai dua hingga tiga
daripada tujuh kategori kredibiliti. Malangnya, kebanyakan maklumat yang disediakan
melalui laman Web adalah berat sebelah, palsu dan fabrikasi. Pengekstrakan jawapan dari
sistem QA berasaskan Web tersebut menunjukan jawapan yang kurang tepat, sejurusnya
mengurangkan ketepatan sistem QA berasaskan Web dan sistem lain yang bergantung
kepada laman Web. Kebanyakan kajian sistem QA berasaskan Web yang lepas dan yang
terbaru pada asasnya tertumpu dalam memperbaiki teknik pemprosesan bahasa asli dan
teknik capaian maklumat untuk pemarkahan jawapan, tanpa membuat penilaian
kredibiliti laman Web.
Kajian ini mencadangkan satu algorithm penilaian kredibiliti untuk menilai laman
Web dan menggunakan skor kredibiliti untuk kedudukan jawapan dalam sistem QA
berasaskan Web. Model penilaian kredibiliti yang dicadangkan menggunakan tujuh
kategori untuk menjaringkan kredibiliti, termasuk ketepatan, kuasa, mata wang,
profesionalisme, populariti, kesaksamaan dan kualiti, di mana setiap kategori terdiri
daripada satu atau lebih faktor kredibiliti. Kajian ini cuba meningkatkan ketepatan dalam
sistem QA berasaskan Web dengan membangunkan prototaip sistem QA berasaskan Web
yang dinamakan Optimal Methods QA (OMQA), yang menggunakan kaedah
menghasilkan ketepatan tertinggi jawapan, dan meningkatkannya dengan penambahan
penilaian modul kredibiliti, yang dipanggil sistem Credibility-based OMQA
(CredOMQA). Kedua-dua sistem OMQA dan CredOMQA telah dinilai dari segi
ketepatan jawapan, menggunakan dua metrik penilaian kuantitatif: 1) Peratusan
v
pertanyaan yang dijawab dengan betul dan 2) metrik penilaian Mean Reciprocal Rank.
Eksperimen kuantitatif dan analisis yang meluas telah dijalankan ke atas 211 soalan
factoid dari trek TREC QA tahun 1999, 2000 dan 2011 dan sampel rawak 21 soalan
daripada trek CLEF QA untuk perbandingan dan kesimpulan.
Hasil daripada kaedah dan teknik penilaian menunjukkan bahawa beberapa teknik
meningkatkan ketepatan jawapan lebih daripada teknik lain yang melaksanakan fungsi
yang sama. Dalam beberapa kes, gabungan teknik yang berbeza menghasilkan ketepatan
jawapan yang lebih tinggi daripada menggunakan mereka secara individu.
Kemasukan kredibiliti skor laman Web meningkatkan ketepatan sistem dengan ketara.
Antara tujuh kategori kredibiliti, lima kategori termasuk ketepatan, profesionalisme,
kesaksamaan dan kualiti mempunyai kesan yang besar kepada ketepatan jawapan,
manakala kuasa, populariti dan mata wang memainkan peranan yang kecil. Keputusan
muktamad membuktikan bahawa cadangan CredOMQA lebih berkesan daripada sistem
QA berasaskan Web yang lain. Bukan sekadar itu, ia juga mengatasi sistem QA
berdasarkan kredibiliti yang menggunakan sebahagian penilaian kredibiliti.
Ia dijangka bahawa keputusan ini akan membantu penyelidik/pakar-pakar dalam
memilih kaedah QA berasaskan Web dan teknik menghasilkan ketepatan yang lebih
tinggi dalam pengekstrakan jawapan, dan menilai kredibiliti sumber menggunakan
algorithm penilaian kredibiliti untuk meningkatkan ketepatan yang sedia ada dan sistem
maklumat kelak.
Model yang dicadangkan juga boleh membantu dalam merekabentuk sistem maklumat
berasaskan kredibiliti termasuk bidang pendidikan, kesihatan, saham, rangkaian dan
media, yang memerlukan maklumat yang tepat serta boleh dipercayai, dan membantu
vi
menguatkuasakan piawaian Web-penerbitan baharu, sekali gus meningkatkan
keseluruhan pengalaman Web.
vii
ACKNOWLEDGEMENTS
First and foremost, thanks to Allah for bestowing me the knowledge and guiding me in
pursuing Ph.D. Accomplishing anything requires both moral and technical guidance. For
technical guidance I will like to thank my supervisor Dr. Sri Devi Ravana for always
being cooperative and providing the necessary assistance whenever it was required. I
would also like to thank my co-supervisors, Dr. Suraya Hamid and Dr. Maizatul Akmar
Binti Ismail, for also giving advice on improving my work. A man can only achieve a
little without moral support, for that all credit goes to my better half, my wife Arooj, who
always has been encouraging me to give my best and has always been supporting me
whenever I needed it the most. My daughter has also been a blessing for me during my
PhD, every time I looked at her I knew what needed to be done and that kept me pushing
forward. Lastly, my parents, in-laws and family members back home who have been
supporting and guiding me throughout my research.
viii
TABLE OF CONTENTS
Abstract ............................................................................................................................ iii
Abstrak .............................................................................................................................. v
Acknowledgements ........................................................................................................ viii
Table of Contents ............................................................................................................. ix
List of Figures ................................................................................................................ xiv
List of Tables................................................................................................................. xvii
List of Symbols and Abbreviations ................................................................................ xxi
CHAPTER 1: INTRODUCTION .................................................................................. 1
Motivation................................................................................................................ 3
1.1.1 Web-based QA systems methods and techniques ...................................... 8
1.1.2 Credibility assessment ................................................................................ 9
Research questions................................................................................................. 11
Research objectives ............................................................................................... 11
Contributions ......................................................................................................... 12
Overview of research ............................................................................................. 13
Structure of the thesis ............................................................................................ 15
CHAPTER 2: LITERATURE REVIEW .................................................................... 17
Web-based QA systems ......................................................................................... 17
2.1.1 QA systems types and characterization .................................................... 17
2.1.2 Web-based QA systems vs state-of-the-art QA systems .......................... 21
2.1.3 Web-based QA system model .................................................................. 22
2.1.4 Methods and techniques in Web-based QA systems ................................ 23
2.1.4.1 Question analysis ....................................................................... 29
ix
2.1.4.2 Answer extraction ..................................................................... 31
2.1.4.3 Answer scoring .......................................................................... 38
2.1.4.4 Answer aggregation ................................................................... 41
2.1.5 Web-based QA systems summary ............................................................ 42
Web credibility ...................................................................................................... 43
2.2.1 Defining credibility .................................................................................. 43
2.2.2 Perceiving Web credibility and difficulties faced .................................... 44
2.2.3 Credibility categories ............................................................................... 49
2.2.3.1 Correctness ................................................................................ 50
2.2.3.2 Authority ................................................................................... 51
2.2.3.3 Currency .................................................................................... 52
2.2.3.4 Professionalism ......................................................................... 54
2.2.3.5 Popularity .................................................................................. 56
2.2.3.6 Impartiality ................................................................................ 57
2.2.3.7 Quality ....................................................................................... 58
2.2.3.8 Credibility categories-summary ................................................ 60
2.2.4 Web credibility evaluation ....................................................................... 61
2.2.4.1 Evaluation techniques by humans ............................................. 62
2.2.4.2 Evaluation techniques using computers .................................... 76
2.2.4.3 Issues in the existing Web credibility evaluation approaches ... 98
Credibility assessment in Web-based QA systems ................................................ 99
Research gap ........................................................................................................ 105
CHAPTER 3: RESEARCH METHOLODY ............................................................ 109
Research flow ...................................................................................................... 109
3.1.1 Web credibility assessment .................................................................... 109
3.1.2 Develop a Web-based QA system .......................................................... 110
x
Description:ASAD ALI SHAH .. 2.1.5 Web-based QA systems summary . et al., 2001b; Nakamura et al., 2007; Popat, Mukherjee, Strötgen, & Weikum, and for the Web page itself factual correctness, expert popularity, citations, The credibility-based Web QA system was developed using PHP 5.6.1 Web