Table Of ContentExperiment and
Evaluation in Information
Retrieval Models
K. Latha
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2018 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
International Standard Book Number-13: 978-1-138-03231-6 (Hardback)
Library of Congress Cataloging–in–Publication Data
Names: Latha, K., author.
Title: Experiment and evaluation in information retrieval models / K. Latha.
Description: Boca Raton : CRC Press, Taylor & Francis Group, [2016] |
Includes bibliographical references and index.
Identifiers: LCCN 2017004392| ISBN 9781138032316 (hardback : alk. paper) |
ISBN 9781315392622 (ebook) | ISBN 9781315392615 (ebook) |
ISBN 9781315392608 (ebook) | ISBN 9781315392592 (ebook)
Subjects: LCSH: Data mining. | Querying (Computer science) | Big data. |
Information retrieval--Experiments. | Information storage and retrieval
systems--Evaluation.
Classification: LCC QA76.9.D343 L384 2016 | DDC 006.3/12--dc23
LC record available at https://lccn.loc.gov/2017004392
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface ...........................................................................................................................................xiii
Acknowledgments .....................................................................................................................xvii
About the Author ........................................................................................................................xix
Section I Foundations
1 Introduction .............................................................................................................................3
1.1 Motivation ......................................................................................................................3
1.1.1 Web Search ........................................................................................................4
1.2 Evolutionary Search and IR .........................................................................................4
1.3 Applications of IR .........................................................................................................5
1.3.1 Other Search Applications..............................................................................7
Section II Preliminaries
2 Preliminaries .........................................................................................................................11
2.1 Information Retrieval .................................................................................................11
2.2 Information Retrieval versus Data Retrieval ..........................................................12
2.3 Information Retrieval (IR) versus Information Extraction (IE) ............................12
2.4 Components of an Information Retrieval System ..................................................13
2.4.1 Document Processing ....................................................................................13
2.4.2 Query Processing ...........................................................................................15
2.4.3 Retrieval and Feedback Generation Component ......................................15
3 Contextual and Conceptual Information Retrieval ......................................................19
3.1 Context Search .............................................................................................................19
3.1.1 Need for Contextual Search .........................................................................19
3.1.2 Graphical Representation of Context-Based Search .................................19
3.1.3 Architecture of Context-Based Indexing ....................................................20
3.1.4 Approaches for Context Search ...................................................................22
3.1.4.1 Searching Based on Explicitly Specifying User Context ..........22
3.1.4.2 Searching Based on Automatically Derived Context ................22
3.1.5 Traditional Method for Context-Based Search: User Profile-Based
Context Search ................................................................................................22
3.2 Conceptual Search ......................................................................................................23
3.2.1 The Semantic Web .........................................................................................23
3.2.2 Ontology ..........................................................................................................23
3.2.3 Approaches to Conceptual Search ..............................................................24
3.2.4 Types of Conceptual Structures ...................................................................24
3.2.5 Features of Conceptual Structures ..............................................................25
3.2.6 Framework for Concept-Based Search ........................................................25
3.2.7 Concept Chain Graphs ..................................................................................26
4 Information Retrieval Models ...........................................................................................27
4.1 Boolean Model .............................................................................................................27
4.2 Vector Model ................................................................................................................28
4.2.1 The Vector Space Model ................................................................................28
4.2.2 Similarity Measures ......................................................................................28
4.2.2.1 Cosine Similarity ............................................................................28
4.2.2.2 Jaccard Coefficient ..........................................................................29
4.2.2.3 Dice Coefficient ...............................................................................29
4.3 Fixing the Term Weights ............................................................................................29
4.3.1 Term Frequency .............................................................................................30
4.3.2 Inverse Document Frequency ......................................................................30
4.3.3 tf-idf .................................................................................................................30
4.4 Probabilistic Models ...................................................................................................31
4.4.1 Probabilistic Ranking Principle (PRP) ........................................................31
4.4.2 Binary Independence Retrieval (BIR) Model .............................................32
4.4.3 The Probabilistic Indexing Model ...............................................................33
4.5 Language Model..........................................................................................................33
4.5.1 Multinomial Distributions Model ...............................................................34
4.5.2 The Query Likelihood Model ......................................................................35
4.5.3 Extended Language Modeling Approaches ..............................................36
4.5.4 Translation Model ..........................................................................................36
4.5.5 Comparisons with Traditional Probabilistic IR Approaches ..................37
5 Evaluation of Information Retrieval Systems ................................................................39
5.1 Ranked and Unranked Results .................................................................................39
5.1.1 Relevance ........................................................................................................39
5.2 Unranked Retrieval System .......................................................................................39
5.2.1 Precision ..........................................................................................................39
5.2.2 Recall................................................................................................................40
5.2.3 Accuracy ..........................................................................................................40
5.2.4 F-Measure .......................................................................................................41
5.2.5 G-Measure.......................................................................................................41
5.2.6 Prevalence .......................................................................................................42
5.2.7 Error Rate ........................................................................................................42
5.2.8 Fallout ..............................................................................................................43
5.2.9 Miss Rate .........................................................................................................43
5.3 Ranked Retrieval System ...........................................................................................43
5.3.1 Precision and Recall Curves .........................................................................43
5.3.2 Average Precision ...........................................................................................44
5.3.3 Precision at k ...................................................................................................44
5.3.4 R-Precision ......................................................................................................44
5.3.5 Mean Average Precision (MAP) ...................................................................45
5.3.6 Breakeven Point ..............................................................................................45
5.3.7 ROC Curve ......................................................................................................46
5.3.7.1 Relationship between PR and ROC Curves ...............................46
6 Fundamentals of Evolutionary Algorithms ....................................................................47
6.1 Combinatorial Optimization Problems ...................................................................47
6.1.1 Heuristics ........................................................................................................47
6.1.2 Metaheuristics ................................................................................................48
6.1.3 Case-Based Reasoning (CBR) .......................................................................48
6.2 Evolutionary Programming ......................................................................................48
6.3 Evolutionary Computation ........................................................................................49
6.3.1 Single-Objective Optimization ....................................................................50
6.3.2 Multi-Objective Optimization .....................................................................50
6.4 Role of Evolutionary Algorithms in Information Retrieval ..................................50
6.5 Evolutionary Algorithms ...........................................................................................51
6.5.1 Firefly Algorithm ...........................................................................................51
6.5.2 Particle Swarm Optimization ......................................................................52
6.5.3 Genetic Algorithms .......................................................................................52
6.5.4 Genetic Programming ...................................................................................53
6.5.5 Applications of Genetic Programming .......................................................54
6.5.6 Simulated Annealing ....................................................................................54
6.5.7 Harmony Search ............................................................................................55
6.5.8 Differential Evolution ....................................................................................55
6.5.9 Tabulated Search ............................................................................................56
Section III Demand of Evolutionary Algorithms in IR
7 Demand of Evolutionary Algorithms in Information Retrieval ................................59
7.1 Document Ranking.....................................................................................................59
7.1.1 Retrieval Effectiveness ..................................................................................59
7.2 Relevance Feedback Approach .................................................................................60
7.2.1 Relevance Feedback in Text IR .....................................................................61
7.2.1.1 Query Expansion ............................................................................62
7.2.2 Relevance Feedback in Content-Based Image Retrieval ..........................62
7.2.3 Relevance Feedback in Region-Based Image Retrieval ............................63
7.3 Term-Weighting Approaches ....................................................................................64
7.3.1 Term Frequency .............................................................................................65
7.3.2 Inverse Document Frequency ......................................................................65
7.4 Document Retrieval ....................................................................................................65
7.5 Feature Selection Approach.......................................................................................66
7.5.1 Filter Method for Feature Selection .............................................................67
7.5.2 Wrapper Method for Feature Selection ......................................................67
7.5.3 Embedded Method for Feature Selection ...................................................67
7.6 Image Retrieval ...........................................................................................................68
7.6.1 Content-Based Image Retrieval ...................................................................69
7.6.1.1 Feature Extraction ..........................................................................71
7.6.1.2 Color Descriptor .............................................................................71
7.6.1.3 Texture Descriptor .........................................................................72
7.6.1.4 Shape Descriptor ............................................................................73
7.6.1.5 Similarity Measure .........................................................................73
7.6.2 Region-Based Image Retrieval .....................................................................73
7.6.2.1 Image Segmentation ......................................................................74
7.6.2.2 Similarity Measure .........................................................................75
7.6.3 Image Summarization ...................................................................................75
7.6.3.1 Multimodal Image Collection Summarization ..........................76
7.6.3.2 Bag of Words ...................................................................................77
7.6.3.3 Dictionary Learning for Calculating Sparse Approximately ...79
7.7 Web-Based Recommendation System ......................................................................80
7.8 Web Page Classification ..............................................................................................81
7.9 Facet Generation ..........................................................................................................83
7.10 Duplicate Detection System .......................................................................................84
7.11 Improvisation of Seeker Satisfaction in Community Question Answering
Systems .........................................................................................................................86
7.12 Abstract Generation ....................................................................................................87
Section IV Model Formulations of Information
Retrieval Techniques
8 TABU Annealing: An Efficient and Scalable Strategy for Document Retrieval .....91
8.1 Simulated Annealing ..................................................................................................91
8.1.1 The Simulated Annealing Algorithm .........................................................92
8.1.2 Cooling Schedules .........................................................................................92
8.2 TABU Annealing Algorithm .....................................................................................93
8.3 Empirical Results and Discussion ............................................................................94
9 Efficient Latent Semantic Indexing-Based Information Retrieval Framework
Using Particle Swarm Optimization and Simulated Annealing ................................99
9.1 Architecture of Proposed Information Retrieval System......................................99
9.2 Methodology and Solutions ....................................................................................100
9.2.1 Text Preprocessing .......................................................................................100
9.2.2 Dimensionality Reduction ..........................................................................101
9.2.2.1 Dimensionality Reduction Using Latent Semantic
Indexing .........................................................................................101
9.2.2.2 Query Conversion Using LSI ......................................................102
9.2.3 Clustering of Dimensionally Reduced Documents ................................103
9.2.3.1 Background of Particle Swarm Optimization (PSO)
Algorithm ......................................................................................103
9.2.3.2 Background of K-Means ..............................................................105
9.2.3.3 Hybrid PSO + K-Means Algorithm ............................................106
9.2.4 Simulated Annealing for Document Retrieval ........................................106
9.3 Experimental Results and Discussion ...................................................................106
9.3.1 Performance Evaluation for Clustering ....................................................106
9.3.2 Performance Evaluation for Document Retrieval ...................................108
10 Music-Inspired Optimization Algorithm: Harmony-TABU for Document
Retrieval Using Rhetorical Relations and Relevance Feedback ...............................113
10.1 The Basic Harmony Search Clustering Algorithm ..............................................113
10.1.1 Basic Structure of Harmony Search Algorithm ......................................113
10.1.2 Representation of Documents and Queries .............................................113
10.1.3 Representation of Solutions ........................................................................114
10.1.4 Features of Harmony Search ......................................................................114
10.1.5 Initialize the Problem and HS Parameters ...............................................115
10.1.6 Harmony Memory Initialization ...............................................................115
10.1.7 New Harmony Improvisation ....................................................................115
10.1.8 Hybridization ...............................................................................................116
10.1.9 Evaluation of Solutions ...............................................................................116
10.2 Harmony-TABU Algorithm .....................................................................................116
10.3 Relevance Feedback and Query Expansion in IR ................................................118
10.3.1 Presentation Term Selection .......................................................................118
10.3.2 Direct Term Feedback (TFB) .......................................................................119
10.3.3 Cluster Feedback (CFB) ...............................................................................120
10.3.4 Term-Cluster Feedback (TCFB) ..................................................................120
10.4 Empirical Results and Discussion ..........................................................................121
10.4.1 Document Collections .................................................................................121
10.4.2 Experimental Setup .....................................................................................121
10.5 Rhetorical Structure ..................................................................................................123
10.6 Abstract Generation ..................................................................................................123
11 Evaluation of Light Inspired Optimization Algorithm-Based
Image Retrieval ...................................................................................................................125
11.1 Query Selection and Distance Calculation ...........................................................126
11.2 Optimization Using a Stochastic Firefly Algorithm ............................................127
11.2.1 Agents Initialization and Fitness Evaluation ...........................................127
11.2.2 Variation in Brightness of Firefly ...............................................................127
11.2.3 Strategy for Searching New Swarms ........................................................127
11.3 Experimental Setup ..................................................................................................129
11.4 Visual Signature ........................................................................................................129
11.5 Performance Measures .............................................................................................130
11.6 Parameter Settings of Firefly Algorithm ...............................................................130
11.7 Performance Evaluation ...........................................................................................131
12 An Evolutionary Approach for Optimizing Content-Based Image Retrieval
Using Support Vector Machine .......................................................................................135
12.1 Relevance Feedback Learning via Support Vector Machine ..............................136
12.2 Optimization Using a Stochastic Firefly Algorithm ............................................137
12.3 Image Database .........................................................................................................139
12.4 Baselines .....................................................................................................................139
12.5 Comparison Methods ...............................................................................................140
13 An Application of Firefly Algorithm to Region-Based Image Retrieval ................143
13.1 Image Retrieval .........................................................................................................144
13.1.1 Image Segmentation ....................................................................................144
13.1.2 Image Representation ..................................................................................144
13.1.3 Similarity Measure ......................................................................................144
13.2 Optimization Using a Stochastic Firefly Algorithm ............................................146
13.2.1 Firefly Agent’s Initialization and Fitness Evaluation .............................146
13.2.2 Attraction toward New Firefly...................................................................146
13.2.3 Movement of Fireflies ..................................................................................147
13.3 Image Databases ........................................................................................................147
13.4 Performance Evaluation ...........................................................................................148
14 An Evolutionary Approach for Optimizing Region-Based Image Retrieval
Using Support Vector Machine .......................................................................................151
14.1 Region-Based Image Retrieval ................................................................................151
14.2 Behavior of Fireflies ..................................................................................................153
14.3 Why Is the Firefly Algorithm So Efficient? ............................................................153
14.4 Machine Learning .....................................................................................................154
14.5 Support Vector Machines .........................................................................................155
14.6 Optimization of SVM by PSO .................................................................................155
14.6.1 SVM-Based RF ..............................................................................................156
14.7 Optimization Using a Stochastic Firefly Algorithm ............................................157
14.8 Image Databases ........................................................................................................157
14.8.1 COIL Database..............................................................................................157
14.8.2 The Corel Database ......................................................................................158
14.9 Baselines .....................................................................................................................158
14.9.1 The Proposed SVM: FA Approach ............................................................158
14.10 Discussion ..................................................................................................................159
14.10.1 Comparison of FA with PSO and GA .......................................................160
15 Optimization of Sparse Dictionary Model for Multimodal Image
Summarization Using Firefly Algorithm ......................................................................161
15.1 Image Representation ...............................................................................................162
15.2 Problem Formulation ................................................................................................163
15.3 Optimization of Dictionary Learning ....................................................................165
15.4 Sparse Coding............................................................................................................166
15.5 Iterative Dictionary Selection Stage ........................................................................167
15.6 Performance Analysis ..............................................................................................167
15.6.1 Experiment Setup ........................................................................................167
15.6.2 Experimental Specification .........................................................................168
15.6.3 Baseline Algorithms ....................................................................................168
15.6.4 Mean Square Error Performance ...............................................................168
Section V Algorithmic Solutions to the Problems in
Advanced IR Concepts
16 A Dynamic Feature Selection Method for Document Ranking with
Relevance Feedback Approach ........................................................................................173
16.1 Overview ....................................................................................................................173
16.2 Feature Selection Procedures ..................................................................................173
16.2.1 Markov Random Field (MRF) Model for Feature Selection ..................175
16.2.2 Correlation-Based Feature Selection .........................................................175
16.2.3 Count Difference-Based Feature Selection...............................................176
16.3 Proposed Approach for Feature Selection .............................................................177
16.3.1 Feature Generalization with Association Rule Induction .....................178
16.3.2 Ranking .........................................................................................................178
16.3.2.1 Document Ranking Using BM25 Weighting Function ...........179
16.3.2.2 Expectation Maximization for Relevance Feedback ...............179
16.4 Empirical Results and Discussion ..........................................................................179
16.4.1 Dataset Used for Feature Selection ...........................................................179
16.4.2 n-Gram Generation ......................................................................................180
16.4.3 Evaluation .....................................................................................................180
17 TDCCREC: An Efficient and Scalable Web-Based Recommendation System ......185
17.1 Recommendation Methodologies ...........................................................................185
17.1.1 Learning Automata (LA) ............................................................................186
17.1.2 Weighted Association Rule ........................................................................187
17.1.3 Content-Based Recommendation ..............................................................188
17.1.4 Collaborative Filtering-Based Recommendation.....................................189
17.2 Proposed Approach: Truth Discovery-Based Content and Collaborative
Recommender System (TDCCREC) .......................................................................190
17.3 Empirical Results and Discussion ..........................................................................193
18 An Automatic Facet Generation Framework for Document Retrieval ....................197
18.1 Baseline Approach ....................................................................................................198
18.1.1 Drawbacks ....................................................................................................198
18.2 Greedy Algorithm .....................................................................................................198
18.2.1 Drawbacks ....................................................................................................199
18.3 Feedback Language Model ......................................................................................199
18.4 Proposed Method: Automatic Facet Generation Framework (AFGF) ...............200
18.5 Empirical Results and Discussion ..........................................................................202
19 ASPDD: An Efficient and Scalable Framework for Duplication Detection ...........205
19.1 Duplication Detection Techniques .........................................................................205
19.1.1 Prior Work .....................................................................................................207
19.1.1.1 Similarity Measures .....................................................................207
19.1.1.2 Shingling Techniques ..................................................................207
19.1.2 Proposed Approach (ASPDD) ....................................................................208
19.2 Empirical Results and Discussion ..........................................................................210
20 Improvisation of Seeker Satisfaction in Yahoo! Community Question
Answering Using Automatic Ranking, Abstract Generation, and History
Updation ...............................................................................................................................213
20.1 The Asker Satisfaction Problem ..............................................................................214
20.2 Community Question Answering Problems ........................................................214
20.3 Methodologies ...........................................................................................................216
20.4 Experimental Setup ..................................................................................................220
20.5 Empirical Results and Discussion ..........................................................................225
Section VI Findings and Summary
21 Findings and Summary of Text Information Retrieval Chapters ............................231
21.1 Findings and Summary ...........................................................................................231
21.2 Future Directions ......................................................................................................233
22 Findings and Summary of Image Retrieval and Assessment of Image
Mining Systems Chapters ................................................................................................235
22.1 Experimental Setup ..................................................................................................235
22.2 Results and Discussions ...........................................................................................236
22.3 Findings 1: Average Precision-Recall Curves of Proposed Image Retrieval
Systems for Pascal Database ....................................................................................237
22.4 Findings 2: Average Precision and Average Recall of Proposed Methods
for Different Semantic Classes ................................................................................238
22.5 Findings 3: Average Precision and Average Recall of Top-Ranked Results
after the Ninth Feedback for Corel Database ........................................................240
22.6 Findings 4: Average Precision of Top-Ranked Results after the Ninth
Feedback for IR with Summarization and IR without Summarization............241
22.7 Findings 5: Average Execution Time of Proposed Methods ...............................242
22.8 Findings 6: Performance Analysis of Top Retrieval Results Obtained with
the Proposed Image Retrieval Systems ..................................................................243
22.9 Summary ....................................................................................................................245
22.10 Future Scope ..............................................................................................................246
Appendix: Abbreviations, Acronyms and Symbols ...........................................................249
Bibliography ................................................................................................................................257
Index .............................................................................................................................................279