Computational Techniques for Text Summarization based on Cognitive Intelligence The book is concerned with contemporary methodologies used for automatic text summarization. It proposes interesting approaches to solve well-known problems on text summarization using computational intelligence (CI) techniques including cognitive approaches. A better understanding of the cognitive basis of the summa- rization task is still an open research issue; an extent of its use in text summariza- tion is highlighted for further exploration. With the ever-growing text, people in research have little time to spare for extensive reading, where summarized infor- mation helps for a better understanding of the context in a shorter time. This book helps students and researchers to automatically summarize text doc- uments in an efficient and effective way. The computational approaches and the research techniques presented guides to achieve text summarization at ease. The summarized text generated supports readers to learn the context or the domain at a quicker pace. The book is presented with a reasonable amount of illustrations and examples convenient for the readers to understand and implement for their use. It is not to make readers understand what text summarization is, but for people to perform text summarization using various approaches. This also describes mea- sures that can help to evaluate, determine, and explore the best possibilities for text summarization to analyze and use for any specific purpose. The illustration is based on social media and healthcare domain, which shows the possibilities to work with any domain for summarization. The new approach for text summariza- tion based on cognitive intelligence is presented for further exploration in the field. Computational Techniques for Text Summarization based on Cognitive Intelligence V. Priya and K. Umamaheswari First edition published 2023 by CRC Press 6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742 and by CRC Press 4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN CRC Press is an imprint of Taylor & Francis Group, LLC © 2023 V. Priya and K. Umamaheswari Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publica- tion and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www .copyright .com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750- 8400. For works that are not available on CCC please contact mpkbookspermissions @tandf .co .uk Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. ISBN: 9781032392820 (hbk) ISBN: 9781032442471 (pbk) ISBN: 9781003371199 (ebk) DOI: 10.1201/9781003371199 Typeset in Times by Deanta Global Publishing Services, Chennai, India Contents Preface..................................................................................................................ix About This Book ..................................................................................................xi Chapter 1 Concepts of Text Summarization ....................................................1 1.1 Introduction ..........................................................................1 1.2 Need for Text Summarization ..............................................1 1.3 Approaches to Text Summarization .....................................2 1.3.1 Extractive Summarization .......................................2 1.3.2 Abstractive Summarization .....................................2 1.4 Text Modeling for Extractive Summarization ......................3 1.4.1 Bag-of-Words Model ...............................................3 1.4.2 Vector Space Model ................................................5 1.4.3 Topic Representation Schemes ................................9 1.4.4 Real-Valued Model ................................................11 1.5 Preprocessing for Extractive Summarization .....................11 1.6 Emerging Techniques for Summarization ..........................15 1.7 Scope of the Book ..............................................................16 References .....................................................................................18 Sample Code .................................................................................19 Sample Screenshots ............................................................22 Chapter 2 Large-Scale Summarization Using Machine Learning Approach ...............................................................................23 2.1 Scaling to Summarize Large Text ......................................23 2.2 Machine Learning Approaches ..........................................23 2.2.1 Different Approaches for Modeling Text Summarization Problem ........................................24 2.2.2 Classification as Text Summarization ...................24 2.2.2.1 Data Representation ...............................24 2.2.2.2 Text Feature Extraction..........................27 2.2.2.3 Classification Techniques ......................29 2.2.3 Clustering as Text Summarization ........................32 2.2.4 Deep Learning Approach for Text Summarization ......................................................36 References .....................................................................................44 Sample Code .................................................................................45 Chapter 3 Sentiment Analysis Approach to Text Summarization .................47 3.1 Introduction ........................................................................47 v vi Contents 3.2 Sentiment Analysis: Overview ...........................................47 3.2.1 Sentiment Extraction and Summarization ............47 3.2.1.1 Sentiment Extraction from Text .............48 3.2.1.2 Classification ..........................................48 3.2.1.3 Score Computation ................................48 3.2.1.4 Summary Generation .............................49 3.2.2 Sentiment Summarization: An Illustration ...........49 Summarized Output ...............................................50 3.2.3 Methodologies for Sentiment Summarization ......51 3.3 Implications of Sentiments in Text Summarization ...........54 Cognition-Based Sentiment Analysis and Summarization .................................................55 3.4 Summary ............................................................................56 Practical Examples ........................................................................56 Example 1 ...........................................................................56 Example 2 ...........................................................................57 Sample Code (Run Using GraphLab) .................................58 Example 3 ...........................................................................58 References .....................................................................................59 Sample Code .................................................................................60 Chapter 4 Text Summarization Using Parallel Processing Approach ...........63 4.1 Introduction ........................................................................63 Parallelizing Computational Tasks .....................................63 Parallelizing for Distributed Data .......................................63 4.2 Parallel Processing Approaches .........................................63 4.2.1 Parallel Algorithms for Text Summarization ........64 4.2.2 Parallel Bisection k-Means Method ......................64 4.3 Parallel Data Processing Algorithms for Large-Scale Summarization ...................................................................67 4.3.1 Designing MapReduce Algorithm for Text Summarization ......................................................67 4.3.2 Key Concepts in Mapper .......................................68 4.3.3 Key Concepts in Reducer ......................................69 4.3.4 Summary Generation ............................................71 An Illustrative Example for MapReduce ............................71 Good Time: Movie Review ....................................71 4.4 Other MR-Based Methods ..................................................75 4.5 Summary ............................................................................81 4.6 Examples ............................................................................81 K-Means Clustering Using MapReduce .............................81 Parallel LDA Example (Using Gensim Package) ...............81 Sample Code: (Using Gensim Package) .............................83 Example: Creating an Inverted Index .................................83 Contents vii Example: Relational Algebra (Table JOIN) ........................85 References .....................................................................................87 Sample Code .................................................................................88 Chapter 5 Optimization Approaches for Text Summarization ......................97 5.1 Introduction ........................................................................97 5.2 Optimization for Summarization .......................................97 5.2.1 Modeling Text Summarization as Optimization Problem ...........................................98 5.2.2 Various Approaches for Optimization ..................98 5.3 Formulation of Various Approaches ..................................98 5.3.1 Sentence Ranking Approach .................................98 5.3.1.1 Stages and Illustration .........................100 5.3.2 Evolutionary Approaches .....................................101 5.3.2.1 Stages ....................................................101 5.3.2.2 Demonstration......................................102 5.3.3 MapReduce-Based Approach ..............................104 5.3.3.1 In-Node Optimization Illustration .......105 5.3.4 Multi-objective-Based Approach ........................106 Summary ......................................................................................111 Exercises .......................................................................................112 References ....................................................................................116 Sample Code ................................................................................117 Chapter 6 Performance Evaluation of Large-Scale Summarization Systems .........................................................................................119 6.1 Evaluation of Summaries ..................................................119 6.1.1 CNN Dataset .......................................................120 6.1.2 Daily Mail Dataset ..............................................120 6.1.3 Description ..........................................................121 6.2 Methodologies ..................................................................122 6.2.1 Intrinsic Methods ................................................122 6.2.2 Extrinsic Methods ...............................................122 6.3 Intrinsic Methods ..............................................................122 6.3.1 Text Quality Measures ........................................122 6.3.1.1 Grammaticality ....................................122 6.3.1.2 Non-redundancy ..................................122 6.3.1.3 Reverential Clarity ...............................123 6.3.1.4 Structure and Coherence .....................123 6.3.2 Co-selection-Based Methods ...............................123 6.3.2.1 Precision, Recall, and F-score .............123 6.3.2.2 Relative Utility .....................................124 6.3.3 Content-Based Methods ......................................124 viii Contents 6.3.3.1 Content-Based Measures .....................124 6.3.3.2 Cosine Similarity .................................125 6.3.3.3 Unit Overlap .........................................125 6.3.3.4 Longest Common Subsequence ...........125 6.3.3.5 N-Gram Co-occurrence Statistics: ROUGE ................................................125 6.3.3.6 Pyramids ..............................................126 6.3.3.7 LSA-Based Measure ............................126 6.3.3.8 Main Topic Similarity..........................126 6.3.3.9 Term Significance Similarity ...............126 6.4 Extrinsic Methods ............................................................127 6.4.1 Document Categorization ....................................127 6.4.1.1 Information Retrieval ..........................127 6.4.1.2 Question Answering ............................128 6.4.2 Summary .............................................................128 6.4.3 Examples .............................................................128 Bibliography ................................................................................132 Chapter 7 Applications and Future Directions ............................................133 7.1 Possible Directions in Modeling Text Summarization .....133 7.2 Scope of Summarization Systems in Different Applications ......................................................................133 7.3 Healthcare Domain ...........................................................134 Future Directions for Medical Document Summarization ...............................................135 7.4 Social Media .....................................................................136 Challenges in Social Media Text Summarization .............138 Domain Knowledge and Transfer Learning ........138 Online Learning ................................................................138 Information Credibility .....................................................138 Applications of Deep Learning ........................................138 Implicit and Explicit Information for Actionable Insights ..........................................................139 7.5 Research Directions for Text Summarization ..................139 7.6 Further Scope of Research on Large-Scale Summarization ..................................................................141 Conclusion .........................................................................141 References ....................................................................................141 Appendix A: Python Projects and Useful Links on Text Summarization ......................................................................................143 Appendix B: Solutions to Selected Exercises ................................................199 Index ..................................................................................................................211 Preface People have traditionally utilized written papers to convey important facts, viewpoints, and feelings. New technologies have caused an exponential rise in document output generated because of growing technology. In social networks, markets, production platforms, and websites, a tremendous volume of messages, product reviews, news pieces, and scientific documents are created and published every day. Although often verbose for the readers, this unstructured material can be quite helpful. The most pertinent material has been succinctly presented, and the reader is exposed to the key ideas thanks to the use of summaries. The new field of automatic text summarization was made possible by advances in text min- ing, machine learning, and natural language processing. These methods allow for the automatic production of summaries that typically contain either the most pertinent sentences or the most noticeable keywords from the document or col- lection. For visitors to become familiar with the content of interest rapidly, it is essential to extract a brief but informative description of a single document and/ or a collection. For instance, a synthesized overview of the most important news aspects may be provided by the summary of a group of news articles on the same subject. In contrast, the summary of social network data can help with the discovery of per- tinent details about a particular event and the deduction of user and community interests and viewpoints. Several automatic summarizing techniques have been put forth in recent years that are broadly categorized into extractive summariza- tion and abstractive summarization techniques. This book offers a thorough examination of the state-of-the-art methods to describe text summarization. For both extractive summarizing tasks and abstrac- tive summary tasks, the reader will discover in-depth treatment of several meth- odologies utilizing machine learning, natural language processing, and data mining techniques. Additionally, it is shown how summarizing methodologies can be used in a variety of applications, including healthcare and social media domain along with the possible research directions and future scope. The book comprises seven chapters and is organized as follows. Chapter 1 ‘Concepts of Text Summarization’ gives a basic but detailed text representation based on ideas or principles of text summarization. A detailed discussion of the ideas and practical examples are included for clear understanding. Some exercises related to text representation models are given to practitioners in the domain. Chapter 2 ‘Large-Scale Summarization Using Machine Learning Approach’ covers the representation of text summarization based on machine learning prob- lems such as classification, clustering, deep learning, and others. It also examines the complexities and challenges encountered while using machine learning in the domain of text summarization. Chapter 3 ‘Sentiment Analysis Approach to Text Summarization’ addresses sentiment-based text summarization. Sentiment extraction and summarization ix