Table Of Content

Document Processing Using Machine Learning Document Processing Using Machine Learning Edited by Sk Md Obaidullah, KC Santosh, Teresa Gonçalves, Nibaran Das and Kaushik Roy CRC Press Taylor & Francis Group 52 Vanderbilt Avenue, New York, NY 10017 © 2020 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-0-367-21847-8 (Hardback) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface ....................................................................................................................vii Editors ......................................................................................................................ix Contributors .........................................................................................................xiii 1. Artificial Intelligence for Document Image Analysis ............................1 Himadri Mukherjee, Payel Rakshit, Ankita Dhar, Sk Md Obaidullah, KC Santosh, Santanu Phadikar and Kaushik Roy 2. An Approach toward Character Recognition of Bangla Handwritten Isolated Characters ..............................................................15 Payel Rakshit, Chayan Halder and Kaushik Roy 3. Artistic Multi-Character Script Identification ........................................29 Mridul Ghosh, Himadri Mukherjee, Sk Md Obaidullah, KC Santosh, Nibaran Das and Kaushik Roy 4. A Study on the Extreme Learning Machine and Its Applications ...................................................................................................43 Himadri Mukherjee, Sahana Das, Subhashmita Ghosh, Sk Md Obaidullah, KC Santosh, Nibaran Das and Kaushik Roy 5. A Graph-Based Text Classification Model for Web Text Documents .....................................................................................................53 Ankita Dhar, Niladri Sekhar Dash and Kaushik Roy 6. A Study of Distance Metrics in Document Classification ...................69 Ankita Dhar, Niladri Sekhar Dash and Kaushik Roy 7. A Study of Proximity of Domains for Text Categorization .................85 Ankita Dhar, Niladri Sekhar Dash and Kaushik Roy 8. Supervised Learning for Aggression Identification and Author Profiling over Twitter Dataset ....................................................101 Kashyap Raiyani and Roy Bayot 9. The Effect of Using Features Computed from Generated Offline Images for Online Bangla Handwritten Character Recognition ..................................................................................................121 Shibaprasad Sen, Ankan Bhattacharyya and Kaushik Roy v vi Contents 10. Handwritten Character Recognition for Palm-Leaf Manuscripts .................................................................................................145 Papangkorn Inkeaw, Jeerayut Chaijaruwanich and Jakramate Bootkrajang Index .....................................................................................................................163 Preface We are surrounded by huge volumes of data in various categories. In the con- text of big data, automated and faster tools are in great demand. Often, data exist on scanned documents and can be either machine-printed or handwritten. These include, but are not limited to, letters, checks, payment slips, income tax forms and business forms numbering in the billions. Manual digitization is impossible, time consuming, expensive and vulnerable to errors. Automation can solve such problems. Compared to machine-printed documents, handwritten document processing can be more challenging and, of course, opens a wide range of issues to consider. Writer identification via handwriting can be one of the important applications. Further, we have experienced that both machine-printed and handwritten texts can go together in one document. Can we treat/process them separately? For this, one needs to be able to extract meaningful information; it can be from either machine-printed or handwritten texts. This can open up ideas of document information extraction and understanding, such as information retrieval and natural language processing. Advances in document processing can only be possible by considering appropriate machine learning algorithms that help build data-driven models or predictions. The process includes classification, clustering, regression, association rules and many more elements. How can we forget deep learning-based approaches? In this book, advanced document processing tech- niques using up-to-date and well-trained machine learning algorithms are presented. Chapter 1 discusses the role of AI for different document analysis problems, such as optical character recognition and web text categorization, where real-world issues are considered. Chapter 2 discusses multiple methods for optical character recognition on handwritten isolated Bangla characters. It provides results on a database of 89,000 plus handwritten characters in addition to two publicly available benchmark databases, namely the ISI handwritten character database and CMATERdb 3.1.2. Chapter 3 presents a script identification study on multi-script artistic documents where real-world problems are taken into consideration. The pri- mary motivation behind this work is that OCR tools are script-dependent. In the authors’ work, a semi-automated segmentation algorithm has been used for character separation within words followed by a thinning procedure and a structural feature; Gabor filters are used for feature extraction. The results were reported using several different machine learning classifiers. Chapter 4 discusses the use of extreme learning machines (ELMs), as they suppress the pitfalls of neural networks like slow learning rates and vii viii Preface parameter tuning. The authors provide an idea of how document analysis can be done using an ELM. Internet technology has brought about substantial changes to our day-to- day lives. Nowadays, we need several digital means to automatically man- age, filter, store and retrieve information from text documents. Automated text document analysis and mining are becoming an essential part of computer applications and thus various classification and clustering approaches are required for carrying out these tasks. The classification of documents needs to be performed in the training dataset, which is further used to train the model classifier to classify the text documents into their respective text categories or domains. Thus, text analysis becomes one of the major aspects of text data mining. Under this scope, Chapters 5, 6 and 7 provide various graph-based models for different web text classification problems. Chapter 8 discusses two essential aspects of an author, namely author aggression and author profile. The identification of aggression is classified into three classes: overtly, covertly and non-aggressive. On the other hand, profiling identifies the two properties of age and gender. The chapter also explains various machine learning concepts incorporated into the deep neural network. It also provides insight into the previous work done on author aggression and profiling. Document image analysis problems broadly fall under two categories: offline and online. In Chapter 9, the authors discuss online handwritten Bangla character recognition using features computed from generated offline images. Interestingly, this chapter shows an integrated version/con- cept of how offline and online domains can be merged to advance recognition performance. Chapter 10 discusses handwritten character recognition for palm-leaf manuscripts. Palm-leaf manuscripts are ancient documents primarily from South and Southeast Asia. Over hundreds of years, the manuscripts have become damaged. The chapter discusses how to transform the palm-leaf images into machine-encoded texts so that the document can be edited, retrieved, accessed and processed. The issue of applying optical handwritten character recognition for the said problem, with associated challenges, is covered in detail. We hope this book will help researchers in document analysis and machine learning as different real-life problems are discussed with experimental out- comes. Additionally, undergraduate or postgraduate scholars who wish to carry out their research-based projects or theses on document analysis problems can get help from this book. Sk Md Obaidullah KC Santosh Teresa Gonçalves Nibaran Das Kaushik Roy Editors Sk Md Obaidullah earned a PhD (Engg.) from Jadavpur University, an MTech in Computer Science and Application from the University of Calcutta and a BE in Computer Science and Engineering from Vidyasagar University in 2017, 2009 and 2004, respec- tively. He was an Erasmus post-doctoral fellow funded by the European Commission at the University of Évora, Portugal, from November 2017 to September 2018. He has more than 11 years of professional experi- ence including two years in industry and nine years in academia, out of which five years were spent on research. Presently, he is working as an Associate Professor in the Department of Computer Science and Engineering, Aliah University, Kolkata. He has published more than 65 research articles in renowned journals and reputed national/international conferences. He is an active researcher in the fields of document image processing, medical image analysis, pattern recognition and machine learning. KC Santosh (Senior Member, IEEE) is an Assistant Professor and Graduate Program Director for the Department of Computer Science at the University of South Dakota (USD). Dr. Santosh also serves the School of Computing and IT, Taylor's University, as a Visiting Associate Professor. Before joining USD, Dr. Santosh worked as a research fellow at the U.S. National Library of Medicine (NLM), National Institutes of Health (NIH). He worked as a postdoctoral research scientist at the LORIA research centre, Université de Lorraine, in direct collaboration with industrial partner ITESOFT in France. He also worked as a research scientist at the INRIA Nancy Grand Est research centre, France, where, he completed his PhD in Computer Science. Before that, he worked as a graduate research scholar at SIIT, Thammasat University, Thailand. He has published more than 120 peer-reviewed research articles, authored two books (Springer) and edited ten books, journal issues and conference proceedings. Dr. Santosh serves as an associate editor for the International Journal of Machine Learning & Cybernetics. Dr. Santosh demon- strates expertise in artificial intelligence, machine learning, pattern recognition, computer vision, image processing, data mining and big data with various application domains, such as healthcare and medical imaging, document information content exploitation, biometrics, forensics, speech/audio analysis, satellite imaging, robotics and the Internet of Things. ix

Computer Vision and Image Processing: Fundamentals and Applications PDF

183 Pages·2019·10.522 MB·English

by Manas Kamal Bhuyan

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Computer Vision and Image Processing: Fundamentals and Applications

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.