Table Of Content

Optical Character Recognition of Amharic Documents Million Meshesha C. V. Jawahar Center for Visual Information Technology International Institute of Information Technology Hyderabad - 500 032, India [email protected] [email protected] Abstract— In Africa around 2,500 languages character recognition (OCR) is being known are spoken. Some of these languages have their as one of valuable input devices in this own indigenous scripts. Accordingly, there is a respect [1], [2]. This is also supplemented bulk of printed documents available in libraries, by the emergency of large digital libraries information centers, museums and offices. Digitization of these documents enables to and the advancement of information harness already available information technology. technologies to local information needs and Digital libraries such as Universal Digital developments. This paper presents an Optical Library, Digital Library of India, etc. are Character Recognition (OCR) system for established for digitizing paper documents converting digitized documents in local languages. An extensive literature survey and make them available to users via the reveals that this is the first attempt that reports Internet. The advancement of information the challenges towards the recognition of technology supports faster, more powerful indigenous African scripts and a possible processors and high speed output devices solution for Amharic script. Research in the (printers and others) to generate more recognition of African indigenous scripts faces major challenges due to (i) the use of large information at a faster rate. These days the number characters in the writing and (ii) availability of relatively inexpensive existence of large set of visually similar document scanners and optical character characters. In this paper, we propose a novel recognition software has made OCR an feature extraction scheme using principal attractively priced data entry methodology component and linear discriminant analysis, followed by a decision directed acyclic graph [1]. based support vector machine classifier. Optical character recognition technology Recognition results are presented on real-life was invented in the early 1800s, when it degraded documents such as books, magazines was patented as reading aids for the blind. and newspapers to demonstrate the In 1870, C. R. Carey patented an image performance of the recognizer. transmission system using photocells, and Index Terms— Optical Character Recognition; in 1890 P.G. Nipkow invented sequential African Scripts; Feature Extraction; Classification; scanning OCR. However, the practical OCR Amharic Documents. technology used for reading characters was introduced in the early 1950s as a I. INTRODUCTION replacement for keypunching system. A year later, D.H. Shephard developed the Nowadays, it is becoming increasingly first commercial OCR for typewritten data. important to have information available in The 1980’s saw the emergence of OCR digital format for increased efficiency in systems intended for use with personal data storage and retrieval, and optical computers [3]. Nowadays, it is common to and Fuchs [9] present handwritten Amharic find PC-based OCR systems that are bank check recognition. On the contrary, commercially available. However, most of there are no similar works being found for these systems are developed to work with other indigenous African scripts. Latin-based scripts [4]. Therefore, there is a need to exert much Optical character recognition converts effort to come up with better and workable scanned images of printed, typewritten or OCR technologies for African scripts in handwritten documents into computer order to satisfy the need for digitized readable format (such as ASCII, Unicode, information processing in local languages. etc.) so as to ease on-line data processing. The potential of OCR for data entry II. AMHARIC SCRIPTS application is obvious: it offers a faster, more automated, and presumably less In Africa more than 2,500 languages, expensive alternative to the manual data including regional dialects are spoken. entry devices, thereby improving the Some are indigenous languages, while accuracy and speed in transcribing data others are installed by conquerors of the into the computer system. Consequently, it past. English, French, Portuguese, Spanish increases efficiency and effectiveness (by and Arabic are official languages of many of reducing cost, time and labor) in information the African countries. As a result, most storage and retrieval. African languages with a writing system use Major applications of OCR include: (i) a modification of the Latin and Arabic Library and office automation, (ii) Form and scripts. There are also many languages bank check processing, (iii) Document with their own indigenous scripts and reader systems for the visually impaired, (iv) writing systems. Some of these scripts Postal automation, and (v) Database and include Amharic script (Ethiopia), Vai script corpus development for language modeling, (West Africa), Hieroglyphic script (Egypt), text-mining and information retrieval [2], [5]. Bassa script (Liberia), Mende script (Sierra While the use and application of OCR Leone), Nsibidi/Nsibiri script (Nigeria and systems is well developed for most Cameroon) and Meroitic script (Sudan) [10]. languages in the world that use both Latin Amharic, which belongs to the Semitic and non-Latin scripts [6], an extensive language, became a dominant language in literature survey reveals that few Ethiopia back in history. It is the official and conference papers are available on the working language of Ethiopia and the most indigenous scripts of African languages. commonly learnt language next to English Lately some research reports have been throughout the country. Accordingly, there published on Amharic OCR. Amharic is a bulk of information available in printed character recognition is discussed in [7] form that needs to be converted into with more emphasis to designing suitable electronic form for easy searching and feature extraction scheme for script retrieval as per users’ need. Suffice is to representation. Recognition using direction mention the huge amount of documents field tensor as a tool for Amharic character piled high in information centers, libraries, segmentation is also reported [8]. Worku museums and government and private offices in the form of correspondence letters, Those African languages using a magazines, newspapers, pamphlets, books, modified version of Latin and Arabic scripts etc. Converting these documents into can easily be integrated to the existing Latin electronic format is a must in order to (i) and Arabic OCR technologies. It is worth to preserve historical documents, (ii) save mention here some of the related works storage space, and (iii) enhance retrieval of reported at home and in the Diaspora to relevant information via the Internet. This preserve African languages digitally enables to harness existing information (predominantly in languages that use Latin technologies to local information needs and scripts). Corpora projects in Swahili (East developments. Africa), open source software for languages like Zulu, Sepedi and Afrikaans (South African), indigenous web browser in Luganda (Uganda), improvisation of keyboard characters for some African languages, the presence of African languages on the Internet, etc. [11]. Therefore, we need to give more emphasis to those indigenous African scripts. This is the motivation behind the present report. To the best of our knowledge, this is the first work that reports a robust Amharic character recognizer for conversion of printed document images of varying fonts, sizes, styles and quality. Amharic is written in the unique and ancient Ethiopic script (inherited from Geez, a Semitic language), now effectively a syllabary requiring over 300 glyph shapes for representation. As shown in Fig. 1, Amharic script has 33 core characters each of which occurs in seven orders: one basic form and six non-basic forms consisting of a consonant and following vowel. Vowels are derived from basic characters with some modification (like by attaching strokes at the middle or end of the base character, by elongating one of the leg of the base character, etc.). Other symbols are also Fig. 1: Amharic alphabets (FIDEL) with their seven orders available that represent labialization, row-wise. The 2nd column shows list of basic characters numerals, and punctuation marks. These and others are vowels each of which derived from the bring the total number of characters in the basic symbol. script to 310 as shown in Table 1. representational scheme and classification Type of Amharic Number of method so as to accommodate the effect of No. Characters Symbols degradation. 1 Core characters 231 2 Labialized characters 51 B. Printing variations Printed documents vary in fonts, sizes 3 Numerals 20 and styles. Building character recognition 4 Punctuation marks 8 system is challenging in this situation. For Total 310 example, some of the commonly used fonts Table 1. Summary of the number of characters in the in Amharic printed documents include Amharic script (FIDEL). `PowerGeez', `VisualGeez', `Alphas', `Agafari', etc. Each of these fonts offers 2.1 Challenges in Building an OCR for several stylistic variants, such as normal, African Scripts bold, italic, etc. They are also written in different point sizes, including 10, 12, 14, Character recognition from document etc. These fonts, styles and sizes produce images that are printed in African scripts is texts that greatly vary in their appearances a challenging task, including Amharic (i.e. in size, shape, quality, etc.) within documents. To develop a successful printed documents. We need to standardize character recognition system for these the variation in size by applying scripts, we need to address some of the normalization techniques and extract following issues. suitable features so that the representation is invariant to printing variations. A. Degradation of documents Document images from printed C. Large number of characters in the script documents, such as books, magazines, The total number of characters in newspapers, etc. are extremely poor in Amharic script is more than three hundred. quality. Popular artifacts in printed Existence of such a large number of document images include: Amharic characters in the writing system is (a) Excessive dusty noise, a great challenge in the development of (b) Large ink-blobs joining disjoint Amharic character recognizer. Memory and characters or components, computational requirements are very (c) Vertical cuts due to folding of the paper, intensive. We need to design a mechanism (d) Cuts at arbitrary direction due to paper to compress the dimension of character quality or foreign material, representation so as to come up with (e) Degradation of printed text due to the computationally efficient recognizers. poor quality of paper and ink, (f) Floating ink from facing pages etc. This is the main issue where most character recognition research even for Latin and non-Latin scripts also fails. We Fig. 2: Samples of visually similar characters in Amharic need to carefully design an appropriate writing system. E. Language related issues D. Visual similarity of most characters in African indigenous languages pose the script many additional challenges. Some of these There are a number of very similar are: (i) lack of standard representation for characters in Amharic script that are even the fonts and encoding, (ii) lack of support difficult for humans to identify them easily from operating systems, browsers and (examples are presented in Fig. 2). Robust keyboard, and (iii) lack of language discriminant features needs to be extracted processing routines. for classification of each of the character These issues add to the complexity of into their proper category or class. the design and implementation of an optical character recognition system. Segmentation Components Feature Feature Classifier Extraction Vectors Refines Images Model Preprocessing Recognizer Model Model Raw Converted Images Text Document Corrected Post-processing Images Document Text Text Fig. 3: Overview of the Amharic OCR design. III. RECOGNITION OF AMHARIC SCRIPT information is recorded and, therefore, the greater the file size and vice versa. The We built an Amharic OCR on top of an resolution most suitable for character earlier work for Indian languages [12] and recognition ranges from 200 dpi to 600 dpi also reported a preliminary result of our depending on document quality. Actually work in [13]. Fig. 3 shows the general most systems recommend to scan framework of Amharic OCR design. documents at a resolution of 300 dpi for The system accepts as input scanned optimum OCR accuracy. We also use this document images. The resolution, or resolution to digitize documents for the number of dots per inch (dpi) produced by experiment. the scanner determines how closely the Once images are captured, they are reproduced image corresponds to the preprocessed before individual components original. The higher the resolution, the more are extracted. Preprocessing is necessary In our implementation, line, word and for efficient recovery of the text information characters in the text are segmented in the from scanned image. The preprocessing following manner. Lines are detected from algorithms employed on the scanned image text blocks using horizontal projection. A depend on paper quality, resolution of the text line can be found between two scanned image, the amount of skew in the consecutive boundary lines where the image, the format and layout of the images height of projection profile is least. Identified and text and so on [2], [5]. In this work we text lines are then segmented into its apply binarization, noise removal, skew constituent words. correction and normalization techniques. We employ vertical projection for word Scanned documents often contain noise segmentation. This approach scans each that mainly arises due to printer, scanner, line, the valleys of which show the paper quality and age of the document. We boundaries of each word in the line. Next, use binarization to convert gray scale characters are identified using vertical image (with 256 possible different shades projection. To identify the boundaries of of gray from black to white pixels) to binary characters each word is scanned in the colors (with just black and white pixels) to vertical direction. During scanning if there is ease image processing. In due course no black pixel encountered till the base line binarization removes some of the noises in is reached, then this scan marks a images. However, because of the level of demarcation line for characters. degradations in digitized document images Once character boundaries have been of magazines, books and newspaper, there detected, it is often useful to extract is a need to apply noise filters so as to character components. We use connected reduce the effect of degradation during the component analysis to identify the recognition process. We use Gaussian connected segments of a character. The filtering that smoothes an image by final result is a set of character components calculating weighted averages in a filter box with their bounding box that are scaled to a [2]. standard size of 20×20 . This dataset is During the scanning operation, a certain used as an input for feature extraction, amount of image skew is unavoidable. So, training and testing the recognition engine. skew detection and correction is an important pre-processing step. In our IV. FEATURE EXTRACTION work projection profile have been used for Feature extraction is the problem of skew correction in the range of ± 20%. identifying relevant information from raw Further detail is available in [12]. data that characterize the component Once pages are preprocessed, then images distinctly. There are many popular they are segmented into individual methods to extract features. Selection of a components. Segmentation is an operation feature extraction method is probably the that seeks to decompose an image into single most important factor in achieving sub-images of individual symbols [14]. The high recognition performance in character page segmentation algorithm follows a top- recognition systems. A survey of feature down approach by identifying text blocks in extraction schemes for character the document pages. This is followed by recognition is available in [15]. Different line, word and character segmentation. feature extraction methods are designed for different representations of the characters. vector x , where M depends on the image i Some of these features consider profiles, size. From the large sets of training structural descriptors and transform domain datasets, x , ,x we compute the L representations [2], [16]. Alternates one 1 N covariance matrix as: could consider the entire image as the feature. The former methods are highly 1 N language specific and become very ∑= ∑(x −µ)(x −µ)T complex to represent all the characters in N i i i=1 (1) the script. The later scheme provides excellent results for printed character Then we need to identify minimal dimension, recognition. say Ksuch that: As a result, we extract features from the entire image by concatenating all the rows ∑K λ/Trace(∑) to form a single contiguous vector. This i−1 i ≤ α (2) feature vector consists of zeros (0s) and ones (1s) representing background and where λ is the ith largest eigenvalue of the i foreground pixels in the image, respectively. covariance matrix ∑ and α is a limiting With such a representation, memory and value in percent. computational requirements are very Eigenvectors corresponding to the intensive for languages like Amharic that largest K eigenvalues are the direction of have large number of characters in the greatest variance. The kth eigenvector is writing. the direction of greatest variation Therefore we need to transform the features to obtain a lower dimensional perpendicular to the first through (k−1)st representation. There are various methods eigenvectors. The eigenvectors are employed in pattern recognition literatures arranged as rows in matrix A and this for reducing the dimension of feature transformation matrix is used to compute vectors [17]. We consider in the present the new feature vector by projecting as work Principal Component Analysis (PCA) Y = Ax . With this we get the best one i i and Linear Discriminant Analysis (LDA). dimensional representation of the component images with reduced feature 4.1 Principal Component Analysis size. Principal component analysis (PCA) Principal component analysis (PCA) can yields projection directions that maximize help identify new features (in a lower the total scatter across all classes. In dimensional subspace) that are most useful choosing the projection which maximizes for representation [17]. This should be done total scatter, PCA retains not only between- without losing valuable information. class scatter that is useful for classification, Principal components can give superior but also within-class scatter that is performance for font-independent OCRs, unwanted information for classification easy adaptation across languages, and purposes. Much of the variation seen scope for extension to handwritten among document images is due to printing documents. variations and degradations. If PCA is Consider the ith image sample applied on such images, the transformation represented as an M dimensional (column) matrix will contain principal components ∆ =µ −µ 1 2 (4) that retain these variations in the projected feature space. Consequently, the points in where x is the jth sample in ith class. the projected space will not be well ij separated and the classes may be smeared Sum of the with-in class scatter is also together. determined by, Thus, while the PCA projections are optimal for representation in a low A=cW +(1−c)W 1 2 (5) dimensional space, they may not be optimal from a discrimination standpoint for where 0≤c≤1, and the scatter space using: classification. Hence, in our work, we further apply linear discriminant analysis to S =d A−1d extract useful features from the reduced ij i j (6) dimensional space using PCA. 4.2. Linear Discriminant Analysis d =αA−1∆ 1 1 S−1=1/S 1 11 Linear discriminant analysis (LDA) for n=2 to K selects features based entirely upon their d =αA−1{∆−[d d ]S−1[1/α0 0]'} discriminatory potential. Let the projection n n 1L n−1 n−1 1 L be y = Dx, where x is the input image and ω =(dt∆)2/dtAd n n n n D is the transformation matrix. The 1 ⎡c S−1 +S−1 y ytS−1 −S−1 y ⎤ ombajexcimtivizee so f tLhDe Ar aisti ot oo ffi ntdh ea bperotwjeecetino-nc ltahsast Sn−1=cn ⎢⎢⎣ n n−1−yntnS−1n−−11n n n−1 n1−1 n⎥⎥⎦ scatter and the within-class scatter [15]. We apply the algorithm proposed by Fig. 4: Algorithm for computing L best discriminant feature vectors. The vertical and horizontal lines are drawn Foley and Sammon [18] for linear to partition the scatter matrix S−1 such that the result discriminant analysis. The algorithm n extracts a set of optimal discriminant obtained is a 2×2 matrix. features for a two-class problem which suits the Support Vector Machine (SVM) The algorithm presented in Fig. 4 is classifier we used for classification task. called recursively for extracting an optimal Consider the ith image sample set of discriminant vectors ( dn ) that represented as an M dimensional (column) corresponds to the first L highest vector x , where M is the reduced discriminant values such that i (ω ≥ω ≥ ≥ω ≥0 ) [18]. Here, K is the dimension using PCA. From the given sets 1 2 L L number of iterations for computing of training components, x , ,x we L 1 N discriminant vectors as well as discriminant compute with-in class scatter (W ) matrix values, and, α is chosen such that: and between-class difference (∆) as: n W =∑Ni (x −µ)(x −µ)t dntdn =1 (7) i i=1 ij i ij i (3) y and c are determined in the following classification of unlabelled samples in order n n to evaluate classifier’s performance. In manner. general it is desirable to have a classifier with minimal test error. y =[S S ] L n in (n−1)(n) (8) We use Support Vector Machine (SVM) c =s − ytS−1 y for classification task [19]. SVMs are pair- n nn n n−1 n wise discriminating classifiers with the ability to identify the decision boundary with Since the principal purpose of maximal margin. Maximal margin results in classification is discrimination, the better generalization [20] which is a highly discriminant vector subspace offers desirable property for a classifier to perform considerable potential as feature extraction well on a novel data set. Support vector transformation. Hence, the use of linear machines are less complex and perform discriminant features for each pair of better (lower actual error) with limited classes in each component classification is training data. an attractive solution to design better Identification of the optimal hyper-plane classifiers. for separation involves maximization of an However, the problem with this appropriate objective function. The result of approach is that the classification becomes the training phase is the identification of a slower, because LDA extracts features for each pair-wise class. This requires stack set of labeled support vectors x and a set i space for storing N(N −1)/2 transformation of coefficients α. Support vectors are the i matrices. For instance, for dataset with 200 samples near the decision boundary. They classes LDA generates around 20 thousand have class labels y ranging ±1. The i transformation matrices for each pair and is decision is made from: very expensive for large class problems. Hence, we propose a two-stage feature l extraction scheme; PCA followed by LDA f(x)=sgn(∑αy K(x ,x)) i i i for optimal discriminant feature extraction. i=1 (9) We first reduce the feature dimension using PCA and then run LDA on the reduced where K is the kernel function, which is lower dimensional space to extract the most defined by: discriminant feature vector. This reduces to a great extent the storage and K(x,y)=φ(x)φ(y) (10) computational complexity, while enhancing the performance of SVM-based decision where φ:Rd →H maps the data points in directed acyclic graph (DDAG) classifier. lower dimensions to a higher dimensional V. CLASSIFICATION space H. Binary classifiers like SVM are basically Training and testing are the two basic designed for two class classification phases of any pattern classification problem. problems. However, because of the During training phase, the classifier learns existence of a number of characters in any the association between samples and their script, optical character recognition problem labels from labeled samples. The testing is inherently multi-class in nature. The field phase involves analysis of errors in the of binary classification is mature, and provides a variety of approaches to solve combine the results of binary classifiers the problem of multi-class classification [21]. using a suitable voting methods such as Most of the existing multi-class algorithms majority or weighted majority approaches address the problem by first dividing it into [21], [22]. smaller sets of a pair of classes and then Fig. 5: A rooted binary decision directed acyclic graph (DDAG) multi-class classifier for eight-class classification problem. 5.1 Multi-class SVMs classifiers built for a N class classification problem is N(N −1)/2. The input vector is Multi-class SVMs are usually presented at the root node of the DDAG implemented as combinations of two-class and moves through the DDAG until it solution using majority voting methods. It reaches the leaf node where the output has been shown that the integration of (class label) is obtained. At each node a pairwise classifiers using decision directed decision is made concerning to which class acyclic graph (DDAG) results in better the input belongs. performance as compared to other popular For example, at the root node of Fig. 5, techniques such as decision tree, Bayesian, a decision is made whether the input etc. [19]. belongs to class 1 or class 8. If it doesnot We construct a decision directed acyclic belong to class 1, it moves to the left child; graph (DDAG), where each node in the otherwise, if it doesnot belong to class 8, it graph corresponds to a two-class classifier moves to the right child. This process is for a pair of classes. The multi-class repeated at each of the remaining nodes till classifier built using decision directed it reaches the leaf node where the decision acyclic graph (DDAG) for 8-class is reached to classify the input as one of the classification problem is shown in Fig. 5. It eight classes. can be observed that the number of binary

Description:

Optical Character Recognition of Amharic. Documents. Million Meshesha. C. V. Jawahar. Center for Visual Information Technology. International

Optical Character Recognition System for Amharic PDF

14 Pages·2007·0.23 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Optical Character Recognition System for Amharic

Description:

Optical Character Recognition of Amharic. Documents. Million Meshesha. C. V. Jawahar. Center for Visual Information Technology. International

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.