Effective Databases for Text & Document Management Shirley A. Becker Northern Arizona University, USA IRM Press Publisher of innovative scholarly and professional information technology titles in the cyberage Hershey • London • Melbourne • Singapore • Beijing Acquisitions Editor: Mehdi Khosrow-Pour Senior Managing Editor: Jan Travers Managing Editor: Amanda Appicello Development Editor: Michele Rossi Copy Editor: Maria Boyer Typesetter: Jennifer Wetzel Cover Design: Kory Gongloff Printed at: Integrated Book Technology Published in the United States of America by IRM Press (an imprint of Idea Group Inc.) 1331 E. Chocolate Avenue, Suite 200 Hershey PA 17033-1117 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.irm-press.com and in the United Kingdom by IRM Press (an imprint of Idea Group Inc.) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 3313 Web site: http://www.eurospan.co.uk Copyright © 2003 by IRM Press. All rights reserved. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopy- ing, without written permission from the publisher. Library of Congress Cataloging-in-Publication Data Becker, Shirley A., 1956- Effective databases for text & document management / Shirley A. Becker. p. cm. Includes bibliographical references and index. ISBN 1-931777-47-0 (softcover) -- ISBN 1-931777-63-2 (e-book) 1. Business--Databases. 2. Database management. I. Title: Effective databases for text and document management. II. Title. HD30.2.B44 2003 005.74--dc21 2002156233 British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. New Releases from IRM Press • Multimedia and Interactive Digital TV: Managing the Opportunities Created by Digital Convergence/Margherita Pagani ISBN: 1-931777-38-1; eISBN: 1-931777-54-3 / US$59.95 / © 2003 • Virtual Education: Cases in Learning & Teaching Technologies/ Fawzi Albalooshi (Ed.), ISBN: 1-931777-39-X; eISBN: 1-931777-55-1 / US$59.95 / © 2003 • Managing IT in Government, Business & Communities/Gerry Gingrich (Ed.) ISBN: 1-931777-40-3; eISBN: 1-931777-56-X / US$59.95 / © 2003 • Information Management: Support Systems & Multimedia Technology/ George Ditsa (Ed.), ISBN: 1-931777-41-1; eISBN: 1-931777-57-8 / US$59.95 / © 2003 • Managing Globally with Information Technology/Sherif Kamel (Ed.) ISBN: 42-X; eISBN: 1-931777-58-6 / US$59.95 / © 2003 • Current Security Management & Ethical Issues of Information Technology/Rasool Azari (Ed.), ISBN: 1-931777-43-8; eISBN: 1-931777-59-4 / US$59.95 / © 2003 • UML and the Unified Process/Liliana Favre (Ed.) ISBN: 1-931777-44-6; eISBN: 1-931777-60-8 / US$59.95 / © 2003 • Business Strategies for Information Technology Management/Kalle Kangas (Ed.) ISBN: 1-931777-45-4; eISBN: 1-931777-61-6 / US$59.95 / © 2003 • Managing E-Commerce and Mobile Computing Technologies/Julie Mariga (Ed.) ISBN: 1-931777-46-2; eISBN: 1-931777-62-4 / US$59.95 / © 2003 • Effective Databases for Text & Document Management/Shirley A. Becker (Ed.) ISBN: 1-931777-47-0; eISBN: 1-931777-63-2 / US$59.95 / © 2003 • Technologies & Methodologies for Evaluating Information Technology in Business/ Charles K. Davis (Ed.), ISBN: 1-931777-48-9; eISBN: 1-931777-64-0 / US$59.95 / © 2003 • ERP & Data Warehousing in Organizations: Issues and Challenges/Gerald Grant (Ed.), ISBN: 1-931777-49-7; eISBN: 1-931777-65-9 / US$59.95 / © 2003 • Practicing Software Engineering in the 21st Century/Joan Peckham (Ed.) ISBN: 1-931777-50-0; eISBN: 1-931777-66-7 / US$59.95 / © 2003 • Knowledge Management: Current Issues and Challenges/Elayne Coakes (Ed.) ISBN: 1-931777-51-9; eISBN: 1-931777-67-5 / US$59.95 / © 2003 • Computing Information Technology: The Human Side/Steven Gordon (Ed.) ISBN: 1-931777-52-7; eISBN: 1-931777-68-3 / US$59.95 / © 2003 • Current Issues in IT Education/Tanya McGill (Ed.) ISBN: 1-931777-53-5; eISBN: 1-931777-69-1 / US$59.95 / © 2003 Excellent additions to your institution’s library! Recommend these titles to your Librarian! To receive a copy of the IRM Press catalog, please contact 1/717-533-8845 ext. 10, fax 1/717-533-8661, or visit the IRM Press Online Bookstore at: [http://www.irm-press.com]! Note: All IRM Press books are also available as ebooks on netlibrary.com as well as other ebook sources. Contact Ms. Carrie Skovrinskie at [[email protected]] to receive a complete list of sources where you can obtain ebook information or IRM Press titles. Effective Databases for Text & Document Management Table of Contents Preface.......................................................................................................................vii Shirley A. Becker, Northern Arizona University, USA Section I: Information Extraction and Retrieval in Web-Based Systems Chapter I. System of Information Retrieval in XML Documents........................1 Saliha Smadhi, Université de Pau, France Chapter II. Information Extraction from Free-Text Business Documents.................................................................................................................12 Witold Abramowicz, The Poznan University of Economics, Poland Jakub Piskorski, German Research Center for Artificial Intelligence in Saarbruecken, Germany Chapter III. Interactive Indexing of Documents with a Multilingual Thesaurus .................................................................................................................24 Ulrich Schiel, Universidade Federal de Campina Grande, Brazil Ianna M.S.F. de Sousa, Universidade Federal de Campina Grande, Brazil Chapter IV. Managing Document Taxonomies in Relational Databases ...........36 Ido Millet, Penn State Erie, USA Chapter V. Building Signature-Trees on Path Signatures in Document Databases...................................................................................................................53 Yangjun Chen, University of Winnipeg, Canada Gerald Huck, IPSI Institute, Germany Chapter VI. Keyword-Based Queries Over Web Databases................................74 Altigran S. da Silva, Universidade Federal do Amazonas, Brazil Pável Calado, Universidade Federal de Minas Gerais, Brazil Rodrigo C. Vieira, Universidade Federal de Minas Gerais, Brazil Alberto H.F. Laender, Universidade Federal de Minas Gerais, Brazil Bertheir A. Ribeiro-Neto, Universidade Federal de Minas Gerais, Brazil Chapter VII. Unifying Access to Heterogeneous Document Databases Through Contextual Metadata.................................................................................93 Virpi Lyytikäinen, University of Jyväskylä, Finland Pasi Tiitinen, University of Jyväskylä, Finland Airi Salminen, University of Jyväskylä, Finland Section II: Data Management and Web Technologies Chapter VIII. Database Management Issues in the Web Environment ..............109 J.F. Aldana Montes, Universidad de Málaga, Spain A.C. Gómez Lora, Universidad de Málaga, Spain N. Moreno Vergara, Universidad de Málaga, Spain M.M. Roldán García, Universidad de Málaga, Spain Chapter IX. Applying JAVA-Triggers for X-Link Management in the Industrial Framework................................................................................................................135 Abraham Alvarez, Laboratoire d’Ingéniere des Systèmes d’Information, INSA de Lyon, France Y. Amghar, Laboratoire d’Ingéniere des Systèmes d’Information, INSA de Lyon, France Section III: Advances in Database and Supporting Technologies Chapter X. Metrics for Data Warehouse Quality................................................156 Manuel Serrano, University of Castilla-La Mancha, Spain Coral Calero, University of Castilla-La Mancha, Spain Mario Piattini, University of Castilla-La Mancha, Spain Chapter XI. Novel Indexing Method of Relations Between Salient Objects ......174 R. Chbeir, Laboratoire Electronique Informatique et Image, Université de Bourgogne, France Y. Amghar, Laboratoire d’Ingéniere des Systèmes d’Information, INSA de Lyon, France A. Flory, Laboratoire d’Ingéniere des Systèmes d’Information, INSA de Lyon, France Chapter XII. A Taxonomy for Object-Relational Queries...................................183 David Taniar, Monash University, Australia Johanna Wenny Rahayu, La Trobe University, Australia Prakash Gaurav Srivastava, La Trobe University, Australia Chapter XIII. Re-Engineering and Automation of Business Processes: Criteria for Selecting Supporting Tools................................................................221 Aphrodite Tsalgatidou, University of Athens, Greece Mara Nikolaidou, University of Athens, Greece Chapter XIV. Active Rules and Active Databases: Concepts and Applications.234 Juan M. Ale, Universidad de Buenos Aires, Argentina Mauricio Minuto Espil, Universidad de Buenos Aires, Argentina Section IV: Advances in Relational Database Theory, Methods and Practices Chapter XV. On the Computation of Recursion in Relational Databases .........263 Yangjun Chen, University of Winnipeg, Canada Chapter XVI. Understanding Functional Dependency.........................................278 Robert A. Schultz, Woodbury University, USA Chapter XVII. Dealing with Relationship Cardinality Constraints in Relational Database Design.......................................................................................................288 Dolores Cuadra Fernández, Universidad Carlos III de Madrid, Spain Paloma Martínez Fernández, Universidad Carlos III de Madrid, Spain Elena Castro Galán, Universidad Carlos III de Madrid, Spain Chapter XVIII. Repairing and Querying Inconsistent Databases......................318 Gianluigi Greco, Università della Calabria, Italy Sergio Greco, Università della Calabria, Italy Ester Zumpano, Università della Calabria, Italy About the Authors.....................................................................................................360 Index ..........................................................................................................................368 vii Preface The focus of this book is effective databases for text and document management inclusive of new and enhanced techniques, methods, theories and practices. The re- search contained in these chapters is of particular significance to researchers and practitioners alike because of the rapid pace at which the Internet and related technolo- gies are changing our world. Already there is a vast amount of data stored in local databases and Web pages (HTML, DHTML, XML and other markup language docu- ments). In order to take advantage of this wealth of knowledge, we need to develop effective ways of extracting, retrieving and managing the data. In addition, advances in both database and Web technologies require innovative ways of dealing with data in terms of syntactic and semantic representation, integrity, consistency, performance and security. One of the objectives of this book is to disseminate research that is based on existing Web and database technologies for improved information extraction and re- trieval capabilities. Another important objective is the compilation of international ef- forts in database systems, and text and document management in order to share the innovation and research advances being done at a global level. The book is organized into four sections, each of which contains chapters that focus on similar research in the database and Web technology areas. In the section entitled, Information Extraction and Retrieval in Web-Based Systems, Web and data- base theories, methods and technologies are shown to be efficient at extracting and retrieving information from Web-based documents. In the first chapter, “System of Information Retrieval in XML Documents,” Saliha Smadhi introduces a process for retrieving relevant information from XML documents. Smadhi’s approach supports keyword-based searching, and ranks the retrieval of information based on the similarity with the user’s query. In “Information Extraction from Free-Text Business Documents,” Witold Abramowicz and Jakub Piskorski investigate the applicability of information extraction techniques to free-text documents typically retrieved from Web-based sys- tems. They also demonstrate the indexing potential of lightweight linguistic text pro- cessing techniques in order to process large amounts of textual data. In the next chapter, “Interactive Indexing of Documents with a Multilingual The- saurus,” Ulrich Schiel and Ianna M.S.F. de Sousa present a method for semi-automatic indexing of electronic documents and construction of a multilingual thesaurus. This method can be used for query formulation and information retrieval. Then in the next chapter, “Managing Document Taxonomies in Relational Databases,” Ido Millet ad- viii dresses the challenge of applying relational technologies in managing taxonomies used to classify documents, knowledge and websites into topic hierarchies. Millet explains how denormalization of the data model facilitates data retrieval from these topic hierar- chies. Millet also describes the use of database triggers to solving data maintenance difficulties once the data model has been denormalized. Yangjun Chen and Gerald Huck, in “Building Signature-Trees on Path Signatures in Document Databases,” introduce PDOM (persistent DOM) to accommodate docu- ments as permanent object sets. They propose a new indexing technique in combina- tion with signature-trees to accelerate the evaluation of path-oriented queries against document object sets and to expedite scanning of signatures stored in a physical file. In the chapter, “Keyword-Based Queries of Web Databases,” Altigran S. da Silva, Pável Calado, Rodrigo C. Vieira, Alberto H.F. Laender and Berthier A. Ribeiro-Neto describe the use of keyword-based querying as a suitable alternative to the use of Web inter- faces based on multiple forms. They show how to rank the possible large number of answers returned by a query according to relevant criteria and typically done by Web search engines. Virpi Lyytikäinen, Pasi Tiitinen and Airi Salminen, in “Unifying Access to Heterogeneous Document Databases Through Contextual Metadata,” introduce a method for collecting contextual metadata and representing metadata to users via graphi- cal models. The authors demonstrate their proposed solution by a case study whereby information is retrieved from European, distributed database systems. In the next section entitled, Data Management and Web Technologies, research efforts in data management and Web technologies are discussed. In the first chapter, “Database Management Issues in the Web Environment,” J.F. Aldana Montes, A.C. Gómez Lora, N. Moreno Vergara and M.M. Roldán García address relevant issues in Web technology, including semi-structured data and XML, data integrity, query optimi- zation issues and data integration issues. In the next chapter, “Applying JAVA-Trig- gers for X-Link Management in the Industrial Framework,” Abraham Alvarez and Y. Amghar provide a generic relationship validation mechanism by combining XLL (X-link and X-pointer) specification for integrity management and Java-triggers as an alert mechanism. The third section is entitled, Advances in Database and Supporting Technolo- gies. This section encompasses research in relational and object databases, and it also presents ongoing research in related technologies. In this section’s first chapter, “Metrics for Data Warehouse Quality,” Manuel Serrano, Coral Calero and Mario Piattini propose a set of metrics that has been formally and empirically validated for assessing the quality of data warehouses. The overall objective of their research is to provide a practical means of assessing alternative data warehouse designs. R. Chbeir, Y. Amghar and A. Flory identify the importance of new management methods in image retrieval in their chapter, “Novel Indexing Method of Relations Between Salient Objects.” The authors propose a novel method for identifying and indexing several types of relations between salient objects. Spatial relations are used to show how the authors’ method can provide high expressive power to relations when compared to traditional methods. In the next chapter, “A Taxonomy for Object-Relational Queries,” David Taniar, Johanna Wenny Rahayu and Prakash Gaurav Srivastava classify object-relational que- ries into REF, aggregate and inheritance queries. The authors have done this in order to provide an understanding of the full capability of object-relational query language in terms of query processing and optimization. Aphrodite Tsalgatidou and Mara Nikolaidou describe a criteria set for selecting appropriate Business Process Modeling Tools ix (BPMTs) and Workflow Management Systems (WFMSs) in “Re-Engineering and Auto- mation of Business Processes: Criteria for Selecting Supporting Tools.” This criteria set provides management and engineering support for selecting a toolset that would allow them to successfully manage the business process transformation. In the last chapter of this section, “Active Rules and Active Databases: Concepts and Applica- tions,” Juan M. Ale and Mauricio Minuto Espil analyze concepts related to active rules and active databases. In particular, they focus on database triggers using the SQL-1999 standard committee’s point of view. They also discuss the interaction between active rules and declarative database constraints from both static and dynamic perspectives. The final section of the book is entitled, Advances in Relational Database Theory, Methods and Practices. This section includes research efforts focused on advance- ments in relational database theory, methods and practices. In the chapter, “On the Computation of Recursion in Relational Databases,” Yangjun Chen presents an encod- ing method to support the efficient computation of recursion. A linear time algorithm has also been devised to identify a sequence of reachable trees covering all the edges of a directed acyclic graph. Together, the encoding method and algorithm allow for the computation of recursion. The author proposes that this is especially suitable for a relational database environment. Robert A. Schultz, in the chapter “Understanding Functional Dependency,” examines whether functional dependency in a database sys- tem can be considered solely on an extensional basis in terms of patterns of data repetition. He illustrates the mix of both intentional and extensional elements of func- tional dependency, as found in popular textbook definitions. In the next chapter, “Dealing with Relationship Cardinality Constraints in Rela- tional Database Design,” Dolores Cuadra Fernández, Paloma Martínez Fernández and Elena Castro Galán propose to clarify the meaning of the features of conceptual data models. They describe the disagreements between main conceptual models, the confu- sion in the use of their constructs and open problems associated with these models. The authors provide solutions in the clarification of the relationship construct and to extend the cardinality constraint concept in ternary relationships. In the final chapter, “Repairing and Querying Inconsistent Databases,” Gianluigi Greco, Sergio Greco and Ester Zumpano discuss the integration of knowledge from multiple data sources and its importance in constructing integrated systems. The authors illustrate techniques for repairing and querying databases that are inconsistent in terms of data integrity con- straints. In summary, this book offers a breadth of knowledge in database and Web tech- nologies, primarily as they relate to the extraction retrieval, and management of text documents. The authors have provided insight into theory, methods, technologies and practices that are sure to be of great value to both researchers and practitioners in terms of effective databases for text and document management.