ebook img

18] G. Salton. Automatic Text Processing. Addison-Wesley, USA, 1989. 19] G. Salton and C ... PDF

13 Pages·2007·0.19 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview 18] G. Salton. Automatic Text Processing. Addison-Wesley, USA, 1989. 19] G. Salton and C ...

[18] G. Salton. Automatic Text Processing. Addison-Wesley, USA, 1989. [19] G. Salton and C. Buckley. On the automatic generation of content links in hypertext. Technical Report 89-993, Cornell University, April 1989. [20] F.Sarreand U.Gu(cid:127)nzter. Automatictransformationof lineartextinto hypertext. In International Symposium on Database Systems for Advanced Applications, pages 498{506, Tokyo, Japan, 1991. [21] J. A. Thom, A. J. Kent, and R. Sacks-Davis. TQL: Tutorial and user manual. Technical Report 18, Key Centre for Knowledge Based Systems, Departments of Computer Science, RMIT and the University of Melbourne, 1990. [22] J. A. Thom, A. J. Kent, and R. Sacks-Davis. TQL: a nested relational query language. Australian Computer Journal, 23(2):53{65, May 1991. [23] J. Zobel, R. Wilkinson, E. Mackie, J. Thom, R. Sacks-Davis, A. Kent, and M. Fuller. An architecture for hyperbase systems. In 1st Australian Multi-Media Communications and Applications and Technology Workshop,Sydney, Australia, July 1-2 1991. Also available as Technical Report 42, Key Centre for Knowledge Based Systems, RMIT and the University of Melbourne. [3] B. Campbell and J.M. Goodman. HAM: a general purpose hypertext abstract machine. Communications of the ACM, 31(7):856{861, 1988. [4] K. Chattrasophon. Enhanced access mechanisms for document retrieval system. Master's thesis, Department of Computer Science, Royal Melbourne Institute of Technology, Melbourne, 1987. [5] J. Conklin. Hypertext: An introduction and survey. IEEE Computer, 20(9):17{ 41, September 1987. [6] Software Exoterica Corporation. XGML Translator 1.0, 1990. [7] C.J. Date. An Introduction to Database Systems, 4th edition. Addison-Wesley Publishers, USA, 1986. [8] M. E. Frisse. Searching for information in a hypertext medical handbook. In Hypertext '87 Papers, pages 57{66, Chapel Hill, North Carolina, 1987. [9] M. Fuller, A. Kent, R. Sacks-Davis, J. A. Thom, R. Wilkinson, and J. Zobel. Querying in a large hyperbase. In Second International Conference on Database and Expert Systems Applications, Berlin, Germany, August 21-23 1991. [10] G. W. Furnas. Generalized (cid:12)sheye views. In CHI'86 Proceedings, pages 16{23, April 1986. [11] P.K.Garg. Abstractionmechanismsinhypertext. Communications of the ACM, 31(7):862{870, July 1988. [12] U. Hahn and U. Reimer. Automatic generation of hypertext knowledge bases. In Proceedings ACM Conference on O(cid:14)ce Information Systems, pages 182{188, Palo Alto, California, 1988. [13] E. Mackie and J. Zobel. Retrieval of tree-structured data from disc. In Proceed- ings of the Third Australian Database Conference, Melbourne, Australia, 1992. [14] W. Merkl, S. Vieweg, and A. Karapetjan. KELP: A hypertext oriented user- interface for an intelligent legal fulltext information retrieval system. In Inter- national Conference on Database and Expert System Applications | DEXA 90, pages 399{404, Vienna, Austria, 1990. [15] C. D. Paice. Constructing literature abstracts by computer: Techniques and prospects. Information Processing and Management, 26(1):171{186, 1990. [16] D. R. Raymond. lector | an interactive formatter for tagged text. Technical Report OED-90-02, Centre for the New Oxford Dictionary and Text Research, University of Waterloo, 1990. [17] B. N. Rossiter, T. J. Sillitoe, and M. A. Heather. Database support for very large hypertexts. Electronic Publishing, 3(3):141{154, August 1990. The second database was a technical paper written in SGML. This document had a deep hierarchical structure. This database consisted of only 3805 nodes and 292 links, but demonstrated the ability to store complex structures, and to display them in various ways, including the implementation of a (cid:12)sh-eye view of the table of contents. The next database will providean exampleof a large hyperbase with a fairly com- plex structure. Again, such a hyperbase would be prohibitively di(cid:14)cult to construct manually. It consists of a large portion of the Commonwealth of Australia's Acts of Parliament, and has a fairly complex structure. It is estimated that this database will consist of 1 million nodes and 10 million links. The cost of installing a new database is worth mentioning. If the text is not already in SGML, format, a grammar must be devised to markup the text in the appropriate way. Next, a grammar needs to be written to allow node creation based on the text structure. The rest of the system is generic, dependent entirely upon the SGML markup, so that enormous savings accrue in the creation of large hypertext databases. 8 Conclusions We have described a system that provides an automatic method for generating richly structureddocumenthyperbases. Thesehyperbasessupportbothbrowsingandranked natural language querying. This has occurred through the successful integration of SGML, hypertext browsing capabilities, sophisticated information retrieval querying techniques,and anestedrelationaldatabase system. Keybene(cid:12)tsarethatthissystem requires minimal manual assistance, and is generically applicable. The use of SGML is a central factor in this. It o(cid:11)ers a suitable representation for document structure that can be used in the conversion of source text into hypertext. By preserving the SGML markup within nodes, re-assembly of document fragments for presentation is easy. The nested relational model proved to be a suitable database engine for such a system since it provides both the capabilities and speed required. Three immediate areas of challenge are hypertext node generation, high-level querying,and queryranking. Node{ node, and node{ document relationshipswithin source text need to be recognised and converted to explicitlinks. It remains to incor- porate link information into querying, so that use can be made of inter- and intra- document relationships and database and document structure. Also, devising rank- ing algorithms that take into account the structure and fragmentation of documents would be of immediate bene(cid:12)t. References [1] ISO 8613. Information Processing|Text and O(cid:14)ce systems | O(cid:14)ce Document Architecture (ODA) and Interchange Format, 1989. [2] ISO 8879. Information Processing | Text and O(cid:14)ce systems | Standard Gen- eralized Markup Language (SGML), 1986. Auser can move fromone node to another eitherby selectinga new point in the table ofcontents,byfollowingalink,orbyissuingaquery. Inthelattercase,thecontrolling unit calls a text query processor which creates a window into which the query can be entered. The query processor then identi(cid:12)es and ranks relevant documents. The controlling unit can then present that list of documents, along with some summary informationabout each potential destination, via another lector process. In the other cases, havingdeterminedthe appropriate destination, the relevantnodes are retrieved from the Titan+ database, and reformed into tagged text. This can then be passed to a lector process for display. (See Figure 2.) 7.2 Sample Hyperbases Twodatabases havebeencreatedusingthetoolsdescribedearlier. Thesedemonstrate the ability to import document structure into a large hyperbase. One database con- sists of 611 documents, albeit with little structure, and the other is a single technical paper with a complex structure. Figure 3: The User View The (cid:12)rst of the two is su(cid:14)ciently large that it would have been prohibitively di(cid:14)cult to generate manually. It consisted of Victorian Government press releases, where each release had a title and a body. As well as structural links, a set of links between references to signi(cid:12)cant entities in Victorian politics was created. This resulted in a database with 8,657 nodes and 20,598 links. Figure 3 shows what the user sees when browsing this hyperbase. MENU TABLE OF TEXT QUERY WINDOW CONTENTS WINDOW WINDOW WINDOW QUERY LECTOR LECTOR LECTOR PROCESSOR NODE MANAGER TITAN+ HYPERTEXT DATABASE Figure 2: Hyperbase Implementation that displays textin windows and allows interactive behaviour by reporting whenever and where in a window a mouse button click occurs. Lector has been used as a text previewer, database browser, code pretty printer, menu utility and iconic interface. We use it to display nodes, to display a table of contents, to present menus, and to allow links and buttons to be activated. The particular advantage of lector, which is derived from its SGML-based nature, is that the same data may be presented in substantially di(cid:11)erent ways. It uses simple speci(cid:12)cation (cid:12)les to determine how it is to interpret the SGML tags it (cid:12)nds, and can switch between speci(cid:12)cation (cid:12)les as instructed. One speci(cid:12)cation (cid:12)le might instruct the lector process to present the text in two columns, with references in bold, and section headings omitted, and links highlighted reverse video. Another might cause only titles to be displayed, creating a table of contents. A third might indicate to lectorto output raw text, tags and all. This provides us signi(cid:12)cant scope in providing an appropriate user interface to the hypertext database. The components of each layer are then related in the following way. A controlling unit creates several lector processes to display nodes, menus, and a table of contents. that area. This type of aid can be extremelybene(cid:12)cialto users browsing a hyperbase, and is appropriate for the displaying Tables of contents. Movement within the database can be done by following links within a document, selecting di(cid:11)erent points within the table of contents, or by querying. Querying in- volves thecreationof a queryinterfaceinto which a userinputs hisor herquery. After the appropriate nodes are identi(cid:12)ed, they are collated into a query node containing thequerytext,summaryinformationabout thedocuments satisfying thequeries,and links to the potential destination. 7 A Prototype Hyperbase 7.1 System Overview Before text enters our hyperbase, it passes through a number of processes, as previ- ously described. 6 Raw, un-tagged text enters the system via XGML, an SGML parser/validator. By applying some simple rules, unmarked ASCII text can be converted to SGML text. For example, a particularly simple rule might be that a newline followed by whitespace represents a new paragraph. The next phase involves passing SGML tagged text through XGML. In this pass, XGML is used to split up text into nodes and links that can be directly inserted into a hyperbase. XGML also provides the ability to provide grammar based treatment of the SGML parse tree. The actions that are taken are dependent on the grammar described in the document DTD. Current context and the status of other sections of the parse tree can be taken into account, allowing very complex rules to be used. A simple example of a rule would be to create a node for every paragraph, or to insert the SGML generic identi(cid:12)ers as node types. Prior to insertion into the hyperbase, term frequency information is collected and further link generation occurs. At present, this additional link generation is of the simplistic nature described in Section 3. Finally, the data can be inserted into the hyperbase. The hyperbase architecture on which our prototype is based is the three-layer 23 modeldescribedbyZobeletal. For thedatabase layer,weuseTitan+,as discussed earlier. Of the layers present in the model, it is the middle, or node layer, that is most under-developed in our prototype. At present, its tasks are to coordinate activ- ity between the database layer and the display layer, determineappropriate database queries,reconstructfragmentedtext,and monitorbrowseand querywindowrequests. It currently lacks both the node query language, and the formal node organisation description present in the model. It is expected that the DTDs of documents present in the hyperbase would be used to assist in the creation of a node organisation de- scription for that hyperbase. Forthetop,orinterface,layer,wearecurrentlyusinganSGMLbaseddisplaytool, 16 lector. Lector is a display utility that has a simple model of text and is designed to display text that has been marked up with SGML tags. It is an X.11 application (cid:15) the number of documents in which a given term appears (the database fre- quency), and (cid:15) the number of timesa termappears within each document (the termfrequency) 18,4 be known. The term frequencies are determined at the time of creation of the database, and are stored with the document in a nested table of term, frequency pairs. The presence of this information separate from the body of the data facilitates indexing. The database frequency information is stored in a separate table in the database, and is derived from the within-document frequencies after the database is created. This implementation of ranking is not problem-free. Ranking performance drops when documents are fragmented in relatively small sections. Two major options are 8 apparent. One is that outlined by Frisse. His suggestion was that node ranks could be calculated recursively. The ranked value of a node was determined from its own similarity to the query, in combination with combined weight of each of its children's similarities divided by the number of children. An alternative to this is to create an abstract description summarising document topic at the time of node generation. A bonus of carrying out this `abstracting' at the same timeas the input text is parsed is that the full document structure is available. This structure may well provide useful 15 information for any abstracting operation. The availability of a document abstract could also be useful in other ways. As previouslydiscussed, such an abstract might be used during the linkgeneration phase to help determine appropriate inter-document and inter-node links. It could also be informative to a user browsing or querying the database, who could use it, say, to help determine which nodes of those which satisfy a query are actually of interest. Alternatively, it might be useful if there is a request for a summary of a node, or document, or group of nodes or documents. 6 Hypertext Display Because SGML DTDs do not actually say anything about how to interpret or present thetextthatitde(cid:12)nes,itispossibletotakethethetagged textand displayitinwhat- ever way is most appropriate given the constraints of window size, user preference, and data content and structure. For example, a large document could be presented both as full text in a scrollable window, and as a table of contents in another window. We can dynamically embed in the text tags and data representing available hypertext links, that then can be displayed appropriately. The level of information available and the way in which it is presented can be varied, perhaps at a user's request, merelyby treating those tags in a di(cid:11)erent way. In the second case, the SGML would be interpreted to display only the various section titles, perhaps indented according to nesting. In fact, it would be possible to choose to only display certain instances of section headings, with a particular sub-tree, and to vary which instances and what depth. 10 This allows us to easily present a (cid:12)sh-eye view of the document. Such a view provides (cid:12)ne detail within a small area, but only coarse structure further away from Typically, a user viewing a section of the hyperbase will see a link that he or she wishes to traverse. The user then activates that link, presumably by selecting it with a mouse. Let us assume that the node id of the destination of the link is `1234,' and that the node is the start of a document section. This requires the reconstruction of an entire subtree. A simple approach | a recursive traversal of the subtree | would result in many database queries for what is a common operation. A more e(cid:14)cient approach is to retrieve the whole subtree in one operation. In our example, this is achieved by a single TQL query that retrieves all nodes that have an ancestor link to node 1234. This tree can then be recreated by inspecting each retrieved node's parent id and sibling order. From the tree, the SGML tagged text can re-created by making a inorder traversal. Any embedded links associated with the retrieved nodes can be displayed by inserting appropriate tags and data into the text, that can then be presented via the user interface. For further discussion of tree retrieval, see Mackie 13 and Zobel. 5.2 Querying The second method available for accessing the database is the issue of sophisticated high-level queries. Queries can involve speci(cid:12)cation or constraints on content, struc- ture, and links. The (cid:12)rst of these is found in normal information retrieval systems. Examples of these types include: content: Find nodes containing the words (`computer' or `system') and `parallel' structure: Find documents whose titles contain `database systems' and that have diagrams links: Find documents that cite this document and that are about conceptual clustering In order to support these complex queries, a natural language understanding en- gine is could be used to identify the references to document structure and link type. Alternatively, a form-based or graphical query interface might be used. There are problemsinvolved in querying aggregate structures such as `documents' and querying based on relationship. A document may be broken into fragments that togethersatisfya querybutnot separately. Aswell,e(cid:14)cientqueryformulationstrate- gies need to be devised to handle multiple-stage queries. The task of handling these queries would be eased by the presence of a node query language, that better re(cid:13)ects the typical high-level queries that are likely to be encountered. A node organisation description that provides a formal view of the document and hyperbase structures would also be helpful, both as a guide to the formulation of user queries, and in the task of mapping those queries into low-level database queries. A description of 9 a node query language can be found in Fuller et al., and the need for such a node organisation description in a formal hyperbase architecture is discussed by Zobel et 23 al. In order to provide ranked querying, certain information is required. The cosine measure and its accompanying term-weighting formulae require that HyperText[ node id INTEGER NOT NULL, node type TEXT, parent id INTEGER, sib order INTEGER, links[ dest id INTEGER, link type TEXT, ::: ], data TEXT, ::: ] Figure 1: Basic Schema makes it possible to retrieve an entire subtree of the parse tree with one query, and greatly simpli(cid:12)es the task of displaying documents. The type of each link is noted in the (cid:12)eld link type and can be useful in many hypertext operations. This includes the placement of constraints on hypertextqueries,the provision of additional information as to how a link should be displayed, and helping to determineany action to be taken when a link is traversed. Two alternatives existed for management of links. These were to store all link information in a separate link table (which would be the only choice with a relational database), or to store all link information in the nodes in which the links are an- chored. For e(cid:14)cient database queries, especially those relating to movement around a document, we chose to implement links as a nested table in the node table. 5 Browsing and Querying When the hyperbase is (cid:12)rst accessed, the nodes at the top of the database are dis- played. Fromthat point, there are two ways of exploring the hyperbase. One of these is by making high-level queries, and the other is by browsing. 5.1 Browsing Browsing can bedone byselectinga table of contents entry and movingto the subtree rooted at the individual node having that title, by scrolling through the current document, or by following a link. In each case the e(cid:11)ect is to select the nodes from the database that allow the creation of a partial document tree. A portion of the document is reconstructed and passed to the display manager. At the same time, the table of contents is updated in just the same way, possibly using the same data (discussed in Section 6). nodes together. This linking is via circlesof nodes, or `daisy chaining.' A set of words in the vocabulary of the database is selected as being interesting. This can be done 18 using term weighting functions based on term frequency information. 4 Hypertext Storage Most hypertext research has involved purpose-built hypertext software without an underlying theory, and has focused on the mechanics of hypertext and principles of 5 human-computer interaction. Moreover, most hypertext systems have not been de- signed to cater for large collections of data. Although some work has been directed towards a theory of hypertext (for example, the abstract algebra of hypertext opera- 11 tions described by Garg, and the abstract hypertext engine described by Campbell 3 and Goodman ), hypertexthas not generallybeenformallyassociatedwith data stor- age and retrieval techniques. In contrast, database systems are very much concerned with storage and retrieval of data. They provide techniques for organising data, managing data, retrieving data, and such facilities as transaction and concurrency management. The data is generally organised with regard to a formal model. The 7 existence of this model permits formal de(cid:12)nition of query and update languages. The importance of incorporating such database technology in hypertext systems is 17 discussed by Rossiter et al. There are several requirements that a database system must satisfy in order to be usable as the database component of a sophisticated hyperbase system. Firstly, and most importantly, the database system must have support for retrieval of text based on its content. Secondly, the actual database systemeithershould have embed- ded ranking techniques, or should be able to easily provide the information needed. Thirdly, because of the nature of text, the database system should be able to store 23 variable-sized entities. These questions are addressed by Zobel et al. 21 We have chosen to use a nested relational database system, Titan+, that has been designed to support text-based applications. The nested relational model di- rectly supports the hierarchical data structures required to represent documents, and a number of text operators and access methods have been included in Titan+ to sup- port retrieval from very large text databases. The model allows links and nodes to 22 storedtogetherintheone tuple. Aqueryandupdatelanguage TQL providesa com- mand language interface to Titan+. TQL is an extension of the standard relational language SQL, designed to support nested relations. We need to store the nodes, their types, the location of the node with respect to other nodes. Figure 1 presents a TQL schema containing the information felt to be necessary to maintain a hyperbase. The node type may re(cid:13)ect its position in the tree structure | whether it is the top of a document, or an important document sub-tree, such as a section or chapter, or a paragraph (which could be a leaf node). Alternatively, the type may indicate that it is a user's annotation node, or a node formed as the result of a query. The location of the node within the tree is given by parent id and sib order. The nested table of links contains links to all related nodes and all ancestors of the node. This

Description:
7] C.J. Date. An Introduction to Database Systems, 4th edition. Addison-Wesley. Publishers, USA, 1986. 8] M. E. Frisse. Searching for information in a hypertext techniques, and a nested relational database system. links between references to signi cant entities in Victorian politics was created.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.