On the Use of Linguistic Ontologies for Accessing and Indexing Distributed Digital Libraries

Sujata Banerjee[1] and Vibhu O. Mittal[2]

[1] Department of Information Science,

[2] Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260

Abstract

This paper advocates the use of large lexical resources to generate synonyms of terms supplied by the user in keyword based matches. The use of one such resource -- WordNet [2,10] -- allows the system to identify the role of terms in a phrase (adjective/noun) and attempt to find further matches by either (i) generalizing appropriate terms, or (ii) generating a set of more specific terms that can be used instead. Such an approach can be extended easily to new terms, and also used in a highly distributed system.

Keywords: Ontologies, WordNet, information retrieval.

Introduction

Digital libraries containing vast amounts of information are expected to soon be readily available over wide area distributed networks. As the amount of data available increases, the problem of indexing the resources efficiently and accessing them easily will also increase. There have been several approaches proposed previously: these range from simple keyword retrieval (as in DIALOG and MEDLINE) to sophisticated approaches that analyze the internal document structure and attempt a semantic and syntactic analysis in finding suitable matches [14]. The primary characteristics of these libraries will be:

* A very large repository of information distributed across a wide area network.

* A very large range of users (ranging from naive to expert).

* A significant rate of change in the data leading to high update rates.

These characteristics will require that an efficient indexing scheme allowing easy retrieval across a distributed representation be developed. In this paper, we review some of the previous approaches and then present our approach, based on the use of a large and sophisticated linguistic ontology to generate suitable keyword matches. We compare our approach to the previous ones and discuss the advantages and disadvantages of these approaches in the context of large, full text, digital libraries distributed across a wide area network. We conclude by discussing directions for future work.

Background Work

Conventional automated library catalogues are typically based on a simple keyword pattern matching strategy, augmented by a boolean retrieval (along with some positional operators such as, ADJ, NEAR, SAME). This approach suffers from at least two problems: (1) not all related information is retrieved, and (2) these searches often lead to the extraction of irrelevant materials. The assumption is that keywords, which typically occur in the titles and abstracts are sufficient to represent the topics in the text. However, the success of this approach is crucially dependent upon the accuracy of the searcher's attempts to guess the keywords used by the original author. In a recent study of journal articles in the pure sciences and social science, it was found that few articles that shared common references (and hence presumably dealt with the same or a related subject) shared common keywords [1]. Another study in applied biomaterials reported the difficulty in choosing keywords for a literature search owing to the usage of multiple terms representing the same concepts [4]. A third study reported that nearly 11 of the searches at a particular library retrieved 50 or more articles, while 10 of searches retrieved no articles [15]. Both facts indicate a poor choice of appropriate keywords connected by inappropriate boolean operators. Further examples of inefficient searches were encountered by the authors with queries executed on several existing on line library catalogues.

In order to alleviate some of the problems with purely keyword based systems, a number of approaches attempted to extend the paradigm by incorporating strategies such as statistical and probabilistic models, fuzzy logic, and several different linguistic techniques, e.g., [3,8,16]. There have also been a number of studies that investigated the use of AI techniques, such as the use of rule bases, spreading activation, semantic networks, search expertise and user models in information retrieval [5,7,9,13]. However, these studies were conducted in limited, specialized domains and did not consider issues of scalability to general, multiple domain searches. Finally, there have been a few attempts at characterizing the meta-structure of the text in order to facilitate focused searching based on the role of the search terms in the text, e.g. [9,14]. While such approaches represent a powerful searching strategy in many cases, their strength lies in the ability to distinguish the different roles that the same keyword can play in the text. Thus, these strategies can prove helpful in domains such as medicine where the same drug can be used in different ways (for instance, as a variable in an experiment, as a condition, or as a measure in an observation). But, as pointed out by Rama and Srinivasan [14], such interfaces are difficult to use by even expert searchers. In addition, any approach based on a full text analysis would be difficult to integrate in a distributed system, where different sites can store data in different formats with different indices. Other problems with very sophisticated full text approaches are:

* these are typically not real-time, especially as the amounts of data grow,

* these are not scalable to distributed systems: since each site may have different requirements, the storage and indexing of documents at each site will be different

* such approaches that compile their indices in advance cannot handle changes to the indexing mechanism gracefully -- for instance, the text grammar approach [14] would require that the documents be reprocessed if the meta-structure were modified even slightly. Flexibility is an important criteria because of the dynamic nature of the data and the constantly evolving nature of information terminology.

Our Approach: Using a Large Ontology

Our approach to designing an indexing scheme was motivated by the need to make the scheme scalable to very large amounts of data distributed across a very large number of sites. Since very large amounts of data are involved, the scheme had to be fast; since terminology is dynamically evolving, it had to be flexible, and since it had to be able to scale to distributed networks, with autonomous sites, it had to minimize the amount of information transmitted across the network: thus, it could not transmit the complete text of hundreds of thousands of documents in response to a search query. Thus, full text analyses, such as the meta-structure method of Rama and Srinivasan [14], or Liddy [9] were not considered feasible. Keyword approaches were simple, fast and scalable. The transmission of keywords, in addition to the titles, across network connections would not not unduly slow down the searches. However, pure keyword extensions are brittle to changes in terminology, and keyword extensions, such as the use of thesaurus style keyword synonym lists had been investigated previously. The previous approaches to the synonym lists did not succeed because of three problems:

* the keyword lists were not large, and general enough

* they were implemented in very small and specialized domains

* they did not attempt any level of word role assignment: i.e., if a phrase such as 'large aircraft' was used as a search pattern, the lists were unable to take into account the fact that 'large' was an adjective and other possible patterns could be 'big aircraft', 'large planes', etc.

Our approach makes use of a very large, well developed, linguistic ontology to generate alternative search patterns, such as the ones discussed above -- an approach that addresses all of the issues raised earlier. The ontology used is the WordNet Lexical Database [2,10], a very large lexical database consisting of approximately 100,000 different word forms, organized into 70,000 sets of synonyms. The database, which is organized according to psycholinguistic principles, divides the lexicon into five categories: nouns, verbs, adjectives, adverbs and function words. Nouns are organized as topical hierarchies, verbs are organized by a variety of entailment relations and adjectives and adverbs are organized as N-dimensional hyperspaces. Each of these lexical organizations reflects a different way of categorizing experiences and the psychological complexity of lexical knowledge.

Mappings between words and their meanings are -- some words can have different meanings (polysemy), and some meanings can be expressed by many different words (synonymy). In an information retrieval system, it is essential that the system be able to deal with both these aspects of form and meaning.

-----------------------------------------------------------------------

For each query, until (user satisfied) do:

* give results of an exact match

* if insufficient matches, prompt user for other options by generating new queries:

by substituting adjective synonyms

by substituting noun synonyms

by dropping adjectives

by generalizing nouns

by generating multiple queries from possible disjoint coverings of nouns and adjectives in the original query

Figure 1: Query algorithm used by the system.

-----------------------------------------------------------------------

In our approach, based on the user's query, which consists of a set of keywords, or phrases, the system analyses the query locally and (possibly) reformulates it as shown in Figure 1. The system first attempts an exact match (as in traditional systems), and failing that, subsequently generates queries by analyzing the terms and their roles: a search for 'house pet' will generate successive queries as follows:

* family pet, housing pet, lodging pet, business pet

* house favorite, house animal, house fauna

* pet (and its synonyms)

* house animal, house life form, house entity

* (house dog house cat )

At each step, the system presents the reformulated query to the user for approval and possible modification. This allows the user to select between different senses of the word to be used in the retrieval process. This approach shifts the burden of reasoning on the site making the request: sites responding to the request need only be able to retrieve exact matches of words specified in the request. Thus, each site may upgrade or refine its query generation and keyword mapping algorithms locally. This will allow each site to also extend the lexical resources appropriately depending upon the range of users and their application domains; for instance, sites interested in medical applications could add the UMLS [6,11] hierarchy to this organization and modify it accordingly.

Conclusions and Future Work

To achieve the objective of digital libraries, that of bringing information to the common man, a lot will depend on how easy it is to obtain the information. If the onus of finding the exact keywords for each query is made to rest on the naive users, repeated failures to obtain the relevant data will discourage even the experienced user, let alone a naive one. Hence an effective resource/knowledge discovery mechanism is the key to the success of digital libraries.

The approach we have advocated here represents an attempt to integrate developments in computational linguistics and keyword based database indexing. While similar approaches have been investigated previously, none of these attempts used psycholinguistic organizations of lexical resources that were comparable to WordNet in scope or size. Since the success or failure of this approach depends upon the range of alternatives that the system can generate to use as keywords in finding matches, the quality of the lexical resources used is crucial.

This paper did not investigate the issue of indexing other media, such as audio and video, both of which are expected to form a significant portion of on line resources. However, we expect that this approach can be used to index embedded sound and video in documents by extending the matcher to match on the captions. We intend to investigate this aspect in future work.

An important issue in accessing resources will be indexing distributed library resources. This will require the construction of local, generalized indices, similar to the construction and maintenance of gigantic filespaces in the Prospero [12] system. Each site in a network will be relatively autonomous and cannot assume that other sites will maintain similar complicated indices, or semantic or syntactic links to portions of the documents. However, getting each site to maintain, and advertise, keywords in addition to the titles, is a reasonable extension, and will allow each site to select and refine its own matching algorithms. Retrieval approaches that depend upon indexing the complete document cannot, in general, make such assumptions.

References

[1] Ali, S. N. Subject Relationship between Articles Determined by Cooccurrence of Keywords in Citing and Cited Titles Journal of Information Science 19, 3 (1993), 225-231.

[2] Beckwith, R., Fellbaum, C., Gross, D., and Miller, G.A. WordNet: A Lexical Database Organized on Psycholinguistic Principles In Using On-line Resources to Build a Lexicon, U.Zernik, Ed. Lawrence Erlbaum, Hillsdale, NJ, (in press).

[3] Bookstein, A. Probability and Fuzzy Set Applications to Information Retrieval Annual Review of Information Science Technology 20 (1985), 117-151

[4] Bush, R.B. A Bibliography of Monographic Works on Biomaterials and Biocompatibility Journal of Applied Biomaterials 4, 2 (1993), 195-209.

[5] Carbonell, J.G., Evans, D., Scott, D.S., and Thomason, R.H. On the Design of Bio-Medical Knowledge Bases In MEDINFO (Amsterdam, 1986), R.Salamon, B.Blum, and M.Jorgenson, Eds., Elsevier Science, pp.285-303.

[6] Cimino, J.J., Hripcsak, G., Johnson, S.B., Fink, C. F. D.F., and Clyton, P.D. UMLS as knowledge base - A rule-based expert system approach to controlled medical vocabulary management. In Proceedings of the fourteenth Annual Symposium on Computer Applications in Medical Care (Washington, DC, 1990), pp.175-179.

[7] Cohen, P., and Kjeldson, R. Information retrieval by constrained spreading activation in semantic networks. Information Processing Management 23, 4 (1987), 255-268.

[8] Evens, M., Vandendorpe, J., and Wang, Y.C. Lexical-Semantic Relations in Information Retrieval In Humans and Machines: Proceedings of the Fourth Delaware Symposium on Language Studies (Norwood, NJ, 1985), Ablex, pp.73-100.

[9] Liddy, E.D. Structure of information in full text abstracts. In Proceedings of the RIAO Conference (1988), pp.183-195.

[10] Miller, G.A. Word Net: A Dictionary Browser In Information in Data: Proceedings of the First Conference of the UW Center for the New Oxford Dictionary (Waterloo, Canada, 1985), University of Waterloo.

[11] Nelson, S.J. The semantic structure of the UMLS metathesaurus. In Proceedings of the sixteenth Annual Symposium on Computer Applications in Medical Care (Baltimore, MD, 1992), pp.649-653.

[12] Neuman, B.C. The Prospero File System: A Global File System based on the Virtual System Model Computing Surveys 5, 4 (1992), 407-432.

13] Pollitt, A.S. A rule based system as an intermediary for searching cancer therapy literature on MEDLINE. In Intelligent Information Systems: Progress and Prospects (Chichester, 1986), R.Davies, Ed., Ellis Horwood, pp.82-126.

[14] Rama, D.V., and Srinivasan, P. An Investigation of Content Representation using Text Grammars. ACM Transactions on Information Systems 11, 1 (January 1993), 51-75.

[15] Wallace, P.M. How do Patrons Search the Online Catalog when no ones looking - Transaction Log Analysis and Implications for Bibliographic Instruction and System-Design. RQ 33, 2 (1993), 239-252.

[16] Wong, S. K.M., Ziarko, W., Raghavan, V.V., and Wong, P. C.N. On Extending the Vector Space Model for Boolean Query Processing. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (Pisa, 1986), F.Rabitti, Ed., pp.175-185.