An Interoperable Architecture for Digital Information Repositories

S. Shen, R. Mukkamala, A. Wadaa, C. Zhang, H. Abdel-Wahab, K. Maly, A. Liu, and M. Yuan

Department of Computer Science, Old Dominion University, Norfolk, VA 23529-0162
email: shen@cs.odu.edu

Abstract

During the past decade we have seen a remarkable expansion of Internet which connects millions of users at academic, industrial, and government institutions nation-wide and world-wide. Additionally, it has caused a rapid growth in the available amount of information and the variety of forms it is represented in. Numerous Internet resource discovery systems have come into existence to solve the problem. The interoperability of these various systems and the efficiency of their serving as components of an integrated digital library system remain to be improved. We have proposed a three-layer interoperability architecture which alleviates this problem. It allows a large, distributed user base, autonomies of local systems,

ease in integration of heterogeneous systems, and efficient retrieval of information. The implementation issues of our architecture have been investigated and we believe that it is feasible and effective.

Keywords: Interoperability, resource discovery systems, digital libraries.

1. Introduction

The retrieval of information on the computer has become very popular in recent years. The availability of free access to some Internet resources and the rapid expansion of Internet, particularly in the recent year or two, have made the retrieval of remotely located information a popular exercise. The availability of free access to systems like World-WideWeb (WWW) even allows users to incorporate their own relatively easy-to-implement hypermedia systems intoWWW to become part of an integrated Internet resource system [2]. Several different Internet resource systems have been linked with each other in reasonable, though not ideal, ways. The realization of a practical digital library system[1] is becoming a reality. An example of subsystem implementation is WATERS, Wide Area TEchnical Report Service, jointly developed by ODU, UVA, SUNY Buffalo, and VPI [4]. In that system WAIS is combined with WWW allowing anybody with access to Internet to search a distributed database of computer science technical reports with subject keywords, authors or university names, and issue dates among others. The user does not have to worry about where a particular technical report is stored and retrieval is optimal in the sense that technical reports are stored locally under the control of the authors and sent directly by WWW to the requester once WAIS has identified the location.

However, for a true digital library system, given the huge resource space, for virtually any arbitrary query, its actual solution process and the potential solution space can easily make the task non-feasible in terms of the requirements on the computational and network resources. In addition, the many available subsystems will require local autonomy and are not compatible with each other and can hardly work together in a smooth manner. To resolve this problem, the overall system must be able to utilize some resource partitioning schemes and for each query it must be able to search the partitioned space intelligently. We propose a three-layer interoperability architecture to allow a large, distributed user base and varying local resource system autonomies. Users at local resources will continue to have their customary user-interfaces;individual heterogeneous resource systems can easily participate in the overall system; and information search over the integrated system can be more efficient.

This paper is organized as follows. In section 2 we summarize the features of some of the existing information retrieval systems. The deficiencies of each system are also discussed. We then describe the proposed architecture in Section 3. The salient features of the architecture are discussed here. The implementation issues associated with this architectureare are explored in Section 4. Finally, Section 5 concludes the paper.

2. Background

To a large extent, Internet[2] collects and organizes information around administrative units. This is not a surprise. Many organizations that have benefited from the Internet also want to share their knowledge with others. They put the information into some repositories and try to make the information generally available. Because the administration of the Internet's networks is decentralized, no single organization has a clear idea about which resources are available in the Internet to its end-user and how to guide the end-user to the information of interest. The large amount of the resources and the lack of guidance for their access has posed great challenges for

users to find, acquire, organize, retrieve, and use the information.

Besides other technical aspects, a central problem for designers of document and information retrieval is often related to the contents of the documents rather than their organization. The contents or semantics of documents are not well represented by only surface features such as individual words taken from abstracts, titles, or even when taken from the entire documents. The problem is complicated further when the type of information is image, audio, video, or other media, where even the surface features are not generally available.

We see here a major conflict between the user's information needs and Internet information organization principles: While the user wishes to locate information based on the knowledge domain(s) it pertains to, the Internet generally organizes information around administrative units. The conflict has had profound effects on the design of the Internet information discovery

systems and their search mechanisms.

Archie addresses the problem of locating resources by filename in the Internet [4]. Archie servers centralize indexing information on filenames which are collected periodically from known public Internet file archive sites. Users can query Archie's index to locate files that are available from public Internet file archive sites, also known as anonymous FTP sites. As the Internet grows, however, maintaining a global resource directory is a non-feasible task. Furthermore, the simple filename approach can not reflect the underlying information content most of the times.

Gopher organizes information into a directed graph in which intermediate nodes are servers, directories, or indexes, and leaf nodes are documents [7]. The structure is basically administration-centered. Although it simplifies the registration and management of servers and documents, this structure leads to a complicated mechanism for information search and retrieval. In order to obtain the desired documents, the user may need to manually investigate a number of information servers, and issue the query. In the worst case, the user may need to select a specific network to continue the search, which the user should not be aware of.

WAIS uses a centralized directory of services and divides its indexes among the servers [10]. Though it enables keyword search among the servers and the administrative units are not visible to the user at the beginning, the result of the initial query is a list of potential servers rather than a list of relevant documents. The user is then required to select a few administration-centered servers from the list to continue the search. Thus the user is forced early on to give up on completeness ("recall") of the search; precision is also limited to a great extent because most WAIS databases use only very limited amounts of information to rank the list of "hit" documents.

WWW organizes data into a distributed hypertext in which nodes are either full-text objects, directory objects called cover pages, or indexes [2]. Hypertext offers great flexibility in organizing and browsing information. Cover pages, if properly compiled, provide a good overview of or

reference to the underlying data. The sheer volume of the Internet information, however, has brought special difficulty for end-users to locate information of interest. They often get lost in the information space. The manual hypertext compilation process also poses a great burden

on hypertext designers, administrators, and publishers on tracking and maintaining the links in the dynamic changing Internet environment as well as maintaining the document itself.

Distributed information retrieval systems are emerging from the research/development phase into the experimental deployment phase [1,8,9,11]. However, the existing systems (Archie, Gopher, WAIS and WWW) are still largely inadequate in connecting the information needs of the end-user to the vast Internet resources, given the diversity of user classes and resource representations, the constraints on autonomous administration of the resources and on the heterogeneity of information systems, applications, and user interfaces. Internet information discovery tools need non-trivial approaches to map the massive administrative-centered resources into a kind of conceptual information space that will closely reflect the user's information needs.

In this paper, we put forth a vision for a digitized information architecture which meets the above requirements. The architecture integraties different information services, resources, applications, user interfaces, and end-users into a common framework. The framework again is decentralized to adapt to the rapid growth of the available resources.

3. Proposed Architecture

An important long term goal of the Digital Library is to afford a massively large number of heterogeneous classes of users (user groups), offering access to a massively large number of distributed autonomous resource repositories, in ways that are seamless, timely, and economic. Here, we propose an architecture to achieve these goals. Before describing the architecture, we make the following assumptions about two important entities of the system: users and resources.

A1: Autonomous resource management: The physical resource space would consist of a multitude of autonomous, geographically distributed Published Resource Repositories (PRR). For each PRR there would be an owning entity, managing the repository autonomously. Hence, for each PRR a management system called PRR subsystem is assumed. Also, for each PRR, a Published Access Scheme (PAS) is defined in the framework of PRR subsystem, and is supported by the owning entity as the scheme offered to potential users for accessing that PRR. Due to political, performance, and other reasons, it can be foreseen that entities offering to share their resources, would continue to exercise autonomy on their PRR as described above. Owing to autonomy, several types of heterogeneities, in terms of resource representation, storage, and access schemes, could exist.

A2: Multiplicity of user interfaces: Different classes of users, running different classes of applications would be concurrently accessing the underlying PRRs.The user-resource interface is a function of the class of users and the class of applications at hand, as well as the characteristics of the resources being accessed. Since the environment of the Digital Library is of such a scale, that the combinations of users, applications,and resources cannot be accurately identified a priori, it is simply impractical to have a single user interface for accessing the underlying PRRs. Thus there is a need for supporting multiple customized (user, application, and resource dependent) user interfaces.

Notice that the necessity for multiple user interfaces expressed by the second assumption is orthogonal to the first assumption of PRR autonomy. Supporting uniform integrated access implies that the design of user interface(s) would not be affected (dictated) by constraints of Published Access Schemes of different PRRs. We now present our architecture that attempts to validate the above assumptions.

Our architecture has three distinct layers (see Figure 1): an interoperability layer (IL) managed by an interoperability protocol suite; a resource repository layer (RRL) containing the different participating PRR subsystems; and a user interface layer (UIL). We now describe each of these layers.

User Interface Layer (UIL)

Our top layer (UIL) may contain any number of User Interface Systems (UISs) each customized to the particular specification of a (user, application, resource) situation. A UIS is correct, i.e. supported in the UIL, if and only if it relies on a set (library) of access primitives (APs) defined and supported by the Interoperability Layer (see below). The set of APs forms the interface between UIL, and interoperability layer, and is designed to enable the UISs to view and access a virtual integrated and structured resource space. A given user may belong to any number of UISs. A user accessing resources by way of UISs is called a Digital Library User (DLU), since his access is managed by the digital library system. We distinguish a DLU from a user accessing a particular repository using the associated PAS, since access activity is directly managed by the corresponding PRR subsystem in this case.

Interoperability Layer (IL)

IL is the second and middle layer defined by our architecture. It is managed by an Interoperability protocol suite. The major role of this protocol suite is to define and support an appropriate set of APs. It effectively performs a two-way mapping between the actual distributed physical resource space containing the separate repositories and the virtual space represented by the APs interface. Additional functions would be to bookkeep and manage events of join-in/walk-out, add/delete resources to a repository, administer accounting, and the like.

The interoperability protocol suite integrates a multitude of algorithms and tools, for organization, structure evolution, presentation and manipulation of both the virtual and actual resource spaces described above. Some examples are Hypermedia tools, Collaboration tools, search algorithms, and so forth.

Resource Repository Layer (RRL)

The bottom layer defined by our architecture is the resource repository layer (RRL). This layer simply contains all PRRs and their associated PRR subsystems. Interoperability across heterogeneous PRRs in the RRL is accomplished by a contractual commitment of each PRR

to support a set of resource repository primitives (RRPs). Specifically, The protocol suite of IL defines two sets of primitives at the interface between the IL and the RRL. The function of these primitives is to enable the IL protocol suite to view and uniformly access a distributed physical resource space. The uniformity is a result of limiting the IL protocol suite to access resources in any contracted PRR using a standard set of primitives

as discussed below. In essence, the role of the standard primitives at the IL/RRL interface is to hide inherent heterogeneities of PASs across different PRRs, by offering a uniform resource repository interface, that is minimal and extensible.

The remainder of this section elaborates on the properties and usage of the two types of RRPs, defined by the IL protocol suite.

* Standard RRPs. These can be defined as the minimum set of functions to be supported by a published resource repository in order to join the Digital library proper as a resource donor. The objective of the standard RRPs is to secure a minimally sufficient set of primitive functions for supporting a uniform practical resource repository interface. Given a particular PRR some examples of standard RRPs can be described. One example can be TableOfContents() that returns a list of Digital Library descriptors for atoms of the resource space (e.g., books, articles, and software library programs), for the particular PRR at hand. A second example can be GranularityStructure() returning a table containing type names of different types of atoms recognized in the particular resource space, and their granularity hierarchy, e.g. a repository recognizing proceedings, panel report and technical paper as distinct addressable atoms, and defines the granularity hierarchy to have (proceedings) as root, and both {panel report} and {technical paper} as sons. A third example can be SearchDimensions() returning a list of type names and type constructors of each dimension (the simplest form being keywords and character strings), recognized by the local subsystem, to be used to search the resource space.

* Non Standard (specialized) RRP. Owing to a special characteristic, of a resource its corresponding subsystem can offer additional primitives to enhance access to its resource. For example, assume a NASA repository of timestamped states of the earth's atmosphere during a period of time, each state being defined in terms of a set of quantitative measurements. The local subsystem can offer a special primitive called ComputeAtmosphere (InitialAtmosphere, Pressure+/-, Temperature+/-) returning the new measurements (state) of the atmosphere after enforcing changes in pressure and temperature passed in the primitive call to the atmosphere defined by InitialAtmosphere, after applying a proprietary atmosphere behavior model. Our protocol IL can automatically offer a user a menu of special primitives offered by a particular subsystem if the user identifies part of the TableOfContents() of the subsystem as relevant.

We now turn to discuss some of the salient characteristics of our protocol.

* Satisfying the user and resource needs: Both of our basic assumptions are satisfied under this architecture. Interoperability essentially enables sharing resources across PRRs, under constraints of full local autonomy of PRR subsystems; thus satisfying our first assumption. Note that our second assumption is also satisfied in this architecture. This is achieved by enforcing a separation between functions attributed to user interface specifications, given a particular user, resource being accessed, and application situation, and functions attributed to accessing the resource from the underlying PRRs until it is passed to the user interface. Accordingly, our architecture allows for multiple customized user interfaces to be built on top of a virtual access layer.

* Encapsulation of Resource Sharing Services: As a result of restricting the UIL/IL interface to the set of APs, algorithms of the Interoperability protocol suite are hidden from the UISs. Therefore, autonomous maintenance for enhancing performance, efficiency, economy, and functionality of service in the IL would not affect any of the functioning UISs. Hence, our architectural framework affords the creation of any user interface system in the user interface layer so long as its access to the interoperabilityl ayer is limited to the APs interface.

* Incremental Extensibility of Services: Our architectural framework allows the APs interface to be incrementally enhanced, and enables different levels of utilization. If APs in general are designed to be independent, then adding new primitives does not affect old ones. Also, a given UIS might be built around a minimal subset of fairly simple APs. However, Advanced UISs can utilize a larger subset of APs, that can incorporate sophisticated primitives, e.g. Collaborative Access Support, Parallel Searching, etc..

The next section describes a number of implementation issues of our architecture

4. Implementation Issues

In proposing the new digital information architecture, one of our prime consideration has been the ease and flexibility of its implementation. In particular, we are concerned with the heterogeneity of users' hardware, the heterogeneities in the underlying networks (e.g., differences in speeds), and the autonomous nature of individual repositories. These considerations are reflected in the architecture as well as the (proposed) implementation as shown below. The implementation issues are discussed in terms of its three layers.

User Interface Layer

This is an important component of the proposed system. Since the digital library system is designed to attract a variety of users, it is almost impossible to think of developing a uniform interface to meet all needs. Instead, each community (e.g., educational institutions, physicians, service organizations) will develop its own interfaces, most suitable to its own members. In fact, we foresee the development and availability of a variety of commercially developed interfaces to meet these needs. Some of the major issues in developing such interfaces are:

* The level of transparency offered to the user in terms of the locality of resources, the cost of access, the boundaries of search domain, etc.

* Accuracy of the retrieved information: Often users want a quick response to a query, even if approximate, rather than wait a long time for the most accurate answer. For example, an economist studying the economic trends over the last 100 years does not care whether the data is accurate up to yesterday or last week. The interface should offer an option for the user to specify the required accuracy. The system can in turn use this information to choose appropriate copy.

* Complexity of the interface primitives: One of the key design and implementation decisions at this layer is the nature of primitives. One approach (e.g., RISC-like)is to offer simple and efficient primitives to the user interface so that customized UIS can be built using them. In this case, cost of building UIS will be high.

* Representational issues: Since this layer has to handle all types of data including textual, pictorial, and audio (data, voice, and video in the network terminology),a decision has to be made as to the role of this layer in handling the data. If a uniform access approach is adopted, then it could handle all the data just as one kind: digital. In this case, the user interface will be responsible for recognizing the differences in the data types and handling them accordingly. Some of the issues that pose implementational difficulties are the management of buffers, management of computation and communication resources, and dealing with any synchronization requirements. For example, if the information retrieved from the underlying system consists of independent audio and video sources, but to be presented in a synchronized manner to the user, there may be additional efforts necessary at the user interface layer. It is also possible to transfer this responsibility to the UIS.

Interoperability Layer

This layer interfaces with a variety of underlying resources and offers a uniform interface to the user interface layer. Some of the key implementation issues to be dealt with at this layer are:

* Tools: In order to implement a variety of search/retrieve/combine operations over a large domain and heterogeneous resources, we need to integrate several existing commercial products. We certainly foresee the need for hypermedia tools, collaboration tools, large-space search techniques, etc. Selecting the tools as well as integrating them to achieve the desired task may be quite a challenge.

* Interface routines: The IL protocol has to deal with a slew of resource clusters. Once again, depending on the uniformity (or non-uniformity) offered by the interoperability layer, the routines could become quite complex. Problems of heterogeneity, problems of size (large set of repositories to deal with), and the complexity of services offered to the user interface call for innovative design and implementation. Especially, we should be concerned about the end-to-end delay expected by the users. The implementation choices will be greatly dictated by the performance requirements of the end-user.

* Dealing with autonomous repositories: Since each repository is autonomous, there is no standard format for the data that it stores or the primitives that it offers to the rest of the system. This creates problems in the implementation of standard RRPs.

Resource Repository Layer

This is the layer that we have least control on-since retaining individual repository autonomy is our objective. However, to facilitate its integration with the rest of the system, we propose a flexible interfacing system. The interface will be developed by the repository maintainer. However, the digital library administrators will provide guidelines as to the minimal services expected from each repository. Whether the semantics (and syntax) of these minimal services are to be standardized and set by the digital library system administrators is not clear at this point. Such a standardization at least on the minimal subset would ease the development of the interoperability layer. In addition, it will guarantee some services from each repository. The issue of charging a fee for access is the discretion of individual repositories. We do not intend to look into the issues of privacy and charging by the local sites.

5. Conclusion

We have seen in the past decade a remarkable expansion of Internet which connects millions of users at academic, industrial, and government institutions nation-wide and world-wide. Additionally, it has caused a rapid growth in the available amount of information and the variety in the

forms the information is represented in. Numerous Internet resource discovery systems have come into existence. However, for a true digital library system, given the huge resource space, the actual solution process and the potential solution space for an arbitrary query can easily make the task infeasible in terms of the requirements on the computational and network resources. In addition, the many available subsystems will require local autonomy and are not compatible with each other and can hardly work together in a smooth manner.

To resolve this problem, the overall system must be able to utilize some resource partitioning schemes and for each query it must be able to search the partitioned space intelligently. We proposed a three-layer interoperability architecture which alleviates this problem. It allows a large, distributed user base, autonomies of local systems, ease in integration of heterogeneous systems, and efficient retrieval of information. The implementation issues of our architecture has been investigated and we believe that it is both feasible and effective.

References

[1] Andreessen, M. 1993. "NCSA Mosaic for the X-Window System. Available via anonymous FTP from ftp.ncsa.uiuc.edu:/Web/xmosaic, Software development Group, National center for Supercomputing Applications, University of Illinois at Urbana-Champaign.

[2] Berners-Lee, T., Cailliau, R., Groff, J., and Pollermann B. Spring 1992. World-Wide Web: The Information Universe. In Electronic Networking: Research, Applications and Policy, 1(2), Meckler Publications.

[3] Bowman, C. M., Danzig, P. B., and Shwartz, M. F. 1993. Research Problems for Scalable Internet Resource Discovery. Department of Computer Science Technical Report, University of Colorado at Boulder, (Boulder, CO,March).

[4] Emtage, A., and Deutsch, P. 1992. Archie-An Electronic Directory Service for the Internet. In Proceedings of the 1992 USENIX Technical Conference, (January), 93-110.

[5] Litwin, W., Mark, L., and Roussopoulos, N. 1990. Interoperability of Multiple Autonomous Databases. In ACM Computing Surveys, 22,3, 267-293.

[6] Maly, K., Fox, E., French, J., and Selman, A. 1992. Wide Area Technical Report Service. Department of Computer Science Technical Report No. TR-92-44, Old Dominion University, Norfollk, Virginia.

[7] McChaill, M. 1992. The Internet Gopher: A Distributed Server Information System. In ConneXions- The Interoperability Report , Interop Inc.,6,7,10-14.

[8] Schwartz, M. F., Emtag, A., Kahle, B., and Neuman, B. C. 1992. A Comparison of Internet Resource Discovery Approaches. In Computing Systems,5,4,461-493, University of California press.

[9] Source Book on Digital Libraries, Fox, E. A. (Ed.), Virginia Polytechnique Institute and State University, Blacksburg, VA, December 1993.

[10] Thinking Machines Corp. 1993. WAIS Source Distribution, version 8-b5. Available via anonymous FTP from think.com:/wais, Thinking Machines Corp., Cambridge, MA.

[11] Weider, C., and Deutsch, P. 1993. A vision of Integrated Internet Information Service. Available via anonymous FTP from ietf.cnri.reston.va.us:/internet-drafts/draft-ietf-iiir-vision-0.0.txt.

[1]In this paper the terms digital library and digital information repository are used synonymously.

[2]In this paper, Internet is simply used as an example framework that facilitates resource sharing . But the architecture and the discussions in this paper are relevant even when a much broader framework is considered.