Digital Library Infrastructure for a University Engineering Community

Bruce Schatz[1,3], Ann Bishop[1], William Mischo[2], and Joseph Hardin[3]

[1] Graduate School of Library and Information Science,

[2] University Library

[3] National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign

contact: Bruce Schatz, NCSA, Beckman Institute, 405 N. Mathews Ave, Urbana, IL 61801

emails: bschatz@ncsa.uiuc.edu, bishop@alexia.lis.uiuc.edu, mischo1@vmd.cso.uiuc.edu, hardin@ncsa.uiuc.edu

Abstract

In the world of the near future, the Internet of today will evolve into the Interspace of tomorrow. The international network will evolve from distributed computer nodes supporting file transfer to distributed information sources supporting object interaction. Users will browse the Net by searching digital libraries and navigating relationship links, as well as share new information within the Net by composing and publishing new objects and links. The Net will thus appear as interconnected spaces of information objects, the Interspace.

We propose two concurrent and complementary activities that will accelerate progress towards building the Interspace. These together construct a model large-scale digital library and investigate how it can scale up to the National Information Infrastructure.

* Construction of a digital library testbed for a major university engineering community, in which a large digital collection of interlinked documents and databases will be maintained, software to browse and share within this library developed, and usage patterns of thousands of users spread across the Net evaluated.

* Investigation of fundamental research issues in information systems, information science, computer science, sociology and economics that will address the scalable organization of a large digital collection to provide transparent access for a broad spectrum of users across national networks. Our analysis will center on the testbed experiment and will form the basis for future system design.

Keywords: Digital libraries, National Information Infrastructure, information spaces, network information systems, Interspace

Introduction

The Illinois Digital Library project is constructing a large-scale digital library for engineering documents and databases. This project consists of two inter-related parts. The first is building a testbed of materials obtained from professional and commericial publishers, with software that will be used by an engineering community of thousands of users. The second is performing research in technology and in sociology to understand how to scale the testbed model to the National Information Infrastructure.

The project is a joint effort on the testbed side of the University Library (UL) and the National Center for Supercomputing Applications (NCSA), and on the research side of the Graduate School of Library and Information Science (GSLIS) and the Department of Computer Science (CS). The heads of these organizations form an executive committee to coordinate and support the project: Robert Wedgeworth (UL), Larry Smarr (NCSA), Leigh Estabrook (GSLIS), Duncan Lawrie (CS). Many other faculty and staff besides the authors here participate in the project, e.g. Hsinchun Chen (University of Arizona), Roy Campbell (CS), Leigh Star (Sociology), Larry DeBrock (Economics), Charles Catlett (NCSA), Michael Folk (NCSA), David Stern (UL), Pauline Cochrane (GSLIS).

This paper is the Executive Summary of a proposal submitted to the NSF/ARPA/NASA Digital Library Initiative in February 1994.

Digital Library Testbed

The testbed centers around the new Grainger Engineering Library Information Center at the University of Illinois in Urbana-Champaign (UIUC). This $26M Center is intended as a showcase for state-of-the-art digital libraries and electronic information distribution. The University Library at UIUC is one of the nation's best and largest university libraries. The Engineering College at UIUC is one of the nation's best and largest engineering colleges.

Construction of this national digital library testbed is possible through the active participation of two major institutions at the University of Illinois, the University Library and the National Center for Supercomputing Applications (NCSA). The former has extensive experience with maintaining large digital collections and supporting large user populations within the campus university community. The latter has extensive experience with developing generic computing software and supporting large hardware configurations within the national scientific community. Together, they provide the institutional infrastructure which enables substantial research to be undertaken within the testbed, and each are committed to this project as crucial to their future activities. Each is also currently supporting a major heavily-used service to access network information sources, represented by the Engineering Library on-line system and NCSA Mosaic software respectively.

Operation of the testbed will be supervised by the Director of the Grainger Center, co-PI Mischo. He will be advised by the Associate Director for Software Development at NCSA, co-PI Hardin, and representatives from the faculty of the Graduate School of Library and Information Science specializing in design and in analysis of information systems, respectively PI Schatz and co-PI Bishop. Other investigators include faculty from the Departments of Computer Science, Management Information Systems, Sociology, and Economics, in addition to others from the Library and NCSA. An Executive Committee, consisting of the heads of all the major participating organizations, will help insure institutional support.

The Engineering Library currently processes a million queries a month to its expanded on-line catalog including a digital collection of journal citations. Through a variety of collaborations, the user population for the testbed will expand beyond the Engineering College to the University community as a whole (including the Chicago and regional campuses served by the University Library) to the entire Midwest regional university system (via the CIC network) to the national scientific community (via the NCSA metacomputer center). This provides a national testbed across the Internet of over 100,000 university-level users.

The digital library itself will be centered around a collection of engineering journals and magazines, obtained through collaboration with a range of major professional and commercial publishers. The intention is to attract a broad range of usage from a broad range of users. All documents will be structured and complete, that is, encoded in SGML and containing all pictorial material. The documents will include general engineering magazines (e.g. computer science from IEEE), specific engineering journals (e.g. aeronautical engineering from AIAA), and specific scientific journals (e.g. physics from APS). Finally, articles from commercial engineering publishers (e.g. Wiley & Sons) will be collected for use in our economics trials.

We plan to gather a significant new digital collection of structured documents in the engineering literature and combine this with existing sources available from our front and back end software (see below). For example, these full-text materials will be integrated into an expanded on-line catalog including access to major periodical indexes in science and engineering (Current Contents, Engineering Compendex, INSPEC) which will be linked to the SGML documents. Collections on the Internet will also be made transparently available, e.g. the physics preprints at Los Alamos, the Unified Computer Science Technical Reports at Indiana University, and the international collection of on-line library catalogs.

In addition to the document collections, a number of databases will be gathered into the digital library and cross-linked to the documents where possible. These include significant databases generated by other NSF-funded projects, e.g. the BIMA Grand Challenge database in radio astronomy and the WCS National Collaboratory database in molecular biology. Associated GIS satellite image databases include the NASA-funded data supported by the NCSA HDF project. These projects are local to the University of Illinois and supervised by collaborators on the digital library project.

The testbed software will go through two primary phases within the proposal period of four years. The goal of version 1 is to leverage off our substantial existing resources to build a functional digital library with a large collection used by a substantial user population. Concurrently during this period, the technology research will be developing significant new functionality and the sociology research will be observing the significant usage patterns of the existing functionality. Together, these efforts will enable us to develop and deploy scalable digital library technology on a national testbed. The goal of version 2 is to demonstrate the technical feasibility of a fully functional Interspace system and test its sociological utility on a segment of our user population.

The version 1 software will evolve from two of our existing projects. The first is the existing information retrieval system in the current Engineering Library developed by co-PI Mischo. This is based around a PC front end to a full-text retrieval search from the major commercial vendor BRS. It currently serves the base user population with an on-line catalog connected to a large collection of engineering journal citations. This is in production use with 1 million search queries issued and 3 million items displayed monthly, and is a supported product of the University Library.

The front end to this back end will be the NCSA Mosaic software developed under the supervision of co-PI Hardin. This is one of the most widely used information services currently in the Internet, with a user base of nearly 1 million sites. The NCSA server where the Mosaic Home Page resides is now processing 1 million connections a week. The software provides an easy-to-use interface, on the three major current user platforms, for transparently retrieving documents across the Net. It supports display for the HTML subset of SGML, and for pictorial displays including embedding of images within text. This software is a supported product of NCSA and is being rapidly enhanced.

Together, these software plus enhancements will provide a search and display capability for full-text documents with pictures. This will be a representative system of the large-scale functionality available today. The Mosaic software will serve as the interface and gateway for two database search engines that will support the structured full-text and image documents and databases. BRS Search, which is widely used in libraries to provide full-text retrieval, will be interfaced via the standard protocol Z39.50. Microsoft Server, which is widely used to provide simple access to non-textual materials such as images, video, and sound, will be interfaced via the standard protocol SQL. The gathered collection of documents (and databases) will be transformed and indexed within this search system, then displayed using the internal and external viewers provided by Mosaic for the user's local platform.

Given the large-scale user population and the significant digital collection, we will be able to evaluate the nature of usage of a digital library. The evaluation effort will cover a broad range of methodologies and usages with the goal of answering a broad range of research questions. This information on effective/non-effective usage patterns will be fed back into the future system design. Methodologies will include ethnography observation and interviews, controlled experiments and surveys, and system instrumentation and transaction logs. Both individual behavior, via interviews, and group statistics, via surveys, will be observed. Different samples of the broad user population will be used as appropriate for these studies.

The version 2 software will incorporate the research discussed below to demonstrate a large-scale example of the digital library functionality available tomorrow. It will be a new system, designed from scratch for this application, using the experience from the testbed and other projects of the investigators to provide a scalable architecture for digital library infrastructure. We plan to implement this architecture and gradually introduce it into the testbed. The degree to which the new software ends up being adopted is a key question for the technology and the sociology research.

The functionality of version 2 will demonstrate the range of functionality possible in the Interspace. It will be built upon information spaces and support archive browsing and community sharing of objects with these spaces. The design will be generalized from the Worm Community System (WCS) developed by PI Schatz, which supports this range of functionality in a small specialized scientific domain. WCS was developed under one of the main project grants of the previous NSF IRIS-CISE information systems program in National Collaboratories, and has been featured as a national model for science information systems in National Academy of Science reports and lead news articles in Science magazine. Other major inputs will come from the research projects summarized below and from the IETF (Internet Engineering Task Force) efforts on evolving existing architectures that NCSA is participating in.

The primary goal of version 2 is to deepen the level of interaction and of integration. For documents, search will support semantic retrieval with concept matching and display will be comparable to printed journals or magazines. There will be user profiles supporting customized retrieval, where virtual magazines are delivered containing sets of desired articles displayed with good layout. For databases, there will be live manipulation comparable to direct use without the system, plus links to related items and to documents. There will also be communications support for messages and annotations linked to the documents and databases. The system will be symmetric so that any type of object or link which can be retrieved can also be added by users. In this sense, the system becomes a dynamic library which supports a complete publishing cycle for the Net.

Digital Library Research

The complement of the testbed in our project is the research effort. Each research component is strong enough to stand on its own, while producing results relevant to the critical system issues. Projects were selected which could make good use of the library testbed as an experimental vehicle and which had the potential of generating results which could be used in version 2 of the testbed or in subsequent projects which built on the testbed foundations. The goal was to build a group of collaborators, who professionally would span the range of topics necessary for the complete infrastructure and who personally were willing to actively participate in the project as a whole. Research components include: information systems, information science, computer science, sociology, and economics. As opposed to the testbed efforts which are carried out primarily by professional programmers and librarians, the research efforts are carried out primarily by academic faculty and students.

The Information Systems Research centers around designing an architecture for the Interspace, supervised by PI Schatz. This architecture will consist of an information space environment along with protocols for plugging objects into the space. The information space representation is a schema for federating heterogeneous objects distributed across a network via the use of relationship links. The protocols include support for information search and display (object typing), interconnection forging and following (object linking), and publishing control and communications (object distributing). With the protocols, it is possible to add new documents and databases to an information space with full interactive capability, and to communicate with other users via messages and links to any other objects. In this project, the architecture will be implemented and used as the basis for version 2 of the digital library testbed. When combined with the other technology research, it should provide a much deeper level of interaction and of integration than in version 1.

The Information Science Research centers around semantic retrieval and user customization, supervised by co-PI Chen. The semantic retrieval supports a higher level of abstraction in user search which can help overcome the vocabulary problem for information retrieval. Rather than searching for words within the object space, the search is for terms within a concept space. A concept space is a graph of terms occurring within the objects linked to each other by the frequency with which they occur together. This graph can be used to suggest alternative ("related") terms that a user may wish to search for. Co-occurrence graphs seem to provide good suggestive power in specialized domains, such as biology. The research questions revolve around their effectiveness in the more general domains considered here. Using the same sort of statistical methods, it is possible to infer terms of interest to the users from the objects that have been retrieved. These techniques will be used to provide a form of customized retrieval, where a user profile consisting of terms and demographics specified by the users orients the search matching towards more preferred objects. In this project, the semantic retrieval and user customization will be used to supplement the full-text search in the testbed.

The Computer Science Research centers around operating systems for network information services, supervised by co-PI Campbell, in collaboration with co-PI Catlett from NCSA. This concentrates on the physical performance of the objects rather than on the logical functionality as with the information science (and information system) research. Investigation will be made of the bottlenecks occurring within the information space and how operating system solutions can alleviate them. Measurements will be first made on Mosaic in the Internet, then on the Testbed Facility. Although this latter is also based upon Mosaic, the traffic pattern will likely differ, due to the most common interaction being search rather than navigation. Issues revolve around both the retrieval fetching itself, caching across the network memory hierarchy, and the link following to navigate to other objects in the Net, name resolution within a large distributed system. In this project, the measurements will be used to guide both short-term solutions for the testbed and long-term solutions with new object-oriented operating systems for supporting information space architectures.

The Sociology Research centers around user behavior studies, supervised by co-PI Star, in collaboration with co-PI Bishop. This research is the evaluation component of the testbed discussed above. Part will concentrate on ethnography (Star), seeking descriptions of the conceptual structures needed for users to effectively interact with a digital library, with the goal of influencing the design of future systems. Part will concentrate on user-based methods (Bishop), such as surveys and interviews, seeking group-level statistical information about patterns of usage. These two approaches will be performed on a range of different groups of users, to get a broad and detailed picture of the digital library. In addition, a methodological investigation will be done to attempt nethnography (net-ethnography), where the facilities of the system itself are also used to observe user behavior remotely across the network. A success at this new methodology would enable user studies to be done on much larger national systems in the future.

The Economics Research centers around charging schemes for network access, supervised by co-PI DeBrock. The sociology studies will discover patterns of usage in a large testbed, when there are no limits on user access. But in the real world, the economics of cost play a significant factor in determining usage. In some sense, the main testbed will be studying a flat-rate fee, which is being absorbed in the experimental phases by our project. But many NII applications will require per-use charges for economic viability. In this project, different fee charging will be required for selected portions of the user population, e.g. a small subset of remote users on the commercial materials. This will investigate how actual costs affect actual usage. These economics experiments will give an indication of what people might be willing to pay in a national digital library, just as the sociology experiments will give an indication of what people might be able to do.

Conclusion

Together, the large-scale testbed and the broad-spectrum research will provide a significant demonstration of a model information system for a national digital library, along with an analysis of the requirements and a design of a system that can scale up to the National Information Infrastructure.

Last Modified: