The JANUS Digital Library

Kathleen McKeown[1], David Millman[2], Brian Donnelly[3], James Hoover[3], Robert McClintock[4], Willem Scholten[6], Dimitris Anastassiou[5], Shih-Fu Chang[5], Alan Crosswell[2], Mukesh Dalal[1], Steven Feiner[1], Paul Kantor[7], Judith Klavans[1], Craig Stanfill[8], and Mischa Schwartz[5]

[1] Department of Computer Science,

[2] Academic Information Systems,

[3] Columbia Law School,

[4] Institute for Learning Technologies,

[5] Department of Electrical Engineering Columbia University

[6] Future InfoSystems, Inc.,

[7] Tantalus, Inc.,

[8] Thinking Machines Corporation

Authors addresses:

Kathleen McKeown, 450 Computer Science , Columbia University, N.Y., NY , kathy@cs.columbia.edu

David Millman, 603 Watson Labs, 612 West 115 St., N.Y., NY, dsm@columbia.edu

Abstract

The digital library represents a paradigm shift in how we conceptualize libraries. In removing geographic and temporal boundaries, current technology now offers an unprecedented opportunity to bring vast research collections to every constituency in our society, from the patron of the local public library, to the fifth-grade student, to the university scholar. In this paper, we provide an overview of our research towards developing a system, the JANUS Digital Library, which can provide seamless access to massive amounts of information, regardless of physical location, meeting the needs of this wide variety of potential users. Unique features of our work include fully integrated search and representation of multiple media, including text, images and video; the ability to provide automatically generated natural language summaries and graphical abstractions of retrieved documents; and full participatory design, involving early evaluation of the system by users. Our effort will bring together a wide range of information consumers, a research team including engineers, computer scientists, legal scholars and social scientists, and a group of information providers representing legal, commercial and social interests.

Keywords: user interfaces, search and retrieval, multimedia, summarization, representation, intellectual property rights, participatory design

1. Introduction

The digital library of the future must serve the entire spectrum of library users, from elementary and secondary students to university scholars, from the general reading public to the technical specialist, providing seamless access to all fields of knowledge. For example, the fully digitized library should allow a fifth grader who has just watched the movie Raiders of the Lost Ark to search for and retrieve information, appropriate to his or her level of learning, about the Ark of the Covenant, while enabling a biblical scholar to search and retrieve the latest exegesis on this same subject. It must provide users with easy electronic access to the complete range of books, articles, films, sound recordings and other media currently housed in physical library settings. And it must make this information available in a cohesive, comprehensive and comprehensible form.

Our goal is the development of a digital library that meets these criteria; we aim at making the full range of information currently available in today's libraries easily accessible to a wide range of users of different ages and backgrounds. Developing such a system is a highly complex process and requires simultaneous advances in many different domains of technical inquiry including user interfaces, search and retrieval techniques, representation of information, and management of intellectual property. It requires combining very large-scale networks with very large-scale file storage and creating digital collections of sufficient depth and breadth to be of compelling interest to working user groups. Our project brings together a broad coalition of experts to address these domains of technical inquiry while grounding our research agenda in past and ongoing experience with a prototype library at Columbia University, begun in 1990 and given the name Project JANUS. For example, the initiative will draw on the prototype's coordinated use of imaging technology and full text searching to provide access to fragile documents, including in some cases the marks by censors and the marginal notes by authors as well as the text itself. However, it will dramatically extend the prototype's ability to provide coherent access to multiple document types as it scales up the testbed and user pool.

In this paper, we provide an overview of Columbia University's initiative in Digital Libraries, focusing on our plans for providing coherent access to multiple document types, including text, images, video and combinations. In the following sections, we first provide a broad overview of the research we will carry out to meet this goal. We then demonstrate how each of the discrete research areas contributes to our goal of providing coherent access for a broad range of users, providing a scenario of planned system interaction. We conclude with a discussion of the roles of our partners and a summary of our contributions.

2. Research Overview

Our initiative includes research in user interfaces, search and retrieval of text and images, representation of both text and images to facilitate search and delivery, advanced multimedia networking protocols, and research on representation and reasoning for different models of intellectual property rights. Parallel to each of these research efforts, research on evaluation with users will provide early feedback to each component, shaping the research design, and will ultimately quantify successes in each field. An overview of our planned system is shown below.

Figure 1. System Overview.

A critical contribution of our work is the integration of text and images at all levels of the system. The Janus user interface will integrate text, image, and other media for both query formulation and response, including the ability to automatically generate summaries of the retrieved documents that coordinate natural language and graphics. The testbed will include text, images, and video, as well as documents that integrate media, such as annotated legal documents, scientific documents containing, for example, photographs or diagrams, and humanities documents containing artwork. Our aim is to develop integrated search and retrieval of the multimedia testbed, providing integrated indexing of text and images to allow retrieval of either a textual document, an image, or both in response to a single query. In order to support effective search and retrieval as well as summary generation, our system must provide information about the semantics of document segments in both texts and images. For example, the system must represent that a particular document segment is a photograph that is referred to from specific lines in the text, or that two paragraphs are very closely related in meaning based on similarities in the words used.

Key to our effort is the need to examine the emerging digital library from the user's point of view. We will have two parallel efforts at formative evaluation, one centering on user interface issues and the other on the performance of the retrieval engine. The Institute for Learning Technologies, an affiliate of Columbia University, will conduct a close study of user experience with initial configurations of digital resources to provide the research and development teams with design guidance about features that users are likely to find helpful or problematic. While the JANUS digital library search and retrieval engine will extend the existing retrieval engine of the JANUS prototype, the user interface must be redesigned to incorporate natural language, graphics, and image features all appropriately laid out on the display. In order to incorporate tools and features from this variety of disciplines we will test simulations and preliminary prototypes of potential user-interface features in order to allocate development resources efficiently. For example, we will study the effect of different summary content on reformulation of queries and ease in finding the desired documents. Formative evaluation of the retrieval engine will also start at a very early stage in the project, documenting performance characteristics relative to user needs and preferences. In parallel with formative evaluation by observation, we will conduct user studies at multiple sites to assess the impact of specific design features on the use and usability of the system. We expect both evaluation efforts to participate in a tight feedback relation with the development teams, making possible numerous iterations of user-needs analysis, system design, and formative evaluation. Through these efforts, we will also provide the cumulative evaluative research on which to base summative studies of the likely costs, performance characteristics, scaling problems, and usage levels of full digital library systems.

To support extensive interaction among the research and evaluation teams, it will be important to nurture strong user groups in diverse settings whose experience will provide the empirical feedback for our work. To ensure a critical mass of users, we will evaluate the system in domains where we know we have the ability to collect an adequate base of source material. Initially, we will focus on providing access to the legal field in the context of professional legal education and research. From there, we will move to math and science educational materials, including the earth sciences, involving users from the fifth grade through college seniors. Next, we will expand to include a large body of literature on medical and health sciences. Finally, we will move to collections centered around undergraduate core curricula in the social sciences and humanities and to business and technological developments.

3. Providing Coherent Access

Given the massive increase in both the number of documents in the testbed and the number of users, a key problem for the development of our system is finding and presenting information in a comprehensible way. Thus, research on the user interface will drive our project.

Our work builds on the hypothesis that no single form of user interface can satisfy all users; different users will find different forms of input requests and/or presentation of results more effective. Thus, our interface will feature a range of query formulation and reformulation techniques in addition to standard Boolean keyword retrieval. For example, we will provide natural language free-form queries, the ability to select image features (e.g., texture or portion of an image) to search for similar images (Smith and Chang 1994; Chang 1989), as well as direct manipulation of the presentation of results (Chang and Messerschmitt 1994). A graphical history that users can edit (Kurlander and Feiner 1990) will allow users to easily locate and revise previous queries. We will use natural language techniques such as automatic identification of collocations or synonyms (Smadja and McKeown 1991) or dictionaries and thesauri (Klavans 1990; Moholt 1990) to augment the search and provide feedback to the user on how to formulate a search request. In order to facilitate use of the interface, these different input modalities must be appropriately presented to the user so that modalities of preference are easy to use for different forms of searches.

Presentation of results also will use multiple media, all automatically generated and tailored to the needs of particular users. A unique feature of our work will be the automatic generation of both natural language summaries using text generation techniques (McKeown 1985; Robin and McKeown 1993) and knowledge based generation of graphical abstractions (Feiner 1985; Mackinlay 1986; Roth and Mattis 1990; Beshers and Feiner 1993) of the retrieved documents. Given the difficulty users have in formulating precise queries (Dumais and Schmitt 1991; Greene et al. 1990), they are likely to be inundated with more information than they can understand. Summaries will provide textual and graphical descriptions of the retrieved documents, classifying them by document type, by topic, and by date, noting similarities and differences between sections of the documents, and using comparisons of repeated phrasings between the documents to make contrasts. Techniques for coordinating multiple media (Feiner and McKeown 1991) will be used to relate the textual summary to the accompanying graphical presentation. The user can browse the documents (text and images) and reformulate queries by directly modifying and manipulating the presentation.

To avoid swamping the user with large quantities of retrieved documents, our research will evaluate and develop a variety of search and retrieval models that combine Boolean query formulation, associative retrieval models and relevancy judgments in different ways, for different users, scaling up current techniques (Stanfill 1993) to address terabyte databases. We will also develop techniques to facilitate browsing, an activity that is likely to grow as the number of young or unsophisticated users increase. Our approach will include development of new evaluation criteria to study a variety of methods based on the vector model, such as using a parallel computer to identify a set of maximally dissimilar documents from a set of relevant items. For indexing and searching images, we will investigate non-traditional approaches for visual feature-based query, which allow users to search through millions of images and video clips by using fundamental feature sets derived from shape, texture, color, size, sketches, video scene descriptions, and video scene analysis, thus minimizing prior knowledge about image content when deriving the signal features. We will use a feature-based segmentation approach for automatic image indexing (Smith and Chang 1994), thus obviating the need for manual association of textual keys with images and their segments.

Providing differing perspectives and searching on associative models will require the development of methods for effectively representing the document segments both for text and for images. We will use data extracted from large machine readable dictionaries (MRDs) (Klavans et al. 1993; Klavans 1988), thesauri, and other on-line reference material (e.g. Wordnet) combined with cooccurrence data extracted from large on-line corpora (Smadja and McKeown 1990; Hatzivassiloglou and McKeown 1993) to build a large semantic network that can be used to provide a hierarchical representation of text data. For images, we will use sophisticated algorithms to achieve data reduction, exploiting the inherent redundancy of images, and also the temporal redundancy of video sequences (Nettravali and Haskell 1988; Anastassiou 1992) and segmenting image information into regions of different nature (e.g., text characters, keywords, drawings, halftone photographs, continuous tone images, etc.) coding each segment in a different way. In addition, we will represent the needs and expertise of different users, and develop efficient algorithms for reasoning about the user (Dalal and Etherington 1992). Based on this information, both the interface interface and search engines can tailor their results to specific users.

In addition to considering end users of the digital library, we will address the needs of document providers. We will develop different models of intellectual property rights and billing access, and implement them as part of the system, experimenting with the tradeoff between expressiveness of different languages and efficiency of inference engines (Levesque 1985), and various approximation techniques (Dalal and Etherington 1992). Evaluations of the models will aim at identifying approaches that satisfy both publishers and end users.

Finally, we will also address the networking and file storage requirements needed to support the Janus user interface. Given the need to provide transparent access to documents regardless of physical location, the JANUS digital library will rely on standard networking protocols to allow access to distributed textual documents. However, these standards are inadequate for access to multimedia documents. We will design a transport protocol or set of protocols capable of supporting multimedia traffic, each medium with its own quality of service (QOS), between one or more (possibly communicating) library databases and the library user over a variety of networks (LaPorta and Schwartz 1993).

4. An Example of Planned System Operation

The following example illustrates how interaction with the JANUS Digital Library might proceed. We show the different modalities users can use to make requests, the different document types a user might receive and how a response may be presented to the user.

In an architecture course, students may use image manipulation tools to select image segments by representative features (e.g., texture and color of wall materials, shape and structure of pillars) to search for images with similar features. Thus, from a menu of textures a student might select a texture, similar to stucco to search for images of buildings of Spanish architecture. This search will be done by measuring feature similarity with features extracted directly from the compressed format representing image/video data. From perhaps 40 returned images, they may find that several typical types of roofs are used. Using the same image manipulation tools, they may formulate another level of search functions by combining roof features with similar or dissimilar wall materials, thus further restricting the set of images. Alternatively, they may combine their results with textual queries (e.g., the name of an architect) to refine their search.

Since text and image searches will be integrated, this query will also return any textual documents describing the types of buildings returned using a keyword-feature association index. Such an index, which we will automatically enrich through learning and knowledge based methods, will link the low level features specified above (e.g. texture) to associated textual keywords (e.g. "stucco"). A textual search will be initiated using these keywords.

Automatically generated natural language and graphics will be used to summarize the documents returned. The multimedia summary would indicate the number of textual documents versus number of images, might further categorize the documents by topic (e.g., noting which documents portray or discuss different types of Spanish architecture) and could use dates, either of the document or within the document, to describe the architectural periods included. Further analysis of the words and phrase repetitions within different textual documents could provide further contrasts within the summary (e.g., classifying articles that critique particular architects or that define architectural styles). The graphical summary would use icons and color to code the different categories of documents. By panning over and zooming into the different portions of the graphic as the textual summary is displayed, the system will provide an animated multimedia tour of the information space, which can be customized to the individual user needs. By manipulating the graphic and selecting restricted portions, the user can generate refined queries that will further restrict the results, generating new summaries providing more detail on the smaller set of documents. Alternatively, the user could select documents to browse by manipulating the graphic.

Once the user has selected images and documents of interest, the intellectual property rights inference system will determine the rights of the current user and the different options available for obtaining copies of the document (e.g., free or pay-per-use, where use can include different charges for printing or online browsing).

5. Cooperating Entities and Their Roles

Columbia University has established partnerships with a variety of institutions and corporations who will each contribute to the development of the JANUS Digital Library in one of three main roles. One group of partners

Figure 2. Testbed Collection.

One group of partners will work toward developing the testbed collection by providing digitized materials in a variety of fields. The second group of partners will aid in evaluation of the system, making the JANUS Digital Library available to their patrons. The third group of partners will provide technical expertise in the form of equipment, software, or services for different areas of research.

Through digitized materials furnished by Columbia University Libraries, Yale University Libraries, and partnership with a variety of publishers, the JANUS Digital Library will provide access to substantial collections in the focused fields. In addition, we will draw on existing collections available at other units within Columbia University, such as the Lamont-Doherty Earth Observatory and the Columbia Presbyterian Medical Center. The following diagram shows how we will assemble the testbed collection from material provided by coalition members. We will continue to build on this set of providers throughout the course of the project, and by drawing on collections made public over the Internet.

We have developed strong partnerships with both public and university libraries, along with primary and secondary schools. The JANUS Digital Library will be tested with both students and advanced scholars at Columbia University and Yale University Libraries. In addition, we have developed collaborations with the Seattle Public Library and the New York Public Library, where the system will be made available to and tested with the general public. These settings will necessarily involve casual users, who may use the system only once during the course of evaluation. Finally, through arrangement with the Institute for Learning Technologies and various departments at Columbia University, we will evaluate situations where secondary-school and college students interact with the system as part of their regular curriculum. Such arrangements enable our research agenda in participatory design; through early evaluation of user satisfaction and system functionality, the system will be tuned to the needs of a wide variety of users. These arrangements will also allow the initiative to lead in making electronic access to online collections available to the general public.

Our third group of partners will provide technical expertise. These include Thinking Machines Corporation, who will provide access to their most advanced parallel computer during the course of development as well as researchers with expertise in information retrieval; Future InfoSystems, Inc. (FIS), who will provide information retrieval software tools as well as the researchers skilled in information retrieval; Eastman Kodak Company, who will loan image digitizing and massive storage equipment; General Electric Company, who will contribute natural language software for information extraction; Tantalus, Inc., who will conduct quantitative and qualitative evaluations of system performance; and National Storage Laboratory/IBM, who will contribute their new High Performance Storage System (HPSS).

Figure 3. Testbed Facility.

6. Conclusions

Our work will culminate in a digital library testbed that integrates advanced research prototypes. It will feature a sophisticated and responsive user interface and an efficient search and retrieval facility, incorporating integrated processing of text, images and video at all levels of the system. Through the combined strength of Columbia and Yale University Libraries, and partnership with a variety of publishers, the JANUS Digital Library can provide access to comprehensive collections in focused fields. Our strong collaboration with public and university libraries, along with educational settings associated with the Institute for Learning Technologies, means our work will be at the forefront of participatory design; through early evaluation of user satisfaction and system functionality, we can tune the system to the needs of a wide variety of users. Performing user needs analysis for a variety of user groups means system design must adapt to user needs, as opposed to the other way around. These arrangements also allow Columbia University and its partners to lead in making electronic access to online collections available to the general public, as well as identifying additional research issues that should be addressed to refine digital library models.

Our cross-disciplinary team structure facilitates significant research advances by bringing together individuals who might not otherwise interact. For example, our work aims at advances in the integration of text, images, and video in search and retrieval, in the integration of multiple media for access and presentation of data, in the integration of techniques from information retrieval and natural language processing for search and retrieval, and in the integration of different representation techniques, each of which is best suited to the medium to which it is applied. Through advances in networking for multimedia transport, we will enable connections between outside researchers (and users) and our testbed, and between our digital library system and other remote testbeds. Finally, our work also features a unique collaboration between legal scholars, publishers and computer scientists to develop an intellectual property rights model that incorporates insights from law, the needs of document providers, and an advanced representation and reasoning facility to handle subtle differences between cases.

Our objective of providing coherent access will be met in multiple ways. We will develop a novel user interface that incorporates multiple means for requesting information, allowing the user to freely move between different modes of communication as desired, providing a coherent medium through which data can be accessed. By making possible integrated search and retrieval of text and images, users can gain coherent access to multiple media. Our work on representation of text and images is aimed at supporting sophisticated retrieval methods that allow improvement of precision and recall. Through automatically generated multimedia summaries of the information retrieved, the user can gain a coherent view of massive amounts of data in order to determine which material is relevant. Manipulation of the graphical summaries and images returned, along with session histories, will provide coherent methods for the user to revise requests and further focus attention on a smaller subset of material just retrieved. Through development of networking for multimedia transport, we aim to provide coherent access to information stored at multiple sites in a transparent way.

References

Anastassiou, D., Scalability for HDTV, Invited Paper, Proceedings, International Workshop on HDTV'92, Kawasaki, Japan, November 18-20, 1992. (This paper is reprinted in the book: Signal Processing of HDTV, IV, E. Dubois and L. Chiariglione, editors, Elsevier, 1993.)

Beshers, C. and Feiner, S. AutoVisual: Rule-Based Design of Interactive Multivariate Visualizations. IEEE Computer Graphics and Applications, 13(4):41-49, July 1993.

Chang Shi-Kuo, Principles of Pictorial Information Systems Design, Englewood Cliffs, NJ, Prentice-Hall, 1989.

Chang, S.-F. and Messerschmitt, D.G. Manipulation and Compositing of MC-DCT Compressed Video. To appear in IEEE Journal on Selected Areas in Communications, 1994.

Dalal, M. and Etherington, D.W., Tractable approximate deduction using limited vocabularies. Proceedings Ninth Canadian Conference on Artificial Intelligence, 206-212, Vancouver, Canada, 1992.

Dumais, S. T. and Schmitt, D. G. Iterative searching in an online database. In Proceedings of Human Factors Society 35th Annual Meeting, 398-402, 1991.

Feiner, S., APEX: An Experiment in the Automated Creation of Pictorial Explanations, in IEEE Computer Graphics and Applications, 5(11):29-37, November 1985.

Feiner, S. and McKeown, K.R., Automating the Generation of Coordinated Multimedia Explanations, IEEE Computer, 24(10):33-41, October 1991.

Greene, S., Devlin, S., Cannata, P. and Gomez, L., No IFs, ANDs, or ORs: A Study of Database Querying. International Journal of Man-Machine Studies, 32(3):303-326, 1990.

Hatzivassiloglou, V. and McKeown, K.R., Towards the Automatic Identification of Adjectival Scales: Clustering Adjectives According to Meaning, in Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, June 1993.

Klavans, Judith L., Braden-Harder, L., Yoon, M., and Zadrozny, W., Patent on Semantic Taxonomies and Multimedia Indexing and Retrieval, 1993.

Klavans, Judith L., and Tzoukermann, E., The BICORD System: Combining Lexical Information from Bilingual Corpora and Machine Readable Dictionaries, in Proceedings of the 13th International Conference on Computational Linguistics, Helsinki, Finland, 1990.

Klavans, Judith L., Building a Computational Lexicon using Machine Readable Dictionaries, in Proceedings of the Third International Congress of the European Association for Lexicography, Budapest, Hungary, 1988.

Kurlander, D., and Feiner, S., A visual language for browsing, undoing and redoing graphical interface commands. In Visual Languages and Visual Programming, ed. Chang, S.K. New York: Plenum Press. 1990.

La Porta, T.F., and Schwartz, M., The Multistream Protocol: A Highly Flexible High-Speed Transport Protocol, IEEE J. on Selected Areas in Comm., 11(4):519-530, May 1993.

Levesque, H.J., and Brachman, R.J., A fundamental tradeoff in knowledge representation and reasoning (revised version), in Readings in Knowledge Representation, R.J. Brachman, R.J. and H.J. Levesque, eds., Morgan Kaufmann, Los Altos, California, 41-70, 1985.

Mackinlay, J. Automating the Design of Graphical Presentations of Relational Information, ACM Transactions on Graphics, 5(2):110-141, April 1986.

McKeown, K.R., Text Generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text, Cambridge University Press: Cambridge, England, 1985.

Moholt, P. and Goldbogen, G., The Use of Inter-Concept Relationships for the Enhancement of Semantic networks and Hierarchically Structured Vocabularies, Proceedings of the Sixth Annual Conference of the University of Waterloo Centre for the New Oxford English Dictionary and Text Research, Electronic Text Research, 1990.

Nettravali, A.N. and Haskell, N., Digital Pictures: Representation and Compression, Plenum, 1988.

Robin, J. and McKeown, K.R., Corpus analysis for revision-based generation of complex sentences, in Proceedings of the National Conference on Artificial Intelligence, Washington, D.C., July 1993.

Roth, S. and Mattis, J. Data Characterization for Intelligent Graphics Presentation, Proc. CHI '90, Seattle WA, April 1-5, 1990, New York: ACM, 1990, 193-200.

Smadja, F. and McKeown, K.R., Automatically Extracting and Representing Collocations for Language Generation, Proceedings of the XXVII Annual Meeting of the Association for Computational Linguistics, 252-259, June 1990.

Smadja, F. and McKeown, K.R., Using Collocations for Language Generation, Computational Intelligence, 7(4), December 1991.

Smith, John, and Shih-Fu Chang, Quad-Tree Segmentation for Texture-Based Image Query, submitted to ACM 2nd International Multimedia Conference, 1994.

Stanfill, C., Parallel Information Retrieval Algorithms, in Information and Retrieval: Data Structures and Algorithms, W. Frakes and R. Baeza-Yates, eds., Prentice Hall: Englewood Cliffs, N.J., 1992.

Stanfill, C. and Linhoff, G., Compression of Indexes with Full Positional Information in Very Large Text Databases, Proceedings SIGIR, Pittsburgh, Pa., 88-95, 1993.

Last Modified: