Translating Data to Knowledge in Digital Libraries

Gordon K. Springer1 and Timothy B. Patrick2

1Department of Computer Science, 2Medical Informatics Group, School of Medicine, University of Missouri-Columbia, Columbia, Missouri, USA, 65211, {springer, patrick}@condor.cs.missouri.edu

1. Introduction

For the first time in more than 1000 years an opportunity exists to change the nature of the way libraries are organized and the way that the data contained within them are accessed. In the classical sense a library is an organized collection of data or artifacts. The collection is organized such that a user of the library has a procedure or method for identifying a desired item and being able to extract that item from the collection for perusal or use. This implies the existence of a classification scheme which can be used to store and retrieve items in the library or collection.

The classification schemes, the physical organization and methods of access in traditional libraries are bounded by the fact that these procedures are focused on storing or extracting documents from a finite dimensional space. The dimension is usually three. Books, journals and the like are stored on shelves of a library and the method of access is to go to a point in 3-space where the particular book or document can be found. Similarly, classification schemes utilized are broadly divided into author, title and subject headings. A fourth division, keywords, attempts to quantify the content of a document. Even so, the classification schemes simply mirror the physical organization of the collection. And, in doing so, limits the ability of users to extract information or knowledge in an intelligent way. This is not to belittle the user or the library, but to point out the shortcomings of a classical library which was designed to store and retrieve documents not information or knowledge.

With the evolution of the digital library, the traditional limitations of collecting, organizing and retrieving items from a finite dimensional space are not present. The opportunity and the challenge is to take advantage of the freedom to focus on extracting information and knowledge from the digital collections. The dimensionality of the information space is increased only if analysis tools or filters are utilized as an inherent part of the process of searching and extracting information instead of data from the library collections. It would be a serious injustice to continue to only extract documents. An excellent overview of these challenges are presented in [1].

2. Discussion

In order to translate data to knowledge, access to large quantities of data is necessary, and information must be extracted from these data. Digital libraries provide access to these data much more readily than is possible in the traditional library. To be truly useful, the classification schemes used to locate data in the massive, distributed collections must be extensive and fine-grained. In addition, it is necessary to be able to quickly and precisely locate the desired digital collections needed to satisfy an information request. Thus an organized methodology must be in place to both enhance the finding of pertinent data for a request, as well as limiting the number of discrete data collections that must be accessed to extract the information desired.

The problem of extracting information from data is not addressed by simply developing better classification schemes, organizing data collections using newer and better database schema, nor simply making the data accessible to the entire world by quickly transporting it across the evolving computer networks or data highways. Filters are needed that can derive information or knowledge that can be extracted and analyzed from the massive collections of data stored in digital form. Moreover, extracted information from one source should be usable as input to extract additional information from another source. Thus, it is not simply a single, general purpose filter that is needed. It requires a very large number of filters, that are discipline and user specific. The challenge is to make it possible for a wide variety of filters to be utilized, when appropriate, to process the data and information available and to extract the desired information.

We are developing a system that is based upon the need for providing the user with information rather than just data. It involves the integration of autonomous programs and analysis tools, which can be viewed as filters, to extract the maximum amount of information that can be obtained about genetic sequence data in the biomedical sciences. This system utilizes servers that are based upon open-system, distributed computing concepts. These servers offer various kinds of services to users with information needs [2]. Integral to this system is the ability of a given server to advertise its services which can be quickly and efficiently utilized by prospective users. The user is unaware of where the services are located and what is entailed to access the servers. What the users do know is that they receive information, not data, in response to their queries. A discussion of the mechanisms used in this system can be found elsewhere [3].

3. Summary The digital library brings with it the need to break with the traditions of the classical library. We need to seek out better ways to increase the dimensionality of the information space to provide a wider variety of pathways to the information contained within the space. This necessitates the use of analysis tools or filters to process the data contained in the search space so that information rather than documents is returned to the user.

The National Information Infrastructure is going to be saturated with data flowing from one location to another if the process continues to focus on document retrieval. Without the use of analysis tools or filters to translate the data into information as an integral part of the process, we will continue to be buried in a sea of data. With the use of these filters, we will be able to take full advantage of the digital library technology and provide users with the information they need to effectively carry out their desired activities.

Acknowledgments

This work was supported in part by grants LM07089 and LM05513 from the National Library of Medicine, and also by Pittsburgh Supercomputing Center grant number NCR930001P from the NIH National Center for Research Resources. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the National Library of Medicine or the Pittsburgh Supercomputing Center.

References

[1] Garrett, J. R., 1993. Digital Libraries, The Grand Challenges, EDUCOM Review, July-August 1993, pp. 17-21.

[2] Springer, G. K., 1994. A National Scientific Computing Environment for the Biological Sciences, Proceedings of the 27th Annual Hawaii International Conference on System Sciences, IEEE Computer Society Press, Los Alamitos, CA, 1994, Volume V, pp.87-88.

[3] Patrick T. B., Springer, G. K., Sista, S. M., and Davison, S., 1994. Methods for Shared Access to Medical Internet Information Sources, Proceedings of the American Medical Informatics Association Spring 1994 Congress, American Medical Informatics Association, Bethesda, MD, p. 123.

Last Modified: