Automatic Extraction of Hypermedia Bundles from the Digital Library

Hugh Davis

Multimedia Research Group
Electronics and Computer Science
The University of Southampton
Southampton, SO17 1BJ, UK.
E-mail: hcd@ecs.soton.ac.uk

Jessie Hey

Multimedia Research Group
Electronics and Computer Science
The University of Southampton
Southampton, SO17 1BJ, UK.
E-mail: jmnh94r@ecs.soton.ac.uk

ABSTRACT

This paper describes experiments to extract a set of multimedia documents from a digital library in response to a user query, and then to present these documents as a hypermedia application called a "bundle". A visual interface to a query engine with multiple query tools, using successive refinements is described. The principle of automatic generation of hypertexts using the structure inherent in the library catalogue is explained. Issues arising from these experiments include the content of digital library catalogues and the ownership and regulation of these catalogue entries. The paper explores these issues, and examines the possibility of using the resulting system as a workbench for investigating agent technology.

KEYWORDS: hypermedia, hypertext, hypermedia library systems, digital libraries, Microcosm, World Wide Web, library catalogues, information retrieval, generic links, publication on demand.

INTRODUCTION

The advent of the digital library presents librarians and computer system builders with new challenges and opportunities.

On a national level in the UK, the Joint Funding Councils, under the chairmanship of Sir Brian Follett, produced a significant report [13] in 1993 which has recommended a considerable investment in IT in British university libraries. The Multimedia Research Group at the University of Southampton, UK, has a proposal accepted by the Follett Implementation Group on Information Technology (FIGIT) to establish a standard framework for integrating journals with other networked journals and information resources. The other partners are, the Cognitive Sciences Centre at Southampton, headed by Prof. Stevan Harnad who is the founding editor of Psycoloquy, the first peer-reviewed electronic journal on the Internet, the University of Nottingham (home of the CAJUN - CD-ROM Acrobat Journals Using Networks) project, the Company of Biologists and the British Computer Society which is Europe's largest professional computing society. The end product will be an Open Journal Framework: a combination of document server and hypermedia client technologies which allow customised access to a range of secondary information sources from a central primary source.

While academic libraries in the UK are planning ahead to combat ever decreasing funds with electronic solutions, similar pressures are being investigated in the public domain. The most extensive review of public libraries in the UK since 1942 is just being completed . The recommendation, "Infrastructure investments" of the Review of the Public Library Service in England and Wales [2] puts great emphasis on connecting to the information superhighway and suggests the establishment of 5 or 6 `hyperlibraries' to incorporate decentralised collections in specialised subject areas from the British Library. This would allow the sharing of resources and relieve pressure on large central area libraries. In these `hyperlibraries' some local collections could also be developed to the full extent of their national or international appeal. Improving remote access and where feasible producing digital libraries would then dramatically increase the number of users able to exploit these collections. Extracting customised and manageable subsets from such digital libraries is an important issue, and the principal subject of this paper.

One of the immediately apparent advantages of maintaining resources digitally is the ease with which one may make a query, and then retrieve the documents identified. Another advantage is the ease with which one can browse the materials, quickly skipping from one document to another. The success of the World Wide Web is testimony to the importance of these features.

If Digital Libraries are to be more than computerised search engines, which merely identify the location of the paper document, or allow the user to view or print an electronic copy of the document, then it is essential that the digital library adds value to what is currently available. Gladney et al, [9] in defining a digital library state that "A full service digital library must accomplish all essential services of traditional libraries and also exploit the well-known advantages of digital storage, searching, and communication". However, the result of a computerised query is generally nothing more than a list of documents. All one can do is to traverse the list searching for an appropriate document. What we would like to be able to do is to make a query, and have the system deliver an article or book, of exactly the correct length, that was specially written in response to the query and to exactly the required conceptual depth. Of course, one of the reasons readers browse at libraries and bookshops is to evaluate books to see if they deliver their subject at the correct level.

A second advantage of the digital library is that we may store and present multimedia resources. We believe that multimedia presentations can make very powerful learning aids, and it would be a very useful facility if we could build multimedia presentations from the stored information. These presentations might be requested by some client, built at the library from the latest materials available, and then delivered to the site of learning. But who is to build these presentations? Librarians are certainly overworked, and anyway manual authoring of these presentations is not practical, except in special cases where the end value of the product is intrinsically high.

Articles, books and multimedia presentations built in response to user queries would be good examples of the sort of value add that digital libraries can provide. We refer to them as bundles, and this paper describes various experiments we are conducting on extracting these bundles from large catalogued collections of digital resources.

CREATING MANAGEABLE BUNDLES FROM
CATALOGUES

The objective of our experiments is to provide a system which will produce an appropriate and manageable hypermedia presentation in response to a user's queries to an interactive library catalogue. The resulting presentation is what we refer to as a bundle, being a set of multimedia documents, relevant to a particular topic, that the user has selected as being of interest and that have been automatically linked together as a hypertext.

There are therefore two stages to the creation of a bundle; the first involves the user interacting with the user catalogue in order to specify the set of documents to bundle, and the second, the creation of the hypertext, is undertaken automatically. These two processes may be seen as simple cases of the mediator and collection-interface agents under development in the University of Michigan Digital Library [3]. The following subsections examine these processes.

The Query Engine

In order to help locate suitable documents from the resource catalogue we are prototyping three tools, as shown in figure 1. The first of these tools is the classification tool, which allows the user to view the classification hierarchy (e.g. Dewey Decimal) as a tree, and to select that sub-tree or branch which contains the set of documents to be placed in the intermediate results list.

The second tool is the attribute tool, which is a form based query tool, which displays all the document attributes from the catalogue, and allows the user to specify the attributes of the set of documents to be placed in the intermediate results list. These attributes are such items as the author, the title and keywords. The classification tool and the attribute tool are intended to provide similar functionality to Bellcore's Hierarchical On-line Public Access Catalogue (OPAC) [1]: in their system, the intermediate results list is known as the bookshelf.

The third tool is an information retrieval tool which allows the user to enter a free text query. The tool then locates that set of documents having the best match to the query, using an algorithm developed by Li [17] based upon that suggested by Salton [20] . The tool compares the frequency of terms in the query with the frequency of terms in the documents, as held in pre-prepared indexes. At present this tool only works with text based documents, but we envisage that at a later date we will incorporate the results of work in the Multimedia Research Group in the area of media based content retrieval [16].

Figure 1: Querying the Library Catalogue.

Once the initial query has been made, a list of suitable documents will appear in the intermediate results list box. From this list, the user may elect to keep all, or a selection, of the documents retrieved. Documents that are marked for keeping will appear in the final list of results, regardless of any subsequent queries. The user may now elect to refine the query, or add further results to the list. This is achieved by running any one of the query tools again, with a new query. The results of each iteration may be ANDed with the previous results, so that they are added to the previous results, expanding the set of results in the intermediate result box. Alternatively the new results may be ORed with the previous results, so that the new list of results satisfies all previous queries, thus successively refining the query, until a suitably small set of results has been collected.

Due to the fact that we do not yet have a suitably large digital library of multi-subject on-line resources, for the purpose of prototyping the query tool, we created a catalogue in MS-Access, of every document we have ever included in a Microcosm project within the Multimedia Laboratory. This catalogue refers to around 5000 multimedia documents from 25 different subject areas. The catalogue we created contained all the usual fields that would be expected in library catalogue. However we have found two further attribute fields to be useful.

We have added a field for document length. Expressing the length of a multimedia article is quite difficult. The purpose of keeping this attribute is to allow the user to estimate the time that will be required to view this document. "Pages" might have been a suitable unit of length for text documents, but since many documents are held in media other than text, we decided to adopt the "minute" as an experimental unit of length, being the approximate time that might be required to peruse and understand the contents. This is inevitably a subjective unit, but it does have the advantage that the user may refine queries, for example, by asking only for "documents with length less than 5 minutes".

We have also added a field called reader level. This field is intended to indicate the type of readership that is intended. We have restricted the allowable entries to a very short list, including secondary school, university teaching, research level, and review article.

The purpose of both the above fields is to represent a primitive form of user profile. User profiling by mediation agents is an active research topic [21].

The Hypermedia Presentation

Automatic construction of hypertexts from a collection of linear documents is one of the holy grails of the hypermedia research community. However, when a document (or document set) has explicit structure, it is usually possible to produce usable hypertexts automatically. Most work in the area [18, 8, 14] has concentrated on using features such as tables of contents and indexes to construct the hypermedia links. In this case we have a much coarser grained table of contents, namely the library catalogue, and in place of indexes, we have attributes such as keywords. However, our experiments have indicated that it is still feasible to produce a usable hypertext bundle, given a set of documents and their library catalogue records.

The hypertext system that we have used for these experiments has been Microcosm [6,5] which has certain key advantages for acting as such an experimental workbench, not least because of the fact that it was developed at Southampton, so we have access to all its API's, but also because it supports multiple methods of locating information. Microcosm produces hypertext applications which have a bias towards querying and information retrieval, rather than the simple "button pushing" hypertext that we have come to expect from some of the more popular authoring packages. This quality of hypertext lends itself to this application.

The methods Microcosm supports for locating information are represented in figure 2. The first method of access to the information uses the classification information. This is the information that that is held in the Document Management System (DMS) and is almost exactly equivalent to the information that is held in a library catalogue. It contains the position(s) within the document hierarchy that the file will be located, and it contains all the attributes, such as title, author, date of creation, physical media type and keywords. This information is very high quality, as it provides specific information about documents, and Microcosm provides tools for traversing the subject hierarchy, and for querying the documents by attributes such as keywords.

We were able to create the records for the document management system by writing a few simple macros to export the information from the MS-Access version of the library catalogue. Microcosm supports its own "logical hierarchy", which is similar to a folder structure, and documents may be placed in one or more branches within this structure.

Our initial reaction was to mirror the librarian's classification hierarchy onto the Microcosm DMS. However observation of the ways in which users of Microsoft's Encarta tend to make queries, for example, by asking to see all videos about some topic, lead us to believe that categorisation by physical type is just as important as classification by subject, so we have implemented two hierarchies - one for subject and one for physical types.

The second method of access is hypermedia link following. It is generally supposed that hypermedia links are manually authored in order to indicate some relationship between the two items at each end of the link. Such links are of very high value, but creating them requires considerable manual effort. However, the Multimedia Research Group at Southampton always considered that the reduction of this effort, and the automation of link creation, was an important research issue. One of the earliest results of this research was the introduction of generic links [7], which are links with a fixed end point, but which may start at any point where the source object is located. Typically this means any place where a particular text string is located.

We have taken all the keywords from the library catalogue, and automatically generated generic links to the top of each document for which this keyword was used. The result, from the user's point of view is that all occurrences of keywords appear as buttons within the text, and can be used to navigate to documents sharing this keyword.

The third method of data access is text based information retrieval. The interface to this functionality requires the user to make a selection of text from some document, or else to type in some text, and then choose "compute links" from the menu. The information retrieval engine will consult its indexes, and return a list of documents with the most similar vocabulary to the query. This is exactly the same engine as is used for our information retrieval tool in the catalogue query engine.

Results

We are in the process of producing our first prototype for the query engine, and the automatic Microcosm hypertext construction engine, by integrating various tools that have already been produced. We are testing it using a "digital library" consisting of a large number of on-line documents available within the Multimedia Laboratory. The bundles that the system produces clearly provide for a greater ease of navigation and information location than the simple raw set of documents would have provided, and subjectively we would claim that the clearly marked boundaries to the bundle give the user a greater sense of the scope of the information than would be provided by a number of references into a very large collection of data.

This system was built as a workbench for experimenting with various ideas within the digital libraries domain. It is clear that there is much room for improving our system, and we hope that over time various additions will be made. We would like to add a synonym generator. This would create further terms from the document keywords, and create further generic links on these terms. We would also like to integrate media based content retrieval and navigation [12, 16].

On a more complex level, we see the system as an ideal workbench for the development and testing of intelligent agent technology. There are many ways that such technology could be usefully deployed in this environment. Search agents could track users' interactions and attempt to mine for other relevant topics using background information retrieval, and link creation agents could attempt to create further links and trails through the bundle. We are already working on an agent to produce a "front-end document" as a kind of hypertext overview [15] of the bundle.

ISSUES CONCERNING CATALOGUES AND
KEYWORDS

Academic libraries, in particular, are now generally highly computerised in their basic housekeeping processes and in presenting themselves via on-line public access catalogues (OPACs).

One of the strengths of the traditional libraries is that many large collections of books and materials in other media are catalogued with great care and attention to bibliographic detail and accuracy using standard cataloguing rules such as Anglo-American Cataloguing Rules (AACR2). As Howard Rheingold points out `Librarians and other specialists have a toolkit and syntax for dealing with well-known problems that people encounter in trying to make sense of large bodies of information [19]. However even key libraries such as the Library of Congress are constantly having to review their strategies and approaches to alternative methods of cataloguing such as copy cataloguing and minimal cataloguing in order to dramatically bring down their backlogs.

On-line journal abstracting databases also traditionally provide very thorough bibliographic detail created by professional graduate level staff. INSPEC, for example, can help you refine your search strategy with classification terms such as C7210L (INSPEC's classification for "library automation"), thesaurus terms such as "document image processing" and free terms such as "multimedia databases".

We wish to help maximise the return from this labour intensive work using our tools for producing and viewing bundles. The combination of a variety of materials from different sources may also provide a variety of classification schemes and types of subject headings. These can be added to the bundle's hierarchies. Although at the document level the classification may not easily convey a sufficient level of separation, nevertheless it can be used in the way a library user traditionally browses a library shelf. On the World Wide Web we see subject catalogues such as the WWW Virtual Library being developed to give a useful alternative way to access a huge amount of information. At the same time Internet Yellow Pages directories [10], while they become obsolete fairly quickly, still sell because they provide an overview of accessible items which is not yet so easily scanned from today's computer screen.

The options for locating information are varied and different ones will suit different users at different times and also the same users at different levels of understanding. The level of a item - who it is intended for- has less frequently been emphasised in the past. It may perhaps be highlighted by a pointer such as a treatment code or a readership category, but this may become more important in the future with the enormous volumes generated by both paper and electronic publication. This possibility has been initially addressed by our reader level field.

The catalogue entries produced by the information specialist may give a degree of rigour to the information retrieval process. However, a multimedia bundle will frequently contain documents produced by and often further modified by the authors, particularly in the academic environment, which do not have the advantage of such in-depth indexing. One way to compensate for this might simply be to provide or point to a glossary or dictionary which helps the searchers choose their terms. The linking mechanisms of Microcosm do, in any case, give considerable help in retrieving relevant items. Free text terms used by the author are often more up to date in style and therefore more akin to the searcher's own vocabulary. The art is in balancing the effort required versus the result, particularly as terminology can change dramatically as one moves forwards or backwards in time. The effort also depends on how diffuse the intended audience is - a multimedia database with entirely local users need not be so concerned with the difference in terminology and synonyms between Britain and America, for example.

THE DELIVERY MECHANISMS

An important feature of the approach described in this paper, is that knowledge may be extracted from a library and handed to a user in some manageable piece of information: the user should be assured that the material is all relevant, that the subject matter is at an appropriate level and, most importantly, the quantity of information is manageable. If a ten year old school child has made a query on the term "geology" we do not wish to hand them the entire British Library entry on this topic - we probably need to explain the term, and give a few examples of the sort of work undertaken by geologists.

One way to help users to comprehend the boundaries of a set of materials is to extract the relevant materials, or copies of them, and present them to the user in some form that is manageable and familiar. This is what a book is. In attempting to simulate this effect within the digital library, we feel that at is important that the users can visualise the boundaries of their newly created hypertext material. This is why we produce bundles. In our experiments these bundles have been created in Microcosm, which provides an ideal environment for the production of a self contained hypermedia application.

However, there is nothing that makes Microcosm essential to this application. The essential features are a mechanism for allowing users to search a hierarchy, search attributes, follow hypermedia links and to carry out information retrieval. Microcosm's generic links may be simulated within a finite document set, by scanning the source text for each occurrence of a string that is the source of a generic link, and marking it as a link in the format used for the host hypertext system..

It would be perfectly feasible to produce a World Wide Web (html) version of this application. There might be some problems in converting data formats, but this route would have the advantage that its network delivery would be superior to the current version of Microcosm. However, making new links in html is rather more difficult than using Microcosm, and links cannot be made to objects within other media, so if the user is expecting to personalise and extend the delivered bundle, then perhaps this route is not so suitable. The development of a tool for building html documents from a Microcosm application is described in Hill et al. [11].

Another option might be to compile the resulting bundle into Microsoft's Multimedia Viewer. Again, although there might be some problems automatically converting data formats, and end user link making will not be possible, the compensation would be a superior front end to the delivery engine. The Multimedia Viewer will no doubt emerge as an industry standard for such presentations, particularly for the less computer literate user.

There are two issues which we have deliberately ignored in creating this experiment, but which should nevertheless be mentioned.

We have assumed that there is no copyright problem, and that any user may take an electronic copy of any set of documents away in a bundle. We recognise that there are problems with this approach, but others are addressing this issue [4] and, as part of a wider discussion on charging policy, we are intending to address the issue in our "Open Journal Framework" project for FIGIT. Until such time as the position is clearer, we do not feel that such matters should be a restriction to research into creating usable library technology. In the meantime, it is possible to imagine scenarios where references to the documents in the bundle are created in a private workspace within the domain of the digital library server, so that viewing the documents in the bundle is actually no different (legally or technologically) from viewing them using whatever software was provided by the server.

The second issue we have chosen to ignore is the price (in terms of both network transport time and financial cost) of getting a copy of a document. Paul Evan Peters, his keynote address to Digital Libraries `94, advised that we should assume that the cost of networking is free. We have taken his advice, and being academics, we are, of course, sheltered from knowing the financial cost of anything. For use in a real library, it would be necessary to make these matters explicit to users, so that they were aware of the time it would be likely to take to download all the documents in their bundle, and the cost of the documents specified.

CONCLUSIONS

Initial experiments with our system have indicated that bundles are a manageable and appropriate method of delivering information from the digital library. This approach has the advantages that:

Downloading the resources from the server to the bundle is a one off effort, this minimising the strain on the server and communications network.

The hypermedia links and information retrieval enable better browsing than can be achieved within a simple list of documents.

The boundaries of the information are known to the user, and the information contained within the bundle has been selected as being of an appropriate level.

The approach need not be tied to any particular software or hardware delivery platform.

We have attempted to show that such bundles may be created by using the information that would be available from a standard OPAC. However, we have found that the documents described in such catalogues tend to be too large, such as whole books, and it is necessary to describe smaller chunks of documents such as chapters and pictures. This has knock on effects on the schemes for classifying and keywording information, that we find need to be finer grained and more diverse than is generally available using standard schemes. We have found user defined classifications and keywords to be helpful at the level of the bundle, but acknowledge that there would be difficulties if such anarchy was allowed at the digital library level.

The framework for creating and delivering bundles is a suitable workbench for investigating the behaviour of intelligent agents, both in the field of locating suitable information for placing in the bundle, and for creating links within the completed bundle.

The Authors

Hugh Davis is a lecturer in Computer Science at the University of Southampton, UK, and was a founder member of the multimedia research group. He was one of the inventors of the Microcosm open hypermedia system, and is manager of the Microcosm research laboratory. His research interests include data integrity in open hypermedia systems and the application of multimedia information retrieval techniques to corporate information systems and to digital libraries.

Jessie Hey is a chartered librarian/information specialist and qualified teacher who has worked in a variety of library/information roles at California Institute of Technology, CERN and Southampton Institute of Higher Education. This was followed by 12 years at IBM's UK Development Laboratory where her jobs included managing the technical and business information services and setting up an interactive learning centre. She is now pursuing postgraduate research with the Multimedia Research Group at the University of Southampton.

Acknowledgements

We would like to thank Professor Wendy Hall, the director of the Multimedia Research Group, for her inputs into this project, and the other members of the group, too numerous to mention, but especially Les Carr, for their help and ideas.

References

Allen, R.B., Navigating and Searching in Hierarchical Digital Library Catalogs., In: Schnase, J.L., Leggett, J.J., Furuta, R.K. & Metcalfe, T. (eds.). The Proceedings of Digital Libraries '94. Texas A&M University, June 1994.
Aslib. The Review of the Public Library Service in England and Wales for the Department of National Heritage. Final Report. The Association of Information Management, May 1995.
Birmingham, W.P., Drabenstott, K.M., Frost, C.O., Warner, A.J., & Willis, K., The University of Michigan Digital Library: This Is Not Your Father's Library. In: Schnase, J.L., Leggett, J.J., Furuta, R.K. & Metcalfe, T. (eds.). The Proceedings of Digital Libraries '94. Texas A&M University, June 1994.
Cornish, G.P., Electrocopying: Problems and Needs. In: Libraries and IT. Working Papers of the Information Technology Sub-committee of the HEFC's Libraries Review. UKOLN, 1993.
Davis, H.C. Using Microcosm to access digital libraries. In: Schnase, J.L., Leggett, J.J., Furuta, R.K. & Metcalfe, T. (eds.). The Proceedings of Digital Libraries '94. Texas A&M University, June 1994.
Davis, H.C., Hall, W., Heath, I., Hill, G. & Wilkins, R. Towards an Integrated Information Environment with Open Hypermedia Systems. In: D. Lucarella, J. Nanard, M. Nanard, P. Paolini. eds. The Proceedings of the ACM Conference on Hypertext, ECHT '92 Milano, pp 181-190. ACM Press, 1992.
Fountain, A.M., Hall, W., Heath, I. & Davis, H.C.. MICROCOSM: An Open Model for Hypermedia With Dynamic Linking, in A. Rizk, N. Streitz and J. Andre eds. Hypertext: Concepts, Systems and Applications. The Proceedings of The European Conference on Hypertext, INRIA, France. Cambridge University Press. 1990
Furuta, R., Plaisant, C, & Shneiderman, B., A Spectrum of Automatic Hypertext Constructions., Hypermedia 1(2), pp 179-195, 1989.
Gladney, H.M., Fox, E.A., Ahmed, Z., Ashany, R., Belkin, N.J. & Zemankova, M., Digital Library: Gross Structure and Requirements: Report from March 1994 Workshop. In: Schnase, J.L., Leggett, J.J., Furuta, R.K. & Metcalfe, T. (eds.). The Proceedings of Digital Libraries '94. Texas A&M University, June 1994.
Hahn, H. & Stout, R., The Internet Golden Directory 2nd ed. Berkeley, Osborne McGraw-Hill, 1995.
Hill, G.J, Hall, W., De Roure, D.C. & Carr, L.A. Applying Open Hypertext Principles to the World Wide Web. To be published in the Proceedings of the International Workshop on Hypermedia Design `95. Montpellier. Available from the authors as a Computer Science Technical Report. University of Southampton, 1995.
Hirata, K., Hara, Y., Shibata, N. & Hirabayashi, F. Media-based navigation for hypermedia systems. In The Fifth ACM Conference on Hypertext Proceedings '93, Seattle, Washington, pp 159-173, ACM, 1993.
Joint Funding Councils' Library Review Group. JFC Library Review Group: Report. Bristol: HEFCE, 1993.
Kahn, P. Linking Together Books: Experiments in Adapting Published Material into Intermedia Documents., Hypermedia 1(2), pp 111-145, 1989.
Landow, G.P. & Kahn, P. Where's the Hypertext? The Dickens Web as a System-Independent Hypertext. In: D. Lucarella, J. Nanard, M. Nanard, P. Paolini. eds. The Proceedings of the ACM Conference on Hypertext, ECHT '92 Milano, pp 149-160. ACM Press, 1992.
Lewis. P.H., Davis, H.C., Griffiths,S., Hall, W. & Wilkins, R.J., Content Based Retrieval and Navigation with Images in the Microcosm Model. In: The Proceedings of MediaComm '95, Southampton, April 1995.
Li, Z. Information Retrieval for Automatic Link Creation in Hypertext Systems. PhD Thesis, The University of Southampton, U.K. October 1993.
Rahtz, S.P.Q., Carr, L.A & Hall, W., Creating Multimedia Documents: hypertext processing. In: McAleese, R & Green, C., (eds.) Hypertext: state of the art. intellect, 1990.
Rheingold, H., The virtual community: homesteading on the electronic frontier. Reading, Mass.: Addison-Wesley. 1993.
Salton, G., Yang, C.S. & Wong, A. A Vector Space Model for Automatic Indexing. Comm. ACM 18(11), pp 613-620, Nov. 1975.
Stix, G. The Speed of Write. Scientific American. 271(6), pp 72-77, Dec. 1994.