Mixed-Media Access

Francine Chen, Marti Hearst, Julian Kupiec, Jan Pedersen, and Lynn Wilcox Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304, {fchen,hearst,kupiec,pedersen,wilcox}@parc.xerox.com


Effective information access is crucial for any digital library, since the availability of rich online-document repositories is not useful without a methodology for finding items of interest. In our view an effective information access system enables one to search and present information in a variety of ways, making use both of conventional search and browsing methods, as well as less conventional methods involving automated highlighting, emphasis detection, thematic thread-following, and summarization. Moreover, these multiple access mechanisms should operate seamlessly and robustly over multimedia document types. Our work to date integrates access to information in a wide variety of digital media, including scanned text, scanned images, digitized audio, and digitized video, as well as traditional plain text collections. Below we outline recent work that illustrates our approach to multimedia access.

Mixed-Media Keyword Search

Keyword search and similarity search, currently used to access textual information, can be directly extended to other media types. Plain text is usually input with a keyboard and is usually retrieved via a query posed using the same device. Analogously, digitized spoken text can be queried by spoken keyword input; [WB92] describes work in this vein. Extending the analogy once again, scanned textual images can be searched by selecting a region of the image containing the desired keyword; [KB90] describes how this can be accomplished.

We are also developing retrieval techniques that cross media. In our word-image spotting system, partially specified keywords or phrases which have been entered by a user through a keyboard are detected and located in images, implementing a type of image "grep'' [CWB93]. We also have built a system which enables a user to access documents from a large plain text corpus by speaking the words of a query. The system exploits the fact that word alternatives that are the result of recognition errors are not likely to be semantically correlated, whereas the words intended by the speaker are semantically related and generally occur close together in text [KKB94].

Information Threads

Rather than viewing information as an uninterrupted sequence, we are developing methods to identify information threads in multiple media. For example, audio information may be composed of multiple sources of sounds, as in a conversation between two or more people. We have developed a method for segmenting an audio stream based on speaker identification [WCKB94]. Similarly, in video, scene changes occur which may signal a change in topic or speaker, and we are developing a method for identifying scene changes from a video signal. We are also developing techniques for identifying changes in topic in plain text [hearst94]; these methods may also be applicable to scanned images of text and conversations.

Mixed-media Browsing and Summarization

The user's goals should determine what kind of information is displayed after a search. When scanning for specific information, presentation of a portion of a document may be all that is needed. When trying to get an overview of a sample of documents, summarization of the material may be more appropriate. When a user is trying to determine what is contained in a corpus, tools for browsing may be desired. We are developing these tools in a variety of media.

Browsing provides a way to view the contents of a text collection without requiring the user to input search terms. We have developed Scatter/Gather, an unsupervised method for organizing the contents of very large text collections. The method yields semantically coherent clusters that can be browsed, a subset chosen and recombined, and the new results, showing a select subset of the collection, can again be browsed according to semantically coherent clusters[CKPT92].

We have also developed methods for selecting excerpts to create summaries of documents. One method summarizes plain text documents, and has been extended to scanned images of text. We have also developed a summarizer for audio information that makes use of prosodic cues to detect emphatic (and hence hopefully important) excerpts that are then combined to create a summary [CW92]. In addition we have developed methods to automatically partition an audio signal based on speaker identity to enable quick scanning and browsing [WCKB94].


We have developed many of the components of a system that will provide multiple search and viewing techniques for multimedia information. Our approach is to integrate access methods across various media types and to provide mixed-media access when appropriate. In the future we plan to determine how best to combine various search and display techniques to create a seamless interface to multimedia information.


[CWB93] Francine R. Chen, Lynn D. Wilcox, and Dan S. Bloomberg. Detecting and locating partially specified keywords in scanned images using hidden Markov models. In Proceedings of the International Conference on Document Analysis and Recognition, Tsukuba Science City, Japan, October 1993.

[CW92] Francine R. Chen and Margaret M. Withgott. The use of emphasis to automatically summarize a spoken discourse. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, San Francisco, CA, March 1992.

[CKPT92] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR'92, Copenhagen, Denmark, June 1992. Also available as Xerox PARC technical report SSL-92-02.

[hearst94] Marti A. Hearst. Multi-paragraph segmentation of expository text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, 1994. To appear.

[KB90] Gary Kopec and Steve Bagley. Editing images of text. In Proceedings of Electronic Publishing '90, Cambridge, England, 1990. Cambridge University Press.

[KKB94] Julian Kupiec, Don Kimber, and Vijay Balasubramanian. Speech-based retrieval using semantic co-occurrence filtering. In Proceedings of the ARPA Human Language Technology Workshop, Plainsboro N.J., March 1994.

[WB92] Lynn D. Wilcox and Marcia A. Bush. Training and search algorithms for an interactive wordspotting system. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, San Francisco, CA, March 1992.

[WCKB94] Lynn D. Wilcox, Francine R. Chen, Don Kimber, and Vijay Balasubramanian. Segmentation of speech using speaker identification. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Adelaide, Australia, April 1994.

Last Modified: