Data Retrieval and the Realities of Document Conversion
William B. Cavnar and Andrew M. Gillies
Environmental Research Institute of Michigan P.O. Box 134001, Ann Arbor MI 48113-4001, email@example.com
of the main thrusts of the Digital Libraries Initiative is the conversion of
large numbers of paper and microfilm documents into electronic form using
scanning coupled with optical character recognition OCR. There are commercial
ventures attempting similar projects, such as producing CD-ROMs having both
document images and text derived from OCR of the images. Because of the
inevitable problems with the recognition, these efforts require expensive
manual cleanup and indexing of the generated text in order to provide some
reasonable access to the documents.
These problems will not be trivial to solve. Electronic documents produced by
scanning and OCR, even under ideal circumstances, can contain ten to twenty
recognition errors per page of text, and significantly more if the document
image is in some way degraded. OCR errors are especially prevalent in documents
having some image quality problems, such as discolored backgrounds and faded
ink due to aging in paper documents, and noise and low contrast in microfilm
documents. Those particular documents may then become inaccessible using
conventional retrieval on their OCR results.
Another important area that needs considerable attention is that of converting
non-text portions of documents, e.g., images, graphs, diagrams and tables, into
retrievable electronic form. Conventional OCR is of no use in accessing text in
these portions of documents. Retrieval methods are also required for
A third area that has gotten relatively little attention is the conversion of
hand-written documents. Although these are obviously much harder than
machine-printed documents, they are still important, especially in dealing with
Our research group at the Environmental Research Institute of Michigan ERIM has
been pursuing a number of different research efforts that address these data
conversion problems. All of these efforts have in common an emphasis on inexact
matching. In other words, we build all our databases and indexes under the
assumption that there will be errors that prevent perfect matches, and that our
systems need to deal with this reality. We feel that application of inexact
matching techniques is crucial to accessing the products of large-scale
document conversion projects.
2.0. N-Gram-Based Inexact Matching
approach to text retrieval that overcomes some of these problem with electronic
text generated by a conventional OCR is to use various forms of
high-performance N-gram analysis on the documents. In this context, an N-gram
is an N-character slice of a character string. By treating a word not as a
single unit but as a set of overlapping N-grams, this approach can partially
overcome the problems mentioned above. The natural redundancy present in the
overlapping N-grams provides a means for accessing text that has these kinds of
errors. Suppose a word in a document, RETRIEVAL, has been misrecognized
by an OCR system in a given passage of text as a second word, RETBIEUAL.
A conventional word-based indexing would have missed a passage containing this
misrecognized form of the word. However, by breaking the text passage into its
constituent N-grams, we find that the two words share 6 out of 10 possible
bi-grams. Another advantage is that N-gram analysis essentially provides word
stemming for free. For example, RETRIEVAL and RETRIEVING share 7
out of 10 bi-grams. This 'free' word stemming becomes even more attractive in
a multi-lingual environment since the system can perform this kind of analysis
without any language-specific knowledge at all. Both aspects, retrievability
and word-stemming, are completely automatic, and require no manual cleanup of
We have been working with several N-gram-based representations of documents
which have advantages in different cases. One method uses super-imposed coding
techniques to produce compact indexes for fast retrieval of sub-documents given
a query string [Cavnar92], [Cavnar93]. Another approach uses N-gram frequency
statistics to characterize a document for matching purposes [Cavnar94]. Both
methods are active research areas at this time.
3.0. Oversegmentation-Based Word Matching
approach we have been pursuing is recognizing and matching characters in the
context of whole words. ERIM has developed an extremely successful approach to
word recognition based on text oversegmentation. In this approach, words are
segmented into primitive elements which are character-sized or smaller. These
elements are then reassembled into characters and words using feedback from a
neural-network-based recognition system. The method makes use of domain
specific lexicons, that is, word lists, to constrain the recognition process.
Using these techniques, a practical system for handwritten mail sorting was
developed [Gillies92], [Ganz92]. An early prototype of the system was able to
sort 34% of the hand-written mail with a 3.4% error rate.
In February, 1994, a more advanced version of this technology was used in a
competitive test of handwritten OCR systems conducted by NIST for the Census
Bureau. In the test, 9000 occupation fields from 1990 census forms were
processed. The system used a lexicon of 8000 occupation-related words and
phrases containing the correct response for 62% of the 9000 fields. The
remaining fields contained responses which were not present in the lexicon.
Using a confidence-based rejection criterion, the ERIM system was able to
recognize 40% of the fields with a 3.6% error rate. An error was counted if the
response differed from the correct response in any fashion. The average field
contained less than one character error. ERIM's system was the most accurate of
the eight systems tested, with error rates significantly lower than the nearest
More recently, ERIM has demonstrated a system for keyword-based retrieval of
documents from a handwritten document database. The system makes use of the
oversegmentation technology to prepare documents for entry into the database.
The operator can query the database for documents containing key words or
phrases. The system presents the operator with an image of the retrieved
documents, allowing for verification and perusal of document contents. This
technique is also applicable to recognizing machine-printed text. Currently, we
are employing the oversegmentation-based matching in recognizing
machine-printed Arabic words. This work is in progress, but is already showing
very promising results.
4.0. Image-Based Retrieval
another class of approaches that we have recently begun investigating bypasses
OCR altogether by attempting to perform matching in the image domain. One early
system we developed retrieved technical abstracts by converting query words
into image surrogates, and then matching them against image surrogates of the
words in the abstract. We are also in the preliminary stages of development of
several systems that query image databases using image queries for medical and
radar imagery. These systems allow the user to retrieve documents based on
their containing sub-images like those provided in the query. Image-based
retrieval must provide inexact matching to have any hope of success.
[Cavnar92] Cavnar, William B. and Vayda, Alan J., "Using Superimposed Coding of
N-gram Lists For Efficient Inexact Matching", Proceedings of the Fifth USPS
Advanced Technology Conference, Washington D.C., 1992, pp. 253-267.
[Cavnar93] Cavnar, William B., "N-Gram-Based Text Filtering For TREC-2," to
appear in the proceedings of The Second Text REtrieval Conference TREC-2, ed.
by, Harman, D.K., NIST, Gaithersburg, Maryland, 1993.
[Cavnar94] Cavnar, William B., and Trenkle, John M., "N-Gram-Based Text
Categorization", to appear in the Proceedings of the 1994 Symposium On Document
Analysis and Information Retrieval, University of Nevada, Las Vegas, April
[Ganz92] Ganzberger, Margaret J., Rovner, Richard M., Hepp, Daniel J.,
Gillies, Andrew M., Lake, CharlesD., and Wood, Cyrus R., "A System For
Handwritten Address Interpretation",Proceedings of the Fifth USPS Advanced
Technology Conference, Washington D.C., 1992, pp. 337-351.
[Gillies92] Gillies, Andrew M., "Cursive Word Recognition Using Hidden Markov
Models," Proceedings of the Fifth USPS Advanced Technology Conference,
Washington D.C., 1992, pp. 557-562.