Optical Character Recognition (OCR)
Optical Character Recognition (OCR) is the term used for the automated process of converting the words on an image into machine readable text that can be searched,, displayed on-line, and indexed. When applied to historical newspapers OCR can be exceptionally challenging.
OCR Makes History Searchable
Advantage offers an affordable solution to create a high volume of accessible, digital images using an automated OCR process after scanning your community’s newspapers. Libraries and local communities can afford to procure access to decades of content as an enhancement to the preservation of their papers on microfilm. In addition to making the digital archive keyword searchable, it can also be indexed by City, County, State, Country, Title, Institution, Date, Page and more. As a result, the user has a searchable collection and content-browsing tool, much more practical and accessible to the community that is far more efficient than film readers.
Although digitization coupled with automated OCR makes for a fantastic research tool, it must not be mistaken for a preservation tool. We are a preservation company first and foremost; everything begins and ends with the microfilming process. We preserve the fragile paper to film for long-term preservation and scan the microfilm to create digital files. These digital files are the fourth generation of the images. Three out of the four generations of a newspaper page (printing, filming and digitizing) can happen years, decades, or in some rare instances, centuries in between. The newspaper industry in the United States has evolved considerably over the last 300 years. Each development in the typesetting methods, printing process, and paper stock created unique challenges in adapting the digital image of old newspapers to a searchable format. In short, the OCR quality (therefore, the search-ability of the newspaper) is solely dependent on the condition of the source material. This does not reflect a technology problem, nor is it a problem with. If the words on a page are not recognized by the software, most likely it is the result of a series of problems that began 300 years before the first computer was even invented.
One method of measuring OCR efficiency is gauging how accurately it determines which words are on the printed page. This is normally expressed as the percentage of words on the page that are accurately “read” by the software. Of course, “reading” a word entails piecing it together letter by letter, therefore, OCR accuracy is sometimes measured as the percentage of letters that are accurately “read.” It is important to note, however, that these two measures of accuracy are fundamentally different. Word accuracy is, by definition, significantly lower than letter accuracy, as it is effectively the joint accuracies-or joint probabilities- of the letters in the word. For example, OCR accuracy at the letter-level for a document may be 98 percent. But computing the accuracy of a five-letter word in that same document is done by taking 0.98 to the fifth power.
Most OCR software tools do not necessarily follow the logical arrangement of a newspaper’s multi-column, multi-sectioned layout. The software does, however, endeavor to identify zones with possible text so that OCR may be applied to these zones. There are two types of zones: graphic and text. Typically, the text zones are very accurately represented. At the Advantage, we use 100 percent automation for both zoning and capture. The quality of the OCR is completely dependent upon the quality of the medium that is scanned, as is the case in every step preceding OCR.
The quality of the original image has several implications throughout an automated process. If the second or third generation images on the microfilm have deteriorated to any degree, the imperfections and poor image quality will interfere dramatically with the OCR process. Areas of text may be seen as a graphic and spacing of columns or even letters may lack consistency, which leads to a “ballooning” of the number of terms that are submitted for a search. For example, a simple imperfection can close a “C,” transforming a word like “cat” into “oat”.
Combined with unusual fonts, faded printing, shaded backgrounds, fragmented letters, skewed text, curved lines and bleed-through on the originals, OCR will be far less than 50 percent on most historical documents. We are able to manually correct OCR, but this would be a cost-prohibitive process for libraries given their budget constraints.
Despite these challenges, the benefits of OCR far outweigh the benefits of using microfilm for research or content-based inquiries. Microfilm provides many advantages for long-term archiving and preservation of content, but a quick search to find information, such as a name, is not among them. OCR eliminates the need to tirelessly inspect each page, and scour each word, to locate a mere tidbit of information within a microfilm reel. Furthermore, one must know the City, State, Title, and Date of the information sought in order to first locate the reel. If the user is unable to find an item by conducting a search due to poor OCR returns, the digital images are still indexed by City, State, Title and Date. As a result, the user has a content-browsing tool, and the process operates more efficiently than film readers.