This excerpt from an article posted by The Rolfe Library makes some very good points about the OCR and challenges faced in searching for specific content on historical newspapers found online:
It is important to note that optical character recognition (OCR) software used to scan and convert hard-copy text to a searchable electronic format cannot accurately recognize every text character. For example, if a font utilized in a newspaper is not standard but instead is rather stylized, it likely will not be recognized by the software. Likewise, if an area of a page is smudged or otherwise damaged, the character recognition software will not recognize words in that poor quality area of a page. In such cases, a search will not return a result for a term in which such a character was not correctly recognized.
You can read the article here: Rolfe Library
Given a source material that can range from a few years to a few hundred years old, newspapers have always presented a unique challenge in terms of making the content widely accessible via the web or through other digital portals. The original quality of the newspaper at the time of filming, and how that microfilm was stored over the years play a large part in how it will translate digitally. This is true not only in image quality, but also in searchability.
Advantage Preservation’s process involves scanning the newspaper microfilm with the emphasis being placed on the text. The bi-tonal scan may not create images as “pretty” as some may expect (most noticeably in photos) due to the contrast, but helps to some degree in the OCR process.
Digitization is a process used to enhance the usability of an institution’s microfilm collection by allowing the content to be searched. The goal is to balance the quality, quantity and value to male this process budget friendly.
Visit the archive at: /Rolfe Library