Making Old Newspapers Searchable: The Beginning
The newspaper industry in the United States has evolved considerably over the last 300 years. Each development in the typesetting methods, printing process, and paper stock created unique challenges in adapting the digital image of old newspapers to a searchable format.
Digitization Of Historical Newspapers & OCR (Optical Character Recognition)
“What history buffs want is easy, comprehensive digital searching from anywhere. Access in person in Des Moines or Iowa City to the state’s collection can’t compete with that modern standard, no matter how convenient the hours of operation”
-Kyle Munson, The Des Moines Register.
Mr. Munson is right. We understand the appeal and importance of digitization, which will supplement the microfilming as a means to offer more practical access to members of our community. Our view is twofold: Microfilming ensures preservation; digitization ensures access and outreach. We will explore ways to facilitate access to the entire digital collection in every library and school in the state of Iowa.
Mr. Munson further addresses concerns associated with the digitization process.
“The digitization process doesn’t capture every single word on every page, depending on image quality. Kiley estimated that perhaps half the content ends up searchable by keyword, to say nothing of photos. So there could be more mentions of my mom and grandparents in the Stuart newspaper that I would find only by manually poring over each and every page.”
We are a preservation company first and foremost; everything begins and ends with the microfilming process. We preserve the fragile paper to film for long-term preservation and scan the microfilm to create digital files. These digital files are the fourth generation of the images. Three out of the four steps-printing, filming and digitizing- can happen years, decades, or in some rare instances, centuries in between.
Mr. Munson’s concern is understandable. However, for the sake of context, it is important to understand the nature of the materials, and each of the three generations of the images we work with, as well as the challenges associated with Optical Character Recognition, especially in terms of old newspapers. To fully comprehend these challenges, one must engage in a fairly extensive history lesson. It is also critical to understand digital image creation- particularly, what the process entails s to bring printed words on a 200-year-old piece of paper to your computer screen.
In short, this is not a technology problem, nor is it a problem with OCR. Poor search-ability is the result of a series of problems that began 300 years before the first computer was even invented.
The process of creating, preserving and ensuring access to newspapers entails four phases. Each phase, or step, in the newspaper’s lifecycle represents another step away from the original copy:
- 1st Generation: Publishing (the typesetting, layout and methods of creation)
- 2nd Generation: Printing (the production of the publication onto paper)
- 3rd Generation: Microfilm (the preservation phase)
- 4th Generation: Digital Image (the scanning of the microfilm)
Before we can even discuss the digital image, we have to go all the way back to the original source material. To understand the source material, we must first understand a little bit of American newspaper history.
In 1689, newspapers, called broadsides, were published in the English colonies, but the first multi-page newspaper, the Publick Occurrences was published in Boston in 1690. Published without authority, the Colonial government arrested the publisher, and all copies of the paper were ordered to be destroyed. A copy was discovered in1845 in the British Library, and is the only known copy.
The available images of this paper are remarkable, considering that there were 155 years between its publication and discovery date. Since microfilming was not used for newspaper preservation until the late 1930’s, it is safe to assume that from the time this paper was published to the time it was captured to microfilm, over 240 years had passed. Another 75 years likely passed before the image was digitized. This paper is a “best case scenario” and an extraordinary example of preservation of the hardcopy newspaper by the British Library.
Most papers printed since the late 1800’s were not preserved nearly as well. One reason for the fine quality of the Publick Occurrences is that the only medium used up until the 19th century was “rag” paper. Rags were beaten to pulp using mortar and pestle. This handmade paper is often found in good condition due to the neutral pH content of cotton and linen fibers. Documents written on rag paper were significantly more stable than the wood pulp paper stock that would be used in later years. Cotton fiber is still used today, but only for specialty applications such as currency.
This page is a pristine specimen, but it is also an example of why OCR is such a challenge for newspapers.
The Publick Observer and newspapers for the following 200 years often used fonts that were not uniform, and even used different script within the same sentence.
Another problem is what we refer to as the “S/F” issue. This is caused by the “medial s,” where the letter looks similar to an F. This was commonly used in newspapers until nearly 1900. This “medial s” often appears in the middle of a word, while the “terminal s” was used to finish one off. In practice, this rule was not strictly adhered to, which poses a larger challenge in recognizing the S/F issue on a computer.
This portion of the Publick Occurrences illustrates the issue quite nicely. In the following hand-corrected excerpt, which demonstrates the S/F issue, there is a total of 313 words and it translates like this:
IT is deſigned, that the Countrey ſhall be furniſhed once a moneth ( or if any Glut of Occurrences happen, oftener, ) with an Account of ſuch conſiderable things as have arrived unto our Notice. In order hereunto, the Publiſher will take what pains he can to obtain a Faithful Relation of all ſuch things; and will particularly make himſelf beholden to ſuch Perſons in Boſton whom he Knows to have been for their own uſe the diligent Obſervers of ſuch matters. That which is herein propoſed, is, Firſt, That Memorable Occurrents of Divine providence may not be neglected or forgotten, as they too often are. Secondly, That people every where may better underſtand the Circumstances of Publique Affairs, both abroad and at home; which may not only direct their Thoughts at all times, but at ſome times alſo to to aſſiſt their Buſineſſes and Negotiations. Thirdly, That ſome thing may be done towards the Curing, or at leaſt the Charming, of that Spirit of Lying, which prevails amongſt us wherefore nothing ſhall be entered, but what we have reaſon to believe is true, repairing to the beſt fountains for our Information. And when there appears any material miſtake in any thing that is collected, it ſhall be corrected in the next. Moreoverthe Publiſher of theſe Occurrences is willing to engage, that whereas, there are many Falſe Reports, maliciouſly made, and ſpread among us, if any well-minded perſon will be at the pains to trace any such false Report ſo far as to find out and Convict the Firſt Raiſer of it, he will in this Paper ( unleſs juſt Advice be given to the contrary ) expoſe the Name of ſuch perſon, as A malicious Raiſer of a falſe Report. It is ſuppos’d that none will dislike this Propoſal, but ſuch as intend to be guilty of ſo villainous a Crime
If we remove the highlighted “non-words”, the word count drops to 262 words. That is a loss of index-able words of almost a 20 percent,due solely to the S/F issue. Even if the technology existed in 1689 to digitally scan the original newspaper on the day it was printed, these factors would still make OCR difficult.
The following is the actual OCR output of the same excerpt of the Publick Occurrences without manual corrections:
is iefipKi^ that the Countrty JItMS be fur* nifhed 9fiee m mrpt€th ()nr if any Glut o/0c« curtcnccs happen^ ofccncr, ) vi(h an Ac* count of fkch confiderahU things as have aT” ysvnd unto otsr Notice. . In order hereunto^ the Puhlifljer wiH takf v>hat fMws he can to obtAin a Faithful Rebtion of all ftich things • and mil particularly make himfclf htholden to fuchPerfons i> Bofton xcbom be Knows 10 have been for their own afe the diligent Obfer* fiCrs of fuch flatters. That whkh is i)erein fropofed^ is^ Firft, That Memorable Occurrtnti ofDivine Providence rmy not he ntglc^td or forgotten at they too ofttn ate. Secondly, Thatpeo^Uevtryvphere mayhv” ter uffderftand the CircHnifiofices of Fubltque Af* 0frs^ both nhr(}4d und at home ‘j vhith rmty not enly direU their Thou^hts\« all. tinsts^ butut fcrrtje; simet alfo to ajpfi their BuCaefles and Negotiations* Thirdljr, Thitt fome thing may be done towards the Curing, or at leafi the Charming oftb^t S\ i- rix of l.ying, rohtch prevails: amongfi us^ whe s* fore nothing jh4l ki thered, hnt what m have reafon tohelteve is true^ repairing to the l^tflfonn. tJUhs for our Information. And when then af^ pears any material miftake in any thing that ts colleHed^ it fnMl be corrcded j« the next. Moreover^ the PubUjker, of theft Occurrences iswilling to eugage^t.that whereas^ there are ma* t!jVz\k Kt\>oti%^ malicioufly inade^ and fpread 4moiig Hs^ tfany well- minded pexfon will beat the fains to trace any fkch f^lfe Report fofar as to find oHt and CenvtStjhe Firft Raifer 0/ if, he “pill in this Paper ( nnlefs jufl Advice be given to iothe contrary ) expofe the Name of fuch perfon^ AS- Amal i^iious Raifer of a falfe Report. Is it Jfippos^d that none wili dislike this Propofa!^ k»tt fidch
The number of “words” that would be indexed on the page is 300, however, none of these “words” would likely lead to a search result. If you remove the “non-words” and the S/F words from the OCR, you are left with:
is that the be if any Glut happen an count of things as have unto Notice. In order the he can to a Faithful of all things and particularly make to be Knows 10 have been for their own the diligent of That is is That Memorable Providence not he or forgotten at they too ate. Secondly, the of both und at home not their all. to their and Negotiations thing may be done towards the Curing, or at the Charming of prevails: us fore nothing what s true repairing to for our Information. And when then pears any material in any thing that it be the next. Moreover the of theft Occurrences to whereas there are and well- minded will beat the to trace any Report as to find and if, he “pill in this Paper Advice be given to the Name of of a Report. Is it that none dislike this a
The actual number of words on the page total 154, about 50 percent of the original number of words. While legible when viewed by the human eye, the varied typeset, blotchy ink and imperfections in the paper cause considerable loss in the OCR process.
Over the next few days we will be discussing the source material and how the differences in paper, typesetting, and printing process have evolved over the last 2 and 1/2 centuries. The evolution of this industry creates a unique set of challenges for the OCR process.