Planning To Digitize? Use A "3 Legged Stool"
By: Jeffrey Kiley - Advantage Archives
When evaluating how to approach making your microfilm or other historical printed content available digitally, plan for the best balance of quality and quantity that allows you to maximize the available funds,
We all know that preserving our cultural heritage often falls to local communities and groups, with small staffs and even smaller budgets. Many institutions and government agencies face funding challenges and restrictive budgets, and with no intent to monetize their collections, they need to define a plan to make those limited funds as impactful as possible. It takes careful planning.
You must define your vision for the digital collection and how your patrons will interact with it. Spend time exploring the digital archives made available online from other institutions, and make notes of what you like and what you might do differently. There are some really well-done digital collections available for free online, and I am a huge fan.
You just need to keep in mind that not all digital archives are created equally. The very large institutions, whose digitization projects are often funded through federal grants, rarely are burdened with the budgetary constraints that most local organizations face every day. For example, the digital archives created through the Library of Congress, and grants from the National Endowments for the Humanities are fantastic projects and should be celebrated, but these archives are built on a strict set of standards and guidelines to provide the highest quality image, most robust metadata, and several file derivatives that are highly adaptable to most use cases.
The images can range between 400 and 600dpi grayscale, at 8bit or 16bit depth, oftentimes with hand-corrected OCR. The images are beautiful, and the research and effort put into building the collections result in a very valuable cultural asset. A very expensive one too.
These large scale projects are really, really fantasitic….but come with a really, really big commitment of resources and money.
Similarly, the “pay for access” solutions, both subscription-based consumer websites and the large content aggregators that service libraries and educational institutions, can monetize (and re-monetize) their archives, allowing them the luxury of spending a bit more on the production of their digital images.
Since funding is sometimes the biggest perceived obstacle to embarking on a project to digitize your collection, you need to focus on building a plan to make those limited funds as impactful as possible. I believe there IS a solution for any budget, and that access to local history should not only be available to the largest of institutions, but also to small and underserved communities.
So you need to ask yourself, “What do I want to accomplish, that is both realistic and affordable, YET it gives me all the features that I envision my users need for a fulfilling experience?”, then try to figure out what “realistic” and “affordable” options are available, and start to develop a budget. But, how do you know how to set a project budget, let alone maximize one? I encourage our partners to start by using
“The Good 'ol 3 Legged Stool”
You have likely heard some variation of the “stool with three legs”. President Franklin Roosevelt used this metaphor during the rollout of the Social Security program in 1934 when discussing properly balancing Social Security, pensions, and personal savings. I have seen the “3 legged stool” analogy used to explain everything from insurance, to marketing, to wedding rings…and everyone’s “3 legs” represent something different.
When using the ” ‘ol 3 legged stool” to plan your digitization project, the three legs on your stool are:
We need a plan that allows for the best possible balance of image quality and volume, that allows you to maximize the available funds. When any single factor is given more priority than the other two, the ” ‘ol three-legged stool” can become wobbly and tip over unless there is careful planning as to the overall goal of your digitization process.
The starting point in evaluating image requirements for your archive is deciding if you want black and white images, or opt to spend a little bit more to have the images scanned in greyscale. The deciding factor usually comes down to the pictures.
When I speak to “quality” I am speaking specifically to image requirements. Is a photorealistic recreation of the newspaper the intent of the digitization, or are you working towards a keyword-searchable archive with less emphasis on the images on the page? With the institutions we partner with, the majority of them have characterized their projects as a complement to, or enhancement of, their current microfilm or physical newspaper collection, focusing on creating a text-searchable archive. By using digitization as a supplement (not a replacement) to your long term archival strategy, it opens up a very real way for the members of your community to connect with their history.
When scanning your collection “bitonally”, the images are black and white with no halftones. Bitonal image capture is basically like a light switch: on or off. White or black with nothing in-between. This is ideal for text…much less ideal for photos. If the goal of the digitization project is to develop a fully keyword-searchable database with a high return rate on keywords, and browse-able index, and are looking to maximize the funds available to you, then bitonal images may be the best fit. Just keep in mind that the reproduction of photographs and accurate tonality is not the intent of this type of digitization project
I often recommend bitonal image capture when the institution’s primary focus is on making the text discoverable through search, effectively making their microfilm into a practical research tool. This method of scanning reproduces the image of the newspaper in a way where the text is clear, legible, and laid out the way it was printed, in context with the page. The black-only text on a white background makes for very favorable OCR results and is more “budget-friendly” than its grayscale counterpart.
If the goal of the project is in fact to provide something closer to a photorealistic representation of the page, grayscale is a more accurate fit. Instead of the proverbial “on or off” switch for the color black that a bitonal image offers, grayscale images are produced with either 8 or 16 shades of gray between the color black and the color white. These halftones allow for photos and finer visual details of the page to be recreated more accurately. It is slightly more expensive, but there are ways to still make a the cost of a grayscale image less expensive than what you might expect. The difference in cost between a bitonal image and a Grayscale image boils down to file size. Grayscale images are larger than bitonal images. Grayscale images take a bit longer to scan and process. The images also require more storage space and use more bandwidth to deliver.
Some of the “cost savings” to be had on grayscale (or bitonal for that matter) come from making smart decisions on the “extras” that can add a lot of “budget bloat” if you aren’t careful. Don’t “over engineer” your project with requirements that don’t fit the overall plan for you archive. I put these “upgrades” in the “Quality” bucket, because decisions on things like scanning resolution make a signifigent difference (do you want to scan them at 300dpi, 400dpi, 600dpi? Do you want to pay for the higher resolution or is it overkill for your intended use?).
“Quality” also speaks as to the quality of the data and how it will be organized and utilized. Things like the amount of metadata required, utilizing “articlization” or zoning of the images, or manual corrections. to the OCR. “Quality” can also speak to the quality of the user experience too. How will people access it? What features will the platform offer? How much software development is required? What are the hardware specifications? Will there be other delivery methods or hosting options? A lot to consider…
Although greyscale is presumed by many to increase OCR accuracy, differences in output between OCR applied to a grayscale image, and OCR applied to a bi-tonal image are negligible, however, studies show that, in general, surprisingly bitonal digitization has a slight edge when compared to a grayscale image of the same page. With eight shades of grey used to recreate the image, words can get lost in the “noise” of the page, and the halftones can create pale or blurry text and poor contrast, but it does do better with poorly printed materials, catching some words bitonal misses. On the other side of the coin, bitonal seems to catch “more” real words but does miss some that grayscale catches. It is 6 of 1 or half of a dozen of the other.
Sometimes it is simply not practical (or possible) to embark on a project to digitize your entire collection. It becomes necessary to prioritize what content will be digitized first, and if necessary, possibly break the digitization project into multiple “phases”.
I would encourage you to always prioritize any at-risk materials first. If there are no originals considered “at-risk” in your collection, you may want to consider starting with the earliest content first, however, it is not the only choice. An equally effective approach can be to release content by eras or periods of interest. You may have valuable content from 1910 – 1920 that provides a unique perspective on the Great War or newspapers from 1930 – 1950 covering the events leading up to the second world war and the impact on the years after. Maybe you could start with releasing images from 1950 – 1970 as a dataset for researching the Civil Rights Movement.
Consider prioritizing the digitization of papers and documents that hold historical significance to your community.
That includes content that is unique to your community, that may not be available elsewhere. There is a point in the ’70s and 80’s where the local news began to be overshadowed by wire stories that could be found in papers across the country. The later papers are still a treasure trove of local history and filled with invaluable information, but page-for-page may not be a “dense” with unique content as when the papers were primarily created from the local newsroom. You may be digitizing many pages full of wire stories that also were published verbatim in other communities.
Some institutions choose to start by doing one publication or title at a time. If there were multiple newspapers published in your community or the paper changed titles throughout the life of the publication, breaking it out a title at a time often serves as a way to create logical breaks when laying out a project.
As a side note, a “phased” approach to releasing new digital content is actually is a very effective engagement tool. It allows you an ongoing opportunity to talk about the collection through social media and other communication methods. By rolling out new content periodically, and taking the time to highlight that content it continues to generate interest and utilization of the archive by the community.
In the end, you will have to decide if your Budget will be used to determine the Quality and the Quantity for your project, or if the Quality and Quantity will determine the Budget. There is no wrong answer.
The ” ‘ol 3 legged stool” is just a tool to help you visualize the options you have in building an archive that meets your needs (and available funding). There are numerous approaches that you can take, and I truly believe there is a solution for every use case AND every budget. Do your homework and seek input from other institutions that have embarked on similar digital projects. The people who have “been there, and done that”, they will be your best source of information and advice.
If funding remains the only real obstacle preventing you from creating a digital archive, I would encourage you to look outside of your budget. Reach out to the community and ask them to embrace the idea that preserving and providing access to history can be a shared responsibility.
Encourage partnerships with local community publishers, libraries, and other like-minded individuals to make local content more accessible through donations, sponsorships, and gifts. Leverage the donor networks already in place for not only your institution, but also through other community foundations and networks as well. Apply for any grant opportunity available to you. Consider a “go fund me” type model. Think about an “adopt a year” campaign. The possibilities are only limited by your imagination.
Once you have located the funds, do what you can to stretch them, giving you more to work with in terms of your Quality & Quantity criteria. Evaluate executing a portion of the project in-house using internal resources, or find a service provider to partner with that can tailor a solution to your needs.
At Advantage we have built a model that relies on the strength of our partnerships. Our team works with institutions across the country, standing side-by-side with them through every step of their digitization efforts, and helping them in creating their own Community History Archive.
The Community History Archive platform is intended to serve as a “portal to the past”, allowing those primary source documents to give an accounting of history as told by the individuals that witnessed it. The pages in an archive, when stitched together, tell the story of the people, places, and events that shaped the community. We feel strongly that making this content practically accessible (and free) is important. It is core to our mission, and we partner with others that feel the same.
If you are interested in exploring the possibility of creating a Community History Archive for your community, we would love to chat. We can discuss finding funding, best approaches, ensuring all copyright laws are being followed, making sure the content is properly preserved, and ways to best use the archive as an outreach and engagement resource. We want to help you bring YOUR community’s past into the present!