Complex Conversion Effort Puts Historically Significant Archive at Fingertips of Students and Researchers
When ProQuest acquired the rights to the microfilm archives of two of America's most prestigious newspapers – The Washington Post and The New York Times – they knew they were sitting on a potential goldmine.
But extracting that gold – a century and a-half of news articles, editorials and photos capturing the history of the United States – would not be easy. All told, the two archives combined for nearly 5.6 million newspaper pages, more than one million articles and over 100,000 individual newspaper editions. Moreover, newspapers are complex digitization projects, given the large page format, page jumps and multiple photos and graphics.
ProQuest also needed to convert the archives into a format that could be preserved indefinitely in order to maximize its potential for reuse and repurposing. If the digitization was not cost effective, or efficient, ProQuest would find itself in a position where it would be difficult to obtain an adequate return on its investment.
To convert the microfilm to an electronic format economically, ProQuest turned to Innodata Isogen, which has orchestrated and carried out some of the world’s largest data conversion efforts.
Innodata Isogen's expertise in converting documents to XML was a strong fit for this project. Moreover, Innodata Isogen has gained extensive experience scanning source material in microfilm and converting that to an electronic format like XML. Just as important, the company’s project teams could perform that task with a high degree of accuracy.
Newspapers present unique challenges for data conversions. With multiple page layouts, articles of varying lengths, page jumps, photos and artwork, a newspaper can look drastically different from one day to the next.
To meet this challenge, Innodata Isogen and ProQuest pioneered new digitization techniques based on zoning and threading – which entailed identifying areas of relevant text and relating them to each other – and enhancing the quality of images. The end result is a fully searchable file that allows users to view articles in their original context.
For this large-scale conversion, Innodata Isogen deployed a broad team of digitization experts, who developed a plan to meet the customer’s specifications within the aggressive timetable. They were also able to work with microfilm of varying quality and convert to a format useable by ProQuest’s Learning systems.
Once it was finished, the final version of the archives for the Post and the Times allowed researchers to use basic keyword, advanced, guided, and relevancy search techniques to locate information. They could also browse through issues page by page, as if they were reading a printed edition. Search result lists also provide bibliographic information, which include date, issue, article headline, page number, and the byline. And in any issue, users may choose to display the full page image.
ProQuest was hailed by a number of leading publications, including Information Week, for launching such a historically important newspaper archive. Even more important, the archive helped ProQuest reaffirm its standing as a leading information provider for university and college libraries. One librarian noted that providing the complete back-file electronically for both newspapers opened new doors for students and researchers.
By strengthening ProQuest’s competitive advantage over other database companies, the project has also made a strong contribution to the publisher’s bottom line. ProQuest’s subscription renewal rate has grown steadily since the archive was launched and revenues from its microfilm archives increased 24 percent.