Migration of Millions of Complex Documents Every Day Requires Data Transformation on a Massive Scale
Our client, a leading provider of information and workflow solutions to legal and other professionals, had been using a legacy publishing environment that was limited in its ability to support search, and was cumbersome and costly to update. This impeded market responsiveness and revenue growth.
To compete more effectively, our client needed to upgrade its core publishing platform, moving from a proprietary information schema to an XML-based platform. The upgrade could not interrupt existing operations.
Massive amounts of original source content needed continuous updates. To accomplish this, disparate existing source formats would be converted into a new XML schema suitable for a new consolidated platform. The client also needed to continue to operate uninterrupted during the development and migration phases. As part of this platform modernization, the old editorial systems were required to continue as the sole source for content, while new platforms were built and introduced gradually over a period of two years.
Updates to the content over the two-year period would be pushed to the existing product platforms and, in parallel through the Innodata Isogen solution, to the new platform. The system was required to be fully operational within six months, for over 1,200 different incoming content types, and then expanded to cater to an additional 500 incoming content formats and related conversions. The project promised to be challenging due to the large number of unique and complex conversions required and a tight service level turnaround time for large volumes of content.
Our client asked Innodata to handle this effort’s complex requirements. We chose to partner with MarkLogic, a leading provider of infrastructure software for information-centric applications. Since 2004, the two companies have teamed up to help their clients unlock the full value of their content — from assessment to implementation; data modeling to full transformation.
Innodata deployed the MarkLogic Server and the XQuery language to cut across all content types and formats in the data repository to perform complex migration activities. With full-text and XML indexes now accessible, the MarkLogic Server provided finegrained search and retrieval based on document type, content structure, occurrence and attributes — thereby allowing the development of complex transformation routines.
Innodata Isogen defined conversion instructions and the building of related MarkLogic XQuery conversion routines. We also developed supporting tools to integrate production workflow and quality activities with MarkLogic’s conversion routines and XML repository.
The project was broken into three components: conversion rules and instruction definition, conversion rules build and quality assurance checking. One set of conversion instructions were developed for each variation in incoming content type and format, and passed to the conversion team for development in XQuery. The conversions were staged to first normalize the incoming content and then transform the content to the final format.
Additionally, an application was developed to provide a dashboard for managing and tracking the production workflow of the conversion activity through channels (each unique incoming content format was classed as a channel). This allowed the Quality Assurance team to selectively review the conformity of the output content. The platform was operational within eight weeks for the first channel of content.
Our client was able to upgrade its core publishing platform, moving from proprietary formats to a single-source, XML-based platform.
As the systems development and content migration process — including the conversion of millions of complex documents to XML daily — was simultaneously executed over a span of two years, our client continued to operate without interruption. The new system enables our client to develop, manufacture and market new products more rapidly and at less cost.