How Innodata Extracts Data from Complex Documents
As companies look for ways to reap the benefits of artificial intelligence and machine learning, they need as much data as feasibly possible to train and improve their models. The challenge lies in obtaining the right data, in the right formats and systems, and in the right quantity. After all, AI and machine learning will only be as smart as the quality of data that is collected.
PDF documents like contracts and privacy policies are ripe with rich data points. Whether it’s detailed communications or transactions, these documents contain valuable information that can provide smarter insights. Think about the volume and variety of documents businesses create and share every day.
Documents Rich in Data
- Press Releases
- Research Reports
- Purchase Orders
- Shipping Information
- Investor Summaries
- Privacy Policies
For example, in the financial services industry, there are massive amounts of data generated every time a contract is created or amended. This information is always in flux, rapidly evolving during the typical lifespan of the agreement. Financial institutions must keep up with the changes to understand how this information may affect them. Therefore, they need to be able to seamlessly extract, analyze and manage critical data points within the document.
What is Zoning?
Harnessing the information stored in PDFs is no easy task; especially making it accessible for machine learning systems. Zoning is the first step in the PDF conversion process and intends to automate the process of recognizing and classifying sequences and blocks of information within a document and then mapping it to a predefined “zone” category. This helps identify and categorize content in the PDF into different content types like abstract, title, image, references, authors etc. After categorization, these content blocks could be further processed through ML models to perform sequence labeling on them. This also presents the ability to perform complex tasks like extracting text from an image.
Zoning is specifically designed to turn unstructured information typically stored in documents like contracts and purchase orders into readily accessible smaller blocks of data that can be used in machine learning environments to drive smarter business outcomes.
Innodata has been building a cognitive extraction and structuring engine to address the need of classifying and extracting complex documents. The platform accepts a PDF document as input, extracts information from the document, performs transformations and generates an annotated/tagged XML as output. While we encountered several challenges while engineering our zoning platform, we were able to learn from our obstacles and employ a zoning solution that can quickly work on complex documents and deliver the quality results our clients demand for their training data.
Check out our whitepaper on zoning to learn more about why zoning content within PDF’s is an essential endeavor in AI and what you can do to employ similar document strategy.