Understanding the Role of Taxonomies, Ontologies, Schemas and Knowledge Graphs

The Often-Forgotten but Critical Step in Scaling AI and Machine Learning

When most people think of artificial intelligence (AI) they conjure up notions of advanced machine learning algorithms, deep neural networks or computational cybernetics. You know, the sexy, futuristic-sounding concepts that are having an impact on the world around us. What doesn’t come to mind are taxonomies, ontologies and schemas; not as sexy, but equally if not more important in the role of bringing AI to life.

AI and machine learning require structured information to train machines to learn and understand how to replicate human behavior. The process of creating structured information from unstructured data and using it to teach machines to think like humans starts with the creation of clean structured data for AI and ML processes. Taxonomies provide the means for machines to understand hierarchies in the information. Ontologies specify the domains. Schema give clarity of how data is structured. Before we proceed, let’s break down what each term really means.

What is a Taxonomy?

A data taxonomy is the classification of data into categories and sub-categories. It provides a unified view of the data in a system and introduces common terminologies and semantics across multiple systems. Taxonomies represent the formal structure of classes or types of objects within a domain. A taxonomy is static.  

What is an Ontology?

An ontology is a formal naming convention and the definition of the types, properties, and inter-relationships of the entities that really or fundamentally exist for a particular domain of discourse. An ontology is dynamic and domain-centric.

What is a Schema?

In computer programming, a schema is the organization or structure for a database. A schema is a formal expression of an inference rule for artificial intelligence computing.

What is a Knowledge Graph?

A knowledge graph is a model of a knowledge domain created by subject matter experts with the help of intelligent machine learning algorithms. It provides a structure and common interface for all of your data and enables the creation of smart multilateral relations throughout your databases. Structured as an additional virtual data layer, the knowledge graph lies on top of an existing databases – or data sets – to link all of your data together at scale – be it structured or unstructured.

Leveraging the learnings, information can be structured to create meaningful representations through automation. This becomes the basis for machines to understand the information like humans, recognize patterns (cognitive intelligence) and predict relationships. When this intelligence is combined with tools like knowledge graphs, machines can read and predict the connections, inter-relations and interpret the data. Knowledge graphs provide the interface to leverage the learning and visualize the relations over large datasets of information which enables humans to make sense of large unconnected datasets based on the predictions from machines to take data-driven decisions. When the humans check the predictions and correct anomalies in the predictions from the machines, the supervised learning cycle kicks in improving the intelligence and driving higher accuracy of predictions. These technologies are extensively used for AI and ML-enabled recommendation engines, fraud detection, anti-terrorism, risk management and much more.

The Key Differences

The difference between an ontology and a taxonomy is an ontology is a subset of a taxonomy. A taxonomy formalizes the hierarchical relationships among concepts and specifies the term to be used to refer to each; it prescribes structure and terminology. An ontology identifies and distinguishes concepts and their relationships based on a domain; it describes content and relationships in the context of a specific domain.

For example, let’s say a taxonomy has been created for contracts management. It would contain the terms and relations for contracts documents. If this taxonomy is applied for information extraction from OTC (Over the counter) derivate contracts such as ISDA/GIMRA, the taxonomy alone would prove to be inadequate as industry specific contracts like ISDA, GIMRA, etc., have their own domain specific ontologies. Similarly, contracts for rights management, supplier contracts etc., each has an ontology which is domain specific and provides the best reference of terms applicable in that domain. Therefore, the right choice of taxonomies and ontologies is crucial for AI and ML applications to work successfully for information extraction.    

While slightly different, they are all related to metadata, information organization, knowledge representation. Although each one has a specific role to play in representing information.

How to Get Started

The process starts with the creation of structured data from unstructured information. The first step is to acquire clean data, define a taxonomy, ontology and schema upfront. These can be done either by staring to use existing taxonomies and ontologies, or developing them from scratch. Subject knowledge and domain expertise are crucial to be able to build these correctly. This is where Innodata applies our internal subject matter expertise across domains. Once these are defined, the raw data is annotated by applying the taxonomy, ontology and/or schema as needed.

AI and ML technologies work best when the base dataset is clean, well-structured and the taxonomies and ontologies are accurate and appropriate to the context. Subject matter expertise and domain knowledge are key ingredients for success. Unfortunately, finding the right combination of all the above in one place is quite a task in itself. Open-source taxonomies and ontologies could be too generic and might not be the best choice. In-house SMEs could be working on day-to day-functions and not available for crucial projects. With the high velocity and variety of data flowing in, taxonomies need to be updated and renewed constantly for organizations to remain relevant and sustain continued AI performance. Innodata has a full spectrum of solutions to help build clean data , create base taxonomies from scratch, SMEs across a large variety of domains for ontologies and schema development, cutting edge-tools to build and update taxonomies, ontologies, annotate raw data, customize, test and validate in an agile process, which could drastically accelerate the time to market by 40-50%. 

At the end of the day, machines cannot read, interpret, or make sense of data without structure. A well-designed taxonomy, ontology and schema are fundamental to teach machines to understand patterns like humans and are fundamental for long-term AI and ML success.

Meg Farrell, VP of Healthcare Data Services

Meg Farrell- VP, Data Solutions​

(NASDAQ: INOD) Innodata is a leading data engineering company. Prestigious companies across the globe turn to Innodata for help with their biggest data challenges. By combining advanced machine learning and artificial intelligence (ML/AI) technologies, a global workforce of over 3,000 subject matter experts, and a high-security infrastructure, we’re helping usher in the promise of digital data and ubiquitous AI.