Innodata Inc.
Search
Close this search box.

How Much Do Today's Large Language Models Really Know?

Imagine all of human knowledge as a football. 

Now, consider how much of that knowledge is actually captured in today’s leading AI models—barely a dime’s worth. 

That’s about 0.2% of the football’s surface. 

Why is this gap so vast? Because much of the knowledge AI still needs doesn’t yet exist in structured, machine-readable formats. 

The Challenge of Knowledge Representation

AI models today struggle with: 

  • Complex, multi-step reasoning across domains. 
  • Specialized, domain-specific problem-solving. 
  • A deep, contextual understanding of the world beyond text and images.

This aligns with expert insights: 

  • Theoretical linguistics (Chomsky): Human language is infinitely generative, meaning no finite dataset can capture its full range. 
  • The Knowledge Doubling Curve (Buckminster Fuller): Human knowledge used to double every century, then every 25 years by 1945. Today, in some fields, it doubles every 12 hours. 
  • Data scale limitations: While LLMs are trained on roughly 100TB of data, enterprise and private datasets likely exceed 100 exabytes—a fraction of a fraction. 
  • Industry insights: Sam Altman of OpenAI recently noted that current AI systems handle only a “single-digit percentage of all economically valuable tasks,” reinforcing the scale of the challenge. 

The Road to AGI: Beyond Compute, Toward Better Data

The path to more advanced AI isn’t just about increasing computational power. It’s about capturing, structuring, and utilizing the right data: 

  • Multi-lingual and multi-modal data to enhance AI’s global adaptability.
  • Data for safety, alignment, and reasoning to improve decision-making.
  • Meta-learning and reasoning data to enable AI to generalize beyond training examples.
  • Operator and agentic data to refine AI’s ability to act in real-world environments. 
  • World modeling and simulation data to create richer, more contextual AI models.

Innodata’s Perspective: The Data Imperative

At Innodata, we recognize that the most pressing challenge in AI today is not just compute power—it’s access to and structuring of high-quality data. During Innodata’s fourth quarter and fiscal year 2024 earnings call, Innodata’s CEO, Jack Abuhoff, used the “football to dime” analogy to highlight the vast untapped potential of unstructured data. Jack said,  

“An industry analogy to explain where we are in capturing data is to imagine the realm of all useful data to be the size of a football. By comparison, today’s best-performing LLMs have been trained with data sets that are probably the size of a dime. What’s even more interesting is that much of this is uncaptured. But useful data does not even exist explicitly today, such as how to execute a multi-step process using a series of websites or how to reason through complex, domain-specific problems. We believe this likely means an even greater need for investment in our services that will be necessary to achieve the goal of AGI. We intend for Innodata to be at the forefront of providing these services.” 

Innodata helps bridge this gap, providing the essential data layer that turns raw information into powerful inputs for AI training. Our expertise spans the entire model lifecycle, helping organizations by structuring and enriching data at scale—from data annotation and collection to red teaming to human preference optimization and beyond. 

Innodata Inc.

Bring Intelligence to Your Enterprise Processes with Generative AI.

Innodata provides high-quality data solutions for developing industry-leading generative AI models, including diverse golden datasets, fine-tuning data, human preference optimization, red teaming, model safety, and evaluation.

Follow Us