- Big Purple Clouds
- Posts
- Junk Food for Algorithms: Why Large Language Models Need a Data Detox
Junk Food for Algorithms: Why Large Language Models Need a Data Detox
BIGPURPLECLOUDS PUBLICATIONS
Junk Food for Algorithms: Why Large Language Models Need a Data Detox
Introduction
Artificial intelligence (AI) has seen massive advances in recent years, largely driven by a class of systems called large language models (LLMs). From chatbots to search engines to auto-complete suggestions, LLMs now power many of the AI applications we interact with daily.
But how do these seemingly intelligent models actually learn? Like humans, AI systems require vast amounts of data and experience to acquire knowledge and skills. The training data used is key to shaping what LLMs know and how they behave.
In this post, we explore the origins of the data that nourishes modern AI, and why this matters.
A Hungry Mind: LLMs Require Massive Datasets
LLMs like GPT-3, Google's Bard and DeepMind's Gopher have billions of parameters - like synapses in a brain network. To set these parameters in a useful way, they need to digest huge datasets during training.
For example, it is estimated that GPT-3 was trained on 570 gigabytes of text from websites, books and online forums which is the equivalent to over 1 million physical books! Without exposure to large volumes of data, LLMs would remain literally clueless about language and the world.
The training data acts like food, providing the nourishment for an LLM's "brain" to grow and develop its capabilities. With more data, abilities like language understanding, reasoning and generation of text improve.
But the quality of the training diet also matters tremendously. Bad data leads to a junk food diet that causes models to learn harmful biases and behaviours.
Crawling the Web: Origins of Training Data
Where exactly does this massive training data come from? Well it comes from a variety of places but some primary sources are:
Reply