Big Purple Clouds
Posts
Junk Food for Algorithms: Why Large Language Models Need a Data Detox

Junk Food for Algorithms: Why Large Language Models Need a Data Detox

Big Purple Clouds
November 10, 2023

BIGPURPLECLOUDS PUBLICATIONS
Junk Food for Algorithms: Why Large Language Models Need a Data Detox

Introduction

Artificial intelligence (AI) has seen massive advances in recent years, largely driven by a class of systems called large language models (LLMs). From chatbots to search engines to auto-complete suggestions, LLMs now power many of the AI applications we interact with daily.

But how do these seemingly intelligent models actually learn? Like humans, AI systems require vast amounts of data and experience to acquire knowledge and skills. The training data used is key to shaping what LLMs know and how they behave.

In this post, we explore the origins of the data that nourishes modern AI, and why this matters.

A Hungry Mind: LLMs Require Massive Datasets

LLMs like GPT-3, Google's Bard and DeepMind's Gopher have billions of parameters - like synapses in a brain network. To set these parameters in a useful way, they need to digest huge datasets during training.

For example, it is estimated that GPT-3 was trained on 570 gigabytes of text from websites, books and online forums which is the equivalent to over 1 million physical books! Without exposure to large volumes of data, LLMs would remain literally clueless about language and the world.

The training data acts like food, providing the nourishment for an LLM's "brain" to grow and develop its capabilities. With more data, abilities like language understanding, reasoning and generation of text improve.

But the quality of the training diet also matters tremendously. Bad data leads to a junk food diet that causes models to learn harmful biases and behaviours.

Crawling the Web: Origins of Training Data

Where exactly does this massive training data come from? Well it comes from a variety of places but some primary sources are:

Subscribe to keep reading

This content is free, but you must be subscribed to Big Purple Clouds to continue reading.

Already a subscriber?Sign in.Not now

Reply

or to participate.

Junk Food for Algorithms: Why Large Language Models Need a Data Detox

BIGPURPLECLOUDS PUBLICATIONSJunk Food for Algorithms: Why Large Language Models Need a Data Detox

Subscribe to keep reading

Reply

BIGPURPLECLOUDS PUBLICATIONS
Junk Food for Algorithms: Why Large Language Models Need a Data Detox