LAION-5B

LAION-5B is an open-source foundation dataset used to train AI models such as Stable Diffusion. It contains 5.8 billion image and text pairs—a size too large to make sense of. In this visual investigation, we follow the construction of the dataset to better understand its contents, implications and entanglements.

Publications

Background

The AI field’s goal is nothing less than to transform the whole world. But what are the foundations upon which this transformation is built? In this investigation, Jer Thorp and I looked at LAION-5B, the only open-source foundation dataset currently available.

How do you investigate a dataset containing 5.8 billion text-image pairs – a size too large to make sense of? In this visual story by Knowing Machines, we follow the construction of the dataset to better understand its contents, implications, and entanglements.

Datasets are curated through automated processes and statistics to meet the scale required by today’s AI. At every point in this curatorial process, another model, another dataset, and another benchmark make decisions, defining the worldview an AI is exposed to during training.

It’s models all the way down. However, not just the biases of models come to light; structural biases of the AI ecosystem and their data processes are becoming visible. Transforming the world is the goal of the field, but not everyone is equally included in this transformation.