The 100-billion webpage dataset that powers AI

A conversation about the Common Crawl with Stefan Baack, PhD

Apr 02, 2024

This week I spoke to Stefan Baack from the Mozilla Foundation about a recent research article he authored on the Common Crawl. The Common Crawl is the name of both a non-profit open-data company founded in 2008 by Gil Elbaz and the name of the associated dataset. The Common Crawl is one of the most important datasets in the Generative AI ecosystem and has been used to train dozens of large language models.

To give a sense of just how large Common Crawl, every month it collects 3 to 5 billion webpages, 500 times more webpages than all of the articles on Wikipedia. The associated size of these monthly datasets is around 90 terabytes compressed (or 400 terabytes uncompressed), 4,000 times as large as all of the text on Wikipedia. Over its 17 year history Common Crawl has collected more than 250 billion webpages.

Stefan is a researcher and data analyst at the Mozilla Foundation’s Insights Team. He completed his PhD at the Research Center for Media and Journalism studies at the University of Grow Knee In, where he wrote a dissertation about the relationship between data journalism and civic tech.

Stefan and I spoke about how Common Crawl decides what webpages to collect, about its founder Gil Elbaz and his philosophy of building neutral data companies, about how AI builders utilize and filter Common Crawl, and about how pre-training influences large language model behavior and biases.

The transcript below has been lightly edited for clarity.

Stefan Baack, welcome to the podcast.

Thank you. Happy to be here.

So we're going to be talking all about the Common Crawl today, which is this incredible dataset that's been built up over a number of years. And I'm really excited to talk to you about all the details of that.

So the Common Crawl data set is most relevant for the pre-training phase of large language model development. The other major phase is called fine-tuning. And for listeners who are curious about the fine-tuning phase, I would recommend listening to my previous episode with Shayne Longpre and Robert Mahari of the Data Provenance Initiative.

Tracing AI Data Origins

James McCammon

December 12, 2023

Read full story

But give us a quick refresher on those two major phases of model training, the pre-training phase and the fine-tuning phase and how they differ.

Yeah, sure. I mean, pre-training is usually about creating a base model, or some also call it foundation model, which is basically just a large language model that is really good at predicting the next token in a sequence.
And token can be just the next word in a sentence, or part of a word in the sentence, or the next pixel in an image, or something like that. And to train these large language models and make them really good at predicting this next token, they just are trained on very large amounts of data, usually too large to really carefully look at all the contents of the data that you train the model with. So AI builders in this phase usually rely on techniques that can be automated and scaled to collect the data and to filter it, et cetera.
Then the pre-trained model itself, or the foundational model itself, is usually difficult to use because it doesn't reliably produce the outputs that are useful to you. Like when you ask a pre-trained model a question, it doesn't necessarily provide an answer. So you need to give it more training, and that's the fine-tuning that you do.
So you basically provide it with additional training to make it behave in more predictable and useful ways. So, for example, for Chat-GPT, OpenAI first created this base model called GPT, and to fine-tune it, they, among other things, generated multiple answers to the same prompts. And then they had human moderators that rated the responses from best to worst and optimized the model to produce answers that are rated highly to make the model more useful as a chatbot, essentially.
So this fine-tuning requires less data than the pre training, and more curated and hands-on data, sometimes even by data workers.

And roughly how much more data is needed in the pre-training phase than the fine-tuning phase? Is it ten times more, 100 times more, a million times more?

Oh, that depends a lot. I guess the thing is, I don't have a good idea about exactly how much fine-tuning data is usually needed. Because when you think of Chat-GPT, I don't think OpenAI discloses that because they like fine-tuning is a constant effort, right? I mean in a way it's constantly growing I would assume. But I mean at least initially, usually the pre-training data is just a lot more. Like one of the most popular training datasets for this pre-training is called The Pile from EleutherAI. And that's like 800 gigabytes of just text data.
And I would assume that the fine-tuning data phase is a lot less. It's still a lot when you look at it individually, but compared to the pre-training it’s a lot less

And how are AI model builders actually collecting data for that pre-training phase? What's the process and what are their goals of that phase of training?

I mean usually when you want to compile data for pre-training you have different goals. And it's not easy to align those goals always. On the one hand you want data that is high quality, you want data to be very diverse to teach the language model a lot of different styles of language. But then you also want to have a lot of that, as I said, too much to really look carefully into it.
So where do they get this data? They usually combine two types of data, I would say on a very high level.
First, you have a bunch of different datasets that come from different platforms with user generated content or just archives of particular types of content. So you have stuff like Wikipedia, you have archive arXiv for scientific text, you have GitHub for source code, you have Project Gutenberg for books, you have EuroParl, which is like the proceedings of the European Parliament in various languages. Or you have also often pirated materials like shadow libraries to have even more books.
So these are sources where you have a better idea what this data is. And AI builders use it because they consider it good quality and diverse enough. The thing is, if you only use these sources, the amount of training data is still not considered large enough by most AI builders to make the model perform well. So the second type of data is usually this web crawl data. They basically scale up the size of their overall pre-training data by adding like a ton of web crawl data, like basically HTML texts from websites from all over the Internet.
And when you are Google or OpenAI or Microsoft, you have your in house crawlers and you can basically collect this data yourself. If you're not one of these big companies as far as I'm aware, almost everybody is relying on Common Crawl, which offers this kind of data for free.

Common Crawl's founder, Gil Elbaz, has a pretty unique backstory, as does Common Crawl itself. Tell us a little bit about the history of Gil and the Common Crawl project.

Sure. Yeah. So, I mean, Common Crawl's founder that we just mentioned, Gil Elbaz, he was one of the co founders of Applied Semantics in the late nineties.
And this company invented AdSense, which later became Google AdSense because Google acquired this company in the early 2000s. And Gil Elbaz then worked for Google also until 2007. And in several interviews that he gave some years after his departure he explained that he left Google because he was worried that Google is becoming too powerful because it had these giant amounts of data it could work with. And in his view, data is the key driver of all kinds of innovation. So as he put it, Google is becoming a monopoly of innovation.
And to counter that, he wanted to found what he called “neutral data companies,” companies whose primary purpose is to just provide data to other companies. And Common Crawl was one of these neutral data companies that was meant to be like a neutral nonprofit infrastructure that should imitate the way Google crawled the web for its search engine and then make that data available to anyone for free in order to level the playing field of technology development and enable others to compete with Google, if you will. And I think understanding this history is really important when we want to understand Common Crawl's role in Generative AI, because it shows that providing AI training data was never Common Crawl’s primary purpose.
AI builders have always been part of its user group, but it has never been the primary sole purpose of Common Crawl.

And another incredible thing that you mentioned in your article is the small size of Common Crawl. It's like less than five people, right, who were doing this whole operation?

Yeah. When I actually did the interview, they had three employees, and when OpenAI published its GPT-3 paper, I think they had one. So it was a tiny project for quite some time. But I mean, I'm not sure how many people are working there now, but it’s a lot more now.

We'll talk in a moment about the size of common crawl. But it's very large. It can't be cheap to store all of that data and to crawl the web and to gather all of that data. How is the project financed?

Oh yeah, that's a good question. I mean, partly it's possible because they get support from Amazon. They can host their data for free on Amazon Web services because Amazon has, like, I forgot what they call, but some sort of like philanthropic open data initiative. And Common Crawl is part of that. So that saves a lot of money, obviously.
And then Gil Elbaz has basically financed this operation for most of its history. And I mean, he is like a multi-millionaire. Like, when he left Google, he was already a multi-millionaire, so he was able to finance this operation. But I mean, it's still quite impressive because it was a very small team always, and they managed to just build over time through lot of iterations since they founded this giant archive that just keeps growing over this long period of time.
So, yeah, it was a mix of having this support in the background, but also a lot of experience that was being gathered by the people working there, even though it was such a small group.

All right, let's talk about the juicy stuff. So lay it on us. How big is the Common Crawl?

When I interviewed people at Common Crawl, like in mid-2023, the number that they mentioned in these interviews was 9.5 petabytes in total of the entire archive going back until like, I think the first one came out in 2008 or so. And I mean, this was mid-2023.
And this archive is growing every month by roughly 400 terabytes, because every month, Common Crawl publishes new crawling data. And each of these individual crawls contains between 3 to 5 billion URL's, which is like roughly the equivalent to 400 terabytes. So it's like very large and it keeps growing.

Yeah. And so the Internet is very big, and Common Crawl is a snapshot of the Internet. It's not the entire thing. So that means that Common Crawl has to determine somehow which websites to crawl, how often. So talk a little bit about what that process looks like and how they determine what websites to include in the Common Crawl every month and whether they like, include duplicates from month to month or filter them out. How does that work?

There are a lot of repeats, actually.
Maybe on a high level. Common Crawl is always trying to strike a balance. Like on the one hand, it wants to enable this large scale cross domain analysis of web data, but on the other hand, it is also careful to stay within this U.S. fair use regulation for copyrighted material.
And that means that in most cases, they only collect the HTML code of websites. But more importantly for your question, it also means that they don't collect full copies of web domains. Like, there's not a full copy of Wikipedia and Common Crawl for example, they only take some, but not all of the pages of the domains that they encounter.
Okay, so how do they decide what to crawl? I mean, basically, since roughly 2017, they calculate each domain's harmonic centrality. It is called harmonic centrality. Basically, think of it as an alternative to PageRank. It's a mathematical way to determine the relevance of a node in a network. And well, like to put it very simply, harmonic centrality means that the more often a domain is directly or indirectly linked to, the higher its harmonic centrality score, with more direct links contributing more. So if the page directly links to a Wikipedia article, that contributes more to Wikipedia’s harmonic centrality compared to if a page links to another page and that page links to Wikipedia.
So you can think of harmonic centrality as a way to capture how accessible the website is, or domain is in the sense that you can hop to it from other pages. And importantly, Common Crawl uses this harmonic centrality not just to decide which domains to include, but also how many pages from these domains to include. So Wikipedia is a very important domain and always has a very high harmonic centrality.
So you always have a good amount of pages from Wikipedia in each of these crawls. But pages or domains that have like lower or maybe varying scores, they may or may not be included in the crawl. And even if they are included, they are represented with less pages in this crawl.
So the way that it works in more concrete terms is that internally Common Crawl has a database called the CrawlDB. And when I did the interview in August last year, it contained, like, 25 billion URL's. So by now it's more, I assume.
And for each of these URL's in this internal database, they record the harmonic centrality. And when it was last fetched successfully and when they initiated new crawl, they take this score, the harmonic centrality score, and they add or subtract score points from that depending on when the page was last fetched, with the goal to include more pages that have not been included before or that haven't been fetched for a while to prevent, like, the same pages from being included over and over and over again. That said though, in the interview, the main crawl engineer said that about 50% of the pages in each crawl have been crawled at some point before.
As far as I'm aware Common Crawl de-duplicates pages in the individual crawls that they publish monthly, but they don't go back to their older crawls and say, like, “Okay, remove all the URLs that have already been crawled in previous months’ crawls.”

And does Common Crawl only collecting text data or do they collect other kinds of data too?

There's also other kinds of data. I mean, it sometimes collects images, it sometimes collects PDF's. There was actually a project that only uses the PDF's in Common Crawl because even though it's a very small percentage, because it's so large is still a significant amount. But yeah, it's like 90% almost is just HTML text.

Okay, so we have this harmonic centrality score, but what kind of pages is common crawl actually targeting? You mentioned earlier, Wikipedia is an important website for Common Crawl, and there are many Wikipedia pages within the Common Crawl, which makes sense. Wikipedia is a very important website on the Internet. So what about other important Internet websites? So news websites like the New York Times or the Washington Post or sites like Reddit? Does Common Crawl kind of emphasize those larger important websites?

Because obviously there's a very long tail of web pages on the Internet. You know, mom and pop bakeries have web pages. People have LiveJournal blogs from 2003 that two or three people per year read, and so on. So there's this very long tail.

How does the common crawl kind of make a trade off between targeting the long tail versus targeting some of these very important central web pages on the Internet?

I would say their approaches to be wide to capture a lot, and I mean, basically harmonic centrality is their primary way to determine what to include and what not to include. So I mean, they have like a separate call for news websites only. I haven't looked into that more deeply. But if you talk about the main crawl that I just described, they just try to be very wide and capture a lot.
But they do care about relevance. They want to include pages that are important in some ways and relevant. They don't want to, for example, have a lot of spam in their crawl.

One thing that you emphasize in your article is that the common crawl is not the entire Internet. It's a small portion of the Internet. Sometimes people mistakenly say — I've said myself — that the Common Crawl and other data sets are large portions of the Internet and that AI models are trained on large portions of the Internet.

So I think that's an important corrective and we can talk more about it later. But do we know, I guess, how big the Internet is and what percentage of the Internet Common Crawl represents?

I mean, I think there are estimates about how big the Internet is, but I'm not sure how accurate they are. I mean, that was something that was striking to me in the interviews that I had with the people working at Common Crawl because I asked them how representative Common Crawl is and how much of the web they thought they were covering. They're very openly acknowledging, “We don't know.” Because they argued that the Internet or the web is practically infinite because it's a moving target, right? I mean, there are constantly new pages added and there's constantly pages being removed from the web and become link rot.
So it's almost impossible to really capture everything. And because they're not sure also how large the web is in total, they are hesitant to make estimates about how much they cover. And they also very openly acknowledging the limitations of the data that they are collecting and that they want to address them.

You mentioned earlier spam and junk web pages. How does Common Crawl define what spam or junk web pages are? And what, if anything, do they do to try to handle that or filter that kind of data out of the common crawl?

I mean, in terms of how they defined it, they are mostly interested in removing really link spam, where pages basically send the crawler from one interconnected spam pool to another. And that's also the only instance where Common Crawl manually tries to intervene in the crawling process because they run into the problem if they don't do that, that their crawler might get stuck in these spam pools. And then you look at the data and most of it is just the spam stuff and they don't want that. So this is, I think, their definition of junk primarily.
To widen your question a little bit, when it comes to junk, in the eyes of AI builders that want to train models on Common Crawl, Common Crawl also contains a lot of stuff like boilerplate text, like the names of menu items on HTML websites, or like error messages or SEO optimization text, or just duplicates. And AI builders, most of them don't want to have this data included when they train their models. So this would be the other category of junk that at least AI builders consider.

And you talk in your article a little bit about what that filtering process looks like for AI model builders, so talk a little bit about what that process is.

Okay. Yeah, I mean, when we talk more broadly about this filtering process that they do, then I should add also that another thing, I mean, I'm not sure if that falls under the junk question directly, but Common Crawl also deliberately does not curate its data in any way. So Common Crawl does not, for example, remove hate speech from its data because it wants this data to be useful for researchers studying hate speech.
And so it does not want to do that. This stance is actually, I mean, it makes sense when you are aware of Common Crawl’s origins with its founder, Gil Elbaz, who thinks that data is the main driver of innovation. And Common Crawl should be a neutral data company, right? The emphasis being on neutral. So it makes sense from that sense.
But when you are an AI builder and you want to train a large language model, you do not want — usually at least — you do not want to train your model on this kind of data, because then your model will also produce harmful outputs.
So when it comes to the filtering that AI builders do to Common Crawl, before the training, it has to take both of these into account, right? I mean, on the one hand the junk and boilerplate and duplicates and so on, but then also removing all that harmful content that Common Crawl deliberately includes.
And I mean, what types of filtering do they do? I mean, there are like a couple of broad techniques. I mean, the most obvious one is of course just language deduplication and language filtering. Then when it comes to removing harmful content or just boilerplate text, you can use like keywords or just very simple heuristics. Like you can for example, make a list of keywords that you consider harmful. And then if a page contains any of these words, you just remove it. Or you say, within these pages only retain lines that end on a punctuation mark. Or you can use AI classifiers, which is something that OpenAI, for example, did.
Like, OpenAI, they trained a classifier on what they considered a high-quality reference data set. And in that case they used Reddit for that. This was before Reddit closed off access to its API and everything. But basically they said, give me the checks of all the URL's that are upvoted on Reddit at least three times, and then make an AI classifier that only keeps pages in Common Crawl that are similar to those pages that were extracted from Reddit. So that would be the AI classifier approach. You use a high quality reference and use that basically to filter Common Crawl.

So AI model builders are using these various techniques, heuristics, keyword searches, AI classifiers, to try and clean up the data. But Common Crawl is so big that it can't be inspected manually. And the filtering techniques these AI builders are using are not 100% effective. So some hate speech and other toxic content does eventually end up in the datasets used to train large language models, is that right?

Yes. I mean, I should say, first of all, like, there are a handful of filtered, Common Crawl versions that are extremely popular and that are being reused over and over. I mean, one of those is called C4, which was created by researchers at Google in, I think, 2019. C4 stands for, like, what was it? Colossal Clean Crawled Corpus, if I remember correctly. And that just used a very simple filtering for the keyboards. I mean, it's like a crowdsourced list on GitHub that is called the “list of naughty and dirty words” or whatever.
And, I mean, this is very problematic, actually, because this list mostly contains words related to sex and pornography, which means if you rely only on that list to filter Common Crawl, you leave other types of harmful content that is, for example, racist, mostly untouched, and you might also remove non-toxic content from LGBTQ communities and stuff like that.
So if you only rely on these simple filtering techniques, you usually end up with, like, a lot of toxic data that is still contained in the data that you train your models with.

Yeah. And are there any solutions that have been proposed that would, I guess, number one, offer better filtering of not just pornography words, as you were mentioning, but other kinds of hate speech, like, you know, racial hate speech and other toxic content, and number two, techniques that would conversely keep the other kinds of content we were mentioning in the dataset, like, you know, “bad words” that are used by members of certain communities in actually an empowering way, or maybe just as slang and are not harmful, but appear harmful if you know, overly simplistic filtering solutions are used? So are there any, you know, papers or articles or any other solutions you've heard about that would be better than the status quo in terms of filtering these very large data sets to prepare them for large language models?

That's a really good question, because, I mean, my impression is there is this unresolved conflict that is at the heart of this question, because, on the one hand, AI builders say we need to train on these gigantic amounts of data to have the right performance, but at the same time, they haven't dealt with this question thoroughly enough, I would say.
I mean, it seems to me at least an implicitly accepted practice to train models on versions of Common Crawl that are not filtered super thoroughly and deal with the problems that emerge from that in the fine-tuning phase. That's at least my impression. And, I mean, how to solve this, I don't think there's like one silver bullet method that will solve this.
I think the problem is mostly that these popular filtered, Common Crawl versions, they are mostly created by the AI builders themselves. They are built by the people that want to train a model with that stuff and creating the data that is just the stepping stone in their project. Even though these are versions that are being used for years.
Like as I said, C4 was created in 2019, is still used until today. They never updated after the original publication to take criticism and feedback into account about this filtering. So, I mean, I would argue that what is needed is, would actually be something like dedicated filtering intermediaries, for example, like organizations whose primary task is to constantly work on filtering this kind of data in transparent and accountable ways, of course, so we know how they do it.
But I think it's just, like, a lot of effort and there is not an easy shortcut on do that properly, I think.

On the flip side, some people have argued that these new large language models are too woke, like their output is bland or too politically correct, or, you know, just uninteresting. And it can be hard to have, you know, a conversation that represents the full expanse of human experience, or even just a conversation that represents certain political views, typically right leaning views, the critics would argue.

And as I understand it, a lot of that kind of output is because of the fine-tuning phase of model training that we were mentioning at the top of our conversation. And I wonder if that fine-tuning is really acting as a kind of cleanup and having to do extra heavy lifting, so to speak, because the large datasets we've been talking about, like Common Crawl used in the pre-training phase, are imperfect because of this filtering we've been mentioning. You know, the filtering is leaving in some content it shouldn't, it's removing other content that should be left in.

And so fine-tuning has to be a little bit heavy handed to try and correct that and make the model give particular kinds of responses. So do you think if the datasets used in pre-training were better filtered or better curated, it would lead to a more enjoyable end product for the general public? Because the fine-tuning phase would be able to evolve in a way that is not as heavy handed, and the experience for the eventual end users of large language models would be more enjoyable, and the models would be, I guess, representing a wider selection of views in a way that's still kind of safe and responsible.

Oh, that is a really good question. I mean, I'm not sure if it would solve the wokeness issue that you described, because I think at the heart of that problem is also the fact that the companies that produce these general purpose Generative AI products like Chat-GPT and Google Gemini or whatever, I mean, they try so hard to avoid making any kind of political statements or produce anything that anyone might consider offensive. I'm not sure if that problem could be solved by more pre-training curation. Maybe, I'm not sure.

But I mean, my case for why I think more effort for data curation in the pre-training phase is worthwhile, and that we should rely less on dealing with toxic content in the fine-tuning phase, would be more that we have more and more of these models running, even on laptops nowadays. We have ways to basically remove the restrictions that are built on the fine-tuning. We have, for example, uncensored versions of Meta's, Lama models, etcetera. And if models based models are less toxic to begin with, that would just be a huge step forward in making generative AI more safe.

Generally, if you only rely on the fine-tuning to make them safe, what happens then to these models that are just used by people without the fine-tuning? I think that's something that is important to consider.
Yeah, I do believe it is also important, especially if these models more and more become gateways to how we experience the Internet, like when they are built into search engines, for example, or if we now use our smartphones more and more through Generative AI products. I do believe it is important also that more viewpoints are represented in the pre-training data.
And I'm not sure if you can just rely on fine-tuning to balance viewpoints. I think it like from an ethical standpoint, we should also have these base models more representative, again, because people might use these models without the fine-tuning of these big companies. And especially, I mean, the interesting thing also related to that is that like Microsoft and others, they are now promoting this AI as a platform idea. Like hey, you can get API access and then customize it, and then you can do your own stuff. At least base models, these foundation models that are the platform, essentially, those should already be safe and representative and fair. And you should not just put this burden on basically everyone that wants to use these foundation models for anything.

And what's the language breakdown of common crawl in terms of English versus non-English?

I mean, most of the content in Common Crawl is English and most of the domains are .com domains. That, I mean, to some extent that reflects also the inequalities of global Internet usage. The web as a whole is also dominated by English and also there's, I think, various estimates that it's like 50% is English as well. It also is because Common Crawl’s infrastructure is based in the U.S. and so that influences the crawls toward English content. For example, if you have a page that provides multiple languages, it will default to the english version if you access it from the U.S.
And also in terms of regional coverage, it is also mostly, I would say, the global north, if you will. It is a bit uneven. It's not a representative view of everything. And it's also something that the people working at Common Crawl, when I interviewed them in the middle of last year, is one of the first things they want to work on. When they get more resources to have a better regional and language coverage. This is something they are working on a lot.

Nice. And so that's something they're actively working on right now, is improving the language coverage?

I mean, I assume they do. I haven't talked with them in a while. I mean, I've seen that the percentage of English went down in more recent crawls, and they are fundraising to get more resources and hired a lot more staff. So I'm assuming that this is something they're working on right now.

Let's discuss Common Crawl's importance to Generative AI and to large language models, specifically. You mentioned earlier that when the Common Crawl project was founded, it wasn't aimed at providing training data specifically for large language models, but it has become quite central to that ecosystem and to Generative AI more broadly. Talk a little bit about the importance there and really how central the common crawl data set is.

Sure. I would go so far as to say that without Common Crawl, we might not even have Generative AI hype right now. Because especially in the early days, around the time OpenAI published GPT-3, Common Crawl was such an important source that everyone relied on. I mean, GPT-3 which still powers the free version of ChatGPT today, like, roughly 80% of the tokens of this model were based on Common Crawl’s data. And in my research, I looked at text generators like ChatGPT, so I didn't look at like, image generators or other things. And I collected like 47 of these text generators published since 2019, which is roughly when the first of these large language models came out for this. And I mean, at least 64% of those have been using Common Crawl, and very often they used it to a significant degree. There were like, I don't know, 10 out of 47 models or so that just did not provide enough information to be able to determine if they used Common Crawl, but I'm pretty certain that at least some of them did.
Like for example, Facebook's Llama 2, they don't tell you what they used, but I wouldn't be surprised if they use common crawl because Llama 1 used Common Crawl very heavily. So at least 64% of these models used it. But also for image generators, like one of the most popular training datasets for that are these LAION datasets, and those are also consisting of image-alt text pairs which are passed from Common Crawl. So Common Crawl is also really important for these image generators, even though I haven't looked more deeply into it.
And I mean maybe just to share an anecdote, like, about one of the most striking things in my research. There was this big science workshop from 2021 to 2022 that was about making a more open and transparent large language model compared to what these leading AI companies were doing at the time. And they also published their own dataset for the training their model. And in the paper describing their dataset, they said, “We included a version of common crawl as well, because if we didn't, we would invalidate like comparisons with other large language ones that have been published previously.” That was striking to me because it indicated just how much Common Crawl data has shaped the expectations of AI builders for how their models behave.

Count of 47 text generator LLMs published between 2019 and October 2023 using Common Crawl for their pre-training. “Unknown” refers to instances where AI builders did not disclose enough information about the pre-training data to determine whether Common Crawl was used. Source.

You mentioned something interesting there when you talked about Common Crawl being used for text-to-image generation technologies by creating pairs of text and images that can be associated for model training. This is the so called Alt text, which is text used to aid assistive technologies for those that have vision impairment. And so website builders add this alt text to describe what the images depict. And so when assistive technologies come upon this alt text, they can describe that to the person with vision impairment. But earlier you mentioned that common crawl actually did not have very many images in it and was mostly only text. So explain that discrepancy a little bit more.

My understanding is they don't use the images in Common Crawl, but they have the HTML code in Common Crawl, and they use that to find the images, and they use the alt text descriptions of these images to help their models to understand if I type, like, “generate a funny rabbit,” whatever, to understand what that looked like.

Earlier you said that Common Crawl was a big part of OpenAI's GPT-3 model, which was not not OpenAI's first model, but the first model that really took on a life of its own with the public and kind of jump started this Generative AI moment we're in, do we know if current versions of OpenAI's model — like the more recent GPT-4 model and they're currently training GPT-5 — do we know if Common Crawl continues to be an important part of that model training? I know OpenAI is a little bit opaque about what's going on. So do we know anything there or not?

We don't really know because they don't provide any information about that. I would assume that they don't use Common Crawl anymore because they have their own crawler and they have more control over what they want to collect or not. But we really don't know.
We also, as I said, it's very likely in my opinion that Llama 2 uses Common Crawl, but Facebook just doesn't tell us, so we cannot know for sure. Same with Google. I mean, with Google I'm pretty sure they don't because they have probably more web crawl data than Common Crawl has.
But yeah, the short answer is we don't know because they don't disclose bad information.

Copyright law and Generative AI is a hot topic right now. Did your research uncover anything about Common Crawl in terms of the copyright law space that you think is worth mentioning?

I mean, what I mentioned before, they always have been trying to be within fair use by not having full copies of domains and by mostly just collecting HTML code.
That said, I think right now they are watching very closely all these legal cases that challenge the notion that training an AI model falls under fair use or not. And they got caught up in that. I mean, The New York Times recently sued Microsoft and OpenAI and they mention Common Crawl explicitly in their complaint because they make the argument in the complaint like, “Hey, around the time OpenAI trained GPT-3 in 2020, there was a lot of our content in Common Crawl.” They already removed it since then. But The NYT cited a study from 2020 or so that analyzed what is in C4, saying like, “Hey, look, there's a lot of our content in there.” And it also means that Common Crawl is now being blocked by more and more websites because simply they don't want to give away their data for free for AI training.

An excerpt from *The New York Times* compliant against Microsoft and OpenAI.

So I would say regardless of what comes out of these court case, even if the court case would decide, “Hey, it's all fair use, it's all legal, perfectly legal,” I think Common Crawl is still being challenged to maybe change the way it operates.
And I don't know how it will change. But I mean, Common Crawl is not interested in having more and more websites blocking its crawler because people don't want their data to be used for AI training. So I think Common Crawl is going to be interested in finding ways, regardless of the legality, that enable people to have more say or more control over how their data is being used. So I think there will be some changes coming in the future. What those changes will be like, I don't know.

As part of your research, you were able to conduct interviews with Common Crawl’s director and main crawl engineer. What are some of the things that stood out to you from those conversations?

I mean, I think we mentioned before, I was struck by how reflective they are about their work. I mean, because my impression is that many AI builders are less reflective about their usage of Common Crawl compared to how Common Craw thinks about their own work. They very openly acknowledge their limitations.
I mean, what also struck me is that they make arguments that are very attractive to me as a researcher because when they say, “Hey, we don't want to remove hate speech because we want to enable researchers to use our data to study hate speech,” that's something that I can sympathize with as a researcher because researchers have tried to get access to platform data for so many years now to study, for example, how hate speech spreads on Facebook or X or whatever. Saying, “Hey, we want to enable researchers to look at this data and we don't want to be the authority that basically shields the access to this information from researchers.” It was also something that was attractive to me and that wasn’t an argument that I saw coming when I went into these interviews.
And yeah, I mean, just generally speaking, like, just how few people worked on the Common Crawl project was just the most striking thing to me. I did not expect that it would be so small.

We're almost out of time. Is there anything we haven't touched on that you want to mention before we close?

Maybe one thing that I didn't mention just now about the interviews, but I mean, how much Common Crawl insists that “Hey, what we have is not the entire Internet.” And also, I mean, the lesson that I take out of that is also you cannot have a copy of the entire Internet because it is a moving target.
It strikes me that when AI builders claim that their models have been trained on the sum of human knowledge or the entire Internet, to me this is just a way to avoid responsibility to make proper dataset curation, because they just assume, like, “Hey, we have just everything. Why do we need to filter? We have everything. Everything is represented in some way.”
But that's never really true. And even if they had the entire Internet, it ignores that the Internet itself is not representative of the global population. I think 30% of the world's population is not even online yet.
So something that I got very allergic are these claims like, “Hey, it's the entire Internet and the is some of human knowledge.” Because those claims get thrown around everywhere these days. And I think that's something that I'm more inclined to push back against because of this research.

Stefan Baack, thanks for being on the podcast.

Thank you so much.

The 100-billion webpage dataset that powers AI

A conversation about the Common Crawl with Stefan Baack, PhD

Tracing AI Data Origins

Discussion about this post