Estimating how much text on the internet is generated

Researchers analyzed newly published websites from 2022 through mid-2025 to estimate what percentage used generated text and how this might affect future information online.

The proliferation of AI-generated and AI-assisted text on the internet is feared to contribute to a degradation in semantic and stylistic diversity, factual accuracy, and other negative developments. We find that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT’s launch in late 2022. We also find evidence suggesting that increases in AI-generated text on the internet bring about a decrease in semantic diversity and an increase in positive sentiment. We do not, however, find statistically significant evidence supporting the hypothesis that an increased rate of AI-generated text on the internet decreases factual accuracy or stylistic diversity. Notably, our findings diverge from public perception of AI’s impact on the internet.

So it has grown to about a third of new sites that use AI-generated or AI-assisted text. That seems like a lot?

I’m more surprised that there didn’t appear to be a significant change in fake information or a convergence in style.

My theory is that most people putting up these generated sites are either experimenting or trying to make a quick buck. Either way, they just take whatever information is given to them via a probabilisitic model and forget about it. They don’t care what the words say or how it is said, just as long as it fills space. So the output defaults to mostly correct statements.

Estimating how much text on the internet is generated

Topic

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)

Estimating how much text on the internet is generated

Topic

Related

Second Edition

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics (2nd Edition)