🤖 AI Summary
Increasingly restrictive web crawling policies threaten the representativeness and quality of AI training data, potentially exacerbating systemic biases. Method: We conduct a large-scale empirical analysis of robots.txt directives across over one million globally popular websites since 2023, augmented with annotated political leanings and fact-checking ratings for news domains. Contribution/Results: We find that 25% of the top 1,000 websites restrict AI crawlers—exceeding 50% among leading news publishers. Notably, neutral or high-factual news sources impose such restrictions at a rate of 58%, whereas right-wing polarized outlets do so at only 4.1%. This constitutes the first empirical evidence of a statistically significant negative correlation between crawler restrictions and political neutrality/factual accuracy. Consequently, current data acquisition practices systematically under-sample high-quality, balanced content while over-representing low-quality or ideologically extreme material—undermining model fairness, robustness, and reliability.
📝 Abstract
Large language models rely on web-scraped text for training; concurrently, content creators are increasingly blocking AI crawlers to retain control over their data. We analyze crawler restrictions across the top one million most-visited websites since 2023 and examine their potential downstream effects on training data composition. Our analysis reveals growing restrictions, with blocking patterns varying by website popularity and content type. A quarter of the top thousand websites restrict AI crawlers, decreasing to one-tenth across the broader top million. Content type matters significantly: 34.2% of news outlets disallow OpenAI's GPTBot, rising to 55% for outlets with high factual reporting. Additionally, outlets with neutral political positions impose the strongest restrictions (58%), whereas hyperpartisan websites and those with low factual reporting impose fewer restrictions -only 4.1% of right-leaning outlets block access to OpenAI. Our findings suggest that heterogeneous blocking patterns may skew training datasets toward low-quality or polarized content, potentially affecting the capabilities of models served by prominent AI-as-a-Service providers.