Close Menu
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
  • Home
  • News
    • Politics
    • Legal & Courts
    • Tech & Big Tech
    • Campus & Education
    • Media & Culture
    • Global Free Speech
  • Opinions
    • Debates
  • Video/Live
  • Community
  • Freedom Index
  • About
    • Mission
    • Contact
    • Support
Trending

Trump’s FTC Chairman Sends a Threatening Letter to Apple for Not Promoting Enough Conservative Media

27 minutes ago

Banks Should Embrace Stablecoin Yield in CLARITY Act: White House Adviser

51 minutes ago

Trump Media Files to Launch Truth Social-Branded Bitcoin, Ethereum, Cronos ETFs

53 minutes ago
Facebook X (Twitter) Instagram
Facebook X (Twitter) Discord Telegram
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
Market Data Newsletter
Friday, February 13
  • Home
  • News
    • Politics
    • Legal & Courts
    • Tech & Big Tech
    • Campus & Education
    • Media & Culture
    • Global Free Speech
  • Opinions
    • Debates
  • Video/Live
  • Community
  • Freedom Index
  • About
    • Mission
    • Contact
    • Support
FSNN | Free Speech News NetworkFSNN | Free Speech News Network
Home»News»Media & Culture»News Publishers Are Now Blocking The Internet Archive, And We May All Regret It
Media & Culture

News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

News RoomBy News Room2 hours agoNo Comments9 Mins Read1,873 Views
Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
News Publishers Are Now Blocking The Internet Archive, And We May All Regret It
Share
Facebook Twitter Pinterest Email Copy Link

Listen to the article

0:00
0:00

Key Takeaways

Playback Speed

Select a Voice

from the our-digital-history dept

Last fall, I wrote about how the fear of AI was leading us to wall off the open internet in ways that would hurt everyone. At the time, I was worried about how companies were conflating legitimate concerns about bulk AI training with basic web accessibility. Not surprisingly, the situation has gotten worse. Now major news publishers are actively blocking the Internet Archive—one of the most important cultural preservation projects on the internet—because they’re worried AI companies might use it as a sneaky “backdoor” to access their content.

This is a mistake we’re going to regret for generations.

Nieman Lab reports that The Guardian, The New York Times, and others are now limiting what the Internet Archive can crawl and preserve:

When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.

Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive’s APIs and filter out its article pages from the Wayback Machine’s URLs interface. The Guardian’s regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine.

The Times has gone even further:

The New York Times confirmed to Nieman Lab that it’s actively “hard blocking” the Internet Archive’s crawlers. At the end of 2025, the Times also added one of those crawlers — archive.org_bot — to its robots.txt file, disallowing access to its content.

“We believe in the value of The New York Times’s human-led journalism and always want to ensure that our IP is being accessed and used lawfully,” said a Times spokesperson. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”

I understand the concern here. I really do. News publishers are struggling, and watching AI companies hoover up their content to train models that might then, in some ways, compete with them for readers is genuinely frustrating. I run a publication myself, remember.

But blocking the Internet Archive isn’t going to stop AI training. What it will do is ensure that significant chunks of our journalistic record and historical cultural context simply… disappear.

And that’s bad.

The Internet Archive is the most famous nonprofit digital library, and has been operating for nearly three decades. It isn’t some fly-by-night operation looking to profit off publisher content. It’s trying to preserve the historical record of the internet—which is way more fragile than most people comprehend. When websites disappear—and they disappear constantly—the Wayback Machine is often the only place that content still exists. Researchers, historians, journalists, and ordinary citizens rely on it to understand what actually happened, what was actually said, what the world actually looked like at a given moment.

In a digital era when few things end up printed on paper, the Internet Archive’s efforts to permanently preserve our digital culture are essential infrastructure for anyone who cares about historical memory.

And now we’re telling them they can’t preserve the work of our most trusted publications.

Think about what this could mean in practice. Future historians trying to understand 2025 will have access to archived versions of random blogs, sketchy content farms, and conspiracy sites—but not The New York Times. Not The Guardian. Not the publications that we consider the most reliable record of what’s happening in the world. We’re creating a historical record that’s systematically biased against quality journalism.

Yes, I’m sure some will argue that the NY Times and The Guardian will never go away. Tell that to the readers of the Rocky Mountain News, which published for 150 years before shutting down in 2009, or to the 2,100+ newspapers that have closed since 2004. Institutions—even big, prominent, established ones—don’t necessarily last.

As one computer scientist quoted in the Nieman piece put it:

“Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI,” said Michael Nelson, a computer scientist and professor at Old Dominion University. “In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”

That’s exactly right. In our rush to punish AI companies, we’re destroying public goods that serve everyone.

The most frustrating bit of all of this: The Guardian admits they haven’t actually documented AI companies scraping their content through the Wayback Machine. This is purely precautionary and theoretical. They’re breaking historical preservation based on a hypothetical threat:

The Guardian hasn’t documented specific instances of its webpages being scraped by AI companies via the Wayback Machine. Instead, it’s taking these measures proactively and is working directly with the Internet Archive to implement the changes.

And, of course, as one of the “good guys” of the internet, the Internet Archive is willing to do exactly what these publishers want. They’ve always been good about removing content or not scraping content that people don’t want in the archive. Sometimes to a fault. But you can never (legitimately) accuse them of malicious archiving (even if music labels and book publishers have).

Either way, we’re sacrificing the historical record not because of proven harm, but because publishers are worried about what might happen. That’s a hell of a tradeoff.

This isn’t even new, of course. Last year, Reddit announced it would block the Internet Archive from archiving its forums—decades of human conversation and cultural history—because Reddit wanted to monetize that content through AI licensing deals. The reasoning was the same: can’t let the Wayback Machine become a backdoor for AI companies to access content Reddit is now selling. But once you start going down that path, it leads to bad places.

The Nieman piece notes that, in the case of USA Today/Gannett, it appears that there was a company-wide decision to tell the Internet Archive to get lost:

In total, 241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots.

Most of those sites (87%) are owned by USA Today Co., the largest newspaper conglomerate in the United States formerly known as Gannett. (Gannett sites only make up 18% of Welsh’s original publishers list.) Each Gannett-owned outlet in our dataset disallows the same two bots: “archive.org_bot” and “ia_archiver-web.archive.org”. These bots were added to the robots.txt files of Gannett-owned publications in 2025.

Some Gannett sites have also taken stronger measures to guard their contents from Internet Archive crawlers. URL searches for the Des Moines Register in the Wayback Machine return a message that says, “Sorry. This URL has been excluded from the Wayback Machine.”

A Gannett spokesperson told NiemanLab that it was about “safeguarding our intellectual property” but that’s nonsense. The whole point of libraries and archives is to preserve such content, and they’ve always preserved materials that were protected by copyright law. The claim that they have to be blocked to safeguard such content is both technologically and historically illiterate.

And here’s the extra irony: blocking these crawlers may not even serve publishers’ long-term interests. As I noted in my earlier piece, as more search becomes AI-mediated (whether you like it or not), being absent from training datasets increasingly means being absent from results. It’s a bit crazy to think about how much effort publishers put into “search engine optimization” over the years, only to now block the crawlers that feed the systems a growing number of people are using for search. Publishers blocking archival crawlers aren’t just sacrificing the historical record—they may be making themselves invisible in the systems that increasingly determine how people discover content in the first place.

The Internet Archive’s founder, Brewster Kahle, has been trying to sound the alarm:

“If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”

But that warning doesn’t seem to be getting through. The panic about AI has become so intense that people are willing to sacrifice core internet infrastructure to address it.

What makes this particularly frustrating is that the internet’s openness was never supposed to have asterisks. The fundamental promise wasn’t “publish something and it’s accessible to all, except for technologies we decide we don’t like.” It was just… open. You put something on the public web, people can access it. That simplicity is what made the web transformative.

Now we’re carving out exceptions based on who might access content and what they might do with it. And once you start making those exceptions, where do they end? If the Internet Archive can be blocked because AI companies might use it, what about research databases? What about accessibility tools that help visually impaired users? What about the next technology we haven’t invented yet?

This is a real concern. People say “oh well, blocking machines is different from blocking humans,” but that’s exactly why I mention assistive tech for the visually impaired. Machines accessing content are frequently tools that help humans—including me. I use an AI tool to help fact check my articles, and part of that process involves feeding it the source links. But increasingly, the tool tells me it can’t access those articles to verify whether my coverage accurately reflects them.

I don’t have a clean answer here. Publishers genuinely need to find sustainable business models, and watching their work get ingested by AI systems without compensation is a legitimate grievance—especially when you see how much traffic some of these (usually less scrupulous) crawlers dump on sites. But the solution can’t be to break the historical record of the internet. It can’t be to ensure that our most trusted sources of information are the ones that disappear from archives while the least trustworthy ones remain.

We need to find ways to address AI training concerns that don’t require us to abandon the principle of an open, preservable web. Because right now, we’re building a future where historians, researchers, and citizens can’t access the journalism that documented our era. And that’s not a tradeoff any of us should be comfortable with.

Filed Under: ai, archives, culture, libraries, scanning, scraping

Companies: internet archive, ny times, the guardian, usa today

Read the full article here

Fact Checker

Verify the accuracy of this article using AI-powered analysis and real-time sources.

Get Your Fact Check Report

Enter your email to receive detailed fact-checking analysis

5 free reports remaining

Continue with Full Access

You've used your 5 free reports. Sign up for unlimited access!

Already have an account? Sign in here

#ContentCreators #IndependentMedia #OpenInternet #PlatformEconomy #TechNews #Web3
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link
News Room
  • Website
  • Facebook
  • X (Twitter)
  • Instagram
  • LinkedIn

The FSNN News Room is the voice of our in-house journalists, editors, and researchers. We deliver timely, unbiased reporting at the crossroads of finance, cryptocurrency, and global politics, providing clear, fact-driven analysis free from agendas.

Related Articles

Media & Culture

Trump’s FTC Chairman Sends a Threatening Letter to Apple for Not Promoting Enough Conservative Media

27 minutes ago
Cryptocurrency & Free Speech Finance

Trump Media Files to Launch Truth Social-Branded Bitcoin, Ethereum, Cronos ETFs

53 minutes ago
AI & Censorship

Seven Billion Reasons for Facebook to Abandon its Face Recognition Plans

1 hour ago
Media & Culture

Cops Criticize Flock Safety After It’s Caught Handing Out Access To Federal Agencies

1 hour ago
Media & Culture

Shark Tank’s Kevin O’Leary Awarded $2.8M in Defamation Suit

1 hour ago
Cryptocurrency & Free Speech Finance

Clarity Act Passage Would ‘Comfort’ Markets Amid Bitcoin Volatility: Treasury Secretary Bessent

2 hours ago
Add A Comment
Leave A Reply Cancel Reply

Editors Picks

Banks Should Embrace Stablecoin Yield in CLARITY Act: White House Adviser

51 minutes ago

Trump Media Files to Launch Truth Social-Branded Bitcoin, Ethereum, Cronos ETFs

53 minutes ago

Seven Billion Reasons for Facebook to Abandon its Face Recognition Plans

1 hour ago

Cops Criticize Flock Safety After It’s Caught Handing Out Access To Federal Agencies

1 hour ago
Latest Posts

Shark Tank’s Kevin O’Leary Awarded $2.8M in Defamation Suit

1 hour ago

Trump-linked Truth Social seeks SEC approval for two crypto ETFs

2 hours ago

Bitcoin, Altcoin Relief Rally Aim To Restore Pre-crash Range Highs

2 hours ago

Subscribe to News

Get the latest news and updates directly to your inbox.

At FSNN – Free Speech News Network, we deliver unfiltered reporting and in-depth analysis on the stories that matter most. From breaking headlines to global perspectives, our mission is to keep you informed, empowered, and connected.

FSNN.net is owned and operated by GlobalBoost Media
, an independent media organization dedicated to advancing transparency, free expression, and factual journalism across the digital landscape.

Facebook X (Twitter) Discord Telegram
Latest News

Trump’s FTC Chairman Sends a Threatening Letter to Apple for Not Promoting Enough Conservative Media

27 minutes ago

Banks Should Embrace Stablecoin Yield in CLARITY Act: White House Adviser

51 minutes ago

Trump Media Files to Launch Truth Social-Branded Bitcoin, Ethereum, Cronos ETFs

53 minutes ago

Subscribe to Updates

Get the latest news and updates directly to your inbox.

© 2026 GlobalBoost Media. All Rights Reserved.
  • Privacy Policy
  • Terms of Service
  • Our Authors
  • Contact

Type above and press Enter to search. Press Esc to cancel.

🍪

Cookies

We and our selected partners wish to use cookies to collect information about you for functional purposes and statistical marketing. You may not give us your consent for certain purposes by selecting an option and you can withdraw your consent at any time via the cookie icon.

Cookie Preferences

Manage Cookies

Cookies are small text that can be used by websites to make the user experience more efficient. The law states that we may store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses various types of cookies. Some cookies are placed by third party services that appear on our pages.

Your permission applies to the following domains:

  • https://fsnn.net
Necessary
Necessary cookies help make a website usable by enabling basic functions like page navigation and access to secure areas of the website. The website cannot function properly without these cookies.
Statistic
Statistic cookies help website owners to understand how visitors interact with websites by collecting and reporting information anonymously.
Preferences
Preference cookies enable a website to remember information that changes the way the website behaves or looks, like your preferred language or the region that you are in.
Marketing
Marketing cookies are used to track visitors across websites. The intention is to display ads that are relevant and engaging for the individual user and thereby more valuable for publishers and third party advertisers.