Listen to the article
from the the-knowledge-base-powering-ai dept
Wikipedia celebrated its 25th birthday last month. Given the centrality of Wikipedia to so much activity online, it is hard to remember (or to imagine, for those who are younger) a time without Wikipedia. The latest statistics are impressive:
- Wikipedia is viewed nearly 15 billion times every month.
- Wikipedia contains over 65 million articles across more than 300 languages.
- Wikipedia is edited by nearly 250,000 editors every month around the world. Editors are defined by one edit or more every month; only editors with a username are counted.
- Wikipedia is accessed by over 1.5 billion unique devices every month.
That’s testimony to the global nature of Wikipedia. But there’s something else, not mentioned there, that is of great relevance to this blog: the fact that every one of those 65 million articles is made available under a generous license – the Creative Commons Attribution-ShareAlike 4.0 license, to be precise. That means sharing and re-use are encouraged, in contrast to most material online, where copyright is fiercely enforced. Wikipedia is living proof that giving away things by relying on volunteers and donations – the “true fans” approach – works, and on a massive scale. Anil Dash puts it well in a post celebrating Wikipedia’s 25th anniversary:
Whenever I worry about where the Internet is headed, I remember that this example of the collective generosity and goodness of people still exists. There are so many folks just working away, every day, to make something good and valuable for strangers out there, simply from the goodness of their hearts. They have no way of ever knowing who they’ve helped. But they believe in the simple power of doing a little bit of good using some of the most basic technologies of the internet. Twenty-five years later, all of the evidence has shown that they really have changed the world.
However, Wikipedia is today facing perhaps its greatest challenge, which comes from the new generation of AI services. They are problematic for Wikipedia in two main ways. The first, ironically, is because it is widely recognized that Wikipedia’s holdings represent some of the highest-quality training materials available. In a post explaining why, “in the AI era, Wikipedia has never been more valuable”, the Wikimedia Foundation writes:
AI cannot exist without the human effort that goes into building open and nonprofit information sources like Wikipedia. That’s why Wikipedia is one of the highest-quality datasets in the world for training AI, and when AI developers try to omit it, the resulting answers are significantly less accurate, less diverse, and less verifiable.
That recognition is welcome, but comes at a price. It means that every AI company as a matter of course wants to download the entire Wikipedia corpus to be used for training its models. That has led to irresponsible behavior by some companies, when their scraping tools download pages from Wikipedia with no consideration for the resources they are using for free, or the collateral damage they are causing to other users in terms of slower responses.
Trying to stop companies drawing on this unique resource is futile; recognizing this, Wikimedia Foundation has come up with an alternative approach: Wikimedia Enterprise, “a first-of-its-kind commercial product designed for companies that reuse and source Wikipedia and Wikimedia projects at a high volume”. In 2022, its first customers were Google and the Internet Archive, and last month, Wikimedia Enterprise announced that Amazon, Meta, Microsoft, Mistral AI, and Perplexity have also signed. That’s important for a couple of reasons. It means that many of the biggest AI players will download Wikipedia articles more efficiently. It also means that the Wikipedia project will receive funding for its work.
This new money is crucial if Wikipedia is to remain a high quality resource. And that is precisely why every generative AI company that uses Wikipedia posts for training should – if only out of self-interest – pay to do so. What is happening here echoes something this blog suggested back in May 2024: that AI companies should pay artists to create new works, and give away the results, because fresh training material is vital. Helping to pay for Wikipedia to create more high-quality articles that are freely available to all is a variation on that theme.
The other problem that generative AI causes Wikipedia is more subtle. The Wikimedia Foundation explains that alongside financial support, the project needs proper attribution:
Attribution means that generative AI gives credit to the human contributions that it uses to create its outputs. This maintains a virtuous cycle that continues those human contributions that create the training data that these new technologies rely on. For people to trust information shared on the internet, platforms should make it clear where the information is sourced from and elevate opportunities to visit and participate in those sources. With fewer visits to Wikipedia, fewer volunteers may grow and enrich the content, and fewer individual donors may support this work.
Without fresh volunteers, Wikipedia will wither and become less valuable. That’s terrible for the world, but it is also bad for generative AI companies. So, again, it makes sense for them to provide proper attribution in their outputs. That requirement has become even more pressing in the light of a new development. According to tests carried out by the Guardian:
The latest model of ChatGPT has begun to cite Elon Musk’s Grokipedia as a source on a wide range of queries, including on Iranian conglomerates and Holocaust deniers, raising concerns about misinformation on the platform.
That’s potentially problematic because of how Grokipedia creates its entries. Research last year found that:
Grokipedia articles are substantially longer and contain significantly fewer references per word. Moreover, Grokipedia’s content divides into two distinct groups: one that remains semantically and stylistically aligned with Wikipedia, and another that diverges sharply. Among the dissimilar articles, we observe a systematic rightward shift in the political bias of cited news sources, concentrated primarily in entries related to politics, history, and religion. These findings suggest that AI-generated encyclopedic content diverges from established editorial norms-favouring narrative expansion over citation-based verification.
If leading chatbots starts drawing on Grokipedia routinely for their answers, it is less likely that there are independent sources where the information can be checked, something generally possible with Wikipedia. It therefore becomes even more urgent for generative AI systems to provide attribution, so at least users know where information is coming from, and whether there are likely to be further resources that confirm a chatbot’s claims. Not everyone will want to do that, but it is important to offer it as an option.
Wikipedia at 25 is an amazing achievement in multiple ways, one of which includes serving as a demonstration that material can be given away for free, supported directly by users, and on a global scale. It would be a tragedy if the current enthusiasm for generative AI systems led to that resource being harmed and even destroyed. A world without Wikipedia would be a poorer world indeed.
Follow me @glynmoody on Mastodon and on Bluesky. Republished from Walled Culture.
Filed Under: ai, attribution, grokipedia, scraping, wikipedia
Companies: amazon, google, internet archive, meta, microsoft, mistral, perplexity, wikipedia
Read the full article here
Fact Checker
Verify the accuracy of this article using AI-powered analysis and real-time sources.

