By Michael Borella --
This Internet has gone through many revolutions, technical and otherwise. Each time it has emerged stronger and more robust than before. One can trace the origins of the Internet to the connection of four computers in 1969. The advent of email a few years later, the standardization of TCP/IP as its communication protocol in the mid-1980s, commercial dialup service providers not long after that, the popularization of the World Wide Web in 1993, and the immediately-following rise of ecommerce gave us the main aspects of what we currently think of as the Internet. Inexpensive broadband, the search engine, online gaming, social media, widespread mobile access, and videoconferencing bring us to today.
The next Internet revolution is being driven by generative AI, and it is happening right now. While we cannot say what is going to be in place on the other side of this inflection point, we do know that our online experience will be different in a few years.
Consider just the web. It relies on an economic model that has been largely static for over two decades. But generative AI is already beginning to disrupt this model. Today, publishers and search engines have a mostly symbiotic relationship.[1] They need each other. But what happens when AI is used to subsume publisher content?
Publishers (including small and independent publishers, blogs, businesses, and so on) seek to be highly ranked in search engines, especially for search terms that are relevant to their content. For example, an independent newsletter that reviews cars (as just one example) would want to appear as high as possible (preferably in the top ten) search results for queries involving the word "car", "automobile", "motor", and so on. This will drive traffic to the publisher's newsletter, allowing the publisher to sell ads and/or subscriptions. With the resulting revenue, the publisher can grow its newsletter by hiring more writers and editors, who in turn produce more content. This content, if of sufficient quality, makes it more likely that the publisher will be highly ranked in search results -- a virtuous cycle.
Search engines (or more correctly, search engine providers) crawl the web and use sophisticated algorithms to rank publishers in terms of their sites' relevance to search terms. A successful search engine may receive billions of search requests per day from users who want to be referred to relevant publishers. Each page of search results may be an ordering of links to publishers and perhaps a short description or summary of the content that can be found by following each link. Search engines make money by displaying advertisements as sponsored links. They also allow publishers to bid on search terms in a form of auction. The publisher who bids the highest for a given search term often ends up at the top of the search result as a sponsored link, assuming that their content is indeed relevant to the search term.
While this symbiosis has not yet been broken, it is getting wobbly. It is no coincidence that the largest investors in generative AI are the companies that own and operate the largest search engines. OpenAI, which is nearly half-owned by Microsoft, has admitted to training its GPT series of models on content that it absorbed from publishers on the Internet, often without their permission. Some would contend that the models that underlie the popular ChatGPT generative AI tool have been trained on the Internet as a whole.[2] Indeed, The New York Times and other publishers have sued OpenAI and Microsoft for allegedly violating their copyrights and (at least in the case of the Times) allegedly producing near-verbatim copies therefore in response to large language model (LLM) prompts.[3]
Search engine results recently changed from being a series of links to publisher sites (with ad-sponsored results on top and clearly flagged) to leading with a generated AI overview as the leading result. This overview seeks to answer the searcher's question or address their need without requiring them to click on links and visit publisher web sites. So far, AI overview sections have also been displaying what the underlying AI model infers to be the most relevant publisher links and/or those used a source material for the overview.
In isolation, AI-generated search results are neither here nor there. Users might appreciate having them summarize publisher content to provide results in a more convenient and readily-consumable form. Of course, the generation and providing of these results is still clearly experimental and subject to the same incorrectness, hallucinations, and bias that plague LLMs. But users may further appreciate not having to navigate to certain publisher sites that display annoying ads or are sophisticated clickbait.
At this point it seems safe to assume that the quality of these AI summaries will continue to improve. If so, at what point does this impact the publishers? The news industry -- one of the biggest and most important publishing sectors -- has lost one-third of its newspapers and two-thirds of its news journalists in the last two decades, mostly in the area of local news.[4] AI is likely to accelerate this decline, with search engines absorbing user traffic rather than directing it to publisher sites. As a result, publishers are likely to have less ad revenue and fewer subscriptions, putting their businesses at risk.
Given that a sea change toward AI-generated search results is playing out in real time, it is difficult to predict what the economic model the web will be based on five years from now. However, we can game out a few possible scenarios.
• Search engines use AI to subsume the publishing industry. Publishers go out of business en masse, leaving only the LLMs that generate news-like search results. Many view this possible future as dystopian and potentially dangerous for society without the fourth estate, as there would be fewer incentives to subsidize investigative journalism that seeks to root out corporate and governmental corruption. In a world without news organizations, it would be extraordinarily difficult to determine "the truth" as we would rely mostly on second and third hand accounts and other forms of hearsay. Ironically, this would also reduce the utility of search engines, likely resulting in a revenue drop for those companies as well.
• In a slightly less dystopian future, publishing does not die but instead gets sorted into two camps. The first are traditional publishers with large audiences, like the Times, who survive due to their name recognition and quality journalism. The second consists of small independent publishers with niche audiences that survive on subscriptions. Both have the ability to be a source of real-time "hot" news that is too fresh for the AI models to ingest. They also can serve as destinations for users (e.g., the users check their apps or bookmarked links frequently), allowing the publishers to flourish even without referrals from search engines. This has been referred to as the "barbell" strategy, where the sweet spots for publishers are at either end of the spectrum in terms of size (the very large media companies or the very small independents) with very little in between.
• Another dark scenario is that the search engines buy up most or all major publishers, reorienting the publishing business from serving the public directly toward generating training data for AI models. While the incentives here are complex, it is likely that the search engine companies would have at least some control over reporting. This could lead to less reporting on any negative externalities associated with these companies or -- even worse -- different "facts" being used to train different search engines. If this notion seems far-fetched, keep in mind that the Washington Post is owned by Amazon founder Jeff Bezos[5] and that Meta was considering the purchase of Simon & Schuster to train its models.[6]
• A more likely path forward is that the search engine providers (and AI companies in general) sign licensing deals with major publishers in order to access and use their content. This is already happening, as the Financial Times, News Corp., Axel Springer, Le Monde, Prisa Media, and The Associated Press have all agreed to license their content to OpenAI.
• Another possibility is that search engine providers are legally prohibited from unauthorized ingestion of publisher content without permission, which makes the licensing discussed above even more appealing. This might be a consequence of a victory for the Times in its dispute with OpenAI.
Right now, it is too early to tell which of these scenarios may play out, or if a different one arises. If anything, we can be certain that the business model of the Internet will be different in five years or less. Search engines will return AI-generated answers with less emphasis of links to source material, publishers will need to engage more directly with their audience and supply a continuous stream of high-quality content, and search-engine optimization may become obsolete.
Like all disruptions, AI search comes with opportunities as well. As the technological and legal landscape evolves we should not be surprised if new models of publishing emerge, adapted to this evolution.
[1] This is an admittedly a simple overview of the relationship. There is quite a bit more nuance, but this level of detail is sufficient for the discussion that follows.
[2] https://futurism.com/the-byte/ai-training-data-shortage.
[3] https://www.patentdocs.org/2024/02/the-new-york-times-case-against-openai-is-different-heres-why.html.
[4] https://localnewsinitiative.northwestern.edu/posts/2024/05/08/ai-local-news-report/index.html.
[5] https://www.theguardian.com/media/2023/jan/29/tech-moguls-media-jeff-bezos-washington-post.
[6] https://www.theguardian.com/books/2024/apr/09/meta-discussed-buying-publisher-simon-schuster-to-train-ai.