major websites blocking content from ai crawlers

amazon, quora and indeed are three major websites that prohibit chatgpt’s bot.

by rick richardson
technology this week

according to recent data from content detector originality.ai, nearly 20 percent of the top 1000 websites in the world are restricting crawler bots that collect web data for ai services.

more: what is an heic file? | ai + mri = diseases that doctors might miss | your boarding pass could onboard hackers | chatgpt passes cpa exam on second try | cyber insurance costs rise in health care as attacks soar | adobe announces “creator-friendly” generative ai tools
exclusively for pro members. log in here or 2022世界杯足球排名 today.

big and small websites are taking matters into their own hands because there are no clear legal or regulatory guidelines limiting ai’s usage of intellectual information.

early in august, openai unveiled its gptbot crawler, claiming that the information obtained “might improve future models,” assuring that paywalled content will be omitted and providing instructions on how to block the crawler on websites.

several well-known news outlets, including the new york times, reuters and cnn, started blocking gptbot shortly after, and many more have subsequently done the same.

according to originality, the percentage of websites censoring openai’s chatgpt bot has climbed from 9.1 percent on august 22 to 12 percent on august 29, among the top 1000 most popular websites.

amazon, quora and indeed are three major websites that prohibit chatgpt’s bot. the analysis shows that larger websites are more likely to have ai bots stopped already. in the top 1000 websites, the common crawl bot – another crawler that regularly collects web data used by some ai services – is blocked 6.77 percent of the time.

this is how it goes. any webpage that can be accessed by a web browser can also be “scraped” by a crawler, which works just like a browser but stores the content in a database rather than showing it to the user. that is how information is gathered by search engines like google.

the ability to publish instructions telling these crawlers to leave has long been available to site owners, but compliance is entirely voluntary, and malicious users can choose to disregard the advice.

although many publishers and owners of intellectual property have long objected, google and other web companies view the activity of their data crawlers as fair use. as a result, the company has been involved in many legal battles over the practice. as generative ai and huge language models gain popularity, this issue has again come to light as ai businesses send out their crawlers to gather information for their chatbot feeds and train their models.

because google and other search engines directed consumers to these publishers’ ad-supported websites, some publishers found at least some value in allowing search crawlers access to their websites. however, in the age of ai, publishers are more adamantly rejecting crawlers because there is now no benefit to providing their data to ai firms.

many media businesses are currently in discussions with ai companies about paying a price to license their data to ai companies, but these discussions are still in the early stages. while this is going on, some websites and owners of intellectual property are suing or considering suing ai businesses that may have misused their data.

the increasing commercialization of ai services like openai is being viewed with anger and a “we won’t get fooled again” attitude by media organizations that feel they were duped by google over the past 20 years. according to the information, openai is expected to earn more than $1 billion in revenue over the coming year.

particularly, news organizations are having trouble striking the correct mix between embracing ai and resisting it. on the one hand, the sector is desperately trying to come up with new ideas to increase profit margins in their labor-intensive operations. on the other hand, integrating ai into a newsroom’s workflow when public confidence in media organizations is at an all-time low raises difficult ethical issues.

if too much of the web bans ai crawlers, the owners of those crawlers may find it more difficult to update and improve their ai products, and good data is getting tougher to find.

posted on october 17, 2023

rick richardson

about the author

rick richardson, cpa.citp, cgma, is the ceo and founder of richardson media & technologies and editor and publisher of technology this week, regularly featured at 卡塔尔世界杯常规比赛时间 under special arrangement.

see more by rick richardson

technology this week provides an easy to read digest of the technology that tax and accounting practitioners need to know to better serve their clients. professionals rely on richardson to tell them in a timely and informative manner what’s new in technology and how it will impact their business and personal lives. subscribers also have access to the entire library of prior issues with an easy-to-use search.

with all of the news sources available today, it's next to impossible to keep up with what's new in the world of technology. technology this week is a 20-minute read designed for an average reader rather than the technologist. richardson had a 28-year career in technology with ernst & young, the last twelve years of which he served as national director of technology. he has been named to the "technology 100" – the annual honors list of the 100 key achievers in technology in america. he has also been honored by the american institute of cpas with two lifetime achievement awards for his contributions to the profession in the field of technology. in 2012, rick was inducted into the accounting hall of fame by cpa practice advisor magazine. he has also been named to the 100 most influential individuals in the accounting profession in america by accounting today magazine. he is a sought-after speaker around the world, providing his annual forecast of future technology trends to thousands of business executives, professionals, community leaders, educators, and students.

click here for more by rick richardson