amazon, quora and indeed are three major websites that prohibit chatgpt’s bot.
by rick richardson
technology this week
according to recent data from content detector originality.ai, nearly 20 percent of the top 1000 websites in the world are restricting crawler bots that collect web data for ai services.
more: what is an heic file? | ai + mri = diseases that doctors might miss | your boarding pass could onboard hackers | chatgpt passes cpa exam on second try | cyber insurance costs rise in health care as attacks soar | adobe announces “creator-friendly” generative ai tools
exclusively for pro members. log in here or 2022世界杯足球排名 today.
big and small websites are taking matters into their own hands because there are no clear legal or regulatory guidelines limiting ai’s usage of intellectual information.
early in august, openai unveiled its gptbot crawler, claiming that the information obtained “might improve future models,” assuring that paywalled content will be omitted and providing instructions on how to block the crawler on websites.
several well-known news outlets, including the new york times, reuters and cnn, started blocking gptbot shortly after, and many more have subsequently done the same.
according to originality, the percentage of websites censoring openai’s chatgpt bot has climbed from 9.1 percent on august 22 to 12 percent on august 29, among the top 1000 most popular websites.
amazon, quora and indeed are three major websites that prohibit chatgpt’s bot. the analysis shows that larger websites are more likely to have ai bots stopped already. in the top 1000 websites, the common crawl bot – another crawler that regularly collects web data used by some ai services – is blocked 6.77 percent of the time.
this is how it goes. any webpage that can be accessed by a web browser can also be “scraped” by a crawler, which works just like a browser but stores the content in a database rather than showing it to the user. that is how information is gathered by search engines like google.
the ability to publish instructions telling these crawlers to leave has long been available to site owners, but compliance is entirely voluntary, and malicious users can choose to disregard the advice.
although many publishers and owners of intellectual property have long objected, google and other web companies view the activity of their data crawlers as fair use. as a result, the company has been involved in many legal battles over the practice. as generative ai and huge language models gain popularity, this issue has again come to light as ai businesses send out their crawlers to gather information for their chatbot feeds and train their models.
because google and other search engines directed consumers to these publishers’ ad-supported websites, some publishers found at least some value in allowing search crawlers access to their websites. however, in the age of ai, publishers are more adamantly rejecting crawlers because there is now no benefit to providing their data to ai firms.
many media businesses are currently in discussions with ai companies about paying a price to license their data to ai companies, but these discussions are still in the early stages. while this is going on, some websites and owners of intellectual property are suing or considering suing ai businesses that may have misused their data.
the increasing commercialization of ai services like openai is being viewed with anger and a “we won’t get fooled again” attitude by media organizations that feel they were duped by google over the past 20 years. according to the information, openai is expected to earn more than $1 billion in revenue over the coming year.
particularly, news organizations are having trouble striking the correct mix between embracing ai and resisting it. on the one hand, the sector is desperately trying to come up with new ideas to increase profit margins in their labor-intensive operations. on the other hand, integrating ai into a newsroom’s workflow when public confidence in media organizations is at an all-time low raises difficult ethical issues.
if too much of the web bans ai crawlers, the owners of those crawlers may find it more difficult to update and improve their ai products, and good data is getting tougher to find.