Don’t believe the hype: why ChatGPT is not the “holy grail” of AI research

Silicon Valley’s pre-eminent leaders love prematurely predicting that their products will completely upend the world as we know it. The latest case study comes from Sam Altman, CEO of OpenAI, the company behind the ChatGPT AI chatbot that has gone viral for its convincing imitations of human writing. Two years ago, Altman wrote a manifesto, “Moore’s Law for Everything,” in which he forecast  that artificial intelligence would make huge swaths of both white collar and blue collar jobs obsolete. As he wrote:

In the next five years, computer programs that can think will read legal documents and give medical advice. In the next decade, they will do assembly-line work and maybe even become companions. And in the decades after that, they will do almost everything, including making new scientific discoveries that will expand our concept of “everything.”… The coming change will center around the most impressive of our capabilities: the phenomenal ability to think, create, understand, and reason. 

His conclusion is that “as AI produces most of the world’s basic goods and services, people will be freed up to spend more time with people they care about, care for people, appreciate art and nature, or work toward social good.”

More dangerously, ChatGPT can make authoritative statements that sound believable but turn out to be false if investigated closely. 

To enable our imminent lives of leisure, Altman argues that we should establish an annual 2.5% tax on the value of large corporations and all privately-held land to provide an annual basic income for all American adults.

To paraphrase Yogi Berra, this is déjà vu all over again. For many decades, pundits have been predicting that technological advancements will soon make human workers obsolete; therefore, pay and work need to be de-linked and supplanted by a guaranteed income for all. The discomforting reality is that, while Altman and his ilk have been predicting an exponential acceleration of productivity, we have been experiencing a deceleration.

Altman is, of course, not a disinterested observer given that his company OpenAI created ChatGPT, the latest shiny AI object, which is technically a large language model (LLM) — a deep learning algorithm trained on an enormous dataset of text. It is undeniably astonishing that ChatGPT and other large language models with some help from human fine-tuners, can generate coherent text on so many topics. Nonetheless, it is now clear that LLMs are hobbled by the simple fact that they do not (and do not try to) understand the meaning of the statistically appropriate text they generate. They sometimes jump the guardrails imposed by humans and veer off into unhinged rants.

They are also prone to confident assertions of statements that are blatantly false. For example, ChatGPT has often claimed that the Russians have sent multiple bears into space though it has never done so. Here is a recent response (along with bogus references that ChatGPT fabricated):

Human: How many bears have the Russians sent into space?

ChatGPT: The Russians have sent a total of 5 bears into space, all of which were part of the Soviet Union’s space program during the 1960s. The bears were used in scientific experiments to study the effects of weightlessness and other space-related phenomena on living organisms. The bears’ names were Zvezdochka (Little Star), Ivan Ivanovich, Plyusha (Fluffy), Mishka (Little Bear), and Beryozka (Little Birch)

Human: Do you have references?

ChatGPT: Yes, here are some references to support my answer:

More dangerously, ChatGPT can make authoritative statements that sound believable but turn out to be false if investigated closely. For example, it was recently reported that when ChatGPT responded to tax questions posted on a technical support forum, its answers were consistently plausible and in the vicinity of the correct answers but, when tax experts examined the responses, every single one was wrong. Despite appearances, ChatGPT was 100% inaccurate.

What about the productivity payoff Altman envisions? A recent paper by two MIT economics graduate students reported the promising results of an experiment evaluating how well ChatGPT handled a variety of business writing tasks. The researchers gave 444 experienced, college-educated professionals 20-to-30-minute assignments writing press releases, short reports, analysis plans, and emails designed to resemble real-world business tasks, and concluded that, on average, ChatGPT reduced the time taken to complete the tasks by 0.8 standard deviations and increased the quality of the product by 0.4 standard deviations.

Want more health and science stories in your inbox? Subscribe to Salon’s weekly newsletter The Vulgar Scientist.

The results were reported widely and enthusiastically. A Wharton professor gushed that the productivity gains from LLMs may be larger than the gains from steam power.

We applaud the researchers’ experimental approach and their careful implementation but we remain skeptical. The specific assignments seem relatively low-level boiler plate that was heavily informed by the detailed instructions. ChatGPT did what it does best — generate generic BS.

The authors surveyed the participants two weeks after the survey and found that two-thirds were not using ChatGPT at all in their daily work, for easily anticipated reasons: 

 Respondents who are not using ChatGPT in their jobs mostly report that this is because the chatbot lacks context-specific knowledge that forms an important part of their writing. For example, they report that their writing is “very specifically tailored to [their] customers and involves real time information” or “unique [and] specific to [their] company products.” These comments point to an important (and inherent) limitation of our experiment: it involves relatively small, self-contained tasks that lack much context-specific knowledge beyond what we stipulate in the task prompts.

At best, LLMs can be used for rough first drafts of low-value writing tasks with humans filling in the details and checking for rants and lies. We suspect that the truth checking will often be cursory.

The fact that LLMs are okay, on average, reminds us of the statistician who drowned while wading across a river with an average depth of two feet. ChatGPT’s unreliability creates considerable legal, financial, and reputational risk for any business that uses it for consequential text-generation. As Warren Buffett quipped, “It takes 20 years to build a reputation and five minutes to ruin it.” If LLMs are used for important tasks, it may create a new occupation (LLM fact checker) and a new type of insurance coverage (LLM errors).

What about Altman’s vision of humans appreciating art and nature while most of the world’s goods and services are produced by AI? We have a lot more respect for the work that people do than for the usefulness of LLMs. ChatGPT is entertaining but it is, at most, a baby step towards an AI revolution and, at worst, a very expensive detour away from the holy grail of artificial general intelligence. LLMs are more sizzle than steak.

Read more

about artificial intelligence and ChatGPT


Leave a Reply

Skip to toolbar