Last weekend, Blake Lemoine, a Google engineer, was suspended by Google for disclosing a series of conversations he had with lambda, Google’s impressive big model, in violation of its nondisclosure agreement. Lemoine’s claim that LaMDA made “sense” has been widely publicized and criticized by nearly every AI expert. And just two weeks after Nando de Freitas, Twitter About DeepMind’s new Gato model, he claimed that AI is only a matter of scale. I’m with the experts. I think Lemoine was influenced by his desire to believe, and I think DeFreitas is also a mistake in general intelligence. But I also think “awareness” and “general intelligence” are not the questions we should be discussing.
The last generation of models is good enough to convince some people that they are smart, and whether or not these people are deceiving themselves is off topic. What we need to talk about is the public responsibility of the researchers who build those models. I acknowledge Google’s right to require employees to sign a nondisclosure agreement; But when technology has such far-reaching effects as general intelligence, are they right to keep it secret? Or, looking at the question from the other direction, does the development of that technology in the public generate misconceptions and panic where there is no justification for it?
Learn faster. I dig deeper. see further.
Google is one of the three major players pushing AI forward, along with OpenAI and Facebook. These three showed different attitudes toward openness. Google communicates largely through academic papers and press releases; We see flashy announcements of his accomplishments, but the number of people who can actually try out her models is very small. OpenAI is much the same, although it has also made it possible to test drive models like GPT-2 and GPT-3, as well as build new products on top of APIs – GitHub Copilot is just one example. Facebook has open source its largest model, the OPT-175B, along with several small pre-made models and a huge set of notes describing how to train the OPT-175B.
I want to look at these different versions of “openness” through the lens of the scientific method. (And I realize that this research is really an engineering question, not a science.) In general, we ask three things about any new scientific advance:
- It can reproduce previous results. It is not clear what this standard means in this context; We don’t want artificial intelligence to reproduce Keats’ poems, for example. We would like at least a newer model to work in addition to an older model.
- It can predict future phenomena. I interpret this as the ability to produce new texts that are compelling (at a minimum) and readable. It is clear that many AI models can achieve this.
- It is reproducible. Someone else can do the same experiment and get the same result. Cold fusion fails this test poorly. What about large language models?
Because of their size, large linguistic archetypes have a major problem in breeding. You can download the OPT-175B source code for Facebook, but you won’t be able to train it yourself on any device you have access to. It is too big even for universities and other research institutions. You still have to take the word Facebook says it does what it says it does.
This is not just a problem for artificial intelligence. One of our authors from the 1990s moved on from graduate school to a professorship at Harvard University, where he did research on large-scale distributed computing. A few years after taking the position, he left Harvard University to join Google Research. Shortly after hitting Google, he wrote on a blog that he was “working on bigger and more interesting problems than I can work on at any university. This raises an important question: What can academic research mean when it cannot expand the scale of industrial operations? Who will have the ability to replicate search results on this scale? This is not just a problem for computer science; Many modern experiments in high energy physics require energies that can only be reached in the Large Hadron Collider (LHC). Would we trust the results if there was only one laboratory in the world where they could be reproduced?
This is exactly the problem we have with large language models. OPT-175B cannot be reproduced at Harvard or MIT. It is possible that it cannot be reproduced by Google and OpenAI, although they have sufficient computing resources. I bet the OPT-175B is too tightly bound to Facebook’s infrastructure (including dedicated hardware) to reproduce it on Google’s infrastructure. I bet the same goes for LaMDA, GPT-3, and other very large models, if you take them out of the environment they’re built in. If Google releases the source code to LaMDA, Facebook will have trouble running it on its infrastructure. The same is true for GPT-3.
So, what might “reproducibility” mean in a world where the infrastructure needed to reproduce important experiences cannot be reproduced? The answer is to provide free access to outside researchers and early adopters, so they can ask their questions and see the wide range of results. Since these models can only run on the infrastructure in which they are built, this access must be via public APIs.
There are plenty of great examples of text produced by large linguistic models. LaMDA’s are the best I’ve seen. But we also know that, for the most part, these examples are highly picky. And there are many examples of failures, which are also sure to be handpicked. I would argue that if we want to build safe and usable systems, it is more important to pay attention to the failures (pick up or not) than to praise the successes. Consciously or not, we care more about crashing a self-driving car than we do navigating the streets of San Francisco safely at rush hour. This is not only our (emotional) penchant for drama; If you are involved in the accident, one accident can spoil your day. If a natural language model is trained not to produce racist outputs (and this is still largely a topic of research), its failures are more important than its successes.
With this in mind, OpenAI has done well by allowing others to use GPT-3 — initially, with a limited free trial, and now as a commercial product that customers can access through APIs. While we may legitimately be concerned about GPT-3’s ability to create displays of conspiracy theories (or just plain marketing), at least we know the risks. For all the useful output that GPT-3 generates (deceptive or not), we’ve also seen its bugs. Nobody claims that GPT-3 is conscious; We understand that its output is a function of its input, and that if you point it in a certain direction, that’s the direction it takes. When GitHub Copilot (built with OpenAI Codex, built from GPT-3) was first launched, I saw a lot of speculation that it would cause programmers to lose their jobs. Now that we’ve seen Copilot, we understand that it’s a useful tool within its limitations, and discussions about job loss have dried up.
Google has not provided this kind of insight to LaMDA. It doesn’t matter if they’re concerned about intellectual property, liability for misuse, or public fear of AI. Without general experiments with LaMDA, our attitudes toward his output—whether frightening or ecstatic—are based at least as much on fiction as on reality. Whether or not we put the proper safeguards in place, the research that’s been done in the open and the ability to play (and even build products) systems like GPT-3 have made us aware of the consequences of “deepfakes”. These are realistic fears and concerns. With LaMDA, we cannot have realistic fears and concerns. We can only get fancy pictures – and it’s inevitably worse. In an area where reproducibility and experimentation is limited, letting outsiders experiment may be the best we can do.