GPT-3 and verification

Summary: This post talks about GPT-3, a new Machine Learning (ML) system currently making waves in the ML community. It explains why GPT-3 is a big deal, and then considers the verification implications of such systems. One way to look at GPT-3 (and the even-bigger GPT-4, GPT-5 etc. which are sure to follow) is as a sequence of increasingly-powerful machines which come with no clear specification of what they can do. This will probably shift the verification-effort-to-design-effort ratio much more towards verification.

Note that this is one of those more speculative results-of-automation posts, so treat it accordingly. I’ll get back to the mainstream topics of this blog (Autonomous Vehicles verification, scenario languages and all that) in the next post.

What is GPT-3

Did you hear about GPT-3 yet? Some of the people I am following / talking to have never heard of it, while others can’t stop discussing it.

GPT-3 is a huge, 175B parameters ML “language model” created by OpenAI. It was trained on a significant piece of the text on the internet (including all of English Wikipedia). I was impressed by the 1.5B-paraemter GPT-2 (January 2019), and then when GPT-3 arrived (June 2020) it was clear that (a) this is a big step forward and (b) there will be more coming.

Here are some links: A summary of why people find it is amazing, the original OpenAI paper, and a summary by @gwern, who got an early access to the GPT-3 API, and wrote:

GPT-3’s samples are not just close to human level: they are creative, witty, deep, meta, and often beautiful. They demonstrate an ability to handle abstractions, like style parodies, I have not seen in GPT-2 at all. Chatting with GPT-3 feels uncannily like chatting with a human. I was impressed by the results reported in the GPT-3 paper, and after spending a week trying it out, I remain impressed.

People are just starting to discover what you can do with it:

@gwern and others are experimenting with “prompt programming” (i.e. customizing GPT-3 by feeding it some initial text to “prime” it for the job at hand)
@sharifshameem created a prototype system which builds a web application automatically based on a textual description of what the application should do, and is now building a company around it
People were impressed with GPT-3’s answers to common-sense questions, but remarked that it also gives (bad) answers to nonsense questions
Then @paraschopra discovered that you can use “prompt programming” to make it reject nonsense questions
And so on

One of the hot discussion topics around GPT-3 is what it does to our estimates of when Artificial General Intelligence (AGI) will happen (see this old post about AGI and its dangers). GPT-3, of course, is not AGI. Nevertheless, some people whose judgement I respect are busy updating their expected-AGI-arrival-time to be significantly sooner.

The basic rationale seems to be as follows: They all assume that AGI will need both bigger models and new ideas, but GPT-3 seems to show that you can achieve an awful lot by just going for bigger models (using an existing popular architectures, namely transformers), and that we are far from the point of diminishing returns.

Leaving AGI aside, one of my GPT-3 impressions is that the balance between “creating the functionality” and “verifying the functionality” seems to be shifting. Let me explain what I mean by that:

Creating (and verifying) products based on GPT-3 / advanced-ML

Consider ML research as a somewhat-alien technology source, which spits out a stream of increasingly-capable machines. Each of these machines is very impressive, but come with no clear specification of what it can and cannot do. Thus, one of the first tasks is to see what it is good for. We make educated guesses about why it does some jobs well and some poorly, and try to understand its limits.

Then we try to create systems / products based on that machine. Some of those systems will be embodied (e.g. autonomous vehicles, robots, drones), and some not (e.g. medical / financial / law / media applications).

Usually those systems will not be ML-only. Consider AVs: They are made of ML, millions of lines of “plain software”, mechanics, electronics and much else. But the ML part keeps growing, and some companies (e.g. comma.ai) are going ML-only for almost the full flow from sensor input to steering/gas/brake.

And some of these systems will have safety-critical implications, in the sense that if they behave badly then the results could be pretty bad (people dying, huge monetary losses, huge lawsuits etc.).

Are we going to see GPT-*-inspired safety-critical systems? Probably not very soon: Most people tinkering with GPT-3 are not looking into that, and these systems may be too unpredictable, too brittle, or too power-hungry. But in the slightly longer term (say 3..10 years) I’d be surprised if none appeared, much like I’d be surprised by somebody not using ML today for their AV perception: ML systems are simply too good at this to be ignored.

Note that because ML-based systems are somewhat unpredictable, safety-critical systems typically employ various non-ML “guard” schemes around the ML-based part (see figure 2 here). And then you must verify the whole, composite system.

Constructing safety-critical systems involves serious verification. This is (obviously) far-removed from the current stage of GPT-3 tinkering, which often sounds more like observing wildlife (“Hey, I just saw it do X – I bet it can even do Y”).

The verification ratio

In electronic chip design, the verification ratio (i.e. the fraction of the total engineering effort which is spent on verification) is above 50%, because the cost of failure is so high. That ratio is much lower for typical web software, and probably much higher for, say, the lunar lander software.

Now consider the class of future ML-based / GPT-*-inspired safety-critical systems (whether embodied or not). I claim their verification ratio is going to be significantly higher than 50%: In a sense, per unit-of-complexity, design effort may shrink, while verification effort will stay unchanged (or even grow).

I use “verification” loosely here to mean the entire range of activities of finding the boundaries of functionality, looking for spec bugs (yes, those systems will have a spec, or at least a set of requirements), working through many scenarios and looking for corner cases, doing user-centered validation and so on.

Can we delegate verification itself to an ML-based system? I posed that question in the post titled Where Machine Learning meets rule-based verification:

So, can we assume that as ML improves, all verification will be done via ML, or is there some unavoidable part which will forever have to be specified in a rule-based, “logical” way (using data structures, assertions and so on)? The history of ML reads like a sequence of battles where the rule-based alternative fought ML and lost, so it is tempting to claim that an all-ML solution will eventually win.

My answer was “no”:

Regulators (and common sense) will insist that verification will be based upon human-supplied requirements, consisting of scenarios, coverage definitions, checks and so on – let’s call them all “rules” for short. So some rule-based specifications will always be here.

Overall I expect this golden age of advanced-ML to coincide with a golden age for verification: We are going to need the best, most efficient verification tools / methodologies we can find. This includes dynamic verification (sometimes known as “testing”) as well as formal verification (though I suspect it may be less applicable here than many would hope).

I expect scenario-based, Coverage-Driven Verification (CDV) to play a big role (e.g. see this 15-minute video of me talking about verifying ML-based systems). Fuzzing is another rapidly-advancing family of technologies which started in security but has since been generalized and is moving closer and closer to CDV.

We do live in interesting times.

Notes

I’d like to thank Gil Amid, Rahul Razdan, Amiram Yehudai, Kerstin Eder and Ziv Binyamini for commenting on earlier drafts of this post.

One thought on “GPT-3 and verification”

Erik Panu says:

July 25, 2020 at 3:53 pm

So..this topic reminds me of:

1) ..the late 1990s interest by some of us in applying person supplied rules to verifying whether or not a rendered image “passed” or “failed”. A person could immediately spot a problem in a picture that took too many rules or too many dependent constraints (considering other things to detect), to correctly specify or resolve each and every one, much less how to define what the picture looked like so, people did A/B comparisons instead. And..
2) …the other day, a Customer says to us, “Oh, your DL software works great removing typing noise but those mouse clicks are another thing all together.” Well, kind sir, first of all, 42 other people think it sounds fine for mouse clicks. Second, you use a 2002 Kensington mouse, and no one in their right mind would ever have a 2002 Kensington mouse any more. And third, for heaven’s sake, you are a major city symphony conductor – c’mon man!

As a sweeping generalization, it feels like what is right or wrong with an ML result will be subjective or, at least, will have subjective elements. A combination of clearly specified rules as well as mechanisms for applying and “soft” verifying policies seems necessary. Leaving a train horn in conference call audio is probably not acceptable. Having a little bit of barely noticeable noise indicating presence on the call? That’s OK and maybe even welcomed.

Loading...

	Coverage-driven alig… on It’s the spec bugs that kill y…
	https://otomotif71.w… on Stuttgart impressions: Scenari…
	Daan van der Keur on About “The coming AI hackers”…
	Mariah Jackson on M-SDL, the autonomous vehicles…
	sakhokhar on Machine Learning for Coverage…
	hongseoklee on How to write AV scenarios (and…
	Erik Panu on GPT-3 and verification
	Yoav Hollander on Autonomy markets and their pot…
	Nakkeeran Kumaraswam… on Autonomy markets and their pot…

The Foretellix CTO Blog – AI safety

Now focusing on AI safety (autonomy-related posts go to the company blog)