Vincent A Saulys' Blog
GPT-3 & the Universal Algorithm
Tags: ml
May 01, 2020

If you haven't already seen this GPT-3 demo, feast your eyes:

I just built a functioning React app by describing what I wanted to GPT-3.

I'm still in awe.

--- Sharif Shameem (@sharifshameem) July 17, 2020

Neural Networks are powerful stuff. It's debatable as to exactly why this is the case but I'd bet it has to do with a little mentioned bit called the Universal Approximation Theorem.

Simply stated, a neural network with an infinitely wide number of neurons can replicate any mathematical function.

It turns out many things can be represented as mathematical functions. This includes a lot of tasks with a clearly defined input and output. Think of identifying photos: the input is the photo and the output is what is in the picture. If you can generate enough data for it, a neural network can probably "learn" it. The theorem is one of the few proven to be mathematically true in machine learning.

You can see universal approximation theorems play out with my favorite neural network tutorial. The author created a very simple network that replicated an arbitrary boolean function. First he synthesizes the data, feeding an input of 1s and 0s into the function to calculate the output. Then he feeds that data into a network. After some backpropagation, it's mimicking the boolean function.

While this comes up often in textbooks and papers -- the mathematical notation often starts by assuming a function exists at all -- I never see it in lay articles. I think that's a mistake.

Because whenever you see an impressive result of a neural network, you need to think about what function its replicating and what data its using to do so.

Something that goes unstated is what was used to train this model. The published paper, "Language Models are Few-Shot Learners," states that the training data used is the common crawl dataset. This is a corpus of scraped websites.

You can conveniently see what sites are scraped for it with their tool. I checked a couple of months inbetween the years mentioned (2016-2019). Lo and behold, Stackoverflow comes up.

What's going on?

GPT-3 seems to have learned programming JSX because it saw a lot of prior examples.

Think about the questions being asked in the tweeted app. They're notable for being specific and the answer being rote. He's asking it to build a specific type of button or to mimic a specific layout (e.g. "build me a button that looks like a watermelon"). He's not asking it to build a website that matches a particular aesthetic. That task would require reasoning. Building components is rote because they're only so many ways to do it. The real skill in design comes in how you combine these elements. You see this with the multitude of component libraries being offered for free. Nobody cares if you "steal" them for your websites or apps.

This isn't to totally take away from the achievement. I liked the watermelon example in particular. It shows an incredible ability to combine context on questions (e.g. "watermelons and jsx"). I don't believe anybody has pulled that off before.

But it is to say that the model has limits. In fact, it may be the end point for brute force statistical methods. The cost to train this was an estimated $12 million. Money is rarely an issue if you're solving a worthy enough problem but then comes the issue of finding a large enough dataset.

The common crawl, used by GPT-3, is massive. The size is in the terabytes and the page count in the billions just for one month. The only other source of data that is reasonably close would be games whih can be played an infinite amount of times, generating data with each play. You can see models like AlphaGo taking advantage of this fact. Images are another space where big datasets abound (please let me know if you have any other examples of problems with massive datasets!).

GPT-3 is powerful but it's practical domain space is likely to be small. I saw some people whipping up prototypes around generating text variants for things like landing pages. That's probably a good use case. There's no long narrative that needs to be maintained so it doesn't devolve into nonsense (on a side-note, doesn't this text make people's skin crawl? like some sort of uncanny valley effect?).

Cognitive reasoning this is not. But it is an important achievement for the tools of the future.

Share on...