tl;dr: One of the tricks to using GPT-4 well is understanding how it sees text and instructions
When we started testing GPT-4 internally last summer one of the first tests people would try is getting it to play the game Wordle. While trying to guess a five-letter word should be simple for an advanced language model, it’s actually trickier than it seems. These models don’t see text and characters the same way we do. If I give GPT-4 the word “MAGIC” it understand it as “45820, 2149” while lowercase “magic” is “32707”. Although it doesn’t “see” the letters, the model is smart enough to know what letters are in each word, but this involves extra steps and can sometimes get confusing. For example: GTP-4 can summarize Cinderella with a sentence in alphabetical order while GPT 3.5 struggles.
I’ve seen some public examples of people trying to get GPT-4 to play Wordle with limited or no success. One solution is to just have GPT-4 write code to solve Wordle, but you can actually get GPT-4 to do it through prompting alone if you know how to ask it correctly, based on how it sees text.
Eventually models shouldn’t require tricks and “prompt engineering” (this was my first job at OpenAI back in ancient days of 2020) will just be as normal as typing on a keyboard (we’re all typists now.) Meanwhile, trying to get GPT-4 to solve a problem it should be able to is a good way to understand how it actually thinks and has applications far outside of playing Wordle.
While a number of potential solutions seem like they should work (replacing letters with ‘#” or other symbols) they often don’t. This is because GPT-4 doesn’t see the text like we do. It’s a language model – not a character model.
You need to delineate to GPT-4 how to represent a word in a way that it understands each character is separate. Because of its extensive knowledge of programming, GPT-4 really gets the idea of brackets. When it sees something inside of them, it gets that it’s a discrete item and doesn’t try to compress it.
The words “MAGIC” and “magic” have similar patterns when you put brackets around them:
[m][a][g][i][c] = 58, 76, 7131, 64, 7131, 70, 7131, 40, 7131, 66, 60
[M][A][G][I][C] = 58, 44, 7131, 32, 7131, 38, 7131, 40, 7131, 34, 60
If we confine each letter to a bracket and then use that same bracket to describe if a letter is in the correct position, in the word or not in the word, this makes it easier for GPT-4 to see what we’re seeing and play the game correctly.
Using these rules you can get GPT-4 to play Wordle using this prompt:
Let’s play a version of Wordle. I’m thinking of a five letter word. You have to guess what the word is. Each time you guess a word I’ll let you know if you’re correct by letting you know if your guess is in the word in the right position or the wrong spot or not at all. Each of your guesses has to be an actual word. We start like this: [-][-][-][-][-] You might guess (using an actual word): [A][P][P][L][E] If I was thinking of “MAGIC” I’d respond [A is in the incorrect position][P is not in the word][P is not in the word][L is not in the word][E is not in the word] Got it?
In my experience GPT-4 isn’t great at the game with this prompt, but it can clearly abide by the rules and make good guesses with the right instructions. GPT-4 is just one development on the road to more capable systems and has many limitations. However, some of those limitations can be worked around if you try to communicate with it in a way it really understands.
Here’s an alternate solution by Andrea Bizzotto that uses a more verbose prompt, but a simpler annotation style to show the model what’s correct and what’s not. This is a great illustration that there’s more than one way to solve the problem. Andrea’s solution shows that providing explicit instructions – even when the model claims to know the rules – reinforces what it needs to focus on.
When you encounter a task that GPT-4 seems like it should be capable of accomplishing, break it down in a way that leans into its language capabilities. Sometimes asking the model will put you on the right track. Other times, paying attention to how it failed can shed light on what’s going on.