I’ve finally jumped on the Wordle bandwagon, and enjoy tinkering away every day. Best yet has been two guesses, due to a spectacularly good first guess. Worst was six, which I put down to Covid-19 brain fog.
Just in case anyone doesn’t know, Wordle is a really simple “mastermind” like game, where you have to guess a five letter word in six guesses or less. For each guess you make (which itself has to be a valid five letter word), you get an indication for each letter of whether it is the right letter in the right place, a right letter but in the wrong place, or a letter that doesn’t appear in the word at all.
Obvious questions are whether it is theoretically always solvable, and then what is the best word to start with? Before we dive into this, a quick shout out to Hannah Dee, whose linux refresher through the medium of Wordle didn’t exactly plant the seed in my mind, but did water it.
Back to the question. If the answer didn’t have to be a valid word, the answer would be “no” straight away. There are 26x26x26x26x26 (over 11 million) permutations possible, and you have six guesses. Admittedly you can rule out a lot of these permutations after every guess – if you happen to choose 5 different letters, none of which appear, then that brings it down to 21x21x21x21x21 possibilities (a mere 4 million). However some basic arthimatic soon exposes the problem. We guess 5 letters each turn, so after 5 guesses, we have tried a maximum of 25 different letters. There are 26 to choose from. So by the time we get to guess six – our final guess – we can be certain which unique letters are in the solution, but can’t be sure of their position, or which have been repeated.
However the fact it must be a valid word helps enormously. Taking Linux’s “huge” American English dictionary, I reckon there are 11,302 five letter words, from “aahed” to “zymix”. In all honestly we can probably rule out a lot of these, but let’s continue with this list, as the worst case. Like Hannah, I’ll be using Linux command line tools along the way to show how I got my stats.
Bit of housekeeping to start with – there’s a few diacritics and apostrophes in the source word list, so we’ll filter out anything which isn’t pure ASCII, and then filter it to just words with five letters from a to z in.
$ grep -P "^[[:ascii:]]*$" /usr/share/dict/american-english-huge | \ grep -P "^[a-z]{5}$" > wordle-words.txt $ wc -l wordle-words.txt 11302 world-words.txt
Intuitively we’re trying to gather as much information as possible as early as possible, in terms of presence and position. Therefore we want to start with the letters which are most likely to occur. This way if they are in the word it might not reduce the list much, but we get positional information, and if they are not in the word it reduces the list a lot.
So, let’s count how many times each letter appears in a word.
$ grep -o "." ./wordle-words.txt | sort | uniq -c | sort -b -g -r | column 5807 s 2828 l 1637 c 1194 k 228 j 5723 e 2722 t 1628 m 869 f 92 q 5242 a 2414 n 1531 y 856 w 3832 o 2013 d 1387 h 593 v 3636 r 1898 u 1334 g 340 z 2901 i 1661 p 1307 b 237 x
If we take the 5 most common letters across those 11,000 words, these are “s”, “e”, “a”, “o” , and “r”. Sounds like “AROSE” is a good first guess! If we count all the words which contain one or more of these letters, it’s an amazing 10,782. We can be almost certain of a hit, and if not we’ve only 520 candidate words left to search.
$ grep "[arose]" wordle-words.txt | wc -l 10782
If on the other hand we used the least popular letters we could (actually quite tricky to find words using them – but “BUZZY” is pretty bad), we only cover 4,665 words. Our odds of a hit have gone down from ~95% to less than 50%. What’s worse, if we don’t match any, we are still left with 6,637 words to search. Not good.
First word | No. matches | No. non-matches |
---|---|---|
AROSE | 10782 | 520 |
BUZZY | 4665 | 6637 |
Let’s ignore the position information for just a moment, and see what happens with the possibilities of hit. The huge advantage we have is that if we only hit “A”, we know we didn’t hit any of “ROSE”, which also tells us a lot.
$ grep "[arose]" wordle-words.txt | \ grep "a" | \ grep -v "[rose]" | wc -l 797 $ grep "[arose]" wordle-words.txt | \ grep "a" | \ grep "r" | \ grep -v "[ose]" | wc -l 431
Do the above for all the possible combinations and you get:
Letters matched | No. words |
---|---|
A | 797 |
R | 189 |
O | 531 |
S | 667 |
E | 789 |
AR | 431 |
AO | 323 |
AS | 894 |
AE | 561 |
RO | 249 |
RS | 234 |
RE | 522 |
OS | 659 |
OE | 417 |
SE | 823 |
ARO | 153 |
ARS | 315 |
ARE | 380 |
AOS | 181 |
AOE | 47 |
ASE | 409 |
ROS | 186 |
ROE | 211 |
RSE | 283 |
OSE | 255 |
ROSE | 85 |
AOSE | 10 |
ARSE | 123 |
AROE | 12 |
AROS | 45 |
AROSE | 1 |
So, even without considering positional information, we have reduced the word list from 10,782 to a maximum of 894. If we take the position information into account, we reduce it even further.
$ grep "[arose]" wordle-words.txt | \ grep "a" | \ grep -v "[rose]" | \ grep "a...." | wc -l 123 $ grep "[arose]" wordle-words.txt | \ grep "a" | \ grep -v "[rose]" | \ grep -v "a...." | wc -l 674
Letter | position | Candidate word count |
---|---|---|
A | right | 123 |
A | wrong | 674 |
R | right | 77 |
R | wrong | 112 |
O | right | 140 |
O | wrong | 391 |
S | right | 54 |
S | wrong | 613 |
E | right | 252 |
E | wrong | 537 |
Doing this for all the combinations is pretty tedious, but a dig into AS, which is our least favourable combination:
S Right | S Wrong | |
---|---|---|
A Right | 17 | 68 |
A Wrong | 59 | 750 |
So it seems reasonable to estimate our maximum new candidate list is 750 words. Not bad for one guess.
For the next guess, we want to choose a distinct set of letters to maximise our coverage. The next 5 on the list are “i”, “l”, “t”, “n”, “u” – “UNTIL”. Our coverage with UNTIL against the original list is still pretty high at 8,690 matches vs. 2,612 non-matches. Of course if you are doing this specifically, rather the generically, you would use the actual candidate word list after the first guess. It also turns out there is only one five letter word which does not have any letters from “AROSEUNTIL” in it, and that word is … “PYGMY”.
So after two guesses, we know we have we have either hit at least one letter, or we know the answer is pygmy. We also know there are at most 894 possible words it could be (although this estimate is too high), and we’ve still got four guesses to go.
For completeness, what happens when “AROSE” doesn’t match at all? Let’s rinse and repeat with the 520 words we had left when we didn’t match any:
$ grep -v "[arose]" ./wordle-words.txt | grep -o "." | sort | uniq -c | sort -b -g -r | column 403 i 188 n 100 d 35 z 7 q 260 y 157 c 95 g 30 w 255 u 138 p 89 f 21 v 200 l 134 h 83 b 15 j 190 t 114 m 73 k 13 x
This time we can’t make a word from the top 5. Best option is something like “UNITY”, which it turns out hits every one of our remaining 520 options. So in this scenario we’ve got 4 guesses left, we know at least 1 letter, and something about its position.
Whatever happens, after two guesses we know at least 1 letter (or the answer), our candidate list is reduced to <10% of the original word list, and we’ve still got four more guesses.
So I would feel pretty confident saying “yes” – it is always solvable in 6 guesses. I know I haven’t proved it by a long stretch, but I’m happy enough.
The only trouble with this is it does kind of take the fun out of it a bit, and you lose the sense of closing in on the solution. You also know that you will never score a “2”, unless the second word just happens to be “UNTIL”.
But I’ll give it a whirl for a few days – solving wordle by making the first two guesses “AROSE” and “UNTIL”! My predication is that I will usually only need 3 guesses.