This Computer Won The 2021 American Crossword Puzzle Tournament
In 2012, a computer program named Dr. Fill placed 141st out of some 660 entries in that year’s American Crossword Puzzle Tournament, a competition for elite crossword puzzle solvers. This year, the algorithm beat the human competition, completing the final playoff puzzle in just 49 seconds.
The A.I. relies on a collection of different techniques to make sense of a puzzle. Sometimes, a simple fact is needed—who was the First Lady before Eleanor Roosevelt? (Lou Henry Hoover.) More often, however, crossword puzzle solutions rely not just on factual knowledge, but an ability to recognize themes that puzzle constructors have embedded in the crosswords, along with an understanding of puns, homonyms, and word play. (Think: Five letters, “dining table leaves”—SALAD!) The program makes a series of statistical calculations about likely answers, then tries to fit those possibilities into the puzzle squares.
This year, researchers from the Berkeley Natural Language Processing group added their expertise to Dr. Fill’s algorithms—a contribution that may have helped push Dr. Fill to its crowning victory.
But the program isn’t infallible. This year, it made three mistakes solving puzzles during the tournament, while some human solvers completed the puzzles perfectly. It can make these errors with any unique puzzle form it’s never seen before.
Matt Ginsberg, the computer programmer behind Dr. Fill, joins Ira to talk about the competition and the advances his program has made over the years.
Matt Ginsberg is a computer scientist, crossword constructor, and co-founder of Connected Signals in Eugene, Oregon.
IRA FLATOW: For the rest of the hour, an update on a story we first told you about back in 2014– a computer program called Dr. Fill that could solve crossword puzzles. Last month, for the first time, the program unofficially beat human competition in the American Crossword Puzzle Tournament, solving the playoff puzzle in just– get this– 49 seconds. Oh. Joining me now is Matt Ginsberg, the computer programmer who developed Dr. Fill. Welcome back to Science Friday.
MATT GINSBERG: Thanks, Ira.
IRA FLATOW: I want to tell everybody that we have a video demo of how the program tackled some of this year’s competition puzzles up there in our website, sciencefriday.com/crossword. OK, first of all, describe the format of the competition for us. What do you have to do?
MATT GINSBERG: You have to solve seven crosswords. And six of them are Saturday. One is Sunday morning. And then there’s a championship puzzle for the people who did the best on Sunday afternoon.
It’s timed. The puzzles vary in difficulty. The fifth puzzle is actually far and away the hardest. The first is the easiest, sort of give you a feeling of confidence going in. The sixth is easy so you don’t feel so bad about what happened to you on the fifth puzzle.
And the seventh on Sunday is– it’s like a Sunday-size puzzle. It’s not overwhelmingly difficult. And people solve them flawlessly in a few minutes.
IRA FLATOW: Wow. And how well did your program do in this recent tournament?
MATT GINSBERG: Dr. Fill made three mistakes, but it was so much faster than the humans– typically solving the puzzles in less than a minute– that based on the scoring system in use, it came out a tiny bit ahead of the top human.
IRA FLATOW: Did it shackle the humans there?
MATT GINSBERG: It was a virtual tournament this year, so it couldn’t do anything. And when the tournament is live, Will Shortz, who runs it, typically reports. So when he starts puzzle 2, he says, on puzzle 1, Dr. Fill did thus and so. And when Dr. Fill does poorly, everybody applauds. And when Dr. Fill does well, everybody boos. But it’s sort of good-natured competition between man and machine.
IRA FLATOW: Well, we actually reached out to puzzle master Will Shortz for his thoughts, and he said, when it comes to something new and never seen before, humans still have the advantage in figuring it out. Oh, fighting words.
MATT GINSBERG: No, it’s absolutely true. One of the puzzles this year was very clever. One of the clues, for example, was “crazed,” and the answer was “mannequin.” And the next clue over was “deduce,” and the answer was “fur.” And that makes no sense at all until you realize that if you say it– “mannequin fur”– you’re actually saying “manic” for “crazed” and “infer” for “deduced.”
Dr. Fill didn’t understand it at all. It managed to do pretty well in that puzzle. It made one mistake because it was solving the down clues. But it had no clue what was going on.
I think next year– I’m excited about next year. We’re going to work very hard on Dr. Fill. And by we, I mean me and the Berkeley Natural Language Group, who helped this year. We’re going to work very hard on making Dr. Fill better. And I think the constructors are going to work very hard on making Dr. Fill worse. So we’ll see next year who wins that little battle.
IRA FLATOW: Let’s talk a bit about how it solves a puzzle. What are the steps it works through?
MATT GINSBERG: So it doesn’t solve the puzzle like we do. So when I solve a puzzle– when I try. I’m terrible. When I try and solve a puzzle, I say, oh, here’s a clue. My level of clue is, “Scooby blank,” three letters. So I say, oh, that’s “doo,” and I put it in.
When Dr. Fill solves a puzzle, it actually doesn’t write anything in. It makes giant lists of every possible word in every possible slot and how good it feels about putting that word in that position. And then, armed with those lists, it looks at all these combinations. Well, I can put this word here and that word there, and how do I feel about the combination? And it’s doing a ton of search over possible ways to fill in the puzzle. And it has algorithms that are designed to help it find the overall fill that it feels the best about.
And what the Berkeley guys did is they brought in work where it was more likely that if the program thought an answer was correct, it actually was correct. So that made it much more likely, when it was done with the puzzle, that it actually had gotten all the right answers in all the various places.
IRA FLATOW: Puzzle constructors sometimes reuse clues. Does your program have a database of past answers that it tries?
MATT GINSBERG: It does. It’s very happy when it finds a clue that’s been used before. So when I make a puzzle, for example, I like looking for clues that have been used before to clue a different word of that length, because then you’re throwing a little trick at the solver. And Dr. Fill knows about that. And it says, you know, I’ve seen this before, so probably it’s the same answer as before, but not for sure. But I’ll be pretty happy if I can put that answer in.
IRA FLATOW: You know, to be successful, you need to know a lot of random facts, like who were the Academy Award winners in a certain year. Where is it getting its information? Is it doing live look-ups of Wikipedia for some of the clues as it’s solving the puzzle?
MATT GINSBERG: It can. So one of the rules that Will Shortz and I agreed on was that Dr. Fill was not allowed to access the internet. So it can’t do a Google search, for example. But it’s a computer, right? So I do have a downloaded copy of Wikipedia that it can look in.
One of the things Will’s done for crosswords is it’s not nearly as fact based as used to be. So yeah, occasionally you do see, you know, Eleanor’s predecessor as first lady, something along those lines. And there, you would like to be able to look it up.
But most of the crosswords these days are about common sense knowledge put in interesting ways. And there, it’s really a matter of having some understanding, you know, that “crazed” and “manic” mean the same thing. And there, you have a thesaurus. You have a dictionary. It’s got all these resources that it uses to try and figure out what the various words mean.
And again, this is where the Berkeley collaboration has been so great, because they are using a system that is not totally unlike when you talk to Siri. There’s a system that’s figuring out what you probably need and generating a useful response. And they’re using a system similar, at a high level, to those other systems. It’s designed specifically for crosswords, but it’s using all the work on machine learning and natural language and so on to help Dr. Fill understand a bit better what the right answer to any particular clue is likely to be.
IRA FLATOW: This is Science Friday from WNYC Studios. How well would Dr. Fill work against IBM’s Watson?
MATT GINSBERG: It’s really different domains. Jeopardy really is about facts. When the answer comes up on the Jeopardy board, it doesn’t say, 13th president of the United States, seven letters. When Dr. Fill gets a query, it knows exactly how long the answer is. It makes enormous use of the crossing words to say, oh, seven letters, and the second one is an “I.” That’s essential to what Dr. Fill is doing, and Watson has no capability like that at all.
On the other hand, Dr. Fill is dealing with situations that are deliberately vague, deliberately confusing. There’s a clue in crosswords, for example, you may see, and it’s “Nice flower.” And it’s actually the river that goes through the French city of nice. And you’re supposed to parse that as “Nice ‘flow-er.'” They would never do that to you in Jeopardy, whereas crossword constructors do it to the solvers all the time.
So very, very different problem. Hard for different reasons. Easy for different reasons. And you’ve got two pieces of software that really are just good at what they do.
I mean, Dr. Fill isn’t going to do anything other than solve crosswords. It’s not designed to do anything else. And Watson’s forte is playing Jeopardy. And they feel like they’re the same, but they’re really quite different.
IRA FLATOW: Interesting. I know that crossword puzzle constructors are tricky. And in more difficult puzzles, they can build in added layers of complexity like you’re talking about, like having answers that I’ll skip a certain letter, for instance or having answers that read backwards. How well does Fill do with these types of challenges?
MATT GINSBERG: It’s terrible.
IRA FLATOW: [LAUGHS]
MATT GINSBERG: So there was a puzzle at the tournament one year where every clue was a spoonerism. Dr. Fill can’t figure that out. It doesn’t understand. It has no way to sort of get started.
And if you’ve got these overarching themes– I made a puzzle once where every word was a homonym. But what you were supposed to enter into the grid wasn’t the actual word. It was the homonym of that word. Dr. Fill can’t do that. It just doesn’t understand that none of its rules will help it.
It does understand simple themes. So you might have a puzzle where the long entries add an “E” to a common phrase to get some wacky phrase. It’ll understand about that, and it’ll say, oh, look, the theme is add an “E.”
But these overarching themes are too hard for it, because it doesn’t really know what it’s doing, right? It’s just following rules to fill letters in in a grid. And when Will says in something truly imaginative, humans have the edge, he is absolutely right.
IRA FLATOW: I’d like to thank my guest this hour, Matt Ginsberg, computer programmer. His program Dr. Fill unofficially won the most recent American Crossword Puzzle Tournament. Thank you, Matt, for taking time to be with us today.
MATT GINSBERG: Thank you very much.
IRA FLATOW: Matt is also author of the book Factor Man.