Google had its LLM murder itself in Werewolf to test its AI smarts

A slide during the GDC 2024 panel "Simulacra and Subterfuge: Building Agentic 'Werewolf'". It shows the text "Generative Werewolf" and avatars for LLM bots.
(Image credit: Michael Hicks / Android Central)

At GDC 2024, Google AI senior engineers Jane Friedhoff (UX) and Feiyang Chen (Software) showed off the results of their Werewolf AI experiment, in which all the innocent villagers and devious, murdering wolves are Large Language Models (LLMs).

Friedhoff and Chen trained each LLM chatbot to generate dialogue with unique personalities, strategize gambits based on their roles, reason out what other players (AI or human) are hiding, and then vote for the most suspicious person (or the werewolf's scapegoat). 

They then set the Google AI bots loose, testing how good they were at spotting lies or how susceptible they were to gaslighting. They also tested how the LLMs did when removing specific capabilities like memory or deductive reasoning, to see how it affected the results. 

A slide during the GDC 2024 panel "Simulacra and Subterfuge: Building Agentic 'Werewolf'". It shows an example of the Generative Werewolf game in action, with bots attempting to deceive villagers or sniff out werewolves.

(Image credit: Michael Hicks / Android Central)

The Google engineering team was frank about the experiment's successes and shortcomings. In ideal situations, the villagers came to the right conclusion nine times out of 10; without proper reasoning and memory, the results fell to three out of 10. The bots were too cagey to reveal useful information and too skeptical of any claims, leading to random dogpiling on unlucky targets.

Even at full mental capacity, though, these bots tended to be too skeptical of anyone (like seers) who made bold claims early on. They tracked the bots' intended end-of-round votes after each line of dialogue and found that their opinions rarely changed after those initial suspicions, regardless of what was said. 

Google's human testers, despite saying it was a blast to play Werewolf with AI bots, rated them 2/5 or 3/5 for reasoning and found that the best strategy for winning was to stay silent and let certain bots take the fall.

As Friedhoff explained, it's a legitimate strategy for a werewolf but not necessarily a fun one or the point of the game. The players had more fun messing with the bots' personalities; in one example, they told the bots to talk like pirates for the rest of the game, and the bots obliged — while also getting suspicious, asking, "Why ye be doing such a thing?"

A slide during the GDC 2024 panel "Simulacra and Subterfuge: Building Agentic 'Werewolf'". It shows an example of a werewolf tricking the villagers using a fake accusation and the bots hallucinating that the accusation was true.

(Image credit: Michael Hicks / Android Central)

That aside, the test showed the limits of the bots' reasoning. They would give bots personalities — like a paranoid bot suspicious of everyone or a theatrical bot that spoke like a Shakespearean actor — and other bots reacted to those personalities without any context. They found the theatrical bot suspicious for how wordy and roundabout it was, even though that's its default personality.

In real-life Werewolf, the goal is to catch people speaking or behaving differently than usual. That's where these LLMs fall short. 

Friedhoff also provided a hilarious example of a bot hallucination leading the villagers astray. When Isaac (the seer bot) accused Scott (the werewolf bot) of being suspicious, Scott responded that Isaac had accused the innocent "Liam" of being a werewolf and gotten him unfairly exiled. Isaac responded defensively, and suspicion turned to him — even though Liam didn't exist and the scenario was made up. 

Google's Gemini AI model

(Image credit: Google)

Google's AI efforts, like Gemini, have become smarter over time. Another GDC panel showcased Google's vision of generative AI video games that auto-respond to player feedback in real-time and have "hundreds of thousands" of LLM-backed NPCs that remember player interactions and respond organically to their questions. 

Experiments like this, though, look past Google execs' bold plans and show how far artificial intelligence has to go before it's ready to replace actual written dialogue or real-life players. 

Chen and Friedhoff managed to imitate the complexity of dialogue, memory, and reasoning that goes into a party game like Werewolf, and that's genuinely impressive! But these LLM bots need to go back to school before they're consumer-ready. 

In the meantime, Friedhoff says that these kinds of LLM experiments are a great way for game developers to "contribute to machine learning research through games" and that their experiment shows that players are more excited by building and teaching LLM personalities than they are about playing with them. 

Eventually, the idea of mobile games with text-based characters that respond organically to your text responses is intriguing, especially for interactive fiction, which typically requires hundreds of thousands of words of dialogue to give players enough choices. 

If the best Android phones with NPUs capable of AI processing could deliver speedy LLM responses for organic games, that could be truly transformative for gaming. This Generative Werewolf experiment is a good reminder that this future is a ways off, however.

Michael L Hicks
Senior Editor, VR/AR and fitness

Michael is Android Central's resident expert on fitness tech and wearables, with an enthusiast's love of VR tech on the side. After years freelancing for Techradar, Wareable, Windows Central, Digital Trends, and other sites on a variety of tech topics, AC has given him the chance to really dive into the topics he's passionate about. He's also a semi-reformed Apple-to-Android user who loves D&D, Star Wars, and Lord of the Rings.

For wearables, Michael has tested dozens of smartwatches from Garmin, Fitbit, Samsung, Apple, COROS, Polar, Amazfit, and other brands, and will always focus on recommending the best product over the best brand. He's also completed marathons like NYC, SF, Marine Corps, Big Sur, and California International — though he's still trying to break that 4-hour barrier.