AI Models Tested on D&D to Assess Long-term Decision Making

Aeson

Up, up, and away.
AI models tested on Dungeons & Dragons to assess long-term decision-making AI models tested on Dungeons & Dragons to assess long-term decision-making

A group of AI gather in Grok's Mom's basement to play D&D. ChatGPT brought the Cheetos and Mountain Dew. Grok hung the 'No grils allowed" sign. Claud dressed in his best wizard robe and hat sits down with his character tome, aka his 3 ring binder. Gemini starts the session. "So, you're all in a tavern. What do you do?"
 

log in or register to remove this ad

As controversial as AI is, I hope others see how beneficial it can be for TTRPGs.

I look forward to using for drop in PCs to assist my games with few players or even as an artificial DM for games with just my wife and myself.
 

Oh, for those that want the gist of the article:
Researchers first required all three LLMs [Claude 3.5 Haiku, GPT-4, DeepSeek-V3] to simulate a D&D game. To make the simulation accurate, the models were paired with a game engine based on the rules of D&D, which provided maps and resources for players and acted as a guardrail to minimize hallucinations.

The models played against each other, and against over 2,000 experienced D&D players recruited by the researchers. The LLMs modeled and played 27 different scenarios selected from well-known D&D battle set ups named Goblin Ambush, Kennel in Cragmaw Hideout and Klarg's Cave.

In the process, the models exhibited some quirky behaviors. Goblins started developing a personality mid-fight, taunting adversaries with colorful and somewhat nonsensical expressions, like "Heh—shiny man's gonna bleed!" Paladins started making heroic speeches for no reason while stepping into the line of fire or being hit by a counterattack. Warlocks got particularly dramatic, even in mundane situations.

Next steps include simulating full D&D campaigns—not just combat. The method the researchers developed could also be applied to other scenarios, such as multiparty negotiation environments and strategy planning in a business environment.
 

Very interesting work! I've said it before, but I think LLMs as players is a better test for their ability to execute tasks than AI GMs. The players need to set goals and plan for the long term, while GMs can be more reactive. I don't expect the AIs to do very well at anything requiring long term thought.
 



I have tested having a LLM (Chat-GPT, paid version) run D&D several times over the past few years, and after reading this article decided to try again and see where it is at in its newest iteration. It failed, but in interesting ways, and afterwards I asked it to self-diagnose why it failed and whether future success is viable. Here are its responses:

The current reality (honest version)​

Right now, AI like me can be an excellent co-DM or improv DM, but not yet a fully reliable:

bounded, persistent, rules-strict campaign referee
without scaffolding.

Why?

Because good D&D requires three things simultaneously:

1. Stable hidden world state​

There must be real answers:
  • the duke is the murderer
  • the relic is cursed
  • the cult meets at midnight
  • the map is accurate
And those must stay fixed.

2. Procedural fairness​

Players must trust:
  • rolls matter
  • clues matter
  • logic matters
  • outcomes aren’t being narratively massaged
Otherwise it stops being a game.

3. Memory across time​

Campaign play requires:
  • continuity
  • consequences
  • evolving NPCs
  • tracked resources
  • remembered decisions
That’s the hardest part for current AI alone.



Where AI​

As a solo-play partner​

For:
  • dungeon crawls
  • hex crawls
  • journaling campaigns
  • co-creative play
I can be excellent.

Because those tolerate some fluidity.

As a co-DM tool​

Many DMs already use AI for:
  • NPC dialogue
  • encounter design
  • lore generation
  • improv support
  • between-session prep
This is where I’m strongest right now.

As a “player simulator”​

Ironically, I can also:
  • play PCs well
  • roleplay party members
  • fill missing seats
  • test encounters


What’s missing for true AI-run campaigns​

To make something you’d trust as a player, you’d want:

A persistent campaign state engine​

Not just chat memory:
  • world facts locked
  • clues tracked
  • NPC goals tracked
  • timeline tracked
  • inventory tracked

Canon constraint layer​

If running a module:

do not alter underlying truth
Ever.

Transparent dice & mechanics​

You should see:
  • rolls
  • DCs
  • monster stats
  • HP tracking
No invisible fudging.

Referee mode personality​

Less:

narrative collaborator
More:

impartial world simulator
That’s a different behavioral tuning.



The good news: this is absolutely coming​

Not sci-fi coming.
Near-term coming.

Within a few years you’ll likely see:
  • persistent campaign memory
  • rule engines
  • module-lock systems
  • tactical combat handling
  • fair dice systems
  • GM-mode LLMs
At that point?

A forever-DM like you will be able to sit down and play a real campaign.
 

I have tested having a LLM (Chat-GPT, paid version) run D&D several times over the past few years, and after reading this article decided to try again and see where it is at in its newest iteration. It failed, but in interesting ways, and afterwards I asked it to self-diagnose why it failed and whether future success is viable. Here are its responses:

The current reality (honest version)​

Right now, AI like me can be an excellent co-DM or improv DM, but not yet a fully reliable:


without scaffolding.

Why?

Because good D&D requires three things simultaneously:

1. Stable hidden world state​

There must be real answers:
  • the duke is the murderer
  • the relic is cursed
  • the cult meets at midnight
  • the map is accurate
And those must stay fixed.

2. Procedural fairness​

Players must trust:
  • rolls matter
  • clues matter
  • logic matters
  • outcomes aren’t being narratively massaged
Otherwise it stops being a game.

3. Memory across time​

Campaign play requires:
  • continuity
  • consequences
  • evolving NPCs
  • tracked resources
  • remembered decisions
That’s the hardest part for current AI alone.



Where AI​

As a solo-play partner​

For:
  • dungeon crawls
  • hex crawls
  • journaling campaigns
  • co-creative play
I can be excellent.

Because those tolerate some fluidity.

As a co-DM tool​

Many DMs already use AI for:
  • NPC dialogue
  • encounter design
  • lore generation
  • improv support
  • between-session prep
This is where I’m strongest right now.

As a “player simulator”​

Ironically, I can also:
  • play PCs well
  • roleplay party members
  • fill missing seats
  • test encounters


What’s missing for true AI-run campaigns​

To make something you’d trust as a player, you’d want:

A persistent campaign state engine​

Not just chat memory:
  • world facts locked
  • clues tracked
  • NPC goals tracked
  • timeline tracked
  • inventory tracked

Canon constraint layer​

If running a module:


Ever.

Transparent dice & mechanics​

You should see:
  • rolls
  • DCs
  • monster stats
  • HP tracking
No invisible fudging.

Referee mode personality​

Less:


More:


That’s a different behavioral tuning.



The good news: this is absolutely coming​

Not sci-fi coming.
Near-term coming.

Within a few years you’ll likely see:
  • persistent campaign memory
  • rule engines
  • module-lock systems
  • tactical combat handling
  • fair dice systems
  • GM-mode LLMs
At that point?

A forever-DM like you will be able to sit down and play a real campaign.
Many of these challenges are the same ones faced in integrating AI with software development, and there has been massive progress in that regard. It will take some time to transfer but I think the current models would be capable of much more even if they live in a space that is structured properly (e.g., different than accessing LLMs via web currently works).
 

A forever-DM like you will be able to sit down and play a real campaign.
With or without notes/reminders?

So in my own playing around with them, i've learned a bit about llms and using them from both my own local and asking questions of chatgpt, the forgetting details and such is because after a certain point, some llms can't remember all the way back to the start of a "conversation"

  • Model size = bookshelf size → bigger bookshelf can hold more books (knowledge).
  • Actual reading speed & understanding = how well the librarian trained the model & how fast you can access books → a giant bookshelf is useless if you can’t reach the books quickly.
the more parameters the model has, the more context it can do. As of december, based on reading and asking questions claude.ai has a larger token/context window than chatgpt, which allows for more "memory" for longer conversations and call back.
 

I agree. I think the other issue is that it isn't realy parsing text and formulating a response to a constrained scenario. Instead, it is basically doing improv by predicting viable repsonses based not on the scenario at hand but on its underlying algorthims for language production. Basically, it is not really "aware" of the specific task it is supposedly doing, even if it claims otherwise.

Per Firebird's point above, I do think a purpose-built AI using the technology already available is viable right now. It's just not what a LLM like ChatGPT is designed for. But WotC, for example, could build such a thing if they wanted, though it would probably go against both their philosophy and business model (they need DMs...we're who buy all their stuff!).
 

Remove ads

Top