LLMs as a GM

Thanks, Scott. I think we’re very much aligned on what the tools can do—though I’d caution you on one point. Summaries aren’t a solution to the memory problem. They’re a workaround—and only a partial one.

Adding more detail to summaries doesn’t preserve more information. It just increases the cognitive and processing burden the LLM has to carry in-session. The model doesn’t “retain” the summary—it reinterprets it at runtime, consuming tokens and model capacity to make sense of it on the fly. The more verbose or intricate the input, the more pressure it puts on that process. Something has to give.

The trick isn’t more detail. It’s compression with clarity. Summaries need to be:
  • High-signal, low-noise
  • Prioritized by what the model actually needs to re-reference mid-play
  • Consistently reinforced through structured prompts or schema
Otherwise, you’re just shuffling memory around and hoping the system doesn’t drop anything important. It will. LLMs don’t have a memory problem—they have a token economy. You’re renting attention with every message. Spend wisely.

That said, all is not lost. One of their best traits is its flexibility—when it forgets, you can just remind it. It doesn’t argue. It doesn’t resist. It adapts in real time and moves forward. You can always steer the conversation and reinforce what matters, and it will follow. That kind of responsiveness isn’t perfect, but it’s powerful—and more than enough to make it work.
In other words they are still too much damn work.
From the amount of effort described so far, I cannot see myself using them. They need to be more computationally efficient and some way that they do colour outside the lines, so to speak.
 

log in or register to remove this ad

That's a great encapsulation. An additionally there's an expectation that computers act the way we've always expected them to; so they don't 'get the math wrong', but with LLMs that expectation should really be more aligned with how we might think of a person, they in fact do mis-remember, mis-understand, and get the math wrong.

I've definitely found that I shouldn't necessarily count on the LLM, even provided eg. the PDF, to get the rules right, and I try to either check myself, or ask the kinds of questions that might surface "oh, wait a minute..." from it.

When you're using the ChatGPT (or Claude or Gemini) app, you're bound to its behavior, but if you build your own you have a lot more avenues for helping it with specific prompting, tools and context sources (although fair warning, it gets complicated fast 😄).

I find that at the moment it's still best as a player, because it more naturally lets you interject in the same way you might with players who aren't fully up on the rules, or who are a little 'forgetful' 😊

I actually find one of the biggest problems I had to work around was sort of 'unlicensed co-creation' where the AI doesn't quite understand the distinction between the player/gm roles and ends up eagerly stepping over that boundary. Reprimanding helps for a bit, but context length kills it over time. The best fix I've found is clear system prompting and model choice in particular.
That aligns with something I’ve seen too—especially the part about not expecting the LLM to "just know" the rules, even if you provide the PDF or source material. It’s a common assumption: that having access to the rules means it can retrieve and apply them like a structured system. But that’s not really how it works.

Even with the full rules in context, the model still has to read, interpret, and synthesize meaning from the material every time it responds. It’s not pulling from a stable rule engine—it’s rebuilding its understanding on the fly, each time, through probabilistic reasoning. That introduces variability.

I tend to think of it like this: memory is the pot that cooks the soup. You can keep adding ingredients (text, documents, clarifications), but at some point, the earlier flavors fade. You don’t get more coherence just by adding more content. You need to control the temperature, the order, the timing—which is why I focus more on building stable context windows and scaffolds than just loading up resources.

That said, I’ve also found it much easier to go with the flow than to force strict expectations. If you focus on the experience rather than the rules, LLMs have room to surprise you—especially when you let them lean into inference and improvisation. Once the boundaries are clear, the freedom inside them can produce some unexpectedly good moments.

For example, I asked the model to run Keep on the Borderlands several times to observe how it would approach the same prompt with different variations. It’s a classic, widely discussed module, and the model already had a strong sense of the tone, structure, and general content—just from exposure to the vast amount of material written about it online.

Even without the actual module in context, it was able to generate believable versions of the adventure. The details varied, but the atmosphere, encounter structure, and thematic beats remained consistent enough to feel intentional. It wasn’t exact—but it felt right.

That’s where expectations can get misaligned. We assume that if the LLM has the PDF or module file, it should know exact details—what’s in room 13, how much copper is hidden, the NPC names, etc. But that’s not how LLMs work. They don’t retrieve text from a file like a database. They read, interpret, synthesize, and generate a response based on patterns—not memory.

So while giving it a file might help guide its inference, what you’re getting is still a reconstruction, not a citation. That’s why I’ve found it more effective to focus on the experience rather than expecting perfect recall. If you let the model lean into what it does well—tone, structure, improvisation—it can often surprise you in a good way.
 

SMLs are just LLMs but smaller, no? And (generally speaking) their size mirrors in their performance, no?
They're related but more precise in purpose and tend to actually be vastly more performant for the specific and specialized tasks they're used for.


What’s ironic is that this human feedback loop—the way narratives form, get repeated, and reinforce themselves—mirrors many of the concerns people raise about AI: that it recycles dominant patterns, amplifies bias, resists nuance, and flattens complexity. We talk about LLMs doing this, but public discourse does it just as reliably. And without realizing it, that cycle ends up training expectations more than the systems do.
That's a rather sophomoric comparison, I'd suggest, however appealing that kind of oversimplification might be to people. Also you're conflating "the media/social media" with "human feedback", which are two very separate things. That social media (and much of the rest of the media) is almost completely owned by people investing billions in LLMs informs the conversation far more than any other factor.
 

That's a rather sophomoric comparison, I'd suggest, however appealing that kind of oversimplification might be to people. Also you're conflating "the media/social media" with "human feedback", which are two very separate things. That social media (and much of the rest of the media) is almost completely owned by people investing billions in LLMs informs the conversation far more than any other factor.
You're free to disagree with the comparison, but calling it “sophomoric” doesn’t address the point—it just tries to discredit it. That’s not engagement; it’s dismissal.

What I actually said is that public discourse—how people talk, repeat, and reinforce shared narratives—often mirrors the same dynamics people criticize in AI: pattern reinforcement, loss of nuance, amplification of dominant frames. That includes everything from casual conversation to outrage cycles to social media feedback loops. It’s not just “media,” and it’s not just algorithms. It’s us.

And ironically, the narrative you're falling back on—about billionaire control and media distortion—is itself one of those widely reinforced loops. It's familiar, emotionally charged, and often repeated without deeper inspection. That doesn’t make it false—but it does make it a perfect example of exactly what I was talking about.
 

I’ve also found it much easier to go with the flow than to force strict expectations. If you focus on the experience rather than the rules, LLMs have room to surprise you—especially when you let them lean into inference and improvisation. Once the boundaries are clear, the freedom inside them can produce some unexpectedly good moments.
That's how I feel; it's really all about setting the right expectations for yourself. A year or two ago I think the capabilities weren't quite there for it to not lose the plot in a matter of twenty minutes or so, but now it can go well longer. And I think in between longer contexts, RAG, and various memory techniques it can go much further than that.

And honestly, I've had plenty of games with friends where no one can remember what happened last session, let alone ten sessions ago 😄. I think some of the logging and timeline stuff I've seen people play around with in this space is promising as well.

Likewise on rules, I'm actually someone who's happy to halt the game for five minutes to figure out the rules properly (I also have fond memories of Rolemaster, so draw your own conclusions). But guesstimating rules is what we do around the table most of the times anyway. "Eh, I don't remember, but let's just call it that and I'll look it up for next time."

They're related but more precise in purpose and tend to actually be vastly more performant for the specific and specialized tasks they're used for.
No shade on SMLs, but an SML won't out-reason eg. o3 pro across a range of random tasks (which is what TTRPGs amount too), but it might out-perform it on a specific task space it was made for, doable at that scale. I'm excited for small purpose-built models running locally for all the right reasons including energy use, privacy, security and the neat fit of a purpose-built anything.

But not to drag this out, you said: "There are other forms of AI with a lot more potential, frankly (many of them older than LLMs)". SMLs are the same technology, and for the purposes of TTRPGs they would not be as capable as today's LLMs. There are other forms of AI research, but none of them are doing much/in hand? So what good are they for playing RPGs today? Moreover, LLMs have proven exceptionally capable across a vast swath of area including text, image, video, voice, music and as it turns out, robotics and probably more. The underlying technology ain't nothin'.
 
Last edited:

Remove ads

Top