LLMs as a GM

UngainlyTitan · Jul 5, 2025

Jacob Lewis said:
Thanks, Scott. I think we’re very much aligned on what the tools can do—though I’d caution you on one point. Summaries aren’t a solution to the memory problem. They’re a workaround—and only a partial one.

Adding more detail to summaries doesn’t preserve more information. It just increases the cognitive and processing burden the LLM has to carry in-session. The model doesn’t “retain” the summary—it reinterprets it at runtime, consuming tokens and model capacity to make sense of it on the fly. The more verbose or intricate the input, the more pressure it puts on that process. Something has to give.

The trick isn’t more detail. It’s compression with clarity. Summaries need to be:

High-signal, low-noise

Prioritized by what the model actually needs to re-reference mid-play

Consistently reinforced through structured prompts or schema

Otherwise, you’re just shuffling memory around and hoping the system doesn’t drop anything important. It will. LLMs don’t have a memory problem—they have a token economy. You’re renting attention with every message. Spend wisely.

That said, all is not lost. One of their best traits is its flexibility—when it forgets, you can just remind it. It doesn’t argue. It doesn’t resist. It adapts in real time and moves forward. You can always steer the conversation and reinforce what matters, and it will follow. That kind of responsiveness isn’t perfect, but it’s powerful—and more than enough to make it work.

In other words they are still too much damn work.
From the amount of effort described so far, I cannot see myself using them. They need to be more computationally efficient and some way that they do colour outside the lines, so to speak.

Jacob Lewis · Jul 5, 2025

Heilemann said:
That's a great encapsulation. An additionally there's an expectation that computers act the way we've always expected them to; so they don't 'get the math wrong', but with LLMs that expectation should really be more aligned with how we might think of a person, they in fact do mis-remember, mis-understand, and get the math wrong.

I've definitely found that I shouldn't necessarily count on the LLM, even provided eg. the PDF, to get the rules right, and I try to either check myself, or ask the kinds of questions that might surface "oh, wait a minute..." from it.

When you're using the ChatGPT (or Claude or Gemini) app, you're bound to its behavior, but if you build your own you have a lot more avenues for helping it with specific prompting, tools and context sources (although fair warning, it gets complicated fast ).

I find that at the moment it's still best as a player, because it more naturally lets you interject in the same way you might with players who aren't fully up on the rules, or who are a little 'forgetful'

I actually find one of the biggest problems I had to work around was sort of 'unlicensed co-creation' where the AI doesn't quite understand the distinction between the player/gm roles and ends up eagerly stepping over that boundary. Reprimanding helps for a bit, but context length kills it over time. The best fix I've found is clear system prompting and model choice in particular.

That aligns with something I’ve seen too—especially the part about not expecting the LLM to "just know" the rules, even if you provide the PDF or source material. It’s a common assumption: that having access to the rules means it can retrieve and apply them like a structured system. But that’s not really how it works.

Even with the full rules in context, the model still has to read, interpret, and synthesize meaning from the material every time it responds. It’s not pulling from a stable rule engine—it’s rebuilding its understanding on the fly, each time, through probabilistic reasoning. That introduces variability.

I tend to think of it like this: memory is the pot that cooks the soup. You can keep adding ingredients (text, documents, clarifications), but at some point, the earlier flavors fade. You don’t get more coherence just by adding more content. You need to control the temperature, the order, the timing—which is why I focus more on building stable context windows and scaffolds than just loading up resources.

That said, I’ve also found it much easier to go with the flow than to force strict expectations. If you focus on the experience rather than the rules, LLMs have room to surprise you—especially when you let them lean into inference and improvisation. Once the boundaries are clear, the freedom inside them can produce some unexpectedly good moments.

For example, I asked the model to run Keep on the Borderlands several times to observe how it would approach the same prompt with different variations. It’s a classic, widely discussed module, and the model already had a strong sense of the tone, structure, and general content—just from exposure to the vast amount of material written about it online.

Even without the actual module in context, it was able to generate believable versions of the adventure. The details varied, but the atmosphere, encounter structure, and thematic beats remained consistent enough to feel intentional. It wasn’t exact—but it felt right.

That’s where expectations can get misaligned. We assume that if the LLM has the PDF or module file, it should know exact details—what’s in room 13, how much copper is hidden, the NPC names, etc. But that’s not how LLMs work. They don’t retrieve text from a file like a database. They read, interpret, synthesize, and generate a response based on patterns—not memory.

So while giving it a file might help guide its inference, what you’re getting is still a reconstruction, not a citation. That’s why I’ve found it more effective to focus on the experience rather than expecting perfect recall. If you let the model lean into what it does well—tone, structure, improvisation—it can often surprise you in a good way.

Ruin Explorer · Jul 5, 2025

Heilemann said:
SMLs are just LLMs but smaller, no? And (generally speaking) their size mirrors in their performance, no?

They're related but more precise in purpose and tend to actually be vastly more performant for the specific and specialized tasks they're used for.

What are Small Language Models (SLMs)?

Small Language Models (SLMs) are efficient AI models built for low compute use, domain-specific tasks, and secure, accurate responses.

aisera.com

Jacob Lewis said:
What’s ironic is that this human feedback loop—the way narratives form, get repeated, and reinforce themselves—mirrors many of the concerns people raise about AI: that it recycles dominant patterns, amplifies bias, resists nuance, and flattens complexity. We talk about LLMs doing this, but public discourse does it just as reliably. And without realizing it, that cycle ends up training expectations more than the systems do.

That's a rather sophomoric comparison, I'd suggest, however appealing that kind of oversimplification might be to people. Also you're conflating "the media/social media" with "human feedback", which are two very separate things. That social media (and much of the rest of the media) is almost completely owned by people investing billions in LLMs informs the conversation far more than any other factor.

Jacob Lewis · Jul 5, 2025

Ruin Explorer said:
That's a rather sophomoric comparison, I'd suggest, however appealing that kind of oversimplification might be to people. Also you're conflating "the media/social media" with "human feedback", which are two very separate things. That social media (and much of the rest of the media) is almost completely owned by people investing billions in LLMs informs the conversation far more than any other factor.

You're free to disagree with the comparison, but calling it “sophomoric” doesn’t address the point—it just tries to discredit it. That’s not engagement; it’s dismissal.

What I actually said is that public discourse—how people talk, repeat, and reinforce shared narratives—often mirrors the same dynamics people criticize in AI: pattern reinforcement, loss of nuance, amplification of dominant frames. That includes everything from casual conversation to outrage cycles to social media feedback loops. It’s not just “media,” and it’s not just algorithms. It’s us.

And ironically, the narrative you're falling back on—about billionaire control and media distortion—is itself one of those widely reinforced loops. It's familiar, emotionally charged, and often repeated without deeper inspection. That doesn’t make it false—but it does make it a perfect example of exactly what I was talking about.

Heilemann · Jul 6, 2025

Jacob Lewis said:
I’ve also found it much easier to go with the flow than to force strict expectations. If you focus on the experience rather than the rules, LLMs have room to surprise you—especially when you let them lean into inference and improvisation. Once the boundaries are clear, the freedom inside them can produce some unexpectedly good moments.

That's how I feel; it's really all about setting the right expectations for yourself. A year or two ago I think the capabilities weren't quite there for it to not lose the plot in a matter of twenty minutes or so, but now it can go well longer. And I think in between longer contexts, RAG, and various memory techniques it can go much further than that.

And honestly, I've had plenty of games with friends where no one can remember what happened last session, let alone ten sessions ago

. I think some of the logging and timeline stuff I've seen people play around with in this space is promising as well.

Likewise on rules, I'm actually someone who's happy to halt the game for five minutes to figure out the rules properly (I also have fond memories of Rolemaster, so draw your own conclusions). But guesstimating rules is what we do around the table most of the times anyway. "Eh, I don't remember, but let's just call it that and I'll look it up for next time."

Ruin Explorer said:
They're related but more precise in purpose and tend to actually be vastly more performant for the specific and specialized tasks they're used for.

No shade on SMLs, but an SML won't out-reason eg. o3 pro across a range of random tasks (which is what TTRPGs amount too), but it might out-perform it on a specific task space it was made for, doable at that scale. I'm excited for small purpose-built models running locally for all the right reasons including energy use, privacy, security and the neat fit of a purpose-built anything.

But not to drag this out, you said: "There are other forms of AI with a lot more potential, frankly (many of them older than LLMs)". SMLs are the same technology, and for the purposes of TTRPGs they would not be as capable as today's LLMs. There are other forms of AI research, but none of them are doing much/in hand? So what good are they for playing RPGs today? Moreover, LLMs have proven exceptionally capable across a vast swath of area including text, image, video, voice, music and as it turns out, robotics and probably more. The underlying technology ain't nothin'.

Scott Christian · Jul 6, 2025

Heilemann said:
I actually find one of the biggest problems I had to work around was sort of 'unlicensed co-creation' where the AI doesn't quite understand the distinction between the player/gm roles and ends up eagerly stepping over that boundary. Reprimanding helps for a bit, but context length kills it over time. The best fix I've found is clear system prompting and model choice in particular.

This, one thousand percent. Every session I have completed (seven now, each about two hours), has shown this to be a difficulty. It often comes in some general form of: "What do Elle and Elias find as they open the door?" I am not sure if it is the way I am responding (in the same narrative driven style as Gemini), or if it is just confused. It has happened enough that I have a direct quote I use for it:
(Out of Character: Remember, the GM controls the environment, and the NPCs' and creatures' actions. The player (me) controls only the character's actions.)

I will sometimes phrase the above in question form, just to reinforce an idea. That might look like:
(Out of Character: Who controls the environment? Should I answer what's behind the door or you, the gamemaster?)

Once corrected using either method, it always gets it right. But an hour later it might happen again. I do find this happens more often with the environment (What's in the chest? What's behind the door? What do you see next?) than actions involving NPCs or monsters.

Scott Christian · Jul 6, 2025

Jacob Lewis said:
Even with the full rules in context, the model still has to read, interpret, and synthesize meaning from the material every time it responds. It’s not pulling from a stable rule engine—it’s rebuilding its understanding on the fly, each time, through probabilistic reasoning. That introduces variability.

Jacob Lewis said:
That’s where expectations can get misaligned. We assume that if the LLM has the PDF or module file, it should know exact details—what’s in room 13, how much copper is hidden, the NPC names, etc. But that’s not how LLMs work. They don’t retrieve text from a file like a database. They read, interpret, synthesize, and generate a response based on patterns—not memory.

This is great stuff. Very enlightening, and it sets a nice picture for how to interact with the model, especially on a longer-term basis. Thanks for teaching me something today - it is appreciated.

Scott Christian · Jul 6, 2025

Heilemann said:
And honestly, I've had plenty of games with friends where no one can remember what happened last session, let alone ten sessions ago . I think some of the logging and timeline stuff I've seen people play around with in this space is promising as well.

Likewise on rules, I'm actually someone who's happy to halt the game for five minutes to figure out the rules properly (I also have fond memories of Rolemaster, so draw your own conclusions). But guesstimating rules is what we do around the table most of the times anyway. "Eh, I don't remember, but let's just call it that and I'll look it up for next time."

I share a similar feeling. Remembering last session is easy. Remembering three sessions ago, the one where you got the plot hook and the quest, that gets very difficult. Remembering ten sessions back, outside of an event or awkward mental reconstruction of tone, is insanely difficult.

And ah... Rolemaster. Game of a thousand, font-size 8, printed tables. Good times.

Jfdlsjfd · Jul 6, 2025

Scott Christian said:
And ah... Rolemaster. Game of a thousand, font-size 8, printed tables. Good times.

Your whole body is now encased in an icy coffin. All your internal organs are congealed. You're dead.
Also, frostbite will reduce your fine manipulation skills by 1% for the next two days.

Somehow, the stacking criticals are what I remember mostly.

Jacob Lewis · Jul 7, 2025

Scott Christian said:
This is great stuff. Very enlightening, and it sets a nice picture for how to interact with the model, especially on a longer-term basis. Thanks for teaching me something today - it is appreciated.

Happy to help, if I can.

Going back to your original post, I noticed you mentioned using Gemini with a document on Google Drive as part of your setup. That’s helpful context—and it brings up a broader point I think is worth flagging for anyone following the discussion:

Not all LLMs operate the same way.

Most of what I’ve shared is based on my experience with ChatGPT, which has its own quirks around memory, document handling, and response behavior. I’m also using a Plus account, which gives access to better tools and options than the free tier. Gemini, from what I understand, integrates more directly with tools like Docs and Drive, and may handle file access a bit differently—possibly reading and applying document context more seamlessly than a typical external upload or injected prompt.

That said, the challenges we’re talking about—misaligned expectations, reliance on implicit recall, blurred boundaries between roles—aren’t model-specific. They show up across systems when we assume the tool “knows” more than it actually does, or when we forget that inference is not the same as fact recall.

If anything, the differences between models just highlight how important it is to understand the principles behind how these systems operate—not just the surface behavior. Because whether it’s ChatGPT, Gemini, Claude, or anything else, the user still plays a huge role in shaping the outcome.

So without knowing first-hand how Gemini handles context or instruction, I can’t say with confidence that my experience with ChatGPT will fully translate—but I hope some of the broader insights are still helpful.

LLMs as a GM

UngainlyTitan

Legend

Jacob Lewis

Ye Olde GM

Ruin Explorer

Legend

What are Small Language Models (SLMs)?

Jacob Lewis

Ye Olde GM

Heilemann

Explorer

Scott Christian

Hero

Scott Christian

Hero

Scott Christian

Hero

Jfdlsjfd

.

Jacob Lewis

Ye Olde GM

Similar Threads

Pets & Sidekicks