At a certain point, you have to ask, "What is understanding." If an AI can understand natural language, can respond in natural language, and can create art- then how is that materially different than what we do?
I mean, what this kind of AI is going to do is make it increasingly obvious that there
is a material difference. It's already started.
That's one of the upsides here. The AIs we're seeing absolutely cannot genuinely understand anything that's being said to them. They're merely reacting using a logic-based language model. That's why they fail in the peculiar ways that they do, and until a fundamentally different approach to AI is taken, they'll continue to fail in those ways. Humans will carefully guard them, prune them, constrain them, and limit them in ways that hide these fundamental failings, but the failings will be present.
Take double-checking, for example - something all these AIs are terrible at. Humans know to check things. That's not a language-logic-level response, it's below that, I'd suggest. Humans exist in the world and are aware of the world and know how to figure things out in ways that don't just involve logic based on language. With these kind of AIs, that's not possible - you have to essentially cheat, and bolt on more basic computer functions, like, if someone is talking about the date, then go check the date at some authoritative source. A human doesn't need to be told to do that - it can figure it out - a language-logic-based AI will never in a million years figure it out.
This is a false dawn.
We will see "true" AI eventually - i.e. self-aware and able to genuinely figure stuff out, not merely respond to prompts, but that's not what this generation is. People are very impressed because it's basically Turing-test compliant, but as has been pointed out for decades - almost since Turing suggested it - that's a godawful test for whether something is intelligent. The Chinese room argument is both correct and incorrect - a machine could be and undoubtedly will be made that is intelligent - but what we have here, right now, are mere Chinese rooms. The full philosophical argument is rather fatuous and humanocentric, but the specific thing that's described is essentially what we have.
So ... I am less confident than you are. I would be shocked if we don't have AI-capable DMs within five years- even if that isn't the use case for them. And that's not a statement I would have made a year ago.
We could have language-logic-based AIs right now if someone just wanted to build them, and had a good enough data set. There's a sort of text-adventure AI tool the name of which escapes me that's somewhat similar.
The big problem though, is the data set. Almost all DMing is live, and unrecorded. It is lost like tears in the rain. Over the last few years, we've had a lot of podcasts and streams which are recorded. However, most of them are edited down, rather than full details, and they tend to represent a peculiar, showy branch of DMing, rather than a more typical approach. They're also very time-bound, and the majority of them somewhat similar in tone, so it's not a huge data set. The players are also highly atypical - far less argumentative and far better at improv than 95% of tabletop groups.
It'll also have peculiarities and freak-outs where a real DM never would. Depending on the way it's modelled/built too, it could have a peculiar approach to the rules.
But I agree that we'll see it - you don't need anything beyond a language-model to build one that's basically functional in the same way that other DM replacements are (i.e. like a flashier version of Ironsworn's Oracle), from a technological perspective.
To get one that could understand maps, write and map coherent adventures which weren't dungeon crawls/railroads and so on, you'd need a bit more complexity - to pair a language-model with something funkier. But if you just want an "ask it what happens", that's pretty doable.
Oh there's another major difficulty too - keeping track of the fiction - in 1 on 1 environment, where the only interaction method is text, this is simple. But once you get an entire party of PCs involved, and they're talking rather than writing, it's going to be pretty hard for the AI to keep track of the fiction/fictional positioning, where it'd be intuitive for a human. So five years may actually be optimistic unless language-model AIs become better at dealing with multiple different people talking to them about the same thing.