I have a problem with that in that modern AI is like nothing in a fantasy world.
In real life, the ability to understand human language seems to imply a solution to the hard AI problem; the ability to understand and follow commands in an arbitrary form implies near-human intelligence. INT 3, in D&D terms, which the golem doesn't have.
Video game AI is not the same as true AI.
What magic in D&D enables in a golem is greater functionality than what we even have in robotic technology today.
Socom on my PS2 had voice recognition. i could order my men to defend, follow me, etc. My Motorola Razr had voice recognition (worked pretty good).
Neither of those devices have any intelligence, that is, the ability to dynamically solve a situational problem. My dog can crawl out from under a blanket, if my PS2 had legs, it could not.
It doesn't take intelligence for a patrol path for an enemy MOB. It's a simple script defining their route, and on each loop, check for LOS to the player or on initiation of a gun shot in range and open doorway path, move to intercept and attack. That is the gist of all video game AI.
Magic enables the same effect, but better and more easily (the player is not required to actually handle any of that complexity).
The point is, magic hand-waves that complexity of what's going on. I'm saying that primitive technologies demonstrate the same principles, without actually requiring the entity to be sentient or intelligent.
Case in point, one robot idea a co-worker and I had, was to build a robot with a good GPS in it. Put the robot into the real world environment that we've also mapped out into the game space. Every physical movement the robot makes is updated as the robot's location in the game space.
So mentally, the robot is playing a video game with its physical self mirroring the same action.
So when the robot needs to move 10' north in the game space, that command is translated into a physical movement command.
If every entity had these GPS trackers, then every entity could be rendered in the game space, and the robot could actually track and attack them.
Now, actual GPS's don't work to the accuracy needed, but assume a warehouse test maze, with some beacon system so we can get the x,y position of all entities. That part is quite feasible, and basic video game AI logic can handle it.
The trickier part is object recognition. getting the AI to recognize a carrot on the ground as compared to you wandering around the maze without a GPS tracker. We need it to ignore the carrot, but pay attention to you.
That's trickier, but then, Kinect gives us that solution. It has the functionality to identify humans, so we can use that with its camera to flag that it sees a human (let's assume we hate humans and the goal is, kill all humans in the maze). So, we can hook up API calls so when the camera flags a "human detected" event, we switch from patrol mode (circumnavigating the maze) to attack mode (keep the human in the cross hairs and shoot him until he's flat and does not move for 5 minutes).
Once again, none of this takes neural networks, machine learning. It's just simple logic, once you've flagged objects of interest in the game space.
So casting the Make Golem spell handles all this programming for you, and probably a few more bells and whistles than a basic video game enemy script.
None of it implies any reasoning ability on the part of the golem. Because the same principles can be applied in a video game where there is also non-thinking occurring in the computer.