I think many in the gaming community oversell "theater of the mind" - as if their players are sitting on the edge of their seats, in bated breath as descriptions of nefarious shadows lurk just out of sight, taking in the sounds and smells with their imaginary senses. But be honest - most groups are there eating Doritos, swigging Mountain Dew, checking their cellphones, and quoting Spaceballs. It's socialization time. It's a game. It's not high art. It's not like attending the symphony.
Especially when you're playing online, you need visuals to keep the players focused: maps, NPC portraits, etc. Having just a blank screen there to stare at isn't going to hold their attention.
Most of my games FTF have used TotM as the primary mode, with tokens and maps only occasionally needed.
Some games, however...
D&D 3/4/5, T2K 2e & 4e, and Stargate (a 5E variant)... all have really tactical rules... and really benefit from gridded movement.
L5R, players benefitted from the maps more than the use of markers upon them...
Edit: completing the thought:
TotM isn't the barrier for VTT for me; for many it shouldn't be. There are those for whom it never works, and others who it works but not well. I know one gal for whom it never works... she has aphantasia. She mentioned that, in college, D&D only worked when the minis hit the table OR people were fully in character voice, and best when actually nearly LARPing. There is a wide spectrum. For most, it works at least passably much of the time with the right games and a decent GM.
It's the lack of physical and visual feedback I get from observing people in person, that I can't do online, not even with camera based. Doesn't prevent me from a weekly game (one where most of us log in and chat for the half an hour prior to start time, and sometimes for an hour or two after.) once game starts, it's game until something is misplace, or a rule needs to be adjudicated carefully (especially when playtesting).
Players nervously leaning in over the map in T2K is a sign of engagement. If they're looking between two threats, that's also a different cue.