Well, for some people and some games it can be both:
* They "see" (imagine) an Ogre. This is
the clouds that Vincent Baker is referring to here.
* They "see" (actually interact with) a collection of numbers, keywords, currencies, resolution procedures, incentive structures, clocks/timers, inventory/loadout schemes, relationship values, and moves (or powers or knacks or whatever they're called in a given game) that collectively serve as the game layer language which represents said Ogre so that actual people in meat space can (a) correctly orient to said "clouds" (elements of the imagined space), (b) play the game in front of them which entails composing (if you're a GM) and managing (if you're a player) a compelling, game layer-related decision-space. This is
the boxes that Vincent Baker is referring to here.
When I run Dogs in the Vineyard, I don't "see" dice pools and "raises" and "sees" (etc). I compose situations that provoke the judgement or mercy of young priests who are trying to manage the stewardship role of their faith and all the imaginings that entails. Same goes for Blades in the Dark or Torchbearer or Stonetop or The Between or D&D 4e or Mouse Guard or whatever. No one at the table is just "seeing" a collection of numbers or throws of dice or keywords or Resistance Rolls rather than Ogres, the faithful falling to Sin and Sorcery, corrupt Bluecoats shaking down a corner store, and the stink of soot, machine shavings, dense smoke, and showers of sparks in Coalridge, or x, y, z imaginings.