🤖 AI Summary
Existing generative vision models struggle with precise spatial control due to their inability to directly map numerical coordinates onto the image canvas. This work proposes MetaPoint, a method that encodes continuous 2D coordinates into composable spatial primitive tokens, leveraging the model’s inherent positional encoding mechanism to achieve pixel-level localization—where a single token specifies an object’s location and a pair of tokens defines a bounding box. Without modifying the underlying model architecture, MetaPoint bridges the semantic gap between textual instructions and pixel coordinates, enabling agents to respond to high-level tasks through sequential composition of spatial primitives. Experiments demonstrate that MetaPoint significantly enhances the accuracy, interactivity, and compositional generalization of generative agents in complex layout tasks.
📝 Abstract
Generative visual models fundamentally struggle with precise spatial control. This arises from a core disconnect: models can process textual descriptions of space but cannot directly map numerical coordinates onto the 2D image canvas. We introduce MetaPoint, a method that bridges this gap by representing a continuous 2D coordinate as a single, special token. Crucially, MetaPoint requires no new architectural components; it directly leverages the model's inherent positional encoding schemes to interpret these coordinates, treating our token as a virtual point on the canvas. This lightweight approach enables pixel-level control of an object's position with one token or its bounding box with two, all without requiring architectural changes or bespoke attention masking. The MetaPoint tokens are designed to be compositional, serving as spatial primitives. This allows a planner agent to decompose a high-level user request into a structured sequence of primitives for the generator. By providing a simple, precise, and scalable building block for spatial control, MetaPoint unlocks more powerful compositional generative agents and enables intuitive, interactive editing systems.