🤖 AI Summary
This work addresses the challenge of accurately verifying whether spatial layouts—such as 3D rooms or 2D posters—conform to natural language descriptions, a task where direct use of large language models (LLMs) yields suboptimal performance. The authors propose a novel approach that leverages an LLM to generate multiple weak verifiers expressed in a domain-specific language (DSL), then aggregates them into a strong verifier using weak supervision techniques. Remarkably, this method requires only around ten human-annotated examples for training. It substantially improves verification accuracy, achieving up to a 7× increase in F1 score across diverse layout tasks. When integrated into layout generation pipelines, the resulting designs receive up to 66.2% higher quality ratings in human evaluations, demonstrating that high-quality feedback can be generated with minimal annotation cost.
📝 Abstract
We present a pipeline for building and aggregating task-specific, LLM-generated weak (imperfect) verifiers into a strong verifier for spatial layout domains. Given a task description, our pipeline asks an LLM to synthesize a collection of verifier programs using a layout verification DSL. Each individual LLM-generated verifier usually provides an imperfect check for a match between the layout and the corresponding task description. We show that by aggregating the responses of many such verifiers we can produce a stronger verifier. Moreover, by applying techniques from weak learning, our pipeline can learn how to aggregate the weak verifiers from a very sparse set of human labeled example layouts (about 10). We find that the strong verifiers produced by our pipeline outperform the status-quo approach of using a set of LLM judges to directly check whether a layout matches a task description, raising F1-scores by up to 7X across a variety of 3D room layout and 2D poster design tasks. We also demonstrate that verifier-guided layout generation using natural language feedback from our strong verifiers improves layout quality of a base layout generator by up to 66.2% according to a human evaluator.