🤖 AI Summary
Existing approaches rely on handcrafted dictionaries to map objects in 3D simulation scenes to formal ontologies, suffering from poor generalization and frequent failure. This work proposes the first method to leverage large language models (LLMs) for zero-shot ontology grounding of objects in Universal Scene Description (USD) scenes, establishing a semantic bridge from 3D environments to knowledge graphs without any training. By employing context-enhanced prompting—incorporating sibling names and parent paths—the approach elicits the LLM’s semantic reasoning capabilities. Effectiveness is validated through scene graph parsing and ablation studies. On a kitchen scene containing 125 objects, the method achieves grounding accuracy of 90–96% with descriptive names, substantially outperforming conventional dictionary- and embedding-based techniques; even under opaque naming conditions, it recovers 48% accuracy by exploiting contextual cues.
📝 Abstract
Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.