🤖 AI Summary
This work addresses the limited generalization of existing object counting methods across categories, visual domains, scales, and densities. To overcome this, the authors propose a text-guided universal counting framework that jointly reasons over images and natural language queries to produce spatially localized discrete instance points for accurate counting. The core contributions include CLOC, the first large-scale cross-domain counting benchmark spanning six visual domains; a dual-granularity point-based counting mechanism that integrates region-level sparse and pixel-level dense predictions; and a novel point-center supervision strategy combined with a parameter-free complementary fusion approach. Experiments demonstrate that the proposed method significantly outperforms current open-world counting approaches on CLOC, achieving both high counting accuracy and strong cross-domain generalization.
📝 Abstract
Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.