π€ AI Summary
This work addresses the scarcity of high-quality, large-scale annotated data in cervical cytopathology, which has hindered the application of multimodal foundation models. To overcome this limitation, we propose Singpath-VLβthe first vision-language foundation model tailored for cervical cytology. We construct a million-scale image-text dataset through a three-stage synthetic pipeline that integrates multi-model weak annotations, consensus fusion, and expert knowledge injection. Building upon Qwen3-VL-4B, the model undergoes multi-stage fine-tuning and demonstrates superior performance in fine-grained cellular morphology understanding and diagnostic classification tasks. To foster community progress, we will open-source a portion of the synthesized data along with a standardized evaluation benchmark.
π Abstract
We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.