🤖 AI Summary
This work addresses the challenge of evaluating large language models’ (LLMs) domain-specific competence in financial accounting, statutory law, and quantitative reasoning—core competencies required for India’s Chartered Accountancy (CA) examination. To this end, we introduce CA-Ben, the first comprehensive, stage-aligned benchmark covering the entire CA curriculum. CA-Ben comprises a structured accounting question bank and a standardized evaluation protocol, enabling systematic assessment of state-of-the-art models—including GPT-4o, Llama 3, and Claude 3.5 Sonnet. Results show that Claude 3.5 Sonnet and GPT-4o achieve top performance in conceptual understanding and legal reasoning, yet both exhibit significant limitations in numerical computation accuracy and deep statutory interpretation. Beyond establishing the first accounting-domain benchmark, our analysis demonstrates that hybrid reasoning architectures and retrieval-augmented generation (RAG) substantially enhance model robustness and fidelity in professional contexts.
📝 Abstract
Advanced intelligent systems, particularly Large Language Models (LLMs), are significantly reshaping financial practices through advancements in Natural Language Processing (NLP). However, the extent to which these models effectively capture and apply domain-specific financial knowledge remains uncertain. Addressing a critical gap in the expansive Indian financial context, this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically designed to evaluate the financial, legal, and quantitative reasoning capabilities of LLMs. CA-Ben comprises structured question-answer datasets derived from the rigorous examinations conducted by the Institute of Chartered Accountants of India (ICAI), spanning foundational, intermediate, and advanced CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1 405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated using standardized protocols. Results indicate variations in performance, with Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and legal reasoning. Notable challenges emerged in numerical computations and legal interpretations. The findings emphasize the strengths and limitations of current LLMs, suggesting future improvements through hybrid reasoning and retrieval-augmented generation methods, particularly for quantitative analysis and accurate legal interpretation.