🤖 AI Summary
Large language models (LLMs) exhibit low accuracy, weak curriculum alignment, and poor pedagogical relevance when applied to India’s NCERT educational context—particularly for Grades 6–8 English and Science.
Method: We introduce NCERT-QA, the first structured bilingual (English–Hindi) QA dataset explicitly aligned with NCERT curricula, covering factual, inferential, and evaluative reasoning questions. We systematically evaluate meta-prompting, few-shot prompting, and chain-of-thought prompting across open-source (Gemma, Llama) and commercial LLMs.
Contribution/Results: Curriculum alignment significantly improves answer accuracy and instructional utility; specific prompt-model combinations effectively mitigate hallucination and enhance reasoning consistency. This work establishes a reproducible data benchmark, an empirical evaluation framework, and actionable optimization strategies for LLM adaptation in education—addressing a critical gap in regionally grounded, curriculum-driven AI education research.
📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework"PustakAI"footnote{Pustak means `book'in many Indian languages.} for the design and evaluation of a novel question-answering dataset"NCERT-QA"aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.