GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

📅 2026-02-06

📈 Citations: 2

✨ Influential: 0

🤖 AI Summary

While current large language models demonstrate a capacity to comprehend the poetic essence of Persian ghazal poetry, they struggle to reproduce its culturally normative surface form in open-ended generation. This work proposes GhazalBench, a novel evaluation benchmark that, for the first time, incorporates the ability to generate culturally conformant textual forms as a core assessment dimension. The benchmark introduces two tasks: prose-to-poetry comprehension and cue-guided reconstruction of normative verses, complemented by a comparative experiment using English sonnets. Findings reveal that mainstream multilingual models generally excel at semantic understanding but underperform in open-generation of structurally and culturally compliant ghazals. Discriminative tasks notably narrow the performance gap, and the observed limitations are primarily attributed to insufficient coverage of relevant training data rather than inherent architectural constraints.

📝 Abstract

Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://github.com/kalhorghazal/GhazalBench.

Problem

Research questions and friction points this paper is trying to address.

Persian ghazals

large language models

canonical surface-form

cultural grounding

poetic understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

GhazalBench

canonical surface-form access

poem-to-prose understanding