A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM evaluations predominantly target generic text tasks and fail to align with domain-specific responsible AI requirements—particularly fairness—where protected attributes must be meaningfully coupled with application context (e.g., product description generation). Method: We propose an application-oriented, fine-grained evaluation framework centered on e-commerce product descriptions. It introduces the first multi-dimensionally annotated dataset for this scenario and innovatively employs a “gendered adjectives × product category” cross-parameterization scheme to contextualize fairness and other responsibility dimensions. Our data construction pipeline integrates real-world prompt templates and explicit protected-attribute annotations to ensure scalability and reproducibility. Contribution/Results: Experiments reveal substantial performance disparities across leading LLMs along quality, truthfulness, safety, and fairness dimensions. The framework establishes a novel, application-driven paradigm for responsible AI evaluation, enabling rigorous, context-aware assessment of LLM behavior in practical deployment settings.

Technology Category

Application Category

📝 Abstract
Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for responsible AI dimensions like fairness
Creating application-specific datasets for targeted model assessment
Identifying quality, veracity, safety, and fairness gaps in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs application-specific dataset for responsible AI
Parameterizes fairness attributes with gendered adjectives
Measures quality veracity safety fairness gaps in LLMs
🔎 Similar Papers
No similar papers found.
Alicia Sagae
Alicia Sagae
AWS Responsible AI, Seattle, Washington, USA
C
Chia-Jung Lee
AWS Responsible AI, Seattle, Washington, USA
S
Sandeep Avula
AWS Responsible AI, Seattle, Washington, USA
B
Brandon Dang
AWS Responsible AI, Seattle, Washington, USA
Vanessa Murdock
Vanessa Murdock
Amazon Research
Information RetrievalContent ModerationResponsible AIeCommerce