🤖 AI Summary
This study investigates the controllability and human alignment of large language models (LLMs)—specifically GPT-4 and LLaMA-3—in keyword-driven sentence generation with respect to emotional semantics. To address this, we systematically compare four emotion representations—emotion words, VAD numerical scores, VAD lexicalized terms, and emojis—integrating prompt engineering, multidimensional emotion annotation (Valence-Arousal-Dominance), and human evaluation. Our key findings are: (1) emotion-word representations significantly outperform numerical VAD in both accuracy and naturalness, better aligning with human judgments; (2) we propose a novel VAD-to-lexical mapping method that substantially improves human–model consistency; and (3) representation efficacy is highly contingent on model architecture, emotion category, and representation format. Collectively, these results establish an interpretable, lightweight, and deployable representational optimization paradigm for controllable affective text generation.
📝 Abstract
In controlled text generation using large language models (LLMs), gaps arise between the language model's interpretation and human expectations. We look at the problem of controlling emotions in keyword-based sentence generation for both GPT-4 and LLaMA-3. We selected four emotion representations: Words, Valence-Arousal-Dominance (VAD) dimensions expressed in both Lexical and Numeric forms, and Emojis. Our human evaluation looked at the Human-LLM alignment for each representation, as well as the accuracy and realism of the generated sentences. While representations like VAD break emotions into easy-to-compute components, our findings show that people agree more with how LLMs generate when conditioned on English words (e.g.,"angry") rather than VAD scales. This difference is especially visible when comparing Numeric VAD to words. However, we found that converting the originally-numeric VAD scales to Lexical scales (e.g., +4.0 becomes"High") dramatically improved agreement. Furthermore, the perception of how much a generated sentence conveys an emotion is highly dependent on the LLM, representation type, and which emotion it is.