On the Impact of Language Nuances on Sentiment Analysis with Large Language Models: Paraphrasing, Sarcasm, and Emojis

📅 2025-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited robustness of large language models (LLMs) in sentiment analysis—particularly toward pragmatic phenomena such as irony, emojis, and fragmented language—and their poor generalization to domain-specific corpora (e.g., nuclear energy). We propose a synergistic optimization framework integrating textual paraphrasing, irony detection and removal, adversarial augmentation, and domain-adaptive fine-tuning. We construct a high-quality, manually annotated dataset of ironic tweets. We introduce the novel joint strategy of “irony removal + topic-agnostic pretraining” and empirically validate the critical contribution of general-domain corpora to irony comprehension. Experiments show that irony removal improves sentiment accuracy by 21 percentage points (to 51%); fine-tuning on general-domain data achieves 60% irony detection accuracy; adversarial augmentation yields 85% robustness against perturbations; and paraphrasing upgrades 40% of low-confidence predictions, boosting overall sentiment accuracy by 6%.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated impressive performance across various tasks, including sentiment analysis. However, data quality--particularly when sourced from social media--can significantly impact their accuracy. This research explores how textual nuances, including emojis and sarcasm, affect sentiment analysis, with a particular focus on improving data quality through text paraphrasing techniques. To address the lack of labeled sarcasm data, the authors created a human-labeled dataset of 5929 tweets that enabled the assessment of LLM in various sarcasm contexts. The results show that when topic-specific datasets, such as those related to nuclear power, are used to finetune LLMs these models are not able to comprehend accurate sentiment in presence of sarcasm due to less diverse text, requiring external interventions like sarcasm removal to boost model accuracy. Sarcasm removal led to up to 21% improvement in sentiment accuracy, as LLMs trained on nuclear power-related content struggled with sarcastic tweets, achieving only 30% accuracy. In contrast, LLMs trained on general tweet datasets, covering a broader range of topics, showed considerable improvements in predicting sentiment for sarcastic tweets (60% accuracy), indicating that incorporating general text data can enhance sarcasm detection. The study also utilized adversarial text augmentation, showing that creating synthetic text variants by making minor changes significantly increased model robustness and accuracy for sarcastic tweets (approximately 85%). Additionally, text paraphrasing of tweets with fragmented language transformed around 40% of the tweets with low-confidence labels into high-confidence ones, improving LLMs sentiment analysis accuracy by 6%.
Problem

Research questions and friction points this paper is trying to address.

Investigates how emojis and sarcasm affect sentiment analysis accuracy
Addresses lack of labeled sarcasm data with a human-labeled tweet dataset
Explores text paraphrasing to improve low-confidence sentiment labels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-labeled sarcasm dataset enhances LLM assessment
Adversarial text augmentation boosts model robustness
Text paraphrasing improves sentiment analysis accuracy
🔎 Similar Papers
No similar papers found.
N
Naman Bhargava
Department of Statistics, University of Michigan, Ann Arbor, MI 48109, United States
Mohammed I. Radaideh
Mohammed I. Radaideh
University of Michigan
Alternative FuelsCombustion SimulationMachine LearningLarge Language Models
O
O Hwang Kwon
Department of Nuclear Engineering and Radiological Sciences, University of Michigan, Ann Arbor, MI 48109, United States
A
Aditi Verma
Department of Nuclear Engineering and Radiological Sciences, University of Michigan, Ann Arbor, MI 48109, United States
M
Majdi I. Radaideh
Department of Nuclear Engineering and Radiological Sciences, University of Michigan, Ann Arbor, MI 48109, United States