How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews?

📅 2024-09-11

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the challenge of fine-grained sentiment analysis in app reviews—specifically, zero-shot and few-shot feature–sentiment pair extraction—where models must identify functional features and classify associated sentiment polarities (positive/neutral) without domain-specific training. Method: We systematically evaluate GPT-4, ChatGPT, and Llama-2-chat against rule-based SAFE and fine-tuned RE-BERT on a unified benchmark for both feature identification and polarity classification. Contribution/Results: GPT-4 achieves 76% F1 in zero-shot feature extraction—outperforming SAFE by 17 percentage points—and improves to 82% with only five in-context examples. For sentiment prediction, its zero-shot F1 scores are 76% (positive) and 45% (neutral), rising to 83% and 68% with five examples. These results demonstrate that large language models can approach supervised-model performance without fine-tuning, establishing a novel paradigm for low-resource, fine-grained user feedback analysis.

Technology Category

Application Category

📝 Abstract

Automatic analysis of user reviews to understand user sentiments toward app functionality (i.e. app features) helps align development efforts with user expectations and needs. Recent advances in Large Language Models (LLMs) such as ChatGPT have shown impressive performance on several new tasks without updating the model's parameters i.e. using zero or a few labeled examples, but the capabilities of LLMs are yet unexplored for feature-specific sentiment analysis. The goal of our study is to explore the capabilities of LLMs to perform feature-specific sentiment analysis of user reviews. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and different variants of Llama-2 chat, against previous approaches for extracting app features and associated sentiments in zero-shot, 1-shot, and 5-shot scenarios. The results indicate that GPT-4 outperforms the rule-based SAFE by 17% in f1-score for extracting app features in the zero-shot scenario, with 5-shot further improving it by 6%. However, the fine-tuned RE-BERT exceeds GPT-4 by 6% in f1-score. For predicting positive and neutral sentiments, GPT-4 achieves f1-scores of 76% and 45% in the zero-shot setting, which improve by 7% and 23% in the 5-shot setting, respectively. Our study conducts a thorough evaluation of both proprietary and open-source LLMs to provide an objective assessment of their performance in extracting feature-sentiment pairs.

Problem

Research questions and friction points this paper is trying to address.

LLMs for feature-specific sentiment analysis

Comparison of LLMs in zero-shot scenarios

Evaluation of LLMs extracting feature-sentiment pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs for feature-specific sentiment analysis

Comparison of GPT-4, ChatGPT, Llama-2

Zero-shot, 1-shot, 5-shot scenarios evaluated

🔎 Similar Papers

Leveraging Encoder-only Large Language Models for Mobile App Review Feature Extraction