An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Assessing the capacity of large language models (LLMs) to generate human-like movie reviews remains challenging, particularly regarding emotional depth and stylistic consistency. Method: This study systematically evaluates GPT-4o, Gemini 2.0, and DeepSeek-V3 on review generation from film subtitles and scripts, benchmarking against authentic IMDb user reviews across four quantitative dimensions—lexical diversity, sentiment polarity, BERTScore semantic similarity, and topic coherence—and corroborating findings via human blind evaluation. Contribution/Results: It presents the first cross-model, horizontal comparison of leading proprietary and open-weight LLMs on cinematic review generation. All models exhibit grammatical fluency but underperform humans in affective richness and stylistic stability. DeepSeek-V3 achieves the highest overall fidelity to IMDb review style. Human evaluators misclassified ~68% of LLM-generated reviews as human-written, indicating strong surface-level plausibility yet revealing persistent gaps in deep affective modeling and individual stylistic preservation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have been prominent in various tasks, including text generation and summarisation. The applicability of LLMs to the generation of product reviews is gaining momentum, paving the way for the generation of movie reviews. In this study, we propose a framework that generates movie reviews using three LLMs (GPT-4o, DeepSeek-V3, and Gemini-2.0), and evaluate their performance by comparing the generated outputs with IMDb user reviews. We use movie subtitles and screenplays as input to the LLMs and investigate how they affect the quality of reviews generated. We review the LLM-based movie reviews in terms of vocabulary, sentiment polarity, similarity, and thematic consistency in comparison to IMDB user reviews. The results demonstrate that LLMs are capable of generating syntactically fluent and structurally complete movie reviews. Nevertheless, there is still a noticeable gap in emotional richness and stylistic coherence between LLM-generated and IMDb reviews, suggesting that further refinement is needed to improve the overall quality of movie review generation. We provided a survey-based analysis where participants were told to distinguish between LLM and IMDb user reviews. The results show that LLM-generated reviews are difficult to distinguish from IMDB user reviews. We found that DeepSeek-V3 produced the most balanced reviews, closely matching IMDb reviews. GPT-4o overemphasised positive emotions, while Gemini-2.0 captured negative emotions better but showed excessive emotional intensity.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for generating movie reviews compared to IMDb
Assessing impact of subtitles and screenplays on review quality
Analyzing emotional and stylistic gaps in LLM-generated reviews
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GPT-4o, Gemini-2.0, DeepSeek-V3 for reviews
Compares LLM outputs with IMDb user reviews
Analyzes vocabulary, sentiment, similarity, thematic consistency
🔎 Similar Papers
No similar papers found.
B
Brendan Sands
Transitional Artificial Intelligence Research Group School of Mathematics and Statistics UNSW Sydney Sydney Australia
Y
Yining Wang
Transitional Artificial Intelligence Research Group School of Mathematics and Statistics UNSW Sydney Sydney Australia
Chenhao Xu
Chenhao Xu
Victoria University
Deep LearningEdge ComputingBlockchain
Y
Yuxuan Zhou
Transitional Artificial Intelligence Research Group School of Mathematics and Statistics UNSW Sydney Sydney Australia
L
Lai Wei
Transitional Artificial Intelligence Research Group School of Mathematics and Statistics UNSW Sydney Sydney Australia
Rohitash Chandra
Rohitash Chandra
UNSW
Bayesian deep learningNeuroevolutionClimate ExtremesLanguage ModelsComparative Religion