Evaluating LLM-Based Mobile App Recommendations: An Empirical Study

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates large language models (LLMs) for mobile app recommendation, focusing on recommendation consistency, interpretability, and alignment with traditional App Store Optimization (ASO) metrics. Method: We propose a taxonomy of 16 general ranking criteria and construct the first multidimensional evaluation framework tailored to conversational app discovery, enabling quantitative analysis of explicit instruction following and cross-query consistency. Empirical experiments compare outputs from leading general-purpose LLMs across multiple dimensions; all data and evaluation code are open-sourced for reproducibility. Results: LLM recommendation logic is broad yet fragmented, exhibiting only partial alignment with ASO indicators. High-ranking recommendations demonstrate strong consistency, but performance degrades significantly with increasing rank depth and query specificity. Instruction sensitivity varies markedly across models. Our core contribution is the establishment of the first standardized evaluation paradigm and benchmark resource for LLM-driven mobile app recommendation.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used to recommend mobile applications through natural language prompts, offering a flexible alternative to keyword-based app store search. Yet, the reasoning behind these recommendations remains opaque, raising questions about their consistency, explainability, and alignment with traditional App Store Optimization (ASO) metrics. In this paper, we present an empirical analysis of how widely-used general purpose LLMs generate, justify, and rank mobile app recommendations. Our contributions are: (i) a taxonomy of 16 generalizable ranking criteria elicited from LLM outputs; (ii) a systematic evaluation framework to analyse recommendation consistency and responsiveness to explicit ranking instructions; and (iii) a replication package to support reproducibility and future research on AI-based recommendation systems. Our findings reveal that LLMs rely on a broad yet fragmented set of ranking criteria, only partially aligned with standard ASO metrics. While top-ranked apps tend to be consistent across runs, variability increases with ranking depth and search specificity. LLMs exhibit varying sensitivity to explicit ranking instructions - ranging from substantial adaptations to near-identical outputs - highlighting their complex reasoning dynamics in conversational app discovery. Our results aim to support end-users, app developers, and recommender-systems researchers in navigating the emerging landscape of conversational app discovery.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning consistency of LLM-based mobile app recommendations
Assessing alignment between LLM recommendations and ASO metrics
Analyzing LLM sensitivity to explicit ranking instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed taxonomy of 16 ranking criteria from LLM outputs
Created systematic framework for evaluating recommendation consistency
Analyzed LLM sensitivity to explicit ranking instructions
🔎 Similar Papers
No similar papers found.
Q
Quim Motger
Dept. of Service and Information System Engineering, Universitat Politècnica de Catalunya, Spain
X
Xavier Franch
Dept. of Service and Information System Engineering, Universitat Politècnica de Catalunya, Spain
Vincenzo Gervasi
Vincenzo Gervasi
Dept. of Computer Science, University of Pisa, Italy
Jordi Marco
Jordi Marco
Associate Professor, Universitat Politècnica de Catalunya
Service Oriented ComputingNon-Functional RequirementsSoftware EngineeringComputer Graphics