Assessing REST API Test Generation Strategies with Log Coverage

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This study addresses the challenge of evaluating test effectiveness in black-box REST API testing, where the absence of source code coverage metrics and the heterogeneity of multilingual technology stacks hinder assessment. To this end, it introduces three novel log-coverage–based quantitative metrics—average, minimum, and maximum—to measure runtime behavioral differences among testing strategies. An empirical evaluation was conducted on the Light-OAuth2 system, comparing EvoMaster v5.0.2, Claude Opus 4.6, GPT-5.2-Codex, and manually crafted Locust scripts. Results show that Claude Opus 4.6 covers 28.4% more log templates than human-written tests. Moreover, hybrid strategies significantly enhance coverage: combining Claude with manual testing improves coverage by 78.4% and 38.9%, respectively, revealing complementary behavioral strengths across methodologies.

Technology Category

Application Category

📝 Abstract

Assessing the effectiveness of REST API tests in black-box settings can be challenging due to the lack of access to source code coverage metrics and polyglot tech stack. We propose three metrics for capturing average, minimum, and maximum log coverage to handle the diverse test generation results and runtime behaviors over multiple runs. Using log coverage, we empirically evaluate three REST API test generation strategies, Evolutionary computing (EvoMaster v5.0.2), LLMs (Claude Opus 4.6 and GPT-5.2-Codex), and human-written Locust load tests, on Light-OAuth2 authorization microservice system. On average, Claude Opus 4.6 tests uncover 28.4% more unique log templates than human-written tests, whereas EvoMaster and GPT-5.2-Codex find 26.1% and 38.6% fewer, respectively. Next, we analyze combined log coverage to assess complementarity between strategies. Combining human-written tests with Claude Opus 4.6 tests increases total observed log coverage by 78.4% and 38.9% in human-written and Claude tests respectively. When combining Locust tests with EvoMaster the same increases are 30.7% and 76.9% and when using GPT-5.2-Codex 26.1% and 105.6%. This means that the generation strategies exercise largely distinct runtime behaviors. Our future work includes extending our study to multiple systems.

Problem

Research questions and friction points this paper is trying to address.

REST API testing

black-box testing

test effectiveness

log coverage

polyglot tech stack

Innovation

Methods, ideas, or system contributions that make the work stand out.

log coverage

REST API testing

test generation