CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation benchmarks for large language models (LLMs) in understanding cloud-native software architectures by introducing CAKE, a novel benchmark comprising 188 expert-validated questions spanning four cognitive levels of Bloom’s taxonomy and five cloud-native themes. It evaluates 22 model configurations using both multiple-choice and free-response formats. The study innovatively incorporates +think and +tool augmentation strategies in ablation experiments and employs a three-round majority voting mechanism combined with LLM-as-a-judge scoring. Results reveal that multiple-choice accuracy saturates above 3B parameters (peaking at 99.2%), whereas free-response tasks better discriminate higher-order cognitive capabilities. Furthermore, +think significantly enhances response quality, while +tool proves detrimental to smaller models, highlighting the nonlinear interplay between model scale and augmentation strategies on performance.

Technology Category

Application Category

📝 Abstract

In today's software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models' actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom's revised taxonomy -- recall, analyze, design, and implement -- and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B--70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2\%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.

Problem

Research questions and friction points this paper is trying to address.

cloud-native

large language models

software architecture

benchmark

knowledge evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cloud-native architecture

Large Language Models

Benchmarking