🤖 AI Summary
This work addresses the lack of systematic evaluation benchmarks for large language models (LLMs) in understanding cloud-native software architectures by introducing CAKE, a novel benchmark comprising 188 expert-validated questions spanning four cognitive levels of Bloom’s taxonomy and five cloud-native themes. It evaluates 22 model configurations using both multiple-choice and free-response formats. The study innovatively incorporates +think and +tool augmentation strategies in ablation experiments and employs a three-round majority voting mechanism combined with LLM-as-a-judge scoring. Results reveal that multiple-choice accuracy saturates above 3B parameters (peaking at 99.2%), whereas free-response tasks better discriminate higher-order cognitive capabilities. Furthermore, +think significantly enhances response quality, while +tool proves detrimental to smaller models, highlighting the nonlinear interplay between model scale and augmentation strategies on performance.
📝 Abstract
In today's software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models' actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom's revised taxonomy -- recall, analyze, design, and implement -- and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B--70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2\%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.