🤖 AI Summary
This study addresses previously unexamined runtime failures in Model Context Protocol (MCP) servers, such as accepted configuration parameters that are not enforced, leading to unintended default behaviors and system unreliability. Through manual analysis of 837 runtime failure reports from 473 active MCP repositories, the authors employ a bottom-up open coding approach to construct the first comprehensive taxonomy of MCP runtime failures, encompassing dimensions such as protocol interactions, tool invocations, and state management. The resulting classification comprises 11 top-level categories and 27 subcategories, totaling 73 leaf-level failure types. Validated by 55 developers—each encountering an average of 20 categories—and supported by empirical observations across all categories, the taxonomy demonstrates broad applicability and strong external validity.
📝 Abstract
MCP (Model Context Protocol) enables LLMs (Large Language Models) to interact with external tools and data sources via a standardized protocol. Its rapid adoption in tool-augmented Artificial Intelligence (AI) workflows has introduced new reliability challenges, such as configuration parameters that are accepted but not enforced at runtime, leading to unintended default behavior, whose runtime fault characteristics remain empirically unexamined. We present the first empirical taxonomy of runtime faults in MCP servers. We manually analyzed 837 MCP-specific runtime fault threads from 473 actively maintained MCP server GitHub repositories and derived a taxonomy using a bottom-up open coding procedure. The taxonomy comprises 11 top-level categories and 27 subcategories (73 leaf fault types), covering recurrent failures across protocol interactions, tool invocations, schema enforcement, state management, model-provider integration, security validation, and timeouts or explicit cancellations of in-progress operations. To assess the taxonomy's external validity, we surveyed 55 MCP server developers. Respondents reported experiencing an average of 20 of the 27 fault subcategories, and no category remained unobserved. These results indicate that the taxonomy reflects widely observed runtime failures in MCP-based systems and shall assist AI software maintenance and evolution in the future.