🤖 AI Summary
This paper addresses the challenge of detecting latent weaknesses in Dafny formal specifications—defects that evade traditional formal verification. We propose the first mutation testing framework tailored for Dafny. Methodologically, we systematically construct 32 domain-specific mutation operators: 14 are derived from real-world bug-fix commits in GitHub-hosted Dafny projects, ensuring high semantic relevance; additionally, we design an automated weak-specification identification mechanism to precisely localize specification flaws that fail to detect mutants. Evaluated on 794 real Dafny programs, our framework identifies, on average, one strengthenable weak specification per 241 lines of code, effectively exposing behavioral deviations masked by formal verification. Our core contributions are (i) the first comprehensive, Dafny-specific mutation operator taxonomy, and (ii) an empirically grounded paradigm for weak-specification detection.
📝 Abstract
This paper explores the use of mutation testing to reveal weaknesses in formal specifications written in Dafny. In verification-aware programming languages, such as Dafny, despite their critical role, specifications are as prone to errors as implementations. Flaws in specs can result in formally verified programs that deviate from the intended behavior. We present MutDafny, a tool that increases the reliability of Dafny specifications by automatically signaling potential weaknesses. Using a mutation testing approach, we introduce faults (mutations) into the code and rely on formal specifications for detecting them. If a program with a mutant verifies, this may indicate a weakness in the specification. We extensively analyze mutation operators from popular tools, identifying the ones applicable to Dafny. In addition, we synthesize new operators tailored for Dafny from bugfix commits in publicly available Dafny projects on GitHub. Drawing from both, we equipped our tool with a total of 32 mutation operators. We evaluate MutDafny's effectiveness and efficiency in a dataset of 794 real-world Dafny programs and we manually analyze a subset of the resulting undetected mutants, identifying five weak real-world specifications (on average, one at every 241 lines of code) that would benefit from strengthening.