🤖 AI Summary
This study addresses widespread compliance issues in GitHub Actions workflows, such as excessive permissions and weak secret management. It proposes the first documentation-driven compliance checking framework, which derives a 30-item checklist from official documentation and implements a hybrid auditing pipeline combining large language models (LLMs) with expert oversight. The authors automatically evaluate 95 real-world Java workflows using four open-source LLMs, employ GPT-5 as a conflict arbitrator, and integrate manual review into a multi-tiered validation system. Experimental results reveal an overall compliance rate of only 28%, with permission control as low as 4%. The proposed approach reduces manual verification effort by 81% while achieving 87% agreement with expert judgments, significantly enhancing audit efficiency and reproducibility.
📝 Abstract
GitHub Actions (GHA) CI workflows are critical infrastructure, but current tooling offers only syntactic or heuristic checks and does not enforce documented best practices for security, maintainability, or performance. Consequently, issues like over-privileged permissions, weak secrets management, and missing failure notifications remain undetected in real-world pipelines. This paper proposes a novel, documentation-grounded GHA compliance checklist with 30 criteria spanning four workflow sections and eight themes, and assesses Large Language Models (LLMs) for scalable compliance auditing. On 95 real-world Java workflows (2,850 assessments) using four open-weight LLMs, we find only fair agreement (Fleiss' kappa = 0.28), with systematic disagreement on structural reasoning and security-sensitive judgments. To address this, we introduce a multi-tier adjudication framework in which GPT 5 resolves model conflicts before targeted manual review, reducing verification effort by 81% while retaining 87% agreement with expert judgment. At scale, it reveals major compliance gaps: overall compliance is 28%, dropping to 4% for permission controls; Security (26%) lags far behind Clarity (68%). Our results show that LLMs enable scalable compliance measurement but cannot replace experts, highlighting the need for hybrid human-AI auditing and providing empirical benchmarks and guidance for defensible GHA workflow audits.