Assessing Large Language Models in Comprehending and Verifying Concurrent Programs across Memory Models

📅 2025-01-24

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work systematically evaluates the capability of large language models (LLMs) to detect and verify data races and deadlocks in concurrent programs under relaxed memory models. Using the SV-COMP pthread and ARM Litmus benchmarks, we assess state-of-the-art models—including GPT-3.5, GPT-4, GPT-4o, GPT-4o-mini, and Mistral-Large2—across three memory models: sequential consistency (SC), total store ordering (TSO), and partial store ordering (PSO). Our method reveals a fundamental limitation: while LLMs achieve reasonable accuracy in SC environments for canonical concurrency bugs, their performance degrades significantly under TSO and PSO due to an inability to faithfully model non-SC memory ordering constraints. This study establishes the first empirical boundary on LLMs’ capacity for memory-model-aware concurrency verification. It provides a critical benchmark and empirical foundation for future research on memory-model-sensitive code understanding and formal verification.

Technology Category

Application Category

📝 Abstract

As concurrent programming becomes increasingly prevalent, effectively identifying and addressing concurrency issues such as data races and deadlocks is critical. This study evaluates the performance of several leading large language models (LLMs), including GPT-3.5-turbo, GPT-4, GPT-4o, GPT-4o-mini, and Mistral-AI's Large2, in understanding and analyzing concurrency issues within software programs. Given that relaxed memory models, such as Total Store Order (TSO) and Partial Store Order (PSO), are widely implemented and adapted in modern systems, supported even by commodity architectures like ARM and x86, our evaluation focuses not only on sequentially consistent memory models but also on these relaxed memory models. Specifically, we assess two main aspects: the models' capacity to detect concurrency problems under a sequentially consistent memory model and their ability to verify the correctness conditions of concurrent programs across both sequentially consistent and relaxed memory models. To do this, we leverage SV-COMP's pthread tests and 25 ARM Litmus tests designed to evaluate Total Store Order (TSO) and Partial Store Order (PSO) memory models. The experimental results reveal that GPT-4, GPT-4o, and Mistral-AI's Large2 demonstrate a robust understanding of concurrency issues, effectively identifying data races and deadlocks when assessed under a sequentially consistent memory model. However, despite its superior performance, all selected LLMs face significant challenges verifying program correctness under relaxed memory models. These LLMs exhibit limitations in accurately capturing memory ordering constraints, and their current capabilities fall short in verifying even small programs in these complex scenarios.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Memory Management Strategies

Concurrency Issues Detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Advanced Language Models

Multithreaded Program Analysis

Modern Memory Models

🔎 Similar Papers

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates