🤖 AI Summary
Closed-class words (e.g., prepositions, conjunctions, articles) in source code identifiers—grammatically essential in natural language yet systematically understudied in programming language research—lack empirical characterization and theoretical grounding.
Method: We construct CCID, the first manually annotated dataset of 1,275 closed-class identifiers, and integrate extended syntactic pattern modeling, grounded theory coding, and statistical analysis to uncover how such words encode control flow, data transformation, temporal logic, and behavioral roles via part-of-speech sequences.
Contribution/Results: We propose a syntax-pattern–based framework for identifier semantic analysis and empirically demonstrate strong correlations between high-frequency closed-class patterns and program behavior. This work fills a critical gap in programming linguistics by providing the first large-scale empirical study of closed-class words in identifiers, with implications for identifier naming assistance, code comprehension, and programming pedagogy.
📝 Abstract
Identifier names are crucial components of code, serving as primary clues for developers to understand program behavior. This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns; representations of the part-of-speech (PoS) sequences that underlie identifier phrases. The specific focus is on closed syntactic categories (e.g., prepositions, conjunctions, determiners), which are rarely studied in software engineering despite their central role in general natural language. The Closed Category Identifier Dataset (CCID) is presented, a new manually annotated dataset of 1,275 identifiers drawn from 30 open-source systems. The relationship between closed-category grammar patterns and program behavior is analyzed using grounded theory coding, statistical, and pattern analysis. The results reveal recurring structures that developers use to express control flow, data transformation, temporal reasoning, and behavioral roles through naming. This study contributes an empirical foundation for understanding how developers adapt linguistic resources to encode behavior in source code. By analyzing closed-category terms and their associated grammar patterns, the work highlights a previously underexplored dimension of identifier semantics and identifies promising directions for future research in naming support, comprehension, and education.