Sequence Graphs Realizations and Ambiguity in Language Models

📅 2020-03-04
🏛️ International Computing and Combinatorics Conference
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the ambiguity in sequence graph representations induced by the bag-of-words assumption in language models. Specifically, given a window size (w), directed/undirected structure, and edge multiplicities (weights), we study two fundamental questions: (i) realizability—whether a given sequence graph corresponds to at least one valid sequence; and (ii) enumerability—how many distinct sequences map to the same graph. We establish the first systematic theoretical framework for sequence graph realizability, introducing a three-level generalized model that jointly accounts for window size, edge directionality, and edge weights. We design exact dynamic programming algorithms for counting and enumerating preimages. We prove that even small windows (e.g., (w = 2)) induce exponential ambiguity—semantically divergent sentences share identical sequence graphs. Furthermore, we identify several core combinatorial problems whose computational complexity remains open. Our results demonstrate that bag-of-words compression fundamentally undermines representation uniqueness, posing intrinsic challenges to model interpretability and robustness.

Technology Category

Application Category

📝 Abstract
Several natural language models rely on an assumption modeling each word context as a bag of words. We study the combinatorial implications of such assumption for the corresponding word or sentences representations. In particular , we present theoretical results concerning the family of sequence graphs, for which realizations yield equivalent representations given this assumption. Several combinatorial problems are presented, depending on three levels of generalisation (window size, graph orientation, and weights), and whether some of these are NP-complete is left opened. Based on these results, we also establish different algorithms, including a dynamic programming formulation, to count and explicit the different realizations of a sequence graph. This allows us to show that the bag of words assumption can induce an important number of sentences to have the same representations, even for relatively short context window sizes.
Problem

Research questions and friction points this paper is trying to address.

Study realizability and ambiguity of sequence graphs in language models
Analyze combinatorial and algorithmic aspects of sequence graph realizations
Investigate polynomial and hardness results for various graph settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequence graphs model word co-occurrence in windows
Polynomial algorithms for realizability at window size 2
Dynamic programming for enumeration in moderate sizes
🔎 Similar Papers
No similar papers found.