🤖 AI Summary
This paper addresses the Dorst–Smeulders (DS) encoding problem for arbitrary binary words: decomposing a given binary word into the minimum number of Sturmian factors and outputting the DS encoding for each. We present the first linear-time algorithm for this problem. Our method first identifies and encodes the longest Sturmian prefix efficiently, then recursively decomposes the remaining suffix—guaranteeing global minimality in the number of factors. The approach leverages fundamental properties of Sturmian words—including balance, rational approximation of slopes, and modular arithmetic structure—and integrates single-pass string scanning with a greedy decomposition strategy. Experiments confirm that the algorithm achieves both optimal time complexity and provably minimal factor count, matching theoretical lower bounds. To our knowledge, this is the first work extending DS encoding from pure Sturmian words to arbitrary binary words, thereby establishing a theoretically optimal foundation for compressing binary data containing long Sturmian segments.
📝 Abstract
A binary word is Sturmian if the occurrences of each letter are balanced, in the sense that in any two factors of the same length, the difference between the number of occurrences of the same letter is at most 1. In digital geometry, Sturmian words correspond to discrete approximations of straight line segments in the Euclidean plane. The Dorst-Smeulders coding, introduced in 1984, is a 4-tuple of integers that uniquely represents a Sturmian word $w$, enabling its reconstruction using $|w|$ modular operations, making it highly efficient in practice. In this paper, we present a linear-time algorithm that, given a binary input word $w$, computes the Dorst-Smeulders coding of its longest Sturmian prefix. This forms the basis for computing the Dorst-Smeulders coding of an arbitrary binary word $w$, which is a minimal decomposition (in terms of the number of factors) of $w$ into Sturmian words, each represented by its Dorst-Smeulders coding. This coding could be leveraged in compression schemes where the input is transformed into a binary word composed of long Sturmian segments. Although the algorithm is conceptually simple and can be implemented in just a few lines of code, it is grounded in a deep analysis of the structural properties of Sturmian words.