🤖 AI Summary
This work addresses the challenges of high annotation costs, subjectivity, and poor cross-domain generalization in unsupervised text segmentation by proposing a training-free automatic segmentation method. The approach maps sentences into embedding vectors and applies kernel change point detection (KCPD) with a penalty term to identify segment boundaries. Its key innovation lies in establishing, for the first time, a dependency-aware theoretical framework for KCPD under m-dependent sequences that accounts for the short-range dependencies inherent in natural language, providing provable upper bounds on total risk and guarantees for change point localization. A controllable synthetic data validation framework based on large language models is also introduced. Experiments demonstrate that the method outperforms strong unsupervised baselines on standard benchmarks and validates its theoretical soundness and practical efficacy in a case study on Taylor Swift’s tweets.
📝 Abstract
Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.