Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings

📅 2026-01-26
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of high annotation costs, subjectivity, and poor cross-domain generalization in unsupervised text segmentation by proposing a training-free automatic segmentation method. The approach maps sentences into embedding vectors and applies kernel change point detection (KCPD) with a penalty term to identify segment boundaries. Its key innovation lies in establishing, for the first time, a dependency-aware theoretical framework for KCPD under m-dependent sequences that accounts for the short-range dependencies inherent in natural language, providing provable upper bounds on total risk and guarantees for change point localization. A controllable synthetic data validation framework based on large language models is also introduced. Experiments demonstrate that the method outperforms strong unsupervised baselines on standard benchmarks and validates its theoretical soundness and practical efficacy in a case study on Taylor Swift’s tweets.

Technology Category

Application Category

📝 Abstract
Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised Text Segmentation
Change-Point Detection
Sentence Embeddings
Boundary Detection
Short-Range Dependence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kernel Change-Point Detection
Unsupervised Text Segmentation
m-dependent sequences
Sentence Embeddings
LLM-based Simulation
🔎 Similar Papers
No similar papers found.
M
Mumin Jia
Department of Mathematics and Statistics, York University, Toronto, Canada
Jairo Diaz-Rodriguez
Jairo Diaz-Rodriguez
Assistant professor, York University
Data ScienceHigh dimensional statisticsMachine Learninginverse problems