StyleDistance: Stronger Content-Independent Style Embeddings with Synthetic Parallel Examples

📅 2024-10-16

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Existing style embedding methods suffer from content leakage due to contrastive triplets differing along both style and content dimensions, undermining the purity of style representations. To address this, we propose the first fine-grained style contrastive learning framework based on synthetically generated parallel examples: leveraging large language models, we construct high-fidelity, content-preserving parallel texts spanning 40 stylistic dimensions; style–content disentanglement is then achieved via Triplet or NT-Xent loss. The resulting embeddings exhibit significantly enhanced content invariance and cross-domain generalizability. Empirically, our approach achieves state-of-the-art performance across diverse downstream tasks—including style classification, style transfer, and controllable text generation—outperforming all prior methods. The code and pre-trained models are publicly released on Hugging Face.

Technology Category

Application Category

📝 Abstract

Style representations aim to embed texts with similar writing styles closely and texts with different styles far apart, regardless of content. However, the contrastive triplets often used for training these representations may vary in both style and content, leading to potential content leakage in the representations. We introduce StyleDistance, a novel approach to training stronger content-independent style embeddings. We use a large language model to create a synthetic dataset of near-exact paraphrases with controlled style variations, and produce positive and negative examples across 40 distinct style features for precise contrastive learning. We assess the quality of our synthetic data and embeddings through human and automatic evaluations. StyleDistance enhances the content-independence of style embeddings, which generalize to real-world benchmarks and outperform leading style representations in downstream applications. Our model can be found at https://huggingface.co/StyleDistance/styledistance .

Problem

Research questions and friction points this paper is trying to address.

Enhance content-independent style embeddings

Generate synthetic parallel examples for training

Improve style representation generalization and performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic dataset creation

Controlled style variations

Precise contrastive learning

🔎 Similar Papers

No similar papers found.