Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This paper addresses the challenge of temporal representation in large language models (LLMs) for continuous-time event sequence modeling. We systematically evaluate five time tokenization strategies—byte encoding, adaptive residual scalar quantization, calendar-semantic formatting, uniform binning, and raw numeric string encoding—across diverse real-world event time distributions (e.g., log-normal, discrete spikes). To our knowledge, this is the first empirical study on time tokenization for LLMs. Results show that log-transformed tokenization achieves optimal performance on skewed distributions, whereas calendar-semantic formatting exhibits strongest robustness on multimodal and mixed distributions. Experiments, conducted within an LLM fine-tuning framework across multiple real-world event datasets, empirically validate the strategy–distribution alignment principle. We precisely delineate the performance boundaries: log-based tokenization excels on skewed distributions, while human-centric (calendar-semantic) tokenization dominates on mixed distributions. Our work establishes a reproducible, distribution-aware methodology for temporal representation in event sequence modeling.

Technology Category

Application Category

📝 Abstract

Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.

Problem

Research questions and friction points this paper is trying to address.

Evaluates temporal tokenization strategies for event sequence modeling

Compares encoding methods for representing continuous time in LLMs

Identifies optimal tokenization based on data's statistical properties

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive residual scalar quantization for time encoding

Log-based tokenization strategies for skewed distributions

Human-centric calendar tokens for mixed modality robustness

🔎 Similar Papers

TPP-LLM: Modeling Temporal Point Processes by Efficiently Fine-Tuning Large Language Models