Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of temporal representation in large language models (LLMs) for continuous-time event sequence modeling. We systematically evaluate five time tokenization strategies—byte encoding, adaptive residual scalar quantization, calendar-semantic formatting, uniform binning, and raw numeric string encoding—across diverse real-world event time distributions (e.g., log-normal, discrete spikes). To our knowledge, this is the first empirical study on time tokenization for LLMs. Results show that log-transformed tokenization achieves optimal performance on skewed distributions, whereas calendar-semantic formatting exhibits strongest robustness on multimodal and mixed distributions. Experiments, conducted within an LLM fine-tuning framework across multiple real-world event datasets, empirically validate the strategy–distribution alignment principle. We precisely delineate the performance boundaries: log-based tokenization excels on skewed distributions, while human-centric (calendar-semantic) tokenization dominates on mixed distributions. Our work establishes a reproducible, distribution-aware methodology for temporal representation in event sequence modeling.

Technology Category

Application Category

📝 Abstract
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.
Problem

Research questions and friction points this paper is trying to address.

Evaluates temporal tokenization strategies for event sequence modeling
Compares encoding methods for representing continuous time in LLMs
Identifies optimal tokenization based on data's statistical properties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive residual scalar quantization for time encoding
Log-based tokenization strategies for skewed distributions
Human-centric calendar tokens for mixed modality robustness
🔎 Similar Papers
No similar papers found.
Z
Zefang Liu
Capital One, USA
N
Nam Nguyen
Capital One, USA
Yinzhu Quan
Yinzhu Quan
Georgia Institute of Technology
A
Austin Zhang
Capital One, USA