A Large-Scale Web Search Dataset for Federated Online Learning to Rank

📅 2025-08-17

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing FOLTR benchmarks predominantly rely on static datasets with random splits and synchronous assumptions, failing to capture the dynamic and asynchronous nature of real-world search scenarios. To address this, we introduce AOL4FOLTR—the first large-scale, realistic search dataset tailored for Federated Online Learning to Rank (FOLTR). Built from 2.6 million queries and click logs of 10,000 users, it preserves temporal stamps and user identifiers, enabling fine-grained user partitioning and asynchronous federated training. We further propose an end-to-end FOLTR framework integrating realistic log preprocessing, sequential user behavior modeling, and privacy-preserving distributed model aggregation. Our approach jointly ensures modeling fidelity and data privacy. AOL4FOLTR and the accompanying framework significantly enhance experimental realism and reproducibility in FOLTR research, establishing a critical infrastructure for privacy-aware search ranking.

Technology Category

Application Category

📝 Abstract

The centralized collection of search interaction logs for training ranking models raises significant privacy concerns. Federated Online Learning to Rank (FOLTR) offers a privacy-preserving alternative by enabling collaborative model training without sharing raw user data. However, benchmarks in FOLTR are largely based on random partitioning of classical learning-to-rank datasets, simulated user clicks, and the assumption of synchronous client participation. This oversimplifies real-world dynamics and undermines the realism of experimental results. We present AOL4FOLTR, a large-scale web search dataset with 2.6 million queries from 10,000 users. Our dataset addresses key limitations of existing benchmarks by including user identifiers, real click data, and query timestamps, enabling realistic user partitioning, behavior modeling, and asynchronous federated learning scenarios.

Problem

Research questions and friction points this paper is trying to address.

Privacy concerns in centralized search log collection

Limitations of simulated FOLTR benchmarks

Need realistic datasets for federated ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning for privacy-preserving model training

Real user data with identifiers and timestamps

Asynchronous federated learning scenarios support

🔎 Similar Papers

No similar papers found.