🤖 AI Summary
This study addresses the limitations of existing methods for aligning financial text with time-series data, which predominantly rely on keyword matching and fail to capture the complex, multi-level influences—spanning macroeconomic conditions, industry dynamics, peer companies, and the target firm itself—on stock prices. To overcome this, the authors propose a novel pairing framework that integrates semantic matching with a four-tier news classification scheme. Specifically, they extract contextual information about target firms from SEC filings, retrieve semantically relevant news articles using embedding-based retrieval, and employ a large language model to categorize each article into one of four influence levels. The resulting high-quality dataset, FinTexTS, enables significantly improved stock price prediction performance, particularly when augmented with proprietary news sources, marking the first successful implementation of a semantics-driven, multi-level alignment between financial text and time-series data.
📝 Abstract
The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company's stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target-company level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct \textbf{FinTexTS}, a new large-scale text-paired stock price dataset. Experimental results on \textbf{FinTexTS} demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying \textbf{FinTexTS}, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.