An Index-based Approach for Efficient and Effective Web Content Extraction

📅 2025-12-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing web content extraction methods suffer from three key limitations at scale: low efficiency (high latency of generative models), poor adaptability (weak generalization of rule-based approaches), and structural neglect (semantic loss in HTML due to chunking and re-ranking). To address these, this paper proposes a novel indexing-based paradigm that reformulates content extraction as a structure-aware discriminative index prediction task—shifting from generation to precise localization. Our method introduces an HTML-structure-aware segmentation mechanism and an addressable fragment indexing scheme, enabling lightweight discriminative models to directly predict the positions of query-relevant fragments. This fully decouples extraction latency from webpage length. To our knowledge, this is the first work to achieve such a paradigm shift. Extensive experiments demonstrate state-of-the-art performance across three tasks—RAG-based QA, main-content extraction, and query-relevant extraction—delivering higher accuracy (improved match rates), lower latency (faster inference), and stronger robustness.

Technology Category

Application Category

📝 Abstract
As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.
Problem

Research questions and friction points this paper is trying to address.

Efficiently extracts relevant web content for LLM agents
Reduces extraction latency independent of content length
Improves accuracy and speed in web content extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Index-based extraction reframes process into discriminative index prediction
Partition HTML into structure-aware addressable segments for efficiency
Decouple extraction latency from content length enabling rapid query-relevant extraction
🔎 Similar Papers
No similar papers found.