An Index-based Approach for Efficient and Effective Web Content Extraction
By: Yihan Chen , Benfeng Xu , Xiaorui Wang and more
Potential Business Impact:
Finds important web info super fast.
As web agents (e.g., Deep Research) routinely consume massive volumes of web pages to gather and analyze information, LLM context management -- under large token budgets and low signal density -- emerges as a foundational, high-importance, and technically challenging problem for agentic and RAG pipelines. Existing solutions for extracting relevant content are inadequate: generative extraction models suffer from high latency, rule-based heuristics lack adaptability, and chunk-and-rerank methods are blind to webpage structure. To overcome these issues, we introduce Index-based Web Content Extraction to reframe the extraction process from slow, token-by-token generation into a highly efficient, discriminative task of index prediction, achieving both effectiveness and efficiency. We partition HTML into structure-aware, addressable segments, and extract only the positional indices of content relevant to a given query. This method decouples extraction latency from content length, enabling rapid, query-relevant extraction. We first evaluate our method as a post-retrieval processing component within an RAG QA system and find that it improves QA accuracy. Then we directly measure its match rate with the target content in two scenarios: main content extraction (ME) and query-relevant extraction (QE). Experimental results show that our method outperforms existing works in both accuracy and speed, effectively bridging the gap between LLMs and the vast webpages.
Similar Papers
A Systematic Framework for Enterprise Knowledge Retrieval: Leveraging LLM-Generated Metadata to Enhance RAG Systems
Information Retrieval
Finds information in big computer files faster.
WebRec: Enhancing LLM-based Recommendations with Attention-guided RAG from Web
Information Retrieval
Helps online shopping find better things for you.
When Content is Goliath and Algorithm is David: The Style and Semantic Effects of Generative Search Engine
Information Retrieval
Helps AI search engines show better, faster results.