Web Page Classification using LLMs for Crawling Support
By: Yuichi Sasazawa, Yasuhiro Sogawa
Potential Business Impact:
Finds new web pages faster by sorting them.
A web crawler is a system designed to collect web pages, and efficient crawling of new pages requires appropriate algorithms. While website features such as XML sitemaps and the frequency of past page updates provide important clues for accessing new pages, their universal application across diverse conditions is challenging. In this study, we propose a method to efficiently collect new pages by classifying web pages into two types, "Index Pages" and "Content Pages," using a large language model (LLM), and leveraging the classification results to select index pages as starting points for accessing new pages. We construct a dataset with automatically annotated web page types and evaluate our approach from two perspectives: the page type classification performance and coverage of new pages. Experimental results demonstrate that the LLM-based method outperformed baseline methods in both evaluation metrics.
Similar Papers
Craw4LLM: Efficient Web Crawling for LLM Pretraining
Computation and Language
Finds better web pages for AI learning.
Preprint: Did I Just Browse A Website Written by LLMs?
Networking and Internet Architecture
Finds websites made mostly by AI.
Leveraging LLMs to Create Content Corpora for Niche Domains
Computation and Language
Builds helpful lists from internet information.