Score: 1

The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora

Published: January 16, 2026 | arXiv ID: 2601.11170v1

By: Taja Kuzman Pungeršek , Peter Rupnik , Vít Suchomel and more

Potential Business Impact:

Collects more words from the internet.

Business Areas:

Semantic Web Internet Services

Crawling national top-level domains has proven to be highly effective for collecting texts in less-resourced languages. This approach has been recently used for South Slavic languages and resulted in the largest general corpora for this language group: the CLASSLA-web 1.0 corpora. Building on this success, we established a continuous crawling infrastructure for iterative national top-level domain crawling across South Slavic and related webs. We present the first outcome of this crawling infrastructure - the CLASSLA-web 2.0 corpus collection, with substantially larger web corpora containing 17.0 billion words in 38.1 million texts in seven languages: Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian. In addition to genre categories, the new version is also automatically annotated with topic labels. Comparing CLASSLA-web 2.0 with its predecessor reveals that only one-fifth of the texts overlap, showing that re-crawling after just two years yields largely new content. However, while the new web crawls bring growing gains, we also notice growing pains - a manual inspection of top domains reveals a visible degradation of web content, as machine-generated sites now contribute a significant portion of texts.

Web Page Classification using LLMs for Crawling Support

Information Retrieval

Finds new web pages faster by sorting them.

11 May 2025 1

86%

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Computation and Language

Finds bad stuff in AI training data fast.

29 Aug 2025 1

86%

Multilingual corpora for the study of new concepts in the social sciences and humanities:

Computation and Language

Helps computers understand new ideas from company websites.

8 Dec 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇿 Czech Republic

Repos / Data Links

github.com github.com github.com

Page Count

10 pages

The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora

Collects more words from the internet.

Technical Abstract

Web Page Classification using LLMs for Crawling Support

Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

Multilingual corpora for the study of new concepts in the social sciences and humanities: