20min-XD: A Comparable Corpus of Swiss News Articles
By: Michelle Wastl , Jannis Vamvas , Selena Calleri and more
Potential Business Impact:
Helps computers understand news in different languages.
We present 20min-XD (20 Minuten cross-lingual document-level), a French-German, document-level comparable corpus of news articles, sourced from the Swiss online news outlet 20 Minuten/20 minutes. Our dataset comprises around 15,000 article pairs spanning 2015 to 2024, automatically aligned based on semantic similarity. We detail the data collection process and alignment methodology. Furthermore, we provide a qualitative and quantitative analysis of the corpus. The resulting dataset exhibits a broad spectrum of cross-lingual similarity, ranging from near-translations to loosely related articles, making it valuable for various NLP applications and broad linguistically motivated studies. We publicly release the dataset in document- and sentence-aligned versions and code for the described experiments.
Similar Papers
CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English
Computation and Language
Finds fake news by comparing stories in different languages.
SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents
Computation and Language
Helps computers find differences between texts in different languages.
taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades
Computation and Language
Helps understand German language and fairness in news.