Score: 0

Corpus of Cross-lingual Dialogues with Minutes and Detection of Misunderstandings

Published: December 23, 2025 | arXiv ID: 2512.20204v1

By: Marko Čechovič , Natália Komorníková , Dominik Macháček and more

Potential Business Impact:

Helps people talk across languages, finds misunderstandings.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Speech processing and translation technology have the potential to facilitate meetings of individuals who do not share any common language. To evaluate automatic systems for such a task, a versatile and realistic evaluation corpus is needed. Therefore, we create and present a corpus of cross-lingual dialogues between individuals without a common language who were facilitated by automatic simultaneous speech translation. The corpus consists of 5 hours of speech recordings with ASR and gold transcripts in 12 original languages and automatic and corrected translations into English. For the purposes of research into cross-lingual summarization, our corpus also includes written summaries (minutes) of the meetings. Moreover, we propose automatic detection of misunderstandings. For an overview of this task and its complexity, we attempt to quantify misunderstandings in cross-lingual meetings. We annotate misunderstandings manually and also test the ability of current large language models to detect them automatically. The results show that the Gemini model is able to identify text spans with misunderstandings with recall of 77% and precision of 47%.

Country of Origin
🇨🇿 Czech Republic

Page Count
12 pages

Category
Computer Science:
Computation and Language