Score: 1

Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Published: May 26, 2025 | arXiv ID: 2505.19631v1

By: Zihong Zhang , Liqi He , Zuchao Li and more

Potential Business Impact:

Helps computers understand words in any language.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ($\textbf{L}$arge $\textbf{L}$anguage Model-Inspired $\textbf{A}$ho-$\textbf{C}$orasick $\textbf{A}$utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA

BabyLM's First Words: Word Segmentation as a Phonological Probing Task

Computation and Language

Teaches computers to understand word sounds in many languages.

4 Apr 2025 0

87%

Cross-Domain Semantic Segmentation with Large Language Model-Assisted Descriptor Generation

CV and Pattern Recognition

Helps computers see and name objects in pictures.

27 Jan 2025 0

87%

Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of Topic Models

Computation and Language

Computers struggle to understand big document piles.

20 Feb 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com github.com

Page Count

18 pages

Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

Helps computers understand words in any language.

Technical Abstract

BabyLM's First Words: Word Segmentation as a Phonological Probing Task

Cross-Domain Semantic Segmentation with Large Language Model-Assisted Descriptor Generation

Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of Topic Models