Beyond Token Limits: Assessing Language Model Performance on Long Text Classification
By: Miklós Sebők , Viktor Kovács , Martin Bánóczy and more
Potential Business Impact:
Helps computers understand very long texts, like laws.
The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.
Similar Papers
Beyond Token Limits: Assessing Language Model Performance on Long Text Classification
Computation and Language
Helps computers understand very long texts, like laws.
Long Context Automated Essay Scoring with Language Models
Computation and Language
Lets computers grade long essays completely.
Advancing Text Classification with Large Language Models and Neural Attention Mechanisms
Computation and Language
Helps computers understand and sort text better.