Label Forensics: Interpreting Hard Labels in Black-Box Text Classifier
By: Mengyao Du , Gang Yang , Han Fang and more
Potential Business Impact:
Uncovers what hidden computer text labels mean.
The widespread adoption of natural language processing techniques has led to an unprecedented growth of text classifiers across the modern web. Yet many of these models circulate with their internal semantics undocumented or even intentionally withheld. Such opaque classifiers, which may expose only hard-label outputs, can operate in unregulated web environments or be repurposed for unknown intents, raising legitimate forensic and auditing concerns. In this paper, we position ourselves as investigators and work to infer the semantic concept each label encodes in an undocumented black-box classifier. Specifically, we introduce label forensics, a black-box framework that reconstructs a label's semantic meaning. Concretely, we represent a label by a sentence embedding distribution from which any sample reliably reflects the concept the classifier has implicitly learned for that label. We believe this distribution should maintain two key properties: precise, with samples consistently classified into the target label, and general, covering the label's broad semantic space. To realize this, we design a semantic neighborhood sampler and an iterative optimization procedure to select representative seed sentences that jointly maximize label consistency and distributional coverage. The final output, an optimized seed sentence set combined with the sampler, constitutes the empirical distribution representing the label's semantics. Experiments on multiple black-box classifiers achieve an average label consistency of around 92.24 percent, demonstrating that the embedding regions accurately capture each classifier's label semantics. We further validate our framework on an undocumented HuggingFace classifier, enabling fine-grained label interpretation and supporting responsible AI auditing.
Similar Papers
Soft-Label Training Preserves Epistemic Uncertainty
Machine Learning (CS)
Teaches computers to understand when things are unclear.
Fuzzy Label: From Concept to Its Application in Label Learning
Machine Learning (CS)
Teaches computers to understand fuzzy, uncertain labels.
Semantically Guided Adversarial Testing of Vision Models Using Language Models
CV and Pattern Recognition
Makes AI models more easily fooled.