Preprint: Did I Just Browse A Website Written by LLMs?
By: Sichang "Steven" He, Ramesh Govindan, Harsha V. Madhyastha
Potential Business Impact:
Finds websites made mostly by AI.
Increasingly, web content is automatically generated by large language models (LLMs) with little human input. We call this "LLM-dominant" content. Since LLMs plagiarize and hallucinate, LLM-dominant content can be unreliable and unethical. Yet, websites rarely disclose such content, and human readers struggle to distinguish it. Thus, we must develop reliable detectors for LLM-dominant content. However, state-of-the-art LLM detectors are insufficient, because they perform well mainly on clean, prose-like text, while web content has complex markup and diverse genres. We propose a highly reliable, scalable pipeline that classifies entire websites. Instead of naively classifying text extracted from each page, we classify each site based on an LLM text detector's outputs of multiple prose-like pages. We train and evaluate our detector by collecting 2 distinct ground truth datasets totaling 120 sites, and obtain 100% accuracies testing across them. In the wild, we detect a sizable portion of sites as LLM-dominant among 10k sites in search engine results and 10k in Common Crawl archives. We find LLM-dominant sites are growing in prevalence and rank highly in search results, raising questions about their impact on end users and the overall Web ecosystem.
Similar Papers
Detecting LLM-Generated Text with Performance Guarantees
Computation and Language
Finds fake writing made by computers.
Leveraging LLMs to Create Content Corpora for Niche Domains
Computation and Language
Builds helpful lists from internet information.
Web Page Classification using LLMs for Crawling Support
Information Retrieval
Finds new web pages faster by sorting them.