An emerging narrative around ChatGPT and other Large Language Models (LLMs) is that given their output is so convincing and people-like, how can we hope to tell the difference between the two and to not be deceived? In fact, recent research[i]found something, “eye-opening: participants in the study could only distinguish between human or AI text with 50-52% accuracy; about the same random chance as a coin flip.” Obviously then, this is a big problem. Yet, it turns out that multiple computer scientists[ii],[iii]have created models that make detecting LLM generated content easy. Here’s how one of them, DetectGPT, does it.
How does DetectGPT work?
DetectGPT is a zero-shot model for detecting LLMs. What does “zero-shot” mean? It means that rather than using machine learning to develop a second deep network to detect machine-generated text, which would suffer from overfit as all such models do, the original source model is used to detect itself. Further, the original source model is used without fine-tuning or adaptation of any kind, to detect its own samples. Of course, this is super clever: use LLMs to identify themselves. How does this work?
One of the things researchers have noticed is that LLMs tend to generate answers to commands such as “What is a quick vegan recipe I can cook for a party this weekend?” by presenting language that has the maximum average per-token logarithmic probability. What does that mean?
Before explaining the method, we need to understand several key terms: tokens and thresholds. What are tokens? Typically, when a text-based communication is assessed quantitatively the words need to be categorized in some form or fashion and then turned into a numerical representation. Words may be categorized individually, or in combinations of words, like sentences or paragraphs, and so on. Tokens are these sub-divisions of an underlying text.
Once the tokenization of a text has taken place, the tokens are quantified by turning them into numerical vectors. Vectors summarize all of the different associations of the word to other words. For example, for the vector ‘woman,’ the word ‘mother’ is likely frequently associated with it and given that they occur frequently in texts together, a probability may be assigned to these associations. Other parts of the vector for ‘woman’ might be ‘female,’ ‘feminine,’ ‘girl,’ and so on.
In other words, a vector for some words that are very common, such as ‘woman,’ may be composed of thousands of numbers simultaneously which captures the associations of that word with all other words, and across many texts. The number of these associations is referred to as the vector’s dimensionality. By converting words into vectors, it helps computers to start to understand language computationally.
Probabilities, of course, range between 0% and 100% and if we wanted to understand the word associations in a text we could arbitrarily set a threshold of, say 75%, to see how frequently different word pairings occur together. If a token has a 95% probability of being associated with another word, while another word’s probability is 5%, we can infer that the words with the higher probabilities are likely related to one another. Does this make sense? Thought so.
We can use statistics to establish appropriate thresholds for understanding, or they can be set arbitrarily. Either way, thresholds are important for understanding how DetectGPT works.
Now you can understand how LLM’s work. They tend to present answers where the language used has the maximum average per-token logarithmic probability. In other words, the answer has the maximal probability of being associated with the question. This makes sense, because LLMs, like almost all quantitative problems are problems of either maximization or minimization.
Thus, if we rewrite/perturb model generated text and reevaluate the average probability of the component tokens we get a fascinating situation. Namely, the request for rewrites of the LLM model, almost always have a lower maximum average per-token logarithmic probability.
By contrast, when people rewrite a text the result may be either a higher or lower log probability than the original text. Put into more simple terms, LLMs always give you their very best response to a request the first time because they are designed to based on the information available to them.
But people when rewriting a text are not engaged in a quantitatively driven maximization problem. People rewriting a text may be trying to maximize the factual response to a question; maximize the readability of an answer; maximize the humor in a passage; or even minimize the level of offense a reader may experience; and so on.
Thus, if multiple perturbations in a text result in a lower log probability consistently, then with a high probability we can conclude that a LLM generated the original text.
What is DetectGPT’s success rate?
DetectGPT’s success rate is 85% across a range of samples despite being a general LLM detection tool. By contrast, LLM-detectors trained to identify the presence of LLM is specific texts do perform better. But when they are presented with new texts they significantly underperform. Overall, the success rate of machine language trained models is also 85%, but the standard deviation of success is much higher. That is, DetectGPT consistently performs well, but trained models only perform well on their trained datasets.
Conclusion
Detecting whether or not a document has been authored by LLM is very difficult for people. But it is not a difficult problem for zero-shot models. The reason is that LLM’s seek to provide you with their best answer/attempt the first time. Asking it to rewrite an answer or an output almost always results in a less maximal output. By contrast, rewrites done by people can have the average token log-probability be higher or lower.
[i]Jakesch, Maurice, Jeffrey T. Hancock, and Mor Naaman. “Human heuristics for AI-generated language are flawed.” PNAS (March 7, 2023) Vol. 120, No. 11
[ii]Mitchell, Eric, Yooho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. “DetectGPT: Zero-Shot Machine-Generated Text Detection using Proability Curvature.” (26 January 2023) arXiv:2301.11305v1
[iii]Mok, Kimberley. “GPTZero: An App to Detect AI Authorship.” The New Stack(1 February 2023)




0 Comments