Detecting AI-Generated Text Through Verb Frequency Analysis

Leonardo
6 min readJun 6, 2024

In the rapidly evolving field of artificial intelligence, the sophistication of text generation by models such as ChatGPT has reached new heights. These advancements have democratized content creation, igniting discussions about authenticity and reliability. As AI text generators become more prevalent, identifying whether text is generated by a human or an AI has become a pressing challenge. This article focuses on a study that uses verb frequency and forms to detect AI-generated text by ChatGPT model 3.5, pointing out that analyzing noun frequency provided no significant insights due to the variability in topics of the generated texts.

The Inception of the Study

The study was initiated based on the observation that different writers, be they human or machine, exhibit distinct linguistic fingerprints. For humans, these fingerprints are shaped by individual experiences, education, and emotional depth. In contrast, AI models are influenced by their training data and underlying algorithms. Detecting these patterns could help determine whether a text is AI-generated or human-written.

Methodology: Harnessing Python and NLTK

The core of this research was conducted using Python and its Natural Language Toolkit (NLTK), which provided essential tools for tokenizing text, removing stopwords, and conducting part-of-speech tagging. Here’s an overview of the Python script used:

import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter

# Necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

def get_word_frequencies(text):
tokens = word_tokenize(text)
tokens = [token for token in tokens if token.lower() not in stopwords.words('english')]
tagged_tokens = pos_tag(tokens)
verbs = [word for word, tag in tagged_tokens if tag.startswith('VB')]
return Counter(verbs)

The focus was on extracting verbs because they significantly reflect an author’s style. The hypothesis was that AI-generated texts might prefer specific verb forms or frequencies, distinguishing them from human writing.

Data Collection: Varied Prompts

To gather diverse data, the study employed prompts that generated text on a wide array of subjects from technology to philosophy. The responses from ChatGPT were analyzed to chart verb usage patterns. The selection of prompts, aimed to cover a broad range of topics, mitigating thematic biases that could affect noun frequency data. Here are some examples:

Discuss the psychological impact of social media on self-esteem in a comprehensive text, with at least 2000 words
write a scientific article about quantum computing
write a detailed tutorial about how to write a scientific paper
write a comprehensive blog article about Aristotle's philosophy

Analysis: Verb Frequencies as Distinguishing Factors

The analysis centered on the frequency and forms of verbs in the texts generated. Notably, verbs associated with cognitive and creative processes such as “analyze,” “describe,” and “discuss” appeared with remarkable consistency across different texts. This pattern of verb usage, uniform across various topics, starkly contrasts with human writing, which typically shows a wider diversity of verb usage based on subject matter and personal style.

The Shortcomings of Noun Frequency Analysis

The initial plan included analyzing noun frequencies, but this approach failed to produce meaningful results. The challenge was the diversity of topics prompted by the AI, which led to topic-specific nouns like “qubit” and “entanglement” in one text, vastly different from nouns in a text about “philosophy” or “climate change.” This variability rendered consistent patterns in noun usage across AI-generated texts undetectable.

Statistical Insights and Graphical Representations

The study processed 60 different prompts and generated a total of 62,357 words. The most frequently used verbs were the following:

  1. “including” (0.1203%)
  2. leading” (0.0834%)
  3. making” (0.0641%)
  4. shaping” (0.0609%)
  5. enduring” (0.0593%)
  6. provide” (0.0561%)
  7. promoting” (0.0545%)
  8. “delve” (0.0529%)
  9. characterized” (0.0513%)
  10. used” (0.0481%)

Graphical representations created using matplotlib provided a clear comparative perspective of verb usage, emphasizing that certain verbs are favored by AI text generators disproportionately.

Implications and Future Research Directions

The findings of this study are crucial for enhancing the credibility of digital content and guarding against misinformation. For AI developers, these insights could help refine models to produce texts that more closely mimic human diversity and subtlety.

Future research might explore deeper linguistic features like sentence structure complexity, the use of passive voice, and syntactic variability. Integrating machine learning models to predict the likelihood of a text being AI-generated based on these features could further streamline and automate detection methods.

Constructing an AI Detector Using Verb Frequency Analysis

Building an AI detector that leverages verb frequency analysis involves a methodical approach, where the frequency and forms of verbs used in a text are compared against a benchmark established by research, such as this study. The following steps outline how one might develop such a detector:

Step 1: Data Collection

The first step in constructing an AI detector is to gather a comprehensive dataset of text. This dataset should include a balanced mix of AI-generated texts and human-written texts. For AI-generated texts, using a variety of prompts that produce content on diverse subjects will help in capturing a broad range of verb usage patterns.

Step 2: Preprocessing and Tokenization

Once the dataset is ready, the text needs to be preprocessed. This involves converting the text to a uniform case (usually lowercase), removing punctuation, and potentially correcting spelling errors. The text is then tokenized, meaning it is split into individual words or tokens. Tools like NLTK’s `word_tokenize` function can be effectively used for this purpose.

Step 3: Part-of-Speech Tagging

After tokenization, each word in the dataset is tagged with its corresponding part of speech, focusing particularly on verbs. This can be achieved using NLTK’s pos_tag function. It’s crucial to accurately identify different forms of verbs as these are the primary features used in distinguishing AI-generated text from human text.

Step 4: Frequency Analysis

Analyze the frequency of each verb form in the dataset. The frequency distribution of verbs in human-written texts is likely to show greater variation and contextual dependency compared to the more uniform verb usage in AI-generated texts.

Step 5: Benchmarking and Model Training

Use the findings from this study as a benchmark for typical verb usage in AI-generated texts. For example, the high frequency of verbs like “including” and “leading” in AI texts can be used as indicators. Train a machine learning model (such as a logistic regression or a decision tree classifier) to recognize patterns in verb usage. This model will learn from the differences in verb frequency distributions between human and AI texts.

Step 6: Validation and Testing

Validate the effectiveness of the AI detector by testing it on a new set of texts not previously seen by the model. The success of the model should be evaluated based on its accuracy, precision, and recall in correctly identifying AI-generated and human-written texts.

Step 7: Continuous Learning

Since AI text-generation technologies evolve rapidly, continuously updating the dataset and retraining the model with new texts will help maintain the accuracy and relevance of the AI detector. Incorporating feedback mechanisms where incorrect predictions lead to further model training can enhance performance over time.

Concluding Thoughts

This study into verb frequency and forms in AI-generated text opens new research pathways in linguistics and AI. It underscores the nuanced differences between human and machine writing, offering insights into the subtle complexities of language that AI is yet to fully master. As AI continues to advance, our methods for understanding and interacting with this powerful technology will evolve correspondingly, continually challenging the boundaries of what machines can achieve.

--

--

Leonardo

Software developer, former civil engineer. Musician. Free thinker. Writer.