ChatGPT Displays Memorization of Copyrighted Poems, Raises Ethical Concerns: Cornell Study

By: Science Desk

Kochi | January 11, 2024 16:14 IST

If you ever requested ChatGPT, the language model-based chatbot developed by OpenAI, to recite a well-known poem, you might be surprised to find it delivering the entire text verbatim, regardless of copyright law. A recent study conducted by Cornell researchers reveals that ChatGPT can memorize and reproduce poems, leading to ethical questions about the training of AI models using data sourced from the internet.

The study, presented at the Computational Humanities Research Conference on Saturday, focused on exploring how large language models, like ChatGPT, interact with and reproduce copyrighted material, specifically poems that are commonly available online.

First author Lyra D’Souza, a computer science major and summer research assistant at Cornell, highlighted concerns about privacy and the potential use of proprietary models on private data. “It’s generally not good for large language models to memorize large chunks of text, in part because it’s a privacy concern. We don’t know what they’re trained on, and a lot of times, private companies can train proprietary models on our private data,” mentioned D’Souza in a press statement.

Poems were chosen for the study due to their length, making them suitable for language models, and their complex copyright status. While many of the poems examined were technically under copyright, they were widely accessible online from reputable sources such as the Poetry Foundation.

Large language models, including ChatGPT, are trained to generate text by predicting the most likely next word based on their training data, primarily consisting of webpages. When the training data includes duplicated passages, these models can inadvertently memorize specific sequences of words.

The researchers compared the poem reproduction capabilities of ChatGPT with three other large language models—PaLM from Google, Pythia from the non-profit AI research institute EleutherAI, and GPT-2, an earlier version of the model underpinning ChatGPT (GPT-4). The models were prompted with poems from 60 American poets, representing diverse backgrounds in terms of time periods, races, genders, and levels of fame.

ChatGPT successfully retrieved 72 out of 240 poems, outperforming PaLM, which only managed to produce 10 poems. Pythia and GPT-2, however, failed in retrieving complete poems, with Pythia repeating the same phrase and GPT-2 generating nonsensical text.

The study comes at a challenging time for OpenAI, as the company faces lawsuits filed by fiction and nonfiction writers over alleged unauthorized use of their work to train AI programs. The findings underscore the importance of addressing privacy and copyright concerns in the development and deployment of large language models.

Related posts

Leave a Comment