ChatGPT Displays Memorization of Copyrighted Poems, Raises Ethical Concerns: Cornell Study

By: Science Desk

Kochi | January 11, 2024 16:14 IST

If you ever requested ChatGPT, the language model-based chatbot developed by OpenAI, to recite a well-known poem, you might be surprised to find it delivering the entire text verbatim, regardless of copyright law. A recent study conducted by Cornell researchers reveals that ChatGPT can memorize and reproduce poems, leading to ethical questions about the training of AI models using data sourced from the internet.

The study, presented at the Computational Humanities Research Conference on Saturday, focused on exploring how large language models, like ChatGPT, interact with and reproduce copyrighted material, specifically poems that are commonly available online.

First author Lyra D’Souza, a computer science major and summer research assistant at Cornell, highlighted concerns about privacy and the potential use of proprietary models on private data. “It’s generally not good for large language models to memorize large chunks of text, in part because it’s a privacy concern. We don’t know what they’re trained on, and a lot of times, private companies can train proprietary models on our private data,” mentioned D’Souza in a press statement.

Poems were chosen for the study due to their length, making them suitable for language models, and their complex copyright status. While many of the poems examined were technically under copyright, they were widely accessible online from reputable sources such as the Poetry Foundation.

Large language models, including ChatGPT, are trained to generate text by predicting the most likely next word based on their training data, primarily consisting of webpages. When the training data includes duplicated passages, these models can inadvertently memorize specific sequences of words.

The researchers compared the poem reproduction capabilities of ChatGPT with three other large language models—PaLM from Google, Pythia from the non-profit AI research institute EleutherAI, and GPT-2, an earlier version of the model underpinning ChatGPT (GPT-4). The models were prompted with poems from 60 American poets, representing diverse backgrounds in terms of time periods, races, genders, and levels of fame.

ChatGPT successfully retrieved 72 out of 240 poems, outperforming PaLM, which only managed to produce 10 poems. Pythia and GPT-2, however, failed in retrieving complete poems, with Pythia repeating the same phrase and GPT-2 generating nonsensical text.

The study comes at a challenging time for OpenAI, as the company faces lawsuits filed by fiction and nonfiction writers over alleged unauthorized use of their work to train AI programs. The findings underscore the importance of addressing privacy and copyright concerns in the development and deployment of large language models.

0 Shares

Post Views: 317

Cookie	Duration	Description
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

ChatGPT Displays Memorization of Copyrighted Poems, Raises Ethical Concerns: Cornell Study

Related posts

Leave a Comment Cancel reply

Related posts

Technoxian World Cup 2024 to be Held from August 24th to 27th at Noida Stadium

Google Updates Play Store with AI and Other Features to Transform User Experience

Google Paid Money to Reddit, Now Reddit Has Banned All Other Search Engines from Its Pages

Leave a Comment Cancel reply