
An investigation by Proof News has revealed that major tech companies, including Apple, Nvidia, and Anthropic, have been using subtitles from thousands of YouTube videos to train their artificial intelligence (AI) models without the creators’ consent. This unauthorized use has sparked significant backlash from content creators who feel their work has been exploited.
The investigation found that the YouTube Subtitles dataset, which includes transcripts from 173,536 videos across over 48,000 channels, was used by these companies. Channels affected include educational platforms like Khan Academy, MIT, and prominent media outlets such as NPR and the BBC.
High-profile YouTubers like MrBeast, Marques Brownlee, Jacksepticeye, and PewDiePie had their content used without permission. David Pakman, whose channel The David Pakman Show had nearly 160 videos included, voiced his concerns: “This is my livelihood, and I put time, resources, money, and staff time into creating this content.”
Dave Wiskus, CEO of Nebula, a streaming service owned by its creators, criticized the practice, labeling it as “theft” and “disrespectful.”
EleutherAI, the creators of the dataset, did not respond to requests for comment. The dataset, which consists of plain text subtitles and translations from YouTube videos, appears to violate YouTube’s terms of service. Nevertheless, tech giants like Apple, Nvidia, and Salesforce have acknowledged using the dataset as part of the Pile, a larger compilation of data.
Jennifer Martinez from Anthropic confirmed their use of the dataset but emphasized that their interpretation of YouTube’s terms allowed for indirect use.
The controversy highlights the ethical and legal challenges surrounding the use of digital content for AI training. As AI technology continues to evolve, there is a growing need for regulations that protect content creators and ensure they are compensated fairly for their work. The unauthorized use of YouTube content by these companies underscores the necessity for clearer guidelines and respect for intellectual property rights in the tech industry.