A magistrate judge in the ongoing copyright suit against OpenAI has ordered them to turn over a sample of 20 million private chat logs from ChatGPT users to plaintiffs, stating that the protective order baring lawyers from leaking the contents and OpenAI’s own “de-identification” of the logs should be sufficient to protect these user’s privacy. As TechDirt points out, the idea that you can keep communications private by simply redacting names and other identifying terms is a pernicious myth that has been debunked over and over and over. And these previous examples are tiny compared to 20 million chat logs. How much deeply personal information lies in that trove, not to mention HIPAA violations, trade secrets, classified info and privileged communications? Sure we all know we’re not supposed to cut-and-paste such info to a chat-bot, but you know it happens… a lot. Honestly, I’m a little surprised the plaintiff’s lawyers want all that info, since if any of it somehow winds up on the dark web there will be a whole lot of very pissed off people looking for someone to blame.
I take away two main lessons from this order:
- Far too many people think that simply redacting a few key names from logs makes all your privacy concerns go away, and some of these misguided people are in positions of power. It probably doesn’t help that the tech industry claims that very same magical ability when explaining why it’s OK for them to hold on to your data.
- There’s a big social need for LLMs and other agents that can run on hardware you control and not in the cloud, or at least a guarantee that those chats are deleted soon after they’re processed. Unfortunately the financial incentives work in the other direction, since AI companies prefer to protect their tech from competitors by keeping it on their own servers, and those chat logs are the secret sauce that lets an early-mover in the industry keep ahead of its competitors.