What Happened: A security researcher has discovered that over 143,000 user conversations with GenAI chatbots—including Claude, Copilot, and ChatGPT—are publicly accessible on Archive.org. This discovery follows closely on the heels of recent disclosures that ChatGPT queries were being indexed by search engines like Google. Incidents like these continue to raise broader concerns about data leakage and inadvertent exposure of sensitive information through AI platforms.
Incident in-Depth: In a recent blog post, security researcher dead1nfluence outlines how they uncovered shareable links from platforms including Grok, ChatGPT, Copilot, Claude, Mistral, and Qwen.
- The researcher first identified that archived ChatGPT links were accessible through a public API provided by the Wayback Machine
- By querying this endpoint, they retrieved a comprehensive list of of shared LLM conversation URLs
- Using basic command-line tools (such as wget and grep), they downloaded and categorized links by provider
- Each LLM platform used a different mechanism for rendering shared content, ranging from direct public access to requiring backend API calls.
- Once appropriate endpoints were identified, the URLs were programmatically prepared for bulk collection
- In total, 143,000 publicly accessible LLM conversations were recovered across multiple providers
- The researcher also investigated whether any of the exposed content could be useful to an attacker, ultimately uncovering AWS Access Key IDs, a Replicate API token, and more
Why It Matters: Security research like this demonstrates the serious risks for AI and data privacy, especially for organizations adopting LLMs in sensitive workflows. Even if users don’t intend to maliciously share sensitive information, prompts often contain proprietary business context, PII, and intellectual property. If these conversations are archived and made publicly accessible, that information could be exposed to anyone.
Taking a Step Back: Generative AI tools enable users to move quickly, but also introduce new, often hidden risks around data exposure.
- Users may not fully understand where their data goes after interacting with an AI system. In this case, something as simple as sharing a chatbot conversation via link can result in that content being permanently archived or indexed by search engines, even if it was only intended for review by a colleague.
- Companies don’t have a clear picture of how and where their data is flowing when it interacts with third-party AI platforms. Without proper guardrails, sensitive business information can easily leave the organization’s control. This creates exposure to reputational harm, regulatory violations, cybercriminal activities, and long-term loss of sensitive business information.
General Security Strategies:
- Define and enforce AI usage policies by
- Clearly outlining what types of data can and cannot be entered into GenAI tools
- Restrict the use of public LLM platforms for sensitive, confidential, or regulated data
- Establish an approval process for adopting new AI tools internally
- Ensure visibility into shadow AI tools (those not known by IT or security)
- Monitor and Audit AI Usage
- Track who is using AI tools and for what purpose
- Log and analyze AI-related activity
- Educate users on the risks of using AI tools
For Obsidian Customers:
- Inventory all AI usage, including shadow AI
- For each AI app, track and manage adoption, understand usage patterns, and evaluate risk levels
- Restrict access to unapproved AI platforms, preventing employees from using high-risk or unofficial tools
- Implement Obsidian’s prompt security controls:
- Intercept prompts with sensitive data before they’re sent to third-party AI tools.
- Block submissions that contain classified or proprietary information, even when GenAI tools are accessed via unmanaged or personal accounts