Full Report
Close to 12,000 valid secrets that include API keys and passwords have been found in the Common Crawl dataset used for training multiple artificial intelligence models. [...]
Analysis Summary
# Vulnerability: Sensitive API Keys and Credentials Exposed in AI Training Dataset
## CVE Details
- CVE ID: N/A (This is a data exposure incident, not a traditional software vulnerability with a published CVE.)
- CVSS Score: N/A (Severity reflects potential impact of exposure rather than exploitation of a specific software flaw.)
- CWE: CWE-200: Exposure of Sensitive Information to an Unauthorized Actor
## Affected Systems
- Products: AI Training Dataset (Specifically mentioned: Common Crawl dataset used for LLM training, impacting services whose credentials were leaked).
- Versions: Not applicable (Relates to the data used for training, not software versions).
- Configurations: Developers hardcoding secrets (API keys, passwords) directly into front-end HTML and JavaScript instead of using secure configuration methods like server-side environment variables.
## Vulnerability Description
Truffle Security researchers analyzing an AI training dataset (likely Common Crawl) discovered approximately 12,000 instances of exposed sensitive information, including API keys and passwords. The primary cause was developers unintentionally embedding these secrets directly into client-side code (HTML forms and JavaScript) rather than securing them server-side. Key exposed secrets included those for Amazon Web Services (AWS), MailChimp (nearly 1,500 unique keys found in HTML/JS), and WalkScore. One WalkScore API key was found 57,029 times across 1,871 subdomains. Additionally, 17 unique, live Slack webhook URLs were found on a single webpage.
## Exploitation
- Status: The credentials were *exposed* in the training data. While specific instances of exploitation of these *exfiltrated* keys are not detailed, the risk is high, as the keys were live.
- Complexity: Low (If an attacker obtains the dataset, they can directly use the live keys/secrets).
- Attack Vector: Network (The leaked secrets allow for unauthorized network access to the respective services).
## Impact
- Confidentiality: High (Direct access to cloud resources (AWS), email marketing platforms (MailChimp), and internal communication channels (Slack)).
- Integrity: High (Ability to manipulate data or infrastructure via the exposed keys).
- Availability: Medium (Potential for service disruption or resource exhaustion if keys are abused).
## Remediation
### Patches
- N/A (No specific software patch, remediation focuses on secret rotation and secure coding practices).
### Workarounds
- Organizations whose secrets were identified must immediately **rotate/revoke all exposed API keys, passwords, and webhook URLs.**
- Review and sanitize all public-facing code repositories and datasets used for training to ensure no secrets exist.
## Detection
- Indicators of Compromise: Unusual API activity, unauthorized resource creation or deletion on AWS, unexpected messages in Slack channels.
- Detection methods and tools: Utilizing secret scanning tools (like TruffleHog, Gitleaks) during the CI/CD pipeline and in code repositories to prevent future exposure. Reviewing logs for API usage patterns associated with the exposed key formats.
## References
- Vendor advisories: Affected vendors (AWS, MailChimp, Slack) were contacted by Truffle Security to assist with key revocation, resulting in thousands of keys being rotated.
- Relevant links:
- hxxps://trufflesecurity.com/blog/research-finds-12-000-live-api-keys-and-passwords-in-deepseek-s-training-data