Full Report
Data exposed even briefly can live on in generative AI chatbots long after the data is made private. © 2024 TechCrunch. All rights reserved. For personal use only.
Analysis Summary
# Vulnerability: Data Persistence in Copilot from Previously Public GitHub Repositories
## CVE Details
- CVE ID: Not provided in the source text. This appears to be a configuration/caching issue rather than a traditional CVE-tracked vulnerability.
- CVSS Score: Not provided.
- CWE: Due to the nature of the issue (external system retention of public data), potential related CWEs could involve CWE-200 (Exposure of Sensitive Information to an Unauthorized Actor) or issues related to improper caching policies.
## Affected Systems
- Products: Microsoft Copilot (and underlying Microsoft Bing indexing/caching mechanism).
- Versions: Indeterminate versions of Copilot where the underlying Bing index still contains data from GitHub repositories that were recently made private.
- Configurations: Any GitHub repository that was made public at any point in 2024 and subsequently deleted or set to private.
## Vulnerability Description
The issue arises because data from GitHub repositories temporarily set to public (even for a brief period) was indexed and cached by the Microsoft Bing search engine. This cached data remains retrievable when users query Microsoft Copilot, even after the original repository has been set back to private or deleted from GitHub, effectively bypassing the intended access restrictions. This allows Copilot to leak intellectual property, confidential corporate data, access keys, and tokens originally present in the public repository snapshots.
## Exploitation
- Status: Researcher-demonstrated access; the report implies passive collection of 20,000+ repositories. Likely not "exploited in the wild" against individual users outside of the researcher's context, but the data is accessible via prompting.
- Complexity: Low (Requires querying Copilot with the "right question").
- Attack Vector: Network (via interaction with the Copilot service).
## Impact
- Confidentiality: High (Exposure of IP, access keys, tokens).
- Integrity: Medium (Potential for data corruption if keys/tokens are used maliciously).
- Availability: Low (The primary impact is information disclosure, not service disruption).
## Remediation
### Patches
- No specific patch versions are mentioned. Remediation relies on Microsoft adjusting Bing's indexing/caching policy for private repositories and purging existing sensitive entries from the cache.
### Workarounds
- **For Data Owners:** Ensure that any sensitive code or secrets are never exposed publicly, even momentarily. Implement strict internal auditing of repository visibility settings. (Note: Once data is in the Copilot training set or Bing cache, the owner has limited recourse other than reporting/requesting removal.)
- **For Copilot Users:** Exercise caution when prompting Copilot for specific repository content, as historical public data may be returned.
## Detection
- Indicators of Compromise: Unexpected retrieval of private/deleted GitHub repository content, confidential keys, or proprietary code snippets when using Copilot prompting.
- Detection Methods and Tools: The Lasso security company utilized an investigative method based on a historical list of public repositories queried against Copilot. Organizations should audit their own historical public exposure timeline against current AI outputs if they suspect leakage.
## References
- Vendor Advisories: None cited directly, as the article reports on newly discovered findings from Lasso.
- Relevant Links:
- hXXps://techcrunch.com/2025/02/26/thousands-of-exposed-github-repos-now-private-can-still-be-accessed-through-copilot/