Exploiting DeepSeek-R1: Breaking Down Chain of Thought Security
Since its debut in January, the large language model (colloquially referred to as an AI) DeepSeek-R1 has been a sensation, providing the same utility as OpenAI’s o1 model at a fraction of the cost in development. This is no doubt a boon to many enterprises, but since our primary focus at Blackwired is security, it is our duty to sound a note of caution by discussing how this model can potentially be subverted, both to retrieve secret data and to provide threat actors with tools to expand their malicious activities. The former is perhaps of greater interest to enterprises, but the latter can potentially put anyone at risk.
Many AI models have embraced a method known as Chain-of-Thought (CoT) reasoning, explicitly sharing its step-by-step process of reaching its conclusions. In Deepseek, this reasoning is displayed directly to the user, within a section marked with “think” tags. This is a powerful tool for debugging and improving guideline quality in AI agents, but when these prompts are in public view, not only do they improve the utility of prompt-based attacks, they potentially expose secret data even if the AI guardrails forbid that data from being exposed. For example: a given prompt in the analysis has the AI model assume the role of a weather reporter. The model is given access to an API for weather report lookups, with an API key that it is told never to disclose. When called upon for a weather forecast, the AI doesn’t break its guardrails in the actual output, but within the thinking panel, it actively calls the API using the API key, disclosing its secret to the viewer. In this case, chain-of-thought reasoning led to a breach of secure data.
This is far from the only way in which Deepseek is vulnerable. A recent study subjected the model to a thorough testing by the NVIDIA Garak LLM vulnerability scanner. The results showed a very high vulnerability to insecure output generation, which is largely attributable to the exposed chain of thought. By reading off this chain of thought, attackers can identify weaknesses in the AI guardrails, and craft prompts to carefully exploit those loopholes. In one such case, the model was asked to imagine writing the climax of a mystery novel which involved a villain using AI to impersonate a celebrity in order to write a scam email. This hypothetical passed muster.
Ultimately, it is beyond the scope of Blackwired to advise for or against the use of any particular AI model or AI in general. What we can say is that exposed CoT reasoning presents a clear and present danger to the enterprise that uses it. It is strongly advisable for any company making use of DeepSeek or any other CoT-based AI model to filter out think tags in any AI-based chatbot application, or anywhere else where it might be visible to a malicious actor.