How Did DeepSeek Catch Up at a Fraction of the Cost?
Like everyone else, I was more than a little surprised by Deep Seek. I’d been plugging along using the other models and tools for artificial intelligence writing, planning, and crafting. I also still think the best writing tool is my brain and my hands, but there you have it.
Still, I teach and explore the world of AI and was fascinated by what just happened in the world of Large Language Models and the boom-snap of Deep Seek and its ability to tank the Nvidia market overnight. So, as one of the limited pool of tech writers paying attention to how these things take place, I figured I’d give a brief explanation and share a few thoughts here, with a nod to the folks who taught me a thing or two along the way.
DeepSeek’s leap comes down to four major innovations (and some smaller ones). Here’s the lowdown:
- They distilled from a Leading Model, for starters
DeepSeek likely distilled its model from an existing one—most likely Meta’s Llama 3, though they could have accessed OpenAI’s GPT-4 or Anthropic’s Claude. Distillation involves training a new model using an existing one, much like OpenAI’s GPT-4 Turbo, which provides solid performance at lower costs by leveraging GPT-4 as a teacher.
This approach slashes the time and cost of creating a training set, but it has limits. Since your starting point is always someone else’s previous release, leapfrogging the competition becomes far more challenging. If DeepSeek used OpenAI or Anthropic’s models, it would violate their terms of service—but proving that is notoriously difficult. - Inference via Cache Compression
DeepSeek slashed the cost of inference (the process of generating its responses) by compressing the cache that the model uses to make predictions. This breakthrough was clever but it wasn’t entirely unexpected. It’s a technique that others probably would have figured out pretty soon. More importantly, DeepSeek published the method openly, so now the entire industry can benefit from their efforts. - Mixture of Experts Architecture
Unlike traditional LLMs, which load the entire model during training and inference, DeepSeek adopted a “Mixture of Experts” approach. This uses a guided predictive algorithm to activate only the necessary parts of the model for specific tasks.
DeepSeek needs 95% fewer GPUs than Meta because, for each token, they train only 5% of the parameters. This innovation radically lowers costs and makes the model far more efficient.Cheaper - Reasoning Model Without Human Supervision
DeepSeek didn’t just match leading LLMs like OpenAI’s GPT-4 in raw capability; they also developed a reasoning model on par with OpenAI’s o1. Reasoning models combine LLMs with techniques like CoT, enabling them to correct errors and make logical inferences—qualities predictive models lack.
OpenAI’s approach relied on reinforcement learning guided by human feedback. DeepSeek trained their model on math, code, and logic problems, using two reward functions—one for correct answers and one for answers with a clear thought process. Instead of supervising every step, they encouraged the model to try multiple approaches and grade itself. This method allowed the model to develop reasoning capabilities independently.
The TL;DR
- Innovation Beats Regulation
DeepSeek’s success is a reminder that competition should focus on technological progress, not regulatory barriers. Innovation wins. - Lower Costs All Around
DeepSeek’s architecture dramatically reduces the cost of running advanced AI models—in both dollars and energy consumption. Their open-source, permissively licensed approach means the whole industry, including competitors, benefits. - Models Are Becoming Commodities
As models become cheaper and more widely available, the value will shift toward building smarter applications rather than simply integrating a flashy “AI button.” Product managers who can creatively apply AI will drive the next wave of real-world impact. - NVIDIA’s Moat Is Shrinking
DeepSeek’s GPU efficiency challenges assumptions about NVIDIA’s dominance. If others adopt similar techniques, the grip of chipmakers on the AI economy may loosen significantly.
*For some DeepSeek FAQs from folks who know a lot more than me, go here.
