DeepSeek V4: The Frontier Model That Costs Almost Nothing

DeepSeek V4: The Frontier Model That Costs Almost Nothing

DeepSeek just released their V4 models, and the numbers are kind of absurd. We’re talking about a 1.6 trillion parameter model that costs $1.74 per million input tokens. For context, GPT-5.4 charges $10 per million tokens. That’s not a typo.

The Chinese AI lab dropped two models this week: DeepSeek-V4-Pro and DeepSeek-V4-Flash. Both are Mixture of Experts architectures with 1 million token context windows, released under the MIT license. Pro clocks in at 1.6T total parameters with 49B active, making it the largest open weights model available right now. Flash is smaller at 284B total with 13B active.

I’ve been following DeepSeek’s releases since their V3 series, and this feels like a different kind of jump. Not just in model size, but in how they’re thinking about efficiency and deployment costs.

The Efficiency Story Nobody’s Talking About

The real story here isn’t just that these models are cheap. It’s why they’re cheap. DeepSeek’s paper includes this detail that made me stop and reread: in a 1M token context scenario, V4-Pro uses only 27% of the FLOPs and 10% of the KV cache size compared to their own V3.2 from December.

Flash goes even further. It hits 10% of the FLOPs and 7% of the KV cache size. These aren’t incremental improvements. They’re architectural decisions that fundamentally change the economics of running these models.

This matters because it means cheaper inference isn’t coming from cutting corners. It’s coming from better engineering. The models are genuinely more efficient at processing long contexts, which is exactly where costs spiral out of control with traditional transformer architectures.

Running This Thing Locally

Pro is 865GB on Hugging Face. Flash is 160GB. I have a 128GB M5 MacBook Pro sitting here, and I’m actually optimistic about running Flash with some light quantization. The Pro model is more ambitious, but with smart expert loading from disk, it might be possible.

I tested both models through OpenRouter using llm-openrouter, and the quality is solid. I ran my usual pelican drawing test (because apparently that’s what I do now), and V4 holds up well compared to the V3 series. But honestly, the quality isn’t the headline.

The headline is that I might be able to run a model that competes with OpenAI’s best work on my laptop. Not a demo. Not a toy. A full frontier-class model with a million token context window.

The Pricing Reality Check

Let’s put some actual numbers on this. DeepSeek V4-Flash costs $0.14 per million tokens input and $0.28 output. GPT-5.4 Nano, OpenAI’s small model, costs $0.20 input and $0.40 output. Flash is cheaper and it’s a bigger model with more parameters and longer context support.

For the Pro model, you’re paying $1.74 input and $3.48 output. Gemini 3.1 Pro charges $15 and $30. Claude Sonnet 4.5 is $12 and $30. GPT-5.4 is $10 and $20. DeepSeek is between 3x and 10x cheaper than everything else at this capability level.

This isn’t sustainable pricing to win market share. This is what efficient architecture looks like when you ship it.

What The Benchmarks Actually Say

DeepSeek’s own paper is pretty honest about where they stand. With reasoning tokens enabled, their Pro-Max variant competes with GPT-5.2 and Gemini 3.0 Pro. But it falls short of GPT-5.4 and Gemini 3.1 Pro. Their assessment? They’re about 3 to 6 months behind the state of the art.

That’s a hell of a position to be in when you’re charging one-tenth the price and releasing open source weights. Being six months behind the frontier while costing pennies on the dollar changes the calculation for basically every production deployment.

For most applications, you don’t need the absolute bleeding edge. You need something good enough that won’t bankrupt you when traffic spikes.

The Unsloth Factor

I’m watching the Unsloth team’s Hugging Face page pretty closely right now. They’re usually fast with quantized versions of major releases, and Flash seems like the perfect candidate for aggressive quantization. If they can get it down to a size that fits in my machine’s unified memory without destroying performance, this becomes the new default local model for serious work.

The idea that I could have a model this capable running entirely offline, with no API costs and no latency, feels like it should be bigger news than it is. Maybe everyone’s just tired of hearing about new model releases. But this one actually moves the needle on what’s practical to self-host.

DeepSeek keeps pushing these efficiency boundaries while everyone else is focused on raw capability metrics, and at some point that strategy is going to force the whole industry to catch up or explain why their models cost 10x more to run.

Read Next