Skip to main content
← Back to thoughtsAI

Gemma 4: Google's Most Capable Open Model Runs on Your Hardware

Google released Gemma 4 under Apache 2.0. The 31B model ranks #3 globally, the 26B MoE activates just 3.8B params, and the smallest variants run on phones.

··4 min read

Gemma 4: Google's Most Capable Open Model Runs on Your Hardware

TL;DR: Google released Gemma 4 under Apache 2.0 — four models from phone-sized to frontier-class. The 31B Dense ranks #3 among all open models globally. The 26B MoE activates only 3.8B parameters per token, outcompeting models 20x its size. The smallest variants run offline on phones and Raspberry Pi with native audio, video, and image understanding. 256K context. 140+ languages. Built from the same tech as Gemini 3, then given away.


What is Gemma 4?

Gemma 4 is Google DeepMind's latest open model family, released April 2, 2026.  [1] Built from the same research that powers Gemini 3 (Google's proprietary model), it ships in four sizes under Apache 2.0.

The headline: the 31B Dense model ranks #3 on the Arena AI text leaderboard. The 26B MoE sits at #6. Both outperform models up to 20x larger.  [1]

This isn't a research preview. The Gemma family already has 400 million+ downloads and 100,000+ community fine-tuned variants.  [1] Gemma 4 is the next generation of that ecosystem.

Gemma 4 models plotted on Arena AI's Elo Score vs Total Model Size chart — both the 31B and 26B sit in the top-left corner, matching or beating models 20x their size

What are the four models?

31B Dense — the largest and most capable. Maximum quality, best foundation for fine-tuning. Unquantized bfloat16 weights fit on a single 80GB NVIDIA H100. Quantized versions run on consumer GPUs.  [1]

26B Mixture of Experts (MoE) — optimized for speed. It has 26 billion total parameters but activates only 3.8 billion during inference. That's the efficiency trick: frontier-level reasoning at a fraction of the compute.  [1]

E4B and E2B — engineered for mobile, IoT, and edge devices. These activate an effective 4 billion and 2 billion parameter footprint during inference. They run completely offline on phones, Raspberry Pi, and NVIDIA Jetson Orin Nano with near-zero latency.  [1]

All four models support 256K context (larger models) or 128K context (edge models), native video and image processing, and are trained on 140+ languages. The edge models also handle native audio input.  [1]

Why does the Apache 2.0 license matter?

Previous Gemma releases used Google's custom license with usage restrictions. Gemma 4 switches to Apache 2.0 — the same license used by projects like Kubernetes and Apache Kafka.  [1]

That means: full commercial rights, no usage restrictions, complete control over your data, infrastructure, and model weights. Fine-tune it. Deploy it anywhere. No phone calls to Google's legal team.

Clément Delangue, CEO of Hugging Face, called it "a huge milestone."  [1] It's hard to disagree. Google went from "open weights with asterisks" to actual open source.

What's built for agents?

Gemma 4 isn't positioned as a chatbot. The design is explicitly agentic:  [1]

  • Function calling — models can invoke tools and APIs natively
  • Structured JSON output — reliable machine-readable responses
  • Native system instructions — persistent behavior configuration
  • Code generation — designed to power local-first AI coding assistants

This is the pattern the industry is converging on. Models that can plan, call tools, check results, and iterate — not just generate text. Gemma 4 bakes that into the architecture rather than bolting it on.

Where can you run it?

The range is deliberately wide:  [1]

  • Google AI Studio — instant access to 31B and 26B
  • Google AI Edge Gallery — E4B and E2B for mobile development
  • Hugging Face, Ollama, vLLM, llama.cpp, MLX, LM Studio — day-one support
  • NVIDIA NIM and NeMo — enterprise deployment
  • Google Cloud (Vertex AI, Cloud Run, GKE) — production scale
  • Android Studio Agent Mode — direct integration for app developers

Quantized versions mean the 31B model runs on hardware most developers already own. The edge models run on hardware most people carry in their pockets.

What does this mean?

The open model landscape just got more competitive. Google is shipping Gemini 3-class technology as a free, Apache 2.0 download that runs on a consumer GPU.

For developers building AI products, the practical implication is clear: frontier-level reasoning is no longer locked behind API calls. You can fine-tune Gemma 4 on your data, run it on your infrastructure, and ship it in your product — with no usage restrictions and no per-token costs.

The constraint now isn't access. It's imagination.

Key takeaways

  • 31B Dense = #3 open model globally, 26B MoE = #6. Both outcompete models 20x their size.
  • 26B MoE activates just 3.8B parameters per token — frontier intelligence at a fraction of the compute.
  • E2B and E4B run on phones and Raspberry Pi — offline, with audio, video, and image understanding.
  • Apache 2.0 license — full open source, no restrictions. A first for Google's Gemma family.
  • Agentic by design — function calling, structured JSON, system instructions baked in.
  • 256K context on larger models, 128K on edge. 140+ languages.

I break down things like this on LinkedIn, X, and Instagram — usually shorter, sometimes as carousels. If this was useful, you'd probably like those too.


Footnotes

  1. Gemma 4: Byte for byte, the most capable open models — Google Blog [] [ [2]] [ [3]] [ [4]] [ [5]] [ [6]] [ [7]] [ [8]] [ [9]] [ [10]] [ [11]]

The Simple Take

One email when something in AI or tech deserves more than a headline.

Not a digest. Not a roundup. The one idea that week, fully worked out.