Gemma 3n vs Gemma 4: A Real-World Benchmark Guide on the Galaxy S25 Ultra

TecAce Software
1 day ago
4 min read

Every time Google ships a new generation of the Gemma series, the question we ask first is simple: how much faster does it actually run on real hardware? To answer that directly, TecAce ran a head-to-head benchmark between Gemma 3n and Gemma 4 on the Samsung Galaxy S25 Ultra under identical conditions.

We tested four model configurations using the llama.cpp CPU inference engine: the previous-generation Gemma 3n E2B Q8_0 as our baseline, and three quantization variants of Gemma 4 E2B (Q3_K_M, Q4_K_M, Q8_0). Beyond raw speed, we measured task-specific latency and accuracy across real-world scenarios including summarization, structured output, code generation, and math.

The short answer: Gemma 4 delivers clear generational progress across most tasks. But there's a significant catch — choosing the wrong quantization for math-heavy applications can make things substantially worse. This report covers both sides of that story.

Test Environment

All tests were conducted on the same device under identical inference engine settings.

Device	Samsung SM-S942U (Galaxy S25 Ultra)
SoC	QTI SM8850 (Snapdragon 8 Elite) · 8 Cores
OS	Android 16 (SDK 36)
Inference Engine	llama.cpp (CPU only)
Test Date	April 8, 2026

Test Model Configurations

Generation	Quantization	Note	Chart Color
Gemma 3n	Q8_0	Baseline (prev gen)	Blue — Reference
Gemma 4	Q3_K_M	3-bit lightweight	Purple
Gemma 4	Q4_K_M	4-bit balanced	Magenta
Gemma 4	Q8_0 ★	8-bit high precision (recommended)	Green — Best performer

Overall Performance: 37% Faster with 2× Context Processing

Comparing at identical Q8_0 quantization, Gemma 4 shows clear improvements over the previous generation across all key metrics.

Model	Avg Latency	Decode TPS	Prefill TPS	Memory	Pass Rate
G3n Q8_0 (baseline)	38.4s	13.71	19.41	779 MB	72%
G4 Q3_K_M	26.2s	13.18 (-4%)	27.62 (+42%)	707 MB	60%
G4 Q4_K_M	26.3s	17.20 (+25%) ★	34.75 (+79%)	708 MB	64%
G4 Q8_0 ★	24.2s (-37%)	16.75 (+22%)	40.15 (+107%) ★	708 MB	72% ★

The most striking metric is Prefill TPS. G4 Q8_0 processes 40.15 tokens per second during prefill — a 2.1× improvement over Gemma 3n (19.41). Since prefill handles the initial processing of user input, this improvement directly translates to faster perceived response times in applications with long conversation histories, RAG pipelines, and document processing workflows.

Memory consumption also improved. All Gemma 4 variants settled around 708 MB, approximately 70 MB less than Gemma 3n (779 MB) — a meaningful reduction given the models share the same parameter count.

Task-by-Task Performance Breakdown

The aggregate averages tell only part of the story. The variance across task types is dramatic — with structured output and summarization accelerating massively, while math regresses significantly in one variant.

Task	G3n Q8_0	G4 Q3_K_M	G4 Q4_K_M	G4 Q8_0	Best Speedup
Brief answers	0.5s	0.5s	0.4s	0.4s	1.42×
Creative writing	9.2s	6.2s	4.6s	3.8s	2.40×
Reasoning	4.0s	2.0s	8.0s	10.2s	1.99×
Summarization	13.5s	6.4s	3.5s	2.9s	4.72× ★
Structured output (JSON)	40.6s	12.3s	6.6s	9.5s	6.19× ★★
Knowledge explanation	38.3s	19.6s	18.1s	18.2s	2.12×
Math	9.1s	26.5s	50.0s ⚠	21.6s	0.42× ⚠ Regression!
Code generation	93.3s	48.6s	38.6s	51.0s	2.42×
Long-form content	104.5s	92.4s	87.0s	82.6s	1.27×

The 6.19× speedup for structured output (JSON, markdown tables) delivered by G4 Q4_K_M is the standout result of this benchmark. Applications dealing with API response formatting or data pipeline generation will see substantial real-world impact.

Summarization also shows a compelling 4.72× improvement with G4 Q8_0 (13.5s → 2.9s), making a strong case for adopting Gemma 4 in document-based chatbots and news digest applications.

Critical Warning: Q4_K_M Math Regression

Not all metrics point upward. We observed a severe latency regression in math tasks under Gemma 4 Q4_K_M — jumping from 9.1s to 50.0s, a 5.5× slowdown.

Root cause: At 4-bit quantization, excessive chain-of-thought generation is triggered, causing the model to produce unnecessarily long reasoning chains when solving mathematical problems. This behavior does not appear in Q3_K_M (26.5s) or Q8_0 (21.6s) — it is specific to the Q4 quantization level.

For applications where math or logical reasoning is core functionality:

• Calculators, math tutors, coding assistants → Use G4 Q8_0 or G3n Q8_0

• Exclude G4 Q4_K_M from any math-centric workflow

• If battery savings are the top constraint, G4 Q3_K_M is faster than Q4_K_M for math (26.5s vs 50.0s)

System Impact: Battery, Thermals & Memory

Beyond raw inference speed, battery drain and thermal behavior determine whether a model is viable for sustained mobile usage. Here's how the four variants compare.

Model	Memory	Battery/Run	Thermal Delta	Init Time	Pass Rate
G3n Q8_0	779.0 MB	0.240%	6.0 ★	542ms	72.0%
G4 Q3_K_M	706.8 MB ★	0.160% ★	10.32	381ms ★	60.0%
G4 Q4_K_M	707.6 MB	0.240%	12.92	436ms	64.0%
G4 Q8_0 ★	707.7 MB	0.320%	15.20	523ms	72.0% ★

G4 Q3_K_M achieves the lowest battery consumption at 0.160% per run and the fastest initialization at 381ms — making it ideal for IoT devices or low-power scenarios where energy budget is the primary constraint. The trade-off is a 60% pass rate, which limits its applicability for accuracy-critical applications.

G4 Q8_0 consumes the most battery at 0.320% per run but matches G3n's 72% pass rate while running 37% faster. For most production applications, it offers the best overall balance.

Conclusion & Model Selection Guide

Here are three key insights and scenario-based recommendations drawn from the benchmark results.

Key Insights

• G4 Q8_0 achieves 1.6× speed improvement over G3n at identical quantization, while simultaneously reducing memory by 70 MB.

• The 2× Prefill TPS improvement (19.41 → 40.15) will have the greatest impact in RAG systems, long conversation contexts, and document processing pipelines.

• The Q4_K_M math regression (9.1s → 50.0s) is a clear reminder that quantization selection carries quality risks beyond speed trade-offs.

Recommended Model by Use Case

Scenario	Recommended	Rationale
Maximum Performance (Speed + Quality)	G4 Q8_0 ★	1.6× faster than G3n, reduced memory, same pass rate
Streaming Conversation (Best Decode)	G4 Q4_K_M ★	Decode TPS 17.2 — 25% improvement
Battery-First Priority	G4 Q3_K_M	Lowest battery consumption at 0.160% per run
Math & Logic Apps	G4 Q8_0 or G3n	Must avoid Q4_K_M math regression
JSON & Structured Output	G4 Q4_K_M ★	6.2× speedup over G3n — best-in-class performance

Gemma 4 E2B demonstrates clear generational progress at the same parameter scale. The 37% latency reduction and 2× Prefill gain at Q8_0 are meaningful for real-app deployment. However, the Q4_K_M math anomaly requires careful model selection — which is exactly why we run these benchmarks before committing to a deployment decision.

You can find further details at the link.

Gemma 3n vs Gemma 4: A Real-World Benchmark Guide on the Galaxy S25 Ultra

Related Posts

Comments