Gemma 3n vs Gemma 4: A Real-World Benchmark Guide on the Galaxy S25 Ultra
- TecAce Software
- 1 day ago
- 4 min read
Every time Google ships a new generation of the Gemma series, the question we ask first is simple: how much faster does it actually run on real hardware? To answer that directly, TecAce ran a head-to-head benchmark between Gemma 3n and Gemma 4 on the Samsung Galaxy S25 Ultra under identical conditions.
We tested four model configurations using the llama.cpp CPU inference engine: the previous-generation Gemma 3n E2B Q8_0 as our baseline, and three quantization variants of Gemma 4 E2B (Q3_K_M, Q4_K_M, Q8_0). Beyond raw speed, we measured task-specific latency and accuracy across real-world scenarios including summarization, structured output, code generation, and math.
The short answer: Gemma 4 delivers clear generational progress across most tasks. But there's a significant catch — choosing the wrong quantization for math-heavy applications can make things substantially worse. This report covers both sides of that story.
Test Environment
All tests were conducted on the same device under identical inference engine settings.
Device | Samsung SM-S942U (Galaxy S25 Ultra) |
SoC | QTI SM8850 (Snapdragon 8 Elite) · 8 Cores |
OS | Android 16 (SDK 36) |
Inference Engine | llama.cpp (CPU only) |
Test Date | April 8, 2026 |
Test Model Configurations
Generation | Quantization | Note | Chart Color |
Gemma 3n | Q8_0 | Baseline (prev gen) | Blue — Reference |
Gemma 4 | Q3_K_M | 3-bit lightweight | Purple |
Gemma 4 | Q4_K_M | 4-bit balanced | Magenta |
Gemma 4 | Q8_0 ★ | 8-bit high precision (recommended) | Green — Best performer |
Overall Performance: 37% Faster with 2× Context Processing
Comparing at identical Q8_0 quantization, Gemma 4 shows clear improvements over the previous generation across all key metrics.
Model | Avg Latency | Decode TPS | Prefill TPS | Memory | Pass Rate |
G3n Q8_0 (baseline) | 38.4s | 13.71 | 19.41 | 779 MB | 72% |
G4 Q3_K_M | 26.2s | 13.18 (-4%) | 27.62 (+42%) | 707 MB | 60% |
G4 Q4_K_M | 26.3s | 17.20 (+25%) ★ | 34.75 (+79%) | 708 MB | 64% |
G4 Q8_0 ★ | 24.2s (-37%) | 16.75 (+22%) | 40.15 (+107%) ★ | 708 MB | 72% ★ |
The most striking metric is Prefill TPS. G4 Q8_0 processes 40.15 tokens per second during prefill — a 2.1× improvement over Gemma 3n (19.41). Since prefill handles the initial processing of user input, this improvement directly translates to faster perceived response times in applications with long conversation histories, RAG pipelines, and document processing workflows.
Memory consumption also improved. All Gemma 4 variants settled around 708 MB, approximately 70 MB less than Gemma 3n (779 MB) — a meaningful reduction given the models share the same parameter count.
Task-by-Task Performance Breakdown
The aggregate averages tell only part of the story. The variance across task types is dramatic — with structured output and summarization accelerating massively, while math regresses significantly in one variant.
Task | G3n Q8_0 | G4 Q3_K_M | G4 Q4_K_M | G4 Q8_0 | Best Speedup |
Brief answers | 0.5s | 0.5s | 0.4s | 0.4s | 1.42× |
Creative writing | 9.2s | 6.2s | 4.6s | 3.8s | 2.40× |
Reasoning | 4.0s | 2.0s | 8.0s | 10.2s | 1.99× |
Summarization | 13.5s | 6.4s | 3.5s | 2.9s | 4.72× ★ |
Structured output (JSON) | 40.6s | 12.3s | 6.6s | 9.5s | 6.19× ★★ |
Knowledge explanation | 38.3s | 19.6s | 18.1s | 18.2s | 2.12× |
Math | 9.1s | 26.5s | 50.0s ⚠ | 21.6s | 0.42× ⚠ Regression! |
Code generation | 93.3s | 48.6s | 38.6s | 51.0s | 2.42× |
Long-form content | 104.5s | 92.4s | 87.0s | 82.6s | 1.27× |
The 6.19× speedup for structured output (JSON, markdown tables) delivered by G4 Q4_K_M is the standout result of this benchmark. Applications dealing with API response formatting or data pipeline generation will see substantial real-world impact.
Summarization also shows a compelling 4.72× improvement with G4 Q8_0 (13.5s → 2.9s), making a strong case for adopting Gemma 4 in document-based chatbots and news digest applications.
Critical Warning: Q4_K_M Math Regression
Not all metrics point upward. We observed a severe latency regression in math tasks under Gemma 4 Q4_K_M — jumping from 9.1s to 50.0s, a 5.5× slowdown.
Root cause: At 4-bit quantization, excessive chain-of-thought generation is triggered, causing the model to produce unnecessarily long reasoning chains when solving mathematical problems. This behavior does not appear in Q3_K_M (26.5s) or Q8_0 (21.6s) — it is specific to the Q4 quantization level.
For applications where math or logical reasoning is core functionality:
• Calculators, math tutors, coding assistants → Use G4 Q8_0 or G3n Q8_0
• Exclude G4 Q4_K_M from any math-centric workflow
• If battery savings are the top constraint, G4 Q3_K_M is faster than Q4_K_M for math (26.5s vs 50.0s)
System Impact: Battery, Thermals & Memory
Beyond raw inference speed, battery drain and thermal behavior determine whether a model is viable for sustained mobile usage. Here's how the four variants compare.
Model | Memory | Battery/Run | Thermal Delta | Init Time | Pass Rate |
G3n Q8_0 | 779.0 MB | 0.240% | 6.0 ★ | 542ms | 72.0% |
G4 Q3_K_M | 706.8 MB ★ | 0.160% ★ | 10.32 | 381ms ★ | 60.0% |
G4 Q4_K_M | 707.6 MB | 0.240% | 12.92 | 436ms | 64.0% |
G4 Q8_0 ★ | 707.7 MB | 0.320% | 15.20 | 523ms | 72.0% ★ |
G4 Q3_K_M achieves the lowest battery consumption at 0.160% per run and the fastest initialization at 381ms — making it ideal for IoT devices or low-power scenarios where energy budget is the primary constraint. The trade-off is a 60% pass rate, which limits its applicability for accuracy-critical applications.
G4 Q8_0 consumes the most battery at 0.320% per run but matches G3n's 72% pass rate while running 37% faster. For most production applications, it offers the best overall balance.
Conclusion & Model Selection Guide
Here are three key insights and scenario-based recommendations drawn from the benchmark results.
Key Insights
• G4 Q8_0 achieves 1.6× speed improvement over G3n at identical quantization, while simultaneously reducing memory by 70 MB.
• The 2× Prefill TPS improvement (19.41 → 40.15) will have the greatest impact in RAG systems, long conversation contexts, and document processing pipelines.
• The Q4_K_M math regression (9.1s → 50.0s) is a clear reminder that quantization selection carries quality risks beyond speed trade-offs.
Recommended Model by Use Case
Scenario | Recommended | Rationale |
Maximum Performance (Speed + Quality) | G4 Q8_0 ★ | 1.6× faster than G3n, reduced memory, same pass rate |
Streaming Conversation (Best Decode) | G4 Q4_K_M ★ | Decode TPS 17.2 — 25% improvement |
Battery-First Priority | G4 Q3_K_M | Lowest battery consumption at 0.160% per run |
Math & Logic Apps | G4 Q8_0 or G3n | Must avoid Q4_K_M math regression |
JSON & Structured Output | G4 Q4_K_M ★ | 6.2× speedup over G3n — best-in-class performance |
Gemma 4 E2B demonstrates clear generational progress at the same parameter scale. The 37% latency reduction and 2× Prefill gain at Q8_0 are meaningful for real-app deployment. However, the Q4_K_M math anomaly requires careful model selection — which is exactly why we run these benchmarks before committing to a deployment decision.
You can find further details at the link.





Comments