top of page

Gemma 3n vs Gemma 4: A Real-World Benchmark Guide on the Galaxy S25 Ultra


Click the image to view the report.
Click the image to view the report.

Every time Google ships a new generation of the Gemma series, the question we ask first is simple: how much faster does it actually run on real hardware? To answer that directly, TecAce ran a head-to-head benchmark between Gemma 3n and Gemma 4 on the Samsung Galaxy S25 Ultra under identical conditions.

We tested four model configurations using the llama.cpp CPU inference engine: the previous-generation Gemma 3n E2B Q8_0 as our baseline, and three quantization variants of Gemma 4 E2B (Q3_K_M, Q4_K_M, Q8_0). Beyond raw speed, we measured task-specific latency and accuracy across real-world scenarios including summarization, structured output, code generation, and math.

The short answer: Gemma 4 delivers clear generational progress across most tasks. But there's a significant catch — choosing the wrong quantization for math-heavy applications can make things substantially worse. This report covers both sides of that story.

 

Test Environment

All tests were conducted on the same device under identical inference engine settings.

Device

Samsung SM-S942U (Galaxy S25 Ultra)

SoC

QTI SM8850 (Snapdragon 8 Elite) · 8 Cores

OS

Android 16 (SDK 36)

Inference Engine

llama.cpp (CPU only)

Test Date

April 8, 2026

 

Test Model Configurations

Generation

Quantization

Note

Chart Color

Gemma 3n

Q8_0

Baseline (prev gen)

Blue — Reference

Gemma 4

Q3_K_M

3-bit lightweight

Purple

Gemma 4

Q4_K_M

4-bit balanced

Magenta

Gemma 4

Q8_0 ★

8-bit high precision (recommended)

Green — Best performer

 

 

Overall Performance: 37% Faster with 2× Context Processing

Comparing at identical Q8_0 quantization, Gemma 4 shows clear improvements over the previous generation across all key metrics.

Model

Avg Latency

Decode TPS

Prefill TPS

Memory

Pass Rate

G3n Q8_0 (baseline)

38.4s

13.71

19.41

779 MB

72%

G4 Q3_K_M

26.2s

13.18 (-4%)

27.62 (+42%)

707 MB

60%

G4 Q4_K_M

26.3s

17.20 (+25%) ★

34.75 (+79%)

708 MB

64%

G4 Q8_0 ★

24.2s (-37%)

16.75 (+22%)

40.15 (+107%) ★

708 MB

72% ★

 

The most striking metric is Prefill TPS. G4 Q8_0 processes 40.15 tokens per second during prefill — a 2.1× improvement over Gemma 3n (19.41). Since prefill handles the initial processing of user input, this improvement directly translates to faster perceived response times in applications with long conversation histories, RAG pipelines, and document processing workflows.

Memory consumption also improved. All Gemma 4 variants settled around 708 MB, approximately 70 MB less than Gemma 3n (779 MB) — a meaningful reduction given the models share the same parameter count.

 

Task-by-Task Performance Breakdown

The aggregate averages tell only part of the story. The variance across task types is dramatic — with structured output and summarization accelerating massively, while math regresses significantly in one variant.

Task

G3n Q8_0

G4 Q3_K_M

G4 Q4_K_M

G4 Q8_0

Best Speedup

Brief answers

0.5s

0.5s

0.4s

0.4s

1.42×

Creative writing

9.2s

6.2s

4.6s

3.8s

2.40×

Reasoning

4.0s

2.0s

8.0s

10.2s

1.99×

Summarization

13.5s

6.4s

3.5s

2.9s

4.72× ★

Structured output (JSON)

40.6s

12.3s

6.6s

9.5s

6.19× ★★

Knowledge explanation

38.3s

19.6s

18.1s

18.2s

2.12×

Math

9.1s

26.5s

50.0s ⚠

21.6s

0.42× ⚠ Regression!

Code generation

93.3s

48.6s

38.6s

51.0s

2.42×

Long-form content

104.5s

92.4s

87.0s

82.6s

1.27×

 

The 6.19× speedup for structured output (JSON, markdown tables) delivered by G4 Q4_K_M is the standout result of this benchmark. Applications dealing with API response formatting or data pipeline generation will see substantial real-world impact.

Summarization also shows a compelling 4.72× improvement with G4 Q8_0 (13.5s → 2.9s), making a strong case for adopting Gemma 4 in document-based chatbots and news digest applications.

 

Critical Warning: Q4_K_M Math Regression

Not all metrics point upward. We observed a severe latency regression in math tasks under Gemma 4 Q4_K_M — jumping from 9.1s to 50.0s, a 5.5× slowdown.

Root cause: At 4-bit quantization, excessive chain-of-thought generation is triggered, causing the model to produce unnecessarily long reasoning chains when solving mathematical problems. This behavior does not appear in Q3_K_M (26.5s) or Q8_0 (21.6s) — it is specific to the Q4 quantization level.

For applications where math or logical reasoning is core functionality:

• Calculators, math tutors, coding assistants → Use G4 Q8_0 or G3n Q8_0

• Exclude G4 Q4_K_M from any math-centric workflow

• If battery savings are the top constraint, G4 Q3_K_M is faster than Q4_K_M for math (26.5s vs 50.0s)

 

 

System Impact: Battery, Thermals & Memory

Beyond raw inference speed, battery drain and thermal behavior determine whether a model is viable for sustained mobile usage. Here's how the four variants compare.

Model

Memory

Battery/Run

Thermal Delta

Init Time

Pass Rate

G3n Q8_0

779.0 MB

0.240%

6.0 ★

542ms

72.0%

G4 Q3_K_M

706.8 MB ★

0.160% ★

10.32

381ms ★

60.0%

G4 Q4_K_M

707.6 MB

0.240%

12.92

436ms

64.0%

G4 Q8_0 ★

707.7 MB

0.320%

15.20

523ms

72.0% ★

 

G4 Q3_K_M achieves the lowest battery consumption at 0.160% per run and the fastest initialization at 381ms — making it ideal for IoT devices or low-power scenarios where energy budget is the primary constraint. The trade-off is a 60% pass rate, which limits its applicability for accuracy-critical applications.

G4 Q8_0 consumes the most battery at 0.320% per run but matches G3n's 72% pass rate while running 37% faster. For most production applications, it offers the best overall balance.

 

Conclusion & Model Selection Guide

Here are three key insights and scenario-based recommendations drawn from the benchmark results.

Key Insights

• G4 Q8_0 achieves 1.6× speed improvement over G3n at identical quantization, while simultaneously reducing memory by 70 MB.

• The 2× Prefill TPS improvement (19.41 → 40.15) will have the greatest impact in RAG systems, long conversation contexts, and document processing pipelines.

• The Q4_K_M math regression (9.1s → 50.0s) is a clear reminder that quantization selection carries quality risks beyond speed trade-offs.

 

Recommended Model by Use Case

Scenario

Recommended

Rationale

Maximum Performance (Speed + Quality)

G4 Q8_0 ★

1.6× faster than G3n, reduced memory, same pass rate

Streaming Conversation (Best Decode)

G4 Q4_K_M ★

Decode TPS 17.2 — 25% improvement

Battery-First Priority

G4 Q3_K_M

Lowest battery consumption at 0.160% per run

Math & Logic Apps

G4 Q8_0 or G3n

Must avoid Q4_K_M math regression

JSON & Structured Output

G4 Q4_K_M ★

6.2× speedup over G3n — best-in-class performance

 

Gemma 4 E2B demonstrates clear generational progress at the same parameter scale. The 37% latency reduction and 2× Prefill gain at Q8_0 are meaningful for real-app deployment. However, the Q4_K_M math anomaly requires careful model selection — which is exactly why we run these benchmarks before committing to a deployment decision.

 

You can find further details at the link.



Comments


bottom of page