> This is the markdown version of https://www.maniac.ai/blog/qwen-3-5-vs-gemma-4-benchmarks-by-size. Visit the full page for interactive content.


# Qwen 3.5 vs Gemma 4: the benchmark-by-size comparison | Maniac | Maniac

[Blog](/blog)

April 2, 2026

Google DeepMind just shipped [Gemma 4](https://ai.google.dev/gemma/docs/core), its newest open-weight family in `E2B`, `E4B`, `26B A4B`, and `31B` sizes. Alibaba's [Qwen 3.5](https://github.com/QwenLM/Qwen3.5) is already one of the strongest open families on the market, spanning `2B`, `4B`, `9B`, `27B`, `35B-A3B`, `122B-A10B`, and `397B-A17B`.

If you are choosing an open model for local agents, laptop inference, or self-hosted production, the useful question is not "which family has the biggest headline model?" It is **which family wins at the size you can actually run**.

There is not yet a single **Vals-style third-party table** covering _every_ `Gemma 4` and `Qwen 3.5` size. So the most reliable comparison today has to separate **two different kinds of evidence**:

1.  **Third-party chat-preference evidence**, where Google cites [Arena AI's open-source text leaderboard](https://arena.ai/leaderboard/text?license=open-source).
2.  **Official model-card overlap**, where we match models by **deployment class** and compare only the benchmark rows both families actually publish.

**Short version:** the evidence is **mixed, not contradictory**. On the currently published **official benchmark overlap**, `Qwen 3.5` wins more rows in the `2B`, `4B`, and `mid-size MoE` classes, while `Gemma 4` is most competitive in the **dense ~30B** class and has a better story for **audio at the edge**, **multilingual**, and some **multimodal** workloads. On **Arena AI**, though, `Gemma 4 31B` and `Gemma 4 26B A4B` both rank above the comparable open `Qwen 3.5` large models in chat preference.

## Quick verdict

Size class

Gemma 4 counterpart

Qwen 3.5 counterpart

Current best reading

Edge / mobile

Gemma 4 E2B

Qwen3.5-2B

No good third-party leaderboard yet; official overlap favors **Qwen**

4B class

Gemma 4 E4B

Qwen3.5-4B

No good third-party leaderboard yet; official overlap favors **Qwen**

Dense workstation

Gemma 4 31B

Qwen3.5-27B

Official overlap is **split**; Arena AI chat preference favors **Gemma 4 31B**

Efficient MoE

Gemma 4 26B A4B

Qwen3.5-35B-A3B

Official overlap favors **Qwen**; Arena AI chat preference favors **Gemma 4 26B A4B**

Qwen also has an **upper tier that Gemma 4 does not match directly**: `9B`, `122B-A10B`, and `397B-A17B`.

* * *

## Why Google's claim can still be true

If you only read the model-card tables below, it is easy to conclude that `Qwen 3.5` broadly beats `Gemma 4`. That is **too strong**. Google's launch post is pointing at a **different kind of evidence**.

On the [Arena AI open-source text leaderboard](https://arena.ai/leaderboard/text?license=open-source) page dated **March 31, 2026**:

-   `Gemma 4 31B` is **#3 open model** at **1452 ± 9**
-   `Qwen3.5-397B-A17B` is **#4** at **1449 ± 6**
-   `Gemma 4 26B A4B` is **#6** at **1441 ± 9**
-   `Qwen3.5-122B-A10B` is **1416 ± 6**
-   `Qwen3.5-27B` is **1404 ± 6**
-   `Qwen3.5-35B-A3B` is **1400 ± 6**

That is real **third-party** evidence in Gemma's favor, especially for Google's `byte for byte` and `intelligence-per-parameter` framing. It does **not** mean Gemma 4 beats Qwen 3.5 on every benchmark. It means that on a large-scale **chat-preference leaderboard**, the two bigger Gemma 4 models are currently placed above the main open Qwen 3.5 models.

The honest reading is:

-   **Arena AI** currently leans **Gemma 4** for large-model assistant quality.
-   **Official model-card overlap** leans **Qwen 3.5** on many static reasoning, coding, and agent rows.
-   **Small-model third-party evidence is still thin**, so the `2B` and `4B` conclusions remain more provisional than the large-model ones.

* * *

## How I matched the models

Two caveats matter before looking at the tables:

1.  **Gemma's small models use effective parameters.** `Gemma 4 E2B` is **2.3B effective / 5.1B loaded with embeddings**, and `Gemma 4 E4B` is **4.5B effective / 8B loaded**. These are best understood as **deployment-class matches** to Qwen's `2B` and `4B`, not exact raw-weight matches.
2.  **Qwen publishes different benchmark modes by size.** `Qwen3.5-4B`, `Qwen3.5-27B`, and `Qwen3.5-35B-A3B` run in **thinking mode by default** on their model cards. `Qwen3.5-2B` publishes separate **thinking** and **non-thinking** scores; this post uses the **thinking** number whenever the card shows `thinking / non-thinking`.

I also exclude benchmarks that are too methodology-sensitive to treat as clean head-to-head rows here, such as:

-   `AIME 2026 no tools`, because Gemma publishes it but the matched Qwen sizes do not.
-   `SWE-bench Verified`, because Qwen publishes it for larger models but Gemma 4 does not.
-   `CodeForces`, because Qwen footnotes that its `CodeForces` result is measured on **its own query set**, making it a poor direct comparison to Google's published `Codeforces ELO`.

That means the tables below should be read as **reliable but narrow**: they are good for comparing the published overlap, but they are **not** the full story of overall assistant quality.

Across the tables below, read the rows as:

-   `MMLU-Pro`: general knowledge and reasoning
-   `GPQA Diamond`: expert science reasoning
-   `LiveCodeBench v6`: coding
-   `Tau2 / TAU2-Bench`: agentic/tool-use behavior
-   `MMMLU`: multilingual reasoning
-   `MMMU-Pro`: multimodal reasoning

### Size matching

Deployment class

Gemma 4

Qwen 3.5

Why this is the right pairing

Edge / mobile

E2B

2B

smallest local models

4B class

E4B

4B

small-laptop / edge-plus tier

Dense workstation

31B dense

27B dense

largest dense model in each family

Efficient MoE

26B A4B

35B-A3B

closest mid-size MoE class, with ~4B vs ~3B active parameters

* * *

## Edge / mobile class

This is the closest comparison for **phone, browser, Raspberry Pi, and lightweight local assistant** deployments.

Benchmark

Gemma 4 E2B

Qwen3.5-2B

Winner

MMLU-Pro

60.0

**66.5**

Qwen

Tau2 / TAU2-Bench\*

24.5

**48.8**

Qwen

MMMLU

**67.4**

63.1

Gemma

MMMU-Pro

44.2

**50.3**

Qwen

`Qwen3.5-2B` wins **3 of the 4 overlap rows**, and the `Tau2` gap is large enough to matter if you care about **tool-using assistants** or other structured workflows. On the **currently published overlap**, Qwen is the stronger small text-and-agent model.

`Gemma 4 E2B` still has a real edge case, though. It leads on `MMMLU`, supports **native audio** on the small-model tier, and is designed around Google's mobile and Android ecosystem. If your edge workload is **voice + vision + lightweight reasoning**, Gemma is not just a consolation prize.

The deployment tradeoff is also different: `Gemma 4 E2B` gives you **128K** context and native audio, while `Qwen3.5-2B` gives you **262K native** context. If your workload involves long docs, repo snippets, or multi-turn tool traces, Qwen's context advantage may matter more than a few benchmark points.

* * *

## 4B class

This is the class most teams actually evaluate for **single-user local copilots**, **low-cost API inference**, and **small-GPU agents**.

Benchmark

Gemma 4 E4B

Qwen3.5-4B

Winner

MMLU-Pro

69.4

**79.1**

Qwen

GPQA Diamond

58.6

**76.2**

Qwen

LiveCodeBench v6

52.0

**55.8**

Qwen

Tau2 / TAU2-Bench\*

42.2

**79.9**

Qwen

MMMLU

**76.6**

76.1

Gemma

MMMU-Pro

52.6

**66.3**

Qwen

This is the **clearest result in the official overlap tables**. `Qwen3.5-4B` is ahead on nearly every row that matters for **reasoning**, **science**, **coding**, **agents**, and **multimodal reasoning**. Gemma only nudges ahead on `MMMLU`, and even there the gap is just **0.5 points**.

If you want the **strongest small open model** on the published benchmark tables, `Qwen3.5-4B` is the more impressive release. The `Tau2` margin is especially notable: it suggests that Qwen's reinforcement-learning and agent training stack is showing up in behavior, not just in static knowledge benchmarks.

`Gemma 4 E4B` still keeps the same two structural advantages as `E2B`: **native audio** and tighter **Google edge tooling**. But on pure benchmark output, the `4B` class is currently **Qwen's strongest win**.

* * *

## Dense ~30B class

This is the most interesting matchup in the post because it is the one where the **official overlap** and the **third-party chat leaderboard** pull in slightly different directions.

Benchmark

Gemma 4 31B

Qwen3.5-27B

Winner

MMLU-Pro

85.2

**86.1**

Qwen

GPQA Diamond

84.3

**85.5**

Qwen

LiveCodeBench v6

80.0

**80.7**

Qwen

Tau2 / TAU2-Bench\*

76.9

**79.0**

Qwen

MMMLU

**88.4**

85.9

Gemma

MMMU-Pro

**76.9**

75.0

Gemma

By row count, `Qwen3.5-27B` wins the text-heavy side of the table: `MMLU-Pro`, `GPQA Diamond`, `LiveCodeBench`, and `TAU2`. But the margins on the first three are **small**, while `Gemma 4 31B` puts up the better numbers on `MMMLU` and `MMMU-Pro`.

That makes this the most **balanced** size class on the official overlap:

-   If you care most about **text reasoning**, **general problem solving**, and **agentic behavior**, `Qwen3.5-27B` gets the nod.
-   If you want a dense workstation model with stronger **multilingual** and **multimodal** behavior, `Gemma 4 31B` has the better argument than a quick winner-count suggests.

This is also the size class where Google's launch framing has the strongest independent support. On Arena AI, `Gemma 4 31B` at **1452 ± 9** sits above even `Qwen3.5-397B-A17B` at **1449 ± 6**, which is exactly the kind of result Google is pointing to with `byte for byte`.

So the reliable conclusion here is not "Qwen wins" or "Gemma wins." It is: **static benchmark overlap slightly favors Qwen on text-heavy rows, while third-party chat preference favors Gemma 4 31B overall.**

* * *

## MoE ~4B-active class

This is the right comparison for teams that want **mid-size MoE efficiency** without jumping all the way to Qwen's upper-tier `122B-A10B` and `397B-A17B` models.

Benchmark

Gemma 4 26B A4B

Qwen3.5-35B-A3B

Winner

MMLU-Pro

82.6

**85.3**

Qwen

GPQA Diamond

82.3

**84.2**

Qwen

LiveCodeBench v6

**77.1**

74.6

Gemma

Tau2 / TAU2-Bench\*

68.2

**81.2**

Qwen

MMMLU

**86.3**

85.2

Gemma

MMMU-Pro

73.8

**75.1**

Qwen

`Qwen3.5-35B-A3B` is the stronger **all-around MoE on the published overlap**: better text reasoning, better expert-science reasoning, a large `Tau2` lead, and a small edge on `MMMU-Pro`.

`Gemma 4 26B A4B` is not just a speed-oriented compromise, though. It still wins `LiveCodeBench v6`, and it also beats Qwen on `MMMLU`. So if your MoE workload is mostly **coding** plus **multilingual** inference, Gemma remains worth a real A/B test.

For mixed workloads, the official overlap still favors `Qwen3.5-35B-A3B`. But again, the external chat-preference signal points the other way: on Arena AI, `Gemma 4 26B A4B` scores **1441 ± 9** versus **1400 ± 6** for `Qwen3.5-35B-A3B`.

That suggests Gemma's assistant-style tuning may be stronger than the static overlap rows alone would imply.

* * *

## What the benchmark pattern says

-   **On official model-card overlap, Qwen 3.5 wins more rows.** That is most obvious in the `4B` class, and still directionally true in `2B` and the MoE matchup.
-   **On Arena AI, Gemma 4's big models currently look stronger.** `Gemma 4 31B` and `26B A4B` outrank the comparable open `Qwen 3.5` models on third-party chat preference.
-   **The dense ~30B class is the real battleground.** `Gemma 4 31B` is where the two narratives meet: static overlap is split, but third-party assistant preference currently leans Gemma.
-   **Qwen's lineup is broader.** There is no direct Gemma 4 answer to `Qwen3.5-9B`, `Qwen3.5-122B-A10B`, or `Qwen3.5-397B-A17B`.
-   **Qwen has the long-context advantage.** Qwen's model cards advertise **262K native context** across the family, with support for extending toward **~1.01M** via scaling. Gemma 4 gives you **128K** on `E2B/E4B` and **256K** on `26B/31B`.
-   **The small-model verdicts are less settled than the big-model ones.** There is not yet a strong third-party leaderboard covering `Gemma 4 E2B/E4B` versus `Qwen3.5-2B/4B`, so those sections rely mostly on vendor-published benchmark overlap.

* * *

## Which one would I pick?

**For `2B` edge deployments:** `Qwen3.5-2B` if you care about text quality, tool use, and long context; `Gemma 4 E2B` if **audio** and **Google-edge integration** matter more.

**For `4B`:** `Qwen3.5-4B`. This is the easiest call in the post.

**For dense workstation models:** `Qwen3.5-27B` for text-first assistants and agents, `Gemma 4 31B` for more balanced multilingual and multimodal local workloads.

**For mid-size MoE:** `Qwen3.5-35B-A3B` unless your core KPI is coding-heavy multilingual work and you specifically want to test `Gemma 4 26B A4B`.

The important part is that the family-level prior is now clear enough to guide a shortlist. But the prior is **not one-dimensional**: if you care about **static task benchmarks**, the published overlap often points to `Qwen 3.5`; if you care about **assistant-style chat quality**, today's strongest third-party signal points to `Gemma 4` at the top end.

The posterior still comes from **your prompts**, **your context length**, **your tool calls**, and **your latency budget**. That is exactly why real deployment teams should treat public benchmarks as a filter, not as the final answer.

* * *

_Source note: `Gemma 4` rows come from Google's [Gemma 4 model card](https://ai.google.dev/gemma/docs/core/model_card_4) and [launch post](https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/). `Qwen 3.5` rows come from the official model cards for [Qwen3.5-2B](https://huggingface.co/Qwen/Qwen3.5-2B), [Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B), [Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B), and [Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B), accessed April 2, 2026. Arena AI numbers come from the [open-source text leaderboard](https://arena.ai/leaderboard/text?license=open-source), page dated **March 31, 2026**, accessed April 2, 2026. `Qwen3.5-2B` publishes some vision rows as `thinking / non-thinking`; this post uses the thinking score. `Tau2 / TAU2-Bench` is included as a directional agent row, but Google reports `Tau2 (average over 3)` while Qwen reports `TAU2-Bench` with the airline-domain fix noted in its model card._

---

*Maniac, High throughput background agents. Opus-quality outputs at 1/50 of the cost. Learn more at [maniac.ai](https://www.maniac.ai).*