Interactively Visualizing the Qwen3 MoE Architecture

06 Dec 2025

Interactive Qwen3 (Dense) Diagram

In Fall 2025, I am taking CS 182: Deep Neural Networks, a course at UC Berkeley taught by Professors Anant Sahai and Gireeja Ranade. One assignment in this class challenges us to create AI-enhanced learning tools for individual concepts in the class. For this assignment, I've created an interactive diagram to visualize the Qwen3 Mixture-of-Experts (MoE) architecture. Check it out below! For comparison, I also include the non-MoE architecture, Qwen3 Dense.

Beneath the visualizations are explanations of key concepts that might be unfamiliar to readers: Grouped Query Attention, QK-Norm, and Rotary Positional Embeddings (RoPE). Note that a full explanation of transformers is out of scope for this post. For that, I recommend the Wikipedia article on Transformers. I would also recommend checking out Sebastian Raschka's LLM Architecture Comparison, which is what inspired this post.

Qwen3 (Dense)

Linear output layer

Final RMSNorm

Feed forward

RMSNorm 2

QK-Norm

RoPE

Masked grouped-query attention

RMSNorm 1

Token embedding layer

Tokenized text

Linear output layer

Final RMSNorm

MoE feed-forward

Expert 1

Expert 2...

Expert 4

MoE router

RMSNorm 2

QK-Norm

RoPE

Grouped-query attention

RMSNorm 1

Token embedding layer

Tokenized text

Hover over each module to explore how the Qwen3 architecture works. The diagrams show a single block; full models stack many, many copies according to size.

Architectural Concepts

Grouped Query Attention

Queries are partitioned into groups that share a key/value set, so each group computes $\mathrm{softmax}\!\left(\frac{Q_g K_g^{\top}}{\sqrt{d}}\right) V_g$. Reusing keys and values cuts memory bandwidth compared with standard multi-head attention. The per-group outputs are concatenated (or projected) back into the full hidden dimension before the residual connection.

QK-Norm

Before attention logits are formed, queries and keys are rescaled to fixed RMS magnitude: $\hat{Q} = Q / \mathrm{RMS}(Q)$ and $\hat{K} = K / \mathrm{RMS}(K)$. (Note: the RMS normalization of Q/K is performed across the feature dimension - i.e. it is per-token.) The normalization keeps logits balanced across layers so that softmax outputs stay well-conditioned even for very long contexts.

Rotary Positional Embeddings (RoPE)

RoPE rotates each query/key pair in a complex plane by an angle proportional to the token index. For a 2-D head slice $[\mathbf{u}, \mathbf{v}]$, the rotated version is given by $$[\mathbf{u}', \mathbf{v}'] = [\mathbf{u}, \mathbf{v}] R(\theta)$$ where $R(\theta)$ is the standard 2-by-2 rotation matrix that turns the head slice by angle $\theta$. with $\theta = n / \omega$. The resulting phase difference between any two tokens depends only on their relative distance. This is important because it lets the model generalize to sequence lengths beyond those seen in training.