🔴 Dense (Transformer Standard)
All parameters are active for every Token. Basis: Vaswani et al. Attention is All You Need.
Parameters
All active
Inference Speed
Slow (large)
Training
Simple, stable
Context
Up to ~200K
Models: GPT-4, Claude 3.5, Llama 3.1 405B
🟢 Sparse MoE (Mixture of Experts)
Only top-k experts active per token. Scales parameters at same compute.
Parameters
Most inactive
Inference Speed
Fast (top-k)
Training
More complex (Load Bal.)
Context
Standard
Models: Mixtral 8×7B, DeepSeek V3
🔵 Hybrid (Modern Innovations)
Combination of Attention + Linear RNNs or State-Machines. Reduces O(n²) complexity.
Parameters
Efficient
Inference Speed
Very fast
Training
New, researched
Context
Unlimited
Models: Mamba, Hydra, RWKV
Property Dense MoE Hybrid
Memory Complexity O(n) Attention, O(d²) FFN O(n) Router, O(d) Expert O(1) or O(n) depending on design
Trainability Simple, converges well Unstable, needs balancing Still being researched
Inference Latency O(n) with KV-Cache O(log n) Router + Top-k O(1) ideal
Long Context Flash Attention → 200K Same as Dense Unlimited possible
Deployment Standard, many optimizations Complex routing Not yet mainstream
Production-Readiness 100% proven 95% (Mixtral, DeepSeek) 50% (research)
📊
Dense Dominates Production
GPT-4, Claude, Llama are all dense. Simplicity in training & deployment beats efficiency gains of MoE. KV-Cache + Flash Attention is enough.
MoE = Efficiency Multiplier
Mixtral 8×7B: 13B active, but 47B total. Saves compute while keeping many parameters. But router overhead + load-balancing is complex.
🚀
Hybrid Frontier 2025+
Mamba, State Space Models: O(n) instead of O(n²). Theoretically unlimited context. But reasoning capabilities not yet at attention level.
💡
Choice Depends On:
Latency requirements: Hybrid. Accuracy priority: Dense. Cost efficiency: MoE. Unlimited context: Hybrid. Production: Dense.
📈
Scaling Laws Differ
Dense: Power-law with parameters. MoE: Sub-linear (router overhead). Hybrid: Unknown (not yet calibrated).
🔮
Future: Hybrid + Dense Mix
Probably not Hybrid alone, but Hybrid for long sequences, Dense for reasoning. Or Hybrid router chooses Dense blocks.