Architecture Parameters

Number of Transformer blocks in the stack
Hidden state dimension
Multi-Head Attention heads
MHA: same as Q. GQA: fewer. MQA: 1
Feedforward intermediate dimension
Number of different tokens
FFN is replicated to N experts

Results

Total Parameters
0 Billion
Memory Required (FP16)
0 GB
Inference FLOPs per Token
0 TFLOPs

Parameter Distribution

Embedding Matrix 0 M
Attention (Q, K, V, O) 0 M
Output Projection 0 M
Formulas
Attention: 4 x dmodel² (Q, K, V, O)
FFN (ReLU): 2 x dmodel x dff (W1, W2)
FFN (SwiGLU): 3 x dmodel x dff (W, V, W2)
LayerNorm: 2 x dmodel per layer (Gain, Bias)
Embedding: vocab_size x dmodel
Output: dmodel x vocab_size (if no Weight Tying)
GQA Adjustment
Grouped Query Attention reduces the K/V projection. Instead of h x (dk+dv) only nkv_heads x (dk+dv). For Llama 2 70B: 64 Q-heads, 8 KV-heads.
MoE Efficiency
With Mixture of Experts, FFN parameters are replicated. But only top-k experts are active per token. Mixtral 8x7B: 47B total, 13B active.
Memory Considerations
FP16: 2 bytes per parameter. FP32: 4 bytes. INT8: 1 byte. INT4: 0.5 bytes. Additionally: Gradients (training), KV-Cache (inference), activations.
FLOPs Estimate
Inference: ~2 x parameters per token generated. Training: ~6 x parameters x tokens (forward + backward). For MoE: Only active parameters count.