5
Simple: Saturation at ~2 Examples (85%→88%)
Medium: Saturation at ~5 Examples (70%→85%)
Complex: Saturation at ~10 Examples (50%→75%)
Optimal Point (Current)
5
Accuracy @ Optimal
85%
Gain per Example
3.2%
Token Cost (total)
650
Fig. 1 | Few-shot scaling curves for three task difficulties. Each shows different saturation points: Simple tasks need only 2 examples, complex ones need up to 10. After that, gains stagnate (diminishing returns).
📉
Plateau After 5-10 Examples
The gain curve follows: fast at the beginning → gradual flattening → stagnation. After 10 examples, the marginal gain is often <1% per additional example.
⏱️
Task Difficulty Determines Plateau
Simple tasks: Plateau at 2-3 examples. Medium: at 5-7. Complex: at 10-15. The task complexity limits how much the model can learn from examples.
💰
Token Costs Outweigh Gains
15+ examples cost ~1500 additional tokens with latency impact. The accuracy gain is then minimal. Optimal: 5-10 examples based on task complexity.
🎯
Format is Early, Semantics is Late
Examples 1-2: Model learns format. Examples 3-5: Semantic patterns. Examples 6+: Nuances. But the saturation threshold is predetermined by task nature/complexity.
📊
Power-Law Scaling
Accuracy follows power-law: A(n) = A_∞ - c·n^(-α) where α≈0.4-0.6. This explains the plateau: The exponent is small, gains decrease exponentially faster.
🚀
Larger Models Show Flatter Plateau
70B models: Plateau at 3-5 examples. 7B models: Plateau at 8-12. Larger models have better prior knowledge, need less structure definition.
Task Type Baseline Optimal N Accuracy @ Opt Gain Recommendation
Simple 70% 2 88% +18 pp 2-3 examples, more yields <1% gain
Medium 60% 5 85% +25 pp 5-7 examples, cost-benefit sweet spot
Complex 45% 10 75% +30 pp 8-12 examples, over 12 marginal
Very Complex 30% 15 68% +38 pp Reconsider: Prompt engineering or fine-tuning?