Overview: Two Concepts for Flexible Inference

Traditional LLMs work with fixed compute time per token. Flexible Inference Scaling breaks this paradigm: Users can directly control how much thinking time the model invests for a specific task. Two complementary approaches enable this flexibility:

Effort Parameter (Claude 4.5)
Range: 1–10
Thinking Tokens: 100–5000
Speed: 10× faster (Level 1)
Quality: Same up to Level 8
Thinking Budget (o1, Qwen 3)
Token Range: 100–32000
Allocation: User-defined
Speed: 5× slower (max budget)
Quality: Very high (++)

Effort Parameter: Quick Calibration

The Effort Parameter is a simple numerical value (1–10) that the user adjusts per request. The model uses this value to internally decide how much reasoning to invest. This enables quick optimization for different task requirements.

Example (Claude 4.5 with Effort Parameter):

  • Effort 1–2: Quick response, minimal reasoning (~100 thinking tokens). Ideal for classification and simple questions.
  • Effort 5: Balanced (default behavior, ~1000 thinking tokens). Recommended for general use.
  • Effort 8–10: Deep reasoning, maximum quality (~5000 thinking tokens). For complex mathematical and logical problems.
  • ↓ Effort Parameter: Speed vs. Quality Tradeoff

    Thinking Budget: Explicit Token Control

    Thinking Budget is more precise than the effort parameter: Users explicitly specify how many tokens the model should use for thinking (e.g., 1000, 5000, 32000). The model uses this budget specification to generate Chain-of-Thought reasoning in its hidden thinking tokens before generating the final response.

    Main advantage: Precise cost control. The budget is specifiable in the API:

    # Python API example
    response = client.messages.create(
    model="o1",
    messages=[{"role": "user", "content": "Solve this math problem..."}],
    thinking_budget_tokens=8000 # Explicit budget in tokens
    )

    Difference from Effort Parameter: Thinking Budget is an absolute token count (e.g., exactly 8000 tokens), not relative like the effort parameter. This enables precise cost planning for batch operations and production systems.

    Architecture: Dual-Mode Models

    Both parameter types work in Dual-Mode models: A single model can operate with high (Effort 8) or low (Effort 1) compute time, with additional overhead of only ~15% for the feature flag during tokenization. This elegance allows using one model for all requirements.

    ← Dual-Mode Inference Architecture

    Under the hood (Implementation details):

  • During tokenization: Feature flag as additional embedding bit for effort
  • In Attention layer: Router decides based on flag which expert group is used
  • In FFN: Early exit possible at low effort (Computational Gating activated)
  • In Output layer: Potentially reduced precision at high speed priority
  • → 6 Insights: Flexible Inference in Production

    🎯
    Parameter Type Choice

    Effort Parameter: For general use with manual calibration.
    Thinking Budget: For batch processing with cost guarantees.

    💰
    Cost Reduction via Effort 1

    Effort 1 is ~60% cheaper than Effort 10, as only 100 vs. 5000 thinking tokens are used. Ideal for simple classification tasks with high throughput.

    📈
    Quality Plateau at Effort 8

    Research shows: Quality gains after Effort 8 are marginal (~2% more accuracy). ROI is at Effort 5–7 for balanced applications.

    ⏱️
    Latency Tradeoff

    Effort 10: ~3–5 seconds per response.
    Effort 1: ~200ms per response.
    Recommendation for user-facing apps: Effort 3–5.

    🔄
    Adaptive Effort in Production

    Proven pattern: Start with Effort 3, increase to 7 only when quality check fails. Saves ~40% compute with 99% same output quality.

    🚀
    Future: Token-based Optimizers

    Next generation: Models that automatically optimize effort per token, instead of setting it globally. Per-token effort adjustment enables even higher efficiency.