CHAPTER 3.6d · FLEXIBLE INFERENCE

Flexible Inference Scaling

Effort Parameter and Thinking Budget: User-controlled reasoning and adaptive compute time

Flexible Inference: The umbrella term for user-controlled test-time compute. Effort Parameter, Thinking Budget, Adaptive Thinking — different implementations of the same concept.

📖 Learning Context ▼

Compare different flexible inference approaches
Understand implementation differences
Best practices for production deployment

Step 4/5 Reasoning & Test-Time Compute

Overview of all flexible inference approaches from major providers.

Each provider has their own implementation: Claude (Effort), Qwen3 (Token Budget), GPT-5.1 (Dual-Mode). Understanding how these work is critical for cost optimization.

Claude: Effort Parameter (1-10 scale)
Qwen3: Direct token budget
GPT-5.1: Two modes (Fast vs. Thinking)

Overview: Two Concepts for Flexible Inference

Traditional LLMs work with fixed compute time per token. Flexible Inference Scaling breaks this paradigm: Users can directly control how much thinking time the model invests for a specific task. Two complementary approaches enable this flexibility:

Effort Parameter (Claude 4.5)

Range: 1–10

Thinking Tokens: 100–5000

Speed: 10× faster (Level 1)

Quality: Same up to Level 8

Thinking Budget (o1, Qwen 3)

Token Range: 100–32000

Allocation: User-defined

Speed: 5× slower (max budget)

Quality: Very high (++)

Effort Parameter: Quick Calibration

The Effort Parameter is a simple numerical value (1–10) that the user adjusts per request. The model uses this value to internally decide how much reasoning to invest. This enables quick optimization for different task requirements.

Example (Claude 4.5 with Effort Parameter):

Effort 1–2: Quick response, minimal reasoning (~100 thinking tokens). Ideal for classification and simple questions.

Effort 5: Balanced (default behavior, ~1000 thinking tokens). Recommended for general use.

Effort 8–10: Deep reasoning, maximum quality (~5000 thinking tokens). For complex mathematical and logical problems.

↓ Effort Parameter: Speed vs. Quality Tradeoff

Thinking Budget: Explicit Token Control

Thinking Budget is more precise than the effort parameter: Users explicitly specify how many tokens the model should use for thinking (e.g., 1000, 5000, 32000). The model uses this budget specification to generate Chain-of-Thought reasoning in its hidden thinking tokens before generating the final response.

Main advantage: Precise cost control. The budget is specifiable in the API:

# Python API example

response = client.messages.create(

    model="o1",

    messages=[{"role": "user", "content": "Solve this math problem..."}],

    thinking_budget_tokens=8000  # Explicit budget in tokens

)

Difference from Effort Parameter: Thinking Budget is an absolute token count (e.g., exactly 8000 tokens), not relative like the effort parameter. This enables precise cost planning for batch operations and production systems.

Architecture: Dual-Mode Models

Both parameter types work in Dual-Mode models: A single model can operate with high (Effort 8) or low (Effort 1) compute time, with additional overhead of only ~15% for the feature flag during tokenization. This elegance allows using one model for all requirements.

← Dual-Mode Inference Architecture

Under the hood (Implementation details):

During tokenization: Feature flag as additional embedding bit for effort

In Attention layer: Router decides based on flag which expert group is used

In FFN: Early exit possible at low effort (Computational Gating activated)

In Output layer: Potentially reduced precision at high speed priority

→ 6 Insights: Flexible Inference in Production

🎯

Parameter Type Choice

Effort Parameter: For general use with manual calibration.
Thinking Budget: For batch processing with cost guarantees.

💰

Cost Reduction via Effort 1

Effort 1 is ~60% cheaper than Effort 10, as only 100 vs. 5000 thinking tokens are used. Ideal for simple classification tasks with high throughput.

📈

Quality Plateau at Effort 8

Research shows: Quality gains after Effort 8 are marginal (~2% more accuracy). ROI is at Effort 5–7 for balanced applications.

⏱️

Latency Tradeoff

Effort 10: ~3–5 seconds per response.
Effort 1: ~200ms per response.
Recommendation for user-facing apps: Effort 3–5.

🔄

Adaptive Effort in Production

Proven pattern: Start with Effort 3, increase to 7 only when quality check fails. Saves ~40% compute with 99% same output quality.

🚀

Future: Token-based Optimizers

Next generation: Models that automatically optimize effort per token, instead of setting it globally. Per-token effort adjustment enables even higher efficiency.

Flexible Inference Scaling

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Overview: Two Concepts for Flexible Inference

Effort Parameter: Quick Calibration

Thinking Budget: Explicit Token Control

Architecture: Dual-Mode Models

→ 6 Insights: Flexible Inference in Production