By what date will at least one state-of-the-art general-purpose AI system not be a reasoning model?

MANIFOLD

Ṁ825Ṁ249

2030

72%

01.01.2030

69%

01.07.2029

66%

01.01.2029

59%

01.07.2028

55%

01.01.2028

51%

01.07.2027

44%

01.01.2027

38%

01.07.2026

This market is part of the paper: A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

This market resolves based on whether, at each specified date, there exists at least one SOTA model that is not a reasoning model.

Reasoning Model Definition

A "reasoning model" must meet all of the following criteria:

It is a Language Model - The system must be able to input and output language. As an example of what would not count: AlphaGo
It has been trained to use inference-time compute - The system must have undergone significant training in using more than a single forward pass before giving its final output, with the ability to scale inference compute for better performance
The extra inference compute produces an artifact - The way the model uses extra inference compute must lead to some artifact, like a classic chain-of-thought or a list of neuralese activations. For example, a Coconut model counts as a reasoning model here.

State-of-the-Art (SOTA) Definition

A model is considered "state-of-the-art" if it meets these criteria:

Widely recognized as among the 3-5 best models by the AI community consensus
Among the top performances on major benchmarks
Deployed status: The model must be either:
- Publicly deployed (available via API or direct access)
- Known to be deployed internally at AI labs for actual work (e.g., automating research, production use)
- Models used only for testing, evaluation, or red-teaming do not qualify
Assessed as having significant overall capabilities and impact

Update 2026-03-03 (PST) (AI summary of creator comment): Claude Opus 4.6 in non-thinking mode would count as a non-reasoning SOTA model if it is in the top 3-5 models in benchmarks and usefulness by June 1st, which would trigger a positive resolution for that date.

Market context

Technology

Technical AI Timelines

AI Safety

AI Impacts

Get

1,000

to start trading!

2 Comments

3 Holders

15 Trades

Sort by:

I might be confused by the description, does Claude Opus 4.6 non-thinking count? It actually beats the thinking variant at webdev arena and is top 3 or 5 in all benchmarks I've seen that split out thinking and non thinking modes.

@2b3o4o
Good question. I think if Opus 4.6 in non thinking mode is actually in the top 3-5 models in benchmarks and usefulness by June 1st, I would resolve positive.