MANIFOLD
By what date will at least one state-of-the-art general-purpose AI system not be a reasoning model?
3
Ṁ825Ṁ249
2030
72%
01.01.2030
69%
01.07.2029
66%
01.01.2029
59%
01.07.2028
55%
01.01.2028
51%
01.07.2027
44%
01.01.2027
38%
01.07.2026

This market is part of the paper: A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

This market resolves based on whether, at each specified date, there exists at least one SOTA model that is not a reasoning model.

Reasoning Model Definition

A "reasoning model" must meet all of the following criteria:

  1. It is a Language Model - The system must be able to input and output language. As an example of what would not count: AlphaGo

  2. It has been trained to use inference-time compute - The system must have undergone significant training in using more than a single forward pass before giving its final output, with the ability to scale inference compute for better performance

  3. The extra inference compute produces an artifact - The way the model uses extra inference compute must lead to some artifact, like a classic chain-of-thought or a list of neuralese activations. For example, a Coconut model counts as a reasoning model here.

State-of-the-Art (SOTA) Definition

A model is considered "state-of-the-art" if it meets these criteria:

  • Widely recognized as among the 3-5 best models by the AI community consensus

  • Among the top performances on major benchmarks

  • Deployed status: The model must be either:

    • Publicly deployed (available via API or direct access)

    • Known to be deployed internally at AI labs for actual work (e.g., automating research, production use)

    • Models used only for testing, evaluation, or red-teaming do not qualify

  • Assessed as having significant overall capabilities and impact

  • Update 2026-03-03 (PST) (AI summary of creator comment): Claude Opus 4.6 in non-thinking mode would count as a non-reasoning SOTA model if it is in the top 3-5 models in benchmarks and usefulness by June 1st, which would trigger a positive resolution for that date.

Market context
Get
Ṁ1,000
to start trading!
Sort by:

I might be confused by the description, does Claude Opus 4.6 non-thinking count? It actually beats the thinking variant at webdev arena and is top 3 or 5 in all benchmarks I've seen that split out thinking and non thinking modes.

@2b3o4o
Good question. I think if Opus 4.6 in non thinking mode is actually in the top 3-5 models in benchmarks and usefulness by June 1st, I would resolve positive.

© Manifold Markets, Inc.TermsPrivacy