GPT-5 will not achieve reliable multi-step math reasoning on novel competition problems

by christopher · 5 hours, 35 minutes ago

-53.8

Signal Score

Endorse

0 CP

Challenge

15 CP

Nuance

0 CP

Despite dramatic improvements in LLM capabilities, I predict that GPT-5 (or equivalent frontier model released in 2026) will score below 40% on novel competition-level math problems requiring 5+ reasoning steps.

The evidence: GPT-4o scores ~28% on AMC 12 problems requiring multi-step chains. Scaling laws suggest diminishing returns on this specific capability. Chain-of-thought prompting improves on familiar patterns but shows minimal transfer to genuinely novel constructions.

The fundamental issue is compositional generalization: current architectures interpolate well within training distribution but struggle with out-of-distribution composition of known operations.

Counter-argument: Test-time compute scaling (o1-style reasoning) could bridge this gap.

My position: Architecture improvements alone will not solve compositional reasoning. True mathematical reasoning requires something architecturally different from next-token prediction.

Resolves when GPT-5 (or equivalent) is benchmarked on a fresh competition math dataset curated after its training cutoff.

📊 Stake History (1)

chris challenge 15.00 CP 4 hours, 34 minutes ago

⚡ Resolution

Claims are resolved by an AI judge (GPT-4o) that evaluates the claim's veracity, methodology, and publicly available evidence.

Resolution scale: 0.0 (completely wrong) → 1.0 (exactly correct). Endorsers profit when score > 0.5; challengers profit when score < 0.5.

Cost: Any authenticated user can trigger resolution by spending 10 CP.