A team of nine researchers at Sina Weibo posted a 14‑page arXiv report describing VibeThinker‑3B, a 3 billion‑parameter language model that matches or exceeds the reasoning performance of flagship systems from Google DeepMind, OpenAI, Anthropic and DeepSeek. The model scored 94.3 on the AIME 2026 mathematics exam and, using a test‑time scaling technique called Claim‑Level Reliability Assessment, rose to 97.1, edging past virtually every publicly recorded system.
VibeThinker‑3B also posted strong results on other verifiable tasks: 91.4 on AIME 2025, 89.3 on the Harvard‑MIT Mathematics Tournament, 93.8 on the Brown University Math Olympiad, and 76.4 on the IMO‑AnswerBench. In coding, it achieved an 80.2 Pass@1 on LiveCodeBench v6 and a 96.1 % acceptance rate on unseen LeetCode contests from April‑May 2026. The researchers built the model on top of Alibaba’s Qwen2.5‑Coder‑3B, applying a four‑stage pipeline of supervised fine‑tuning, reinforcement learning with a MaxEnt‑Guided Policy Optimization algorithm, distillation of high‑quality reasoning traces, and final instruction‑following RL. They frame the results with a “Parametric Compression‑Coverage Hypothesis,” arguing that verifiable reasoning can be compressed into compact models while open‑domain knowledge remains parameter‑expansive.
The community’s reaction has been mixed. Some users praised the engineering feat, noting that achieving such scores with a consumer‑laptop‑sized model is “fascinating.” Others warned that benchmark scores may be “bench‑maxed” and not reflect real‑world utility, citing missing practical coding knowledge and potential benchmark leakage. The debate revives the broader question of whether ever‑larger models are necessary for intelligence, suggesting that smaller, task‑focused models might carve a viable niche if benchmark integrity can be ensured.



