Ai Benchmarks for Code

Logical Intelligence Tops Leading AI Verification Benchmarks as Verified Code Generation Nears Reality with Aleph

Aleph, an AI coding agent sets new records on four major formal reasoning benchmarks, proving that automated code generation can be formally verified for mission-critical systems.

13h

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

DeepSWE puts GPT-5.5 atop the AI coding leaderboard while raising new questions about Claude Opus, SWE-Bench Pro, and ...

8hon MSN

Alibaba’s new AI model scores higher than OpenAI, Google rivals in coding ranking

The Chinese tech giant is the only non-US firm to crack the top five in Code Arena's latest leaderboard Alibaba Group Holding ...

Decrypt

Huawei's New Benchmark Gives AI Agents Months of Your Life—Then Watches Them Fail

Claw-Anything simulates a real digital existence and asks AI assistants to handle it. GPT-5.5, the best model available, scored 34.5%.

Morning Overview on MSN

AI models can now work for hours on their own to finish long, complex tasks — crossing the line from chatbots into tireless autonomous digital workers

Not long ago, the best AI models topped out at tasks a human could finish in a few minutes. Ask them to debug a function or ...

Alibaba's proprietary Qwen3.7-Max can run for 35 hours autonomously and supports external harnesses like Anthropic's Claude Code

On the Apex Math Reasoning benchmark, Qwen3.7-Max scored 44.5, eclipsing Claude Opus-4.6 Max's score of 34.5 and DeepSeek ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

Logical Intelligence Tops Leading AI Verification Benchmarks as Verified Code Generation Nears Reality with Aleph

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

Alibaba’s new AI model scores higher than OpenAI, Google rivals in coding ranking

Huawei's New Benchmark Gives AI Agents Months of Your Life—Then Watches Them Fail

AI models can now work for hours on their own to finish long, complex tasks — crossing the line from chatbots into tireless autonomous digital workers

Tabnine Named a Visionary in the 2026 Gartner® Magic Quadrant™ for Enterprise AI Coding Agents

A Strategic Game Plan For The Governance Of AI-Enabled Code Development

Al Benchmarks Investigated : Do Companies Tune Private Builds for Leaderboards, Then Ship Weaker Versions?

81% of Enterprise Technology Leaders Report Production Failures from AI-Generated Code, New Research Shows

Alibaba's proprietary Qwen3.7-Max can run for 35 hours autonomously and supports external harnesses like Anthropic's Claude Code