AI Coding Challenge: First Results Shockingly Low

Key Takeaways

The K Prize AI coding challenge saw a top score of only 7.5% in its first round.
Eduardo Rocha de Andrade, a Brazilian prompt engineer, won $50,000 despite the low score.
The challenge, created by Andy Konwinski (Databricks and Perplexity co-founder), aims to test real-world coding ability of AI.
Konwinski pledged $1 million for an open-source AI model that can score over 90%.
The K Prize introduces a “contamination-free” benchmark, avoiding pre-training on known test sets.

What Is the K Prize?

The K Prize is a new AI coding competition designed to measure how well AI models handle real-world programming tasks. The challenge is the brainchild of Andy Konwinski, co-founder of Databricks and Perplexity.

Unlike other benchmarks, the K Prize is built to be extremely difficult and fair. It uses a timed entry system to prevent models from training on the test data in advance.

First Round Results: A Reality Check

The Laude Institute announced the first winner of the K Prize on Wednesday at 5 PM PST.
The top scorer, Eduardo Rocha de Andrade, answered only 7.5% of the coding questions correctly — yet still won the $50,000 prize.

Konwinski welcomed the tough results:

“Benchmarks should be hard if they’re going to matter,” he said. “This is a reality check on AI’s current coding abilities.”

Why Are the Scores So Low?

Traditional benchmarks like SWE-Bench show much higher results, with top models scoring 75% on its easy test and 34% on the harder test. However, Konwinski believes that these scores may not reflect true AI performance due to contamination — where models have already seen similar problems during training.

The K Prize eliminates this issue by creating tests using GitHub issues posted after March 12th (the model submission deadline).
This approach ensures that no model can train on the test data beforehand.

The $1 Million Open-Source Challenge

To push the boundaries of AI coding, Konwinski has pledged $1 million to the first open-source AI model that scores above 90% on the K Prize.

This bold reward aims to motivate developers and researchers to build smarter, cleaner, and more capable coding models.

A Necessary Wake-Up Call for AI

Many AI coding tools have flooded the market, claiming near-human performance. But the K Prize results suggest that AI is far from being a reliable software engineer.

“If we can’t even get more than 10% on a contamination-free SWE-Bench, that’s a reality check,” Konwinski emphasized.

Princeton researcher Sayash Kapoor agrees. He believes new tests and tougher benchmarks are essential to measure true progress in AI coding.

How K Prize Differs from SWE-Bench

The SWE-Bench system is widely used for evaluating AI coding abilities. However, it uses a fixed set of problems. Over time, AI models can be trained directly on these problems, which inflates performance scores.

The K Prize, by contrast:

Uses fresh, unseen GitHub issues for each round.
Prevents benchmark overfitting.
Focuses on real-world bug fixes and coding tasks.

Looking Ahead

Konwinski expects future rounds of the K Prize to be even more competitive as teams adapt to its structure.
The next rounds will help determine whether low scores are due to benchmark contamination or simply the complexity of modern coding problems.

Final Thoughts

The K Prize highlights a hard truth — AI coding tools have a long way to go before replacing human engineers.
But by raising the bar with a contamination-free, real-world benchmark, the challenge could shape the next generation of AI coding models.

AI Coding Challenge K Prize Reveals First Results — and the Scores Are Shockingly Low