Skip to content

GPT-5’s Coding Tests Reveal Startling Gaps — Many Developers Still Lean Towards GPT-4o

GPT-5's Coding

The latest from OpenAI, GPT-5, has been released, but preliminary coding assessments highlight significant disadvantages relative to GPT-4o.  

In controlled programming challenges, GPT-5 has a 50% success rate, representing a significant decline from the near flawless scores of prior models. And while GPT-5 has strong boasting reasoning capabilities, preliminary evidence suggests that for critical coding tasks, many developers may prefer older models.  

GPT-5: Soaring Hype, Mixed Reality  

GPT-5 was released to resounding anticipation. OpenAI has integrated it into ChatGPT, thus, for many users, replacing GPT-4o. As with other products, however, developer tests highlighted trustworthiness issues.  

In one instance, an attempt by GPT-5 to create a WordPress plugin for booking appointments was completely misguided as its output was completely nonfunctional. GPT-5 produced a plugin that simply redirected users to a traffic controlling page rather than performing its primary function. He only solved the problem when asked for the second time.  

All prior versions, 3.5, 4, and 4o, were capable of seamless execution of the same tasks.  

The Coding Tests — Where GPT-5 Shined and Stumbled  

A set of benchmark tasks were designed to assess GPT-5’s capabilities in programming:  

  • Creating a WordPress Plugin—Did Not Succeed  
  • First attempt: Captured the overall essence, but output no functional plugin.
  • Second attempt: Completed and delivered a functional version.
  • Verdict: Presumed a failure because of the first delivered version.

Rewriting a String Function — Pass

Adjusted code to manage dollars and cents portions without over complicating things.

Outcome perfectly aligned with the client’s expectations.

Finding a Hidden Bug in WordPress Code — Pass

Articulated a subtle yet critical architectural flaw within the WordPress code and also offered a lucid repair.

Writing a Multi-Tool Script — Fail

Task: Combine Keyboard Maestro, AppleScript, and a script in Google Chrome.

Mistakes: Failure to recognize AppleScript’s case sensitivity, creation of properties, and references to non-existing variables.

Editing Feature: Good Idea, Poor Execution

GPT-5 added an Edit button to all code generation features. This should allow users to modify code within the ChatGPT interface. Unfortunately, this was filled with bugs:

Saving mechanism was faulty.

You could not go back to the original session without redoing the prompt.

Developers’ Frustrations and OpenAI’s Response

With the transition from GPT-4o to GPT-5, users, especially Pro subscribers, felt aggravated with the inability to switch back to the older, dependable version of the model. With overwhelming dissatisfaction from users, OpenAI reinstated GPT-4o for Pro subscribers allowing them to switch via settings under “Show legacy models.”

Free-tier users, however, have no choice but to use GPT-5.

Key takeaways

Drop in performance – Open AI GPT-5 failed in 50% of programmed benchmark tasks.

Plugin Test Trouble \- Ruined the first attempt at correcting the long-standing “gold standard” results.

Reasoning vs Coding \- Better reasoning does not guarantee better programming output.

Developer Choice \- In high-stakes coding, GPT-4o continues to be the safer and preferred option.

New Editing Tool \- Currently, in its new form, it is unreliable though promising.

Take Away  

What stands out is that GPT-5 is responsive but not reliable when it comes to programming tasks. In cases where the code needs to be perfect from the first attempt, it is better to use GPT-4o for the time being. GPT-5 GPT-5 stands out in reasoning which gives it the potential to be a strong rival down the line, but to be the “gold standard” once again, it has to focus on improving coding precision.