Skip to content

Every AI Model is Struggling with Medicine – LMArena and DataTecnica Propose a New Fix

DataTecnica

Artificial intelligence has made impressive strides in fields like writing, coding, and creative content generation, but when it comes to medicine, the technology still falls short. According to a new study by DataTecnica and the U.S. National Institutes of Health’s Center for Alzheimer’s and Related Dementias (CARD), today’s most advanced AI models, including Google’s Gemini, OpenAI’s GPT-5, Meta’s LLaMA, and Anthropic’s Claude, fail to provide medical answers that are both safe and accurate.

This finding is worrying, especially as more people turn to chatbots such as ChatGPT or Gemini for medical advice. In fact, recent research shows that patients sometimes trust AI’s responses more than doctors’ guidance—even when the AI is wrong. This raises major concerns about the risks of misinformation in healthcare and the potential harm of relying on AI without proper oversight.

The Knowledge Gap in Medicine

The study highlights a fundamental mismatch between what AI models are currently good at and what biomedical researchers actually need. Large language models excel at sounding fluent and confident, but in medicine, “sounding correct” is not enough. Doctors and scientists need tools that can interpret complex data, uncover meaningful insights, and reduce error in critical areas such as drug discovery, disease prevention, and clinical decision-making.

To measure how well AI performs in this domain, DataTecnica and CARD researchers introduced CARDBiomedBench, a benchmark suite of biomedical questions and tasks. The results were disappointing: none of the frontier AI models consistently met the demands of biomedical research. In other words, AI may write an essay well, but when it comes to interpreting clinical studies or analyzing lab results, it falls far behind professional standards.

This echoes similar findings from OpenAI’s HealthBench, introduced earlier this year, which showed some progress in AI’s medical reasoning but still left “significant room for improvement.” Other academic studies, such as those at Emory University School of Medicine, found that models can perform decently on physician exam-style tests like MedQA, but those are very different from the cutting-edge biomedical research tasks measured by CARDBiomedBench.

Enter BiomedArena – A New Benchmark for AI in Medicine

To help close this gap, LMArena.ai and DataTecnica are teaming up to create BiomedArena, a new leaderboard designed to track and compare how different AI models perform in medical contexts. Unlike general-purpose AI leaderboards that rank chatbots on casual Q&A, BiomedArena will focus specifically on the kinds of challenges scientists face in labs and research centers.

The vision is straightforward but ambitious: let researchers and the public see how well AI systems actually perform in medical problem-solving. BiomedArena will measure tasks such as interpreting experimental data, reviewing medical literature, generating new research hypotheses, and even assisting with clinical translation. Already, scientists at the NIH’s Intramural Research Program are exploring ways to use this approach for high-risk, high-reward projects that traditional academic studies often avoid due to scale and complexity.

If successful, BiomedArena could become the “gold standard scoreboard” for testing medical AI—shining a light on which models are safe and trustworthy and which ones are not ready for real-world use.

The Road Ahead: Opportunities and Challenges

Opportunities and Challenges

While BiomedArena is a promising step, two major challenges remain. First, AI in medicine becomes far more useful when connected to trusted medical databases. Studies show that even basic large language models improve dramatically when they can access up-to-date medical knowledge, such as research journals or clinical guidelines. Without measuring how well models use external resources, BiomedArena might only capture part of the picture.

Second, there’s the rise of medicine-specific AI models, like Google’s MedPaLM, which was introduced two years ago. These systems are designed with healthcare in mind, unlike general-purpose AI models that try to handle everything from poetry to coding. For BiomedArena to be truly effective, it will need to include these specialized tools, not just the big-name general models.

Looking forward, the future of medical AI will likely depend on a combination of domain-specific training, regulatory oversight, and transparent benchmarking systems. If researchers can build models that not only sound convincing but also meet the rigorous standards of medicine, AI could play a transformative role in healthcare—from speeding up drug discovery to improving patient diagnostics.

A Look Back at AI in Medicine

This is not the first time AI’s role in medicine has faced scrutiny. Over the past decade, AI has repeatedly been praised for its potential, only to be criticized when tested against real-world medical challenges. For example:

  • In 2020, Google Health claimed its AI could outperform radiologists in reading mammograms, but follow-up studies revealed inconsistent results across different patient populations.
  • In 2022, researchers found that OpenAI’s GPT-3 could pass certain medical licensing exam questions, yet failed on nuanced case studies requiring reasoning and patient safety considerations.
  • By 2023, dedicated tools like MedPaLM promised better accuracy, but even those showed gaps when handling rare conditions or cutting-edge biomedical research.

Now in 2025, with CARDBiomedBench and the launch of BiomedArena, the conversation has shifted from flashy headlines to measurable accountability. Instead of asking whether AI “can” do medicine, researchers are beginning to ask how well it performs in real-world medical science—and how to fix the shortcomings.

Final Thoughts

The message from DataTecnica, NIH’s CARD, and LMArena is clear: AI may be powerful, but medicine demands precision, reliability, and safety above all else. General-purpose models like GPT-5 or Gemini may entertain us, but until they can meet the rigorous demands of biomedical science, they remain tools with limits.

BiomedArena is not just another leaderboard—it represents a critical step in holding AI accountable and making sure that one day, when people ask life-saving questions, the answers they get will be grounded in science, not guesswork.