Grok Had Highest Share of ‘Highly Problematic’ Health Answers in BMJ Open Audit

Grok Had Highest Share of ‘Highly Problematic’ Health Answers in BMJ Open Audit

A person interacts with a smartphone displaying an AI doctor application, highlighting the convenience of technology in accessing healthcare services from the comfort of home.

Source: monkeybusiness/Envato

Written By
Liz Ticong
Liz Ticong
Apr 20, 2026
3 minute read
eWeek content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

You can ask an AI chatbot a health question in seconds. Trusting the answer is a much riskier bet.

A new BMJ Open audit found that 49.6% of responses from five major consumer chatbots to health and medical questions were “problematic,” with Elon Musk’s Grok chatbot delivering the highest share of “highly problematic” answers.

The audit adds to fears that health chatbots can wrap misinformation in a calm, convincing voice.

Confident tone, compromised facts

The topline failure was not just that the chatbots made mistakes — it was the kind of mistakes they made.

Across 250 responses from five consumer tools — Gemini, DeepSeek, Meta AI, ChatGPT, and Grok — the study found many answers drifting away from established scientific consensus or using “hedging language that provided false balance between scientific and non-scientific information.”

The responses covered health topics, including cancer, vaccines, stem cells, nutrition, and athletic performance, providing a broad overview of areas already prone to misinformation.

That left health advice that could sound calm and credible while still pulling readers away from the facts.

Open questions gave bad answers more space

The wording of the question changed the quality of the answer. Open-ended prompts produced far more highly problematic responses than closed-ended ones, making them the riskier format.

Closed-ended questions gave the chatbots less room to wander. Open-ended prompts did the opposite. They opened the door to weaker claims, more speculation, and more polished-sounding misinformation. 

Closed-ended prompts produced just nine highly problematic responses, while open-ended prompts produced 40. Closed-ended prompts also returned 75 non-problematic responses, compared with 51 for open-ended ones.

That gap is notable because open-ended prompts closely resemble how people actually ask health questions online. Many users are not typing in a narrow yes-or-no query. They are asking for options, recommendations, or explanations.

Advertisement

Very few questions triggered any real restraint

Across the AI models, outputs were “consistently expressed with confidence and certainty,” even when answers were contentious or incorrect. Refusals were rare, and so were strong signs that a chatbot might be out of its depth. Out of 250 total questions, there were only two refusals, both from Meta AI.

That low refusal rate stands out in a health setting, where some questions would have been safer to decline or redirect to a medical professional. Instead, the bots usually answered anyway, even when the prompt leaned toward risky or unsupported advice.

The guardrails were uneven, too. Out of 50 responses per chatbot, the study counted caveats or disclaimers recommending a healthcare or medical expert in:

  • 44 Gemini responses
  • 38 DeepSeek responses
  • 37 Grok responses
  • 32 Meta AI responses
  • 28 ChatGPT responses

Even with those warnings, the broader pattern held: the answers often sounded steady and authoritative, which can make weak information feel more trustworthy than it is.

The answers came dressed for credibility

The answers were shaky. The sourcing was no better.

For closed-ended questions, the AI chatbots were asked to provide 10 scientific references backing their responses. The study found a median citation completeness score of just 40%, and no chatbot produced a fully accurate reference list for any prompt.

The answers were also harder to read than they looked. On average, all five AI tools landed in the “Difficult” range for readability, roughly at a college sophomore-to-senior level.

So the package could look convincing even when the substance did not: polished language, academic-looking citations, and plenty of confidence, without the reliability to match.

A widespread ChatGPT outage disrupted access for users worldwide as OpenAI worked to identify the cause.

Liz Ticong

Liz Ticong is a tech industry expert with hands-on experience in AI, software testing, and product analysis. Specializing in AI news, software reviews, and buyer’s guides, she rigorously tests and experiments with the latest AI and tech tools to provide in-depth, practical insights. As a contributor to eWeek and TechRepublic, she simplifies complex topics, helping readers make well-informed decisions.

eWeek Logo

eWeek has the latest technology news and analysis, buying guides, and product reviews for IT professionals and technology buyers. The site's focus is on innovative solutions and covering in-depth technical content. eWeek stays on the cutting edge of technology news and IT trends through interviews and expert analysis. Gain insight from top innovators and thought leaders in the fields of IT, business, enterprise software, startups, and more.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.