You can ask an AI chatbot a health question in seconds. Trusting the answer is a much riskier bet.
A new BMJ Open audit found that 49.6% of responses from five major consumer chatbots to health and medical questions were “problematic,” with Elon Musk’s Grok chatbot delivering the highest share of “highly problematic” answers.
The audit adds to fears that health chatbots can wrap misinformation in a calm, convincing voice.
Confident tone, compromised facts
The topline failure was not just that the chatbots made mistakes — it was the kind of mistakes they made.
Across 250 responses from five consumer tools — Gemini, DeepSeek, Meta AI, ChatGPT, and Grok — the study found many answers drifting away from established scientific consensus or using “hedging language that provided false balance between scientific and non-scientific information.”
The responses covered health topics, including cancer, vaccines, stem cells, nutrition, and athletic performance, providing a broad overview of areas already prone to misinformation.
That left health advice that could sound calm and credible while still pulling readers away from the facts.
Open questions gave bad answers more space
The wording of the question changed the quality of the answer. Open-ended prompts produced far more highly problematic responses than closed-ended ones, making them the riskier format.
Closed-ended questions gave the chatbots less room to wander. Open-ended prompts did the opposite. They opened the door to weaker claims, more speculation, and more polished-sounding misinformation.
Closed-ended prompts produced just nine highly problematic responses, while open-ended prompts produced 40. Closed-ended prompts also returned 75 non-problematic responses, compared with 51 for open-ended ones.
That gap is notable because open-ended prompts closely resemble how people actually ask health questions online. Many users are not typing in a narrow yes-or-no query. They are asking for options, recommendations, or explanations.
Very few questions triggered any real restraint
Across the AI models, outputs were “consistently expressed with confidence and certainty,” even when answers were contentious or incorrect. Refusals were rare, and so were strong signs that a chatbot might be out of its depth. Out of 250 total questions, there were only two refusals, both from Meta AI.
That low refusal rate stands out in a health setting, where some questions would have been safer to decline or redirect to a medical professional. Instead, the bots usually answered anyway, even when the prompt leaned toward risky or unsupported advice.
The guardrails were uneven, too. Out of 50 responses per chatbot, the study counted caveats or disclaimers recommending a healthcare or medical expert in:
- 44 Gemini responses
- 38 DeepSeek responses
- 37 Grok responses
- 32 Meta AI responses
- 28 ChatGPT responses
Even with those warnings, the broader pattern held: the answers often sounded steady and authoritative, which can make weak information feel more trustworthy than it is.
The answers came dressed for credibility
The answers were shaky. The sourcing was no better.
For closed-ended questions, the AI chatbots were asked to provide 10 scientific references backing their responses. The study found a median citation completeness score of just 40%, and no chatbot produced a fully accurate reference list for any prompt.
The answers were also harder to read than they looked. On average, all five AI tools landed in the “Difficult” range for readability, roughly at a college sophomore-to-senior level.
So the package could look convincing even when the substance did not: polished language, academic-looking citations, and plenty of confidence, without the reliability to match.
A widespread ChatGPT outage disrupted access for users worldwide as OpenAI worked to identify the cause.


