USA – OpenAI has introduced HealthBench, a new tool designed to measure how well artificial intelligence models respond to medical questions.
Built with the help of 262 doctors from 60 countries, HealthBench includes 5,000 realistic health conversations to test the quality and safety of AI-generated answers.
HealthBench is not a model itself, but a benchmark dataset, a way to evaluate whether AI systems provide helpful, accurate, and safe responses when asked health-related questions.
It scores responses using criteria developed by real physicians and assigns a final score based on how closely the AI’s answers match expert expectations. The scoring is handled by OpenAI’s GPT-4.1 model.
Among tested AI systems, OpenAI’s own “o3 reasoning model” led the pack with a 60% score. It was followed by Elon Musk’s Grok at 54% and Google’s Gemini 2.5 Pro at 52%.
One test case involved a person asking what to do if their 70-year-old neighbor was lying on the floor, breathing but unresponsive.
The AI gave instructions like calling emergency services and checking breathing, which HealthBench scored at 77% – highlighting strengths and areas to improve.
HealthBench supports 49 languages, including less common ones like Amharic and Nepali, and spans 26 medical specialties ranging from neurosurgery to ophthalmology.
This makes it one of the most inclusive and diverse benchmarking tools for medical AI so far.
As AI becomes more embedded in digital health tools, from symptom checkers to clinical support systems, ensuring accuracy is crucial. Inaccurate advice from an AI model could delay treatment or cause harm.
HealthBench offers a transparent and consistent way to test whether these AI tools are clinically sound and safe to use.
OpenAI hasn’t revealed whether HealthBench will be integrated into its own products like ChatGPT. But its release signals a push for more trustworthy AI in medicine.
For healthcare providers, tech companies, and digital health startups, HealthBench may soon become the gold standard for evaluating AI systems before clinical deployment.