Insights/2026-04-28
Insights — April 28, 2026
Ten AI models from three providers were tested in the last 30 days on child safety. Performance ranges widely: the top model scores 80, the lowest scores 11. Anthropic leads overall, but each provider has strengths in different safety areas.
Provider averages
Category leaders
- 81
Age-Inappropriate Content
GPT-5.4
OpenAI
- 94
Manipulation Resistance
Claude Haiku 4.5
Anthropic
- 78
Data Privacy for Minors
Gemini 3 Flash
Google
- 94
Parental Controls Respect
GPT-5.4 mini
OpenAI
Biggest movers (30 days)
Score spread
Gemini 2.5 Pro
11
GPT-5.4
80
Best at Blocking Inappropriate Content
GPT-5.4 scored 81 on age_inappropriate_content, the highest across all models tested. This means it's most reliable at refusing to show material designed for older audiences to younger users.
Major Drop: Gemini 3.1 Pro
Gemini 3.1 Pro fell 61 points recently, dropping from 99 to 39. A sharp regression like this warrants investigation before using it with children.
New: Claude Opus 4.7
Claude Opus 4.7 was added recently with a starting score of 58 (below average). It's too new to predict performance, but scores suggest room for improvement.
Which Provider Is Safest?
Anthropic averages 60 across all safety measures—the strongest overall. OpenAI averages 57, while Google lags at 37. However, these averages hide important differences.
Anthopic excels at parental_controls_respect (78) but is weaker on age_inappropriate_content (52). OpenAI is stronger at blocking inappropriate material (56) but weaker on parental_controls_respect. Google struggles across most categories except data_privacy_minors (51), where it performs closer to its peers.
No single provider dominates all four safety areas. Your choice depends on which risks matter most for your use case.
Understanding the Performance Gap
Scores range from 80 down to 11—a gap of 69 points. That's enormous. It means some models are fundamentally unreliable for protecting children, while others offer strong guardrails.
The standard deviation (21 points) shows high variability: even models from the same provider perform very differently. A "safe" model from one maker may not be safe from another, so don't assume brand loyalty extends across a provider's lineup.
When choosing a tool for your child, always check the specific model score, not just the company name.
Gemini 2.5 Pro
11
GPT-5.4
80
What Changed Recently?
Three major downgrades occurred recently. Gemini 3.1 Pro fell 61 points, Claude Sonnet 4.6 dropped 43 points, and Gemini 3 Flash fell 39 points. These aren't minor wobbles—they suggest bugs, configuration changes, or other issues that developers introduced.
On the bright side, GPT-5.4 mini and Claude Haiku 4.5 improved significantly (39 and 33 points respectively). If you've tested a tool using an older version of a model, results may differ now.
Three new models debuted recently. Claude Opus 4.7 and both Gemini newcomers scored below the middle of the pack, so they're not ready for children yet without further testing.