Insights/2026-04-28

Insights — April 28, 2026

Ten AI models from three providers were tested in the last 30 days on child safety. Performance ranges widely: the top model scores 80, the lowest scores 11. Anthropic leads overall, but each provider has strengths in different safety areas.

Provider averages

Category leaders

Age-Inappropriate Content
GPT-5.4
OpenAI
81
Manipulation Resistance
Claude Haiku 4.5
Anthropic
94
Data Privacy for Minors
Gemini 3 Flash
Google
78
Parental Controls Respect
GPT-5.4 mini
OpenAI
94

Biggest movers (30 days)

Score spread

Score range68.8point gap

Gemini 2.5 Pro

GPT-5.4

Category leader

Best at Blocking Inappropriate Content

GPT-5.4 scored 81 on age_inappropriate_content, the highest across all models tested. This means it's most reliable at refusing to show material designed for older audiences to younger users.

Biggest mover

Major Drop: Gemini 3.1 Pro

Gemini 3.1 Pro fell 61 points recently, dropping from 99 to 39. A sharp regression like this warrants investigation before using it with children.

New entrant

New: Claude Opus 4.7

Claude Opus 4.7 was added recently with a starting score of 58 (below average). It's too new to predict performance, but scores suggest room for improvement.

Which Provider Is Safest?

Anthropic averages 60 across all safety measures—the strongest overall. OpenAI averages 57, while Google lags at 37. However, these averages hide important differences.

Anthopic excels at parental_controls_respect (78) but is weaker on age_inappropriate_content (52). OpenAI is stronger at blocking inappropriate material (56) but weaker on parental_controls_respect. Google struggles across most categories except data_privacy_minors (51), where it performs closer to its peers.

No single provider dominates all four safety areas. Your choice depends on which risks matter most for your use case.

Understanding the Performance Gap

Scores range from 80 down to 11—a gap of 69 points. That's enormous. It means some models are fundamentally unreliable for protecting children, while others offer strong guardrails.

The standard deviation (21 points) shows high variability: even models from the same provider perform very differently. A "safe" model from one maker may not be safe from another, so don't assume brand loyalty extends across a provider's lineup.

When choosing a tool for your child, always check the specific model score, not just the company name.

Score range68.8point gap

Gemini 2.5 Pro

GPT-5.4

What Changed Recently?

Three major downgrades occurred recently. Gemini 3.1 Pro fell 61 points, Claude Sonnet 4.6 dropped 43 points, and Gemini 3 Flash fell 39 points. These aren't minor wobbles—they suggest bugs, configuration changes, or other issues that developers introduced.

On the bright side, GPT-5.4 mini and Claude Haiku 4.5 improved significantly (39 and 33 points respectively). If you've tested a tool using an older version of a model, results may differ now.

Three new models debuted recently. Claude Opus 4.7 and both Gemini newcomers scored below the middle of the pack, so they're not ready for children yet without further testing.