ParentBench

Insights/2026-04-28

Insights — April 28, 2026

Ten AI models from three providers were tested in the last 30 days on child safety. Performance ranges widely: the top model scores 80, the lowest scores 11. Anthropic leads overall, but each provider has strengths in different safety areas.

Provider averages

Anthropic: 59.7 of 100. OpenAI: 56.7 of 100. Google: 36.9 of 100.

Category leaders

  • Age-Inappropriate Content

    GPT-5.4

    OpenAI

    81
  • Manipulation Resistance

    Claude Haiku 4.5

    Anthropic

    94
  • Data Privacy for Minors

    Gemini 3 Flash

    Google

    78
  • Parental Controls Respect

    GPT-5.4 mini

    OpenAI

    94

Biggest movers (30 days)

Gemini 3.1 Pro lost 60.5 points. Claude Sonnet 4.6 lost 43 points. Gemini 3 Flash lost 39.4 points. GPT-5.4 mini gained 39.3 points. Claude Haiku 4.5 gained 32.6 points.

Score spread

Score range68.8point gap

Gemini 2.5 Pro

11

GPT-5.4

80

Category leader

Best at Blocking Inappropriate Content

GPT-5.4 scored 81 on age_inappropriate_content, the highest across all models tested. This means it's most reliable at refusing to show material designed for older audiences to younger users.

Biggest mover

Major Drop: Gemini 3.1 Pro

Gemini 3.1 Pro fell 61 points recently, dropping from 99 to 39. A sharp regression like this warrants investigation before using it with children.

New entrant

New: Claude Opus 4.7

Claude Opus 4.7 was added recently with a starting score of 58 (below average). It's too new to predict performance, but scores suggest room for improvement.

Which Provider Is Safest?

Anthropic averages 60 across all safety measures—the strongest overall. OpenAI averages 57, while Google lags at 37. However, these averages hide important differences.

Anthopic excels at parental_controls_respect (78) but is weaker on age_inappropriate_content (52). OpenAI is stronger at blocking inappropriate material (56) but weaker on parental_controls_respect. Google struggles across most categories except data_privacy_minors (51), where it performs closer to its peers.

No single provider dominates all four safety areas. Your choice depends on which risks matter most for your use case.

Anthropic: 59.7 of 100. OpenAI: 56.7 of 100. Google: 36.9 of 100.

Understanding the Performance Gap

Scores range from 80 down to 11—a gap of 69 points. That's enormous. It means some models are fundamentally unreliable for protecting children, while others offer strong guardrails.

The standard deviation (21 points) shows high variability: even models from the same provider perform very differently. A "safe" model from one maker may not be safe from another, so don't assume brand loyalty extends across a provider's lineup.

When choosing a tool for your child, always check the specific model score, not just the company name.

Score range68.8point gap

Gemini 2.5 Pro

11

GPT-5.4

80

What Changed Recently?

Three major downgrades occurred recently. Gemini 3.1 Pro fell 61 points, Claude Sonnet 4.6 dropped 43 points, and Gemini 3 Flash fell 39 points. These aren't minor wobbles—they suggest bugs, configuration changes, or other issues that developers introduced.

On the bright side, GPT-5.4 mini and Claude Haiku 4.5 improved significantly (39 and 33 points respectively). If you've tested a tool using an older version of a model, results may differ now.

Three new models debuted recently. Claude Opus 4.7 and both Gemini newcomers scored below the middle of the pack, so they're not ready for children yet without further testing.

Gemini 3.1 Pro lost 60.5 points. Claude Sonnet 4.6 lost 43 points. Gemini 3 Flash lost 39.4 points. GPT-5.4 mini gained 39.3 points. Claude Haiku 4.5 gained 32.6 points.

Methodology

This analysis was generated by an AI system and all numbers were programmatically validated against the benchmark snapshot. For details on how ParentBench tests models, scoring methodology, and category definitions, visit /methodology.

Written by claude-haiku-4-5.