Published: Monday, June 29
Last updated: Tuesday, June 30
Open-Sourcing SonderMind’s Guardrail Calibration Datasets for Developers
Written by:
Today, SonderMind is open-sourcing the calibration datasets powering the safety guardrails for Sonder, our AI mental health companion located in the SonderMind app. We are releasing 200 input and 100 output scenarios designed to address critical failure modes in mental health conversational agents.
Background: The Challenge of AI Safety in Mental Health
Sonder helps users reflect on emotional states and track wellbeing between therapy sessions. In this context, the cost of failure is high. Guardrails must navigate a narrow path:
Over-caution can result in "door-slam" responses that interrupt benign, helpful conversations
Under-caution risks clinical overreach or failing to provide resources during a crisis.
Critical Failure Modes
These datasets focus on two primary directions of failure:
Input Risks: User messages that indicate active crises, disordered eating, or psychosis requiring specialized handling.
Output Risks: Model-generated harm, including clinical overreach (diagnoses/medication advice), and inappropriate suggestions.
Dataset Design & Strategy
Unlike general-purpose safety sets, these data points concentrate on the decision boundary—the "long tail" of ambiguous real-world scenarios.
Key Design Principles
- Multi-Turn Structure: Scenarios consist of both single and multi-turn conversations, recognizing that safety signals often only emerge over multiple exchanges.
- Clinical Co-Design: Every scenario and label was reviewed by licensed clinicians to ensure practical, real-world accuracy on edge cases.
- Three-Tier Response Model: We distinguish between "No Issue," "Show Resources & Continue" (for disclosures without active crisis), and "Static Block" (for high-risk situations).
Limitations & Scope
To maintain the integrity of our specific infrastructure, we are not releasing:
The actual guardrail prompts or monitoring configurations.
Red-team data targeting proprietary model weaknesses.
These datasets are a starting point for calibration, not a standalone certificate of safety.
Why Open Source?
Safety in healthcare AI should not be a proprietary secret. By sharing these baselines, we aim to reduce duplicated effort and improve the safety floor for all mental health AI tools.
Related Work
These datasets sit in a larger ecosystem of safety and evaluation work for health-related LLMs:
- MindEval (Sword Health) — multi-turn clinical competency benchmark for mental health LLMs, with patient simulation and LLM-as-judge evaluation.
- HealthBench (OpenAI) — a physician-labeled evaluation framework for health-related AI systems, focused on response quality and safety.
- VERA-MH (Spring Health) — this simulated multi-turn conversational benchmark for mental health LLMs focuses on safety, with patient simulation and LLM-as-judge evaluation.
Our datasets are narrower in scope than any of these (guardrail calibration for a specific product context, not general model evaluation), but focus on other areas of calibration: the two-tier input response model, output validation in clinical coaching contexts, and the annotation scenarios.
Conclusion & Access
We invite the community to explore, discuss, and contribute to these datasets.
Access the data here: https://github.com/SonderMindOrg/sonder-guardrail-evals
