Lead
Imagine an app displays: "Distress cry 70%, hunger cry 20%, other 10%."
One parent reads this and thinks: "70% — that's probably distress, then." Another reads it and thinks: "70% means it's wrong 30% of the time. That's not reliable." Neither reading is quite right.
Probabilistic estimates from AI-powered features are showing up in more parenting apps and consumer products. The ability to read those estimates clearly — to understand what the number is actually claiming, and what it isn't — is a form of literacy that matters more as AI becomes more embedded in daily tools. This article is about what "70%" means in the context of a machine learning model, what calibration is, and how to use probabilistic output without being misled by it.
What "Probability 70%" Actually Claims
Start with the basic question: what is the model saying when it outputs 70%?
The output means, roughly: "When this model receives input with these characteristics — this acoustic pattern, this time of day, this feeding history — it assigns 70% probability to the 'distress' category." This is a statement about the model's learned pattern-matching, not a direct measurement of the child's internal state.
It is not: "This child is 70% distressed right now." It is not: "There is a 70% chance this interpretation is correct for all crying infants." It is a conditional probability — the model's assessment, given this particular input, given what it has learned from its training data.
This distinction matters because it sets the right frame for using the number. It is a hypothesis with a confidence weight attached, not a diagnosis.
Accuracy and Calibration Are Different Things
Here is a distinction that is easy to miss: a model can be accurate — good at identifying the right category most of the time — and still be poorly calibrated, meaning its stated probabilities are misleading.
Calibration: in predictive modeling, the degree to which stated probabilities match observed outcome frequencies; a well-calibrated model that says "70%" is correct about 70% of the time refers to the correspondence between the probabilities a model outputs and the actual frequencies of being correct [1,2]. A well-calibrated model, when it says "70%," is right approximately 70% of the time across all instances where it said "70%." If the model says "70%" and is actually right 90% of the time at that confidence level, the model is underconfident. If it says "70%" and is right only 50% of the time, the model is overconfident — its stated probabilities are inflated.
Van Calster and colleagues (2019) described calibration as "the Achilles heel of predictive analytics" in a paper in BMC Medicine, noting that many machine learning models achieve high discriminative performance (measured by AUC: Area Under the Curve: a metric from 0 to 1 measuring a model's ability to distinguish between categories, regardless of probability calibration) — the ability to distinguish between categories) while having poor calibration [2]. High accuracy does not guarantee that the probability numbers mean what they appear to mean.
Niculescu-Mizil and Caruana (2005) analyzed the calibration properties of major machine learning algorithms systematically [3]. Their finding: different algorithms have characteristically different calibration biases. Boosted decision trees tend to push probabilities toward 0 and 1 — overconfident at the extremes. Naive Bayes also produces extreme probability estimates. Neural networks and bagged trees tend to produce better-calibrated probabilities. The point is that whether a stated probability is trustworthy depends on the algorithm used and on whether the development process included calibration correction — and that information is rarely disclosed to end users of consumer apps [3].
How to Evaluate Calibration
Two practical tools are used to assess calibration.
The first is the Brier score: a proper scoring rule equal to mean squared error between predicted probabilities and actual outcomes; 0 is perfect, 0.25 equals always predicting 50%, originally developed by meteorologist Glenn Brier (1950) to verify probabilistic weather forecasts [5]. The Brier score is the mean squared error between predicted probabilities and actual outcomes, with 0 representing perfect calibration and 1 the worst possible. A model that simply outputs "50%" for everything would have a Brier score of 0.25; a model worth using should score below that threshold. The Brier score combines discrimination and calibration into a single number.
The second tool is the reliability diagram (also called a calibration curve): a plot with predicted probability on the horizontal axis and actual outcome frequency on the vertical axis. A well-calibrated model produces points close to the diagonal line y = x. When the points bow above or below that line, calibration is poor in predictable ways. This kind of diagram is standard in clinical prediction modeling [1,2] but rarely presented in consumer-facing documentation.
Whether a parenting app has conducted calibration testing and whether it publishes those results is a direct indicator of the transparency of its AI claims. An app that offers probability outputs without disclosing its calibration data is asking users to take the numbers on faith.
Being Wrong Is Not a Failure
A conceptually important point: a model that says "70%" and is sometimes wrong is not malfunctioning. A 70% probability model is expected to be wrong approximately 30% of the time — by design. That is what 70% means. The question is not whether the model is ever wrong; it is whether the probability it outputs corresponds to its actual error rate.
Ghassemi, Oakden-Rayner, and Beam (2021) wrote in The Lancet Digital Health about the tendency to invest AI outputs with more certainty than is warranted, specifically in health care contexts [4]. Their argument was that rigorous internal and external validation — testing whether a model's stated probabilities hold up in populations and settings outside its training data — is more protective than any amount of post-hoc explanation of how the model works. The same logic applies to a crying-classification model: what matters is not that it can produce a number, but whether that number has been validated against real outcomes in a representative sample.
This framing should move the question from "is this model smart?" to "has this model been tested?" — and whether the testing results are available to the people using it.
How to Use Probabilistic Output Well
A few practical orientations that make AI probability output more useful:
Treat the output as hypothesis-generation, not conclusion. "Distress cry 70%" means: "the model believes the most likely explanation is distress; treat this as a starting hypothesis." Then observe the child — expression, body tension, context, what happened before — to confirm or revise. The model is an input into your observation, not a replacement for it.
Pay attention to changes over time, not just absolute values. A single "70% distress" reading carries whatever uncertainty the model's calibration introduces. A pattern of readings shifting over weeks — "hunger cry has been climbing; distress cry has been stable" — can carry meaningful information even if the individual estimates have imprecision, because the signal is in the trend rather than the point estimate.
Ask whether the app discloses its calibration. If a service provides probability outputs without publishing calibration data, or at minimum describing the validation methodology, the probabilities cannot be assessed for trustworthiness. Treating undisclosed probabilities as point estimates — acting as though "70%" is a precise, well-tested number — is the most common misuse.
Summary
"Distress cry 70%" is a model's conditional probability estimate, not a measurement of the child's internal state. Whether that 70% is trustworthy depends on the model's calibration — whether it is actually correct about 70% of the time when it says "70%" [1,2]. Calibration varies substantially by algorithm and is not guaranteed by high accuracy alone [3]. The appropriate tools for evaluating it — Brier score and reliability diagrams — are standard in the prediction modeling literature but rarely disclosed in consumer applications [1,5].
The healthy way to use AI probability output is as a weighted hypothesis: one more piece of information to bring to your own direct observation of the child. Closing the loop between the model's output and your own perception is where the value lies — not in treating the number as an answer.
References
- Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. 2014;35(29):1925–1931. doi:10.1093/eurheartj/ehu207. PMID: 24898551.
- Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. doi:10.1186/s12916-019-1466-7. PMID: 31842878.
- Niculescu-Mizil A, Caruana R. Predicting good probabilities with supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005). Bonn, Germany: ACM; 2005:625–632. doi:10.1145/1102351.1102430.
- Ghassemi M, Oakden-Rayner L, Beam AL. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit Health. 2021;3(11):e745–e750. doi:10.1016/S2589-7500(21)00208-9.
- Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.