All Models Are Wrong — Part 4
How Flashpoint.AI’s synthetic panels work

“Essentially, all models are wrong, but some are useful.” — George E. P. Box
The earlier posts in this series made an argument: synthetic respondents are models, models are wrong in predictable ways, and the value lies in knowing exactly where. This post describes how Flashpoint.AI puts that argument into practice. A study runs in three stages.
Building the panel
A study begins with a description of the target population. Flashpoint.AI scopes the geography behind that description — a state, a county, a metro area, a country — and constructs its demographic profile. For US studies, the profile is drawn from the US Census API. For global studies, it comes from an ensemble of web-search models.
An ensemble of models then generates a set of personas that collectively match the demographic mix in question. The goal is not a single “average” respondent repeated a thousand times, but a distribution — because a real population is a distribution. That design choice is a direct response to one of the failure modes described in Part 1: synthetic respondents that fit their persona too neatly, with less variance than real people show.
Building the survey
Given the research objective, Flashpoint.AI generates the instrument using agency-approved survey methods, with the full range of standard question types available. Behind the scenes, each non-Likert question is paired with a “shadow” Likert version. The reason becomes clear in the next stage: the response methodology scores answers on Likert scales, so every question needs a Likert form to be measurable.
The complete instrument then runs through SurveyCheck™ before fielding. SurveyCheck verifies that the survey’s logic is coherent — branches resolve, skip patterns don’t strand anyone, every question is reachable — and that the instrument makes sense in context: a survey fielded in Indonesia should ask about prices in rupiah, not dollars.
Running the study
Each persona then answers the Likert version of each question sequentially, using semantic similarity rating (SSR), a method introduced by Maier et al. (2025). The insight behind SSR is that language models, asked directly for a numerical rating, produce unrealistically narrow response distributions. SSR instead elicits a textual answer — the persona responds in words, as a person would — and maps that text onto a Likert distribution by measuring its embedding similarity to a set of reference statements. In the original paper’s benchmark of 57 product surveys with 9,300 human responses, SSR recovered roughly 90% of human test-retest reliability while preserving realistic response distributions, along with qualitative feedback explaining each rating. (The paper is worth reading: arXiv:2510.08338.)
Every run returns the two scores described earlier in the series: the Panel Calibration Score, which measures how well the panel reflects the target audience, and the Response Fit Score, which estimates how much to trust the results for each question asked. Questions that fall outside reliable bounds can be easily routed, unchanged, to human respondents through our sample partners.
That is the whole machine: a panel grounded in census data, an instrument built to agency standards, an instant directional read, and a score that tells you which answers to verify with real humans.
All models are wrong, but some are useful. The useful ones are the ones that tell you where they’re wrong.