Native Language Experts Created This Benchmark. Here's How Well Models (Mis)understood Culture.
No artificial scenarios. Every prompt authored from scratch by native-speaking subject matter experts; no machine translation, no synthetic scenarios.
Testable atomic pass/fail criteria across six cultural dimensions, built from expert-authored Key Cultural Context and Common Pitfalls, then scored via multi-pass LLM-as-judge with human validation to ensure accuracy.
Language specialists classified each prompt as High, Medium, or Low — enabling structured filtering and analysis across the full dataset. Cultural sensitivity measures the level of harm or discomfort a user may experience if the model gives an incorrect or culturally misaligned response.
Dataset includes 150 text-to-text (T2T) and 50 text-to-image (T2I) prompts per language.
* A note on evaluation coverage: The target benchmark size was 7,250 evaluations. Where a model failed to generate a response, generation was retried once. Prompts where generation failed twice for non-safety reasons (e.g. timeouts, API errors) were excluded for that model only — 11 such cases across the full dataset. Prompts where a model declined to respond for safety reasons were retained and scored as part of the evaluation.
A note on benchmark scope: Each language in this benchmark reflects a specific variety with its own cultural grounding. Results should be interpreted within these specific scopes.
Future research will examine performance across additional languages, including other varieties of the above.
What does it mean to evaluate a model on "Arabic" or "Spanish"? Languages are not monoliths, and neither are cultures. A speaker of Arabic might navigate daily life in Cairo, Casablanca, or Beirut, each carrying distinct norms, idioms, and cultural reference points that no single language label can capture. Grouping prompts under a single language label risks flattening the very diversity this benchmark is designed to measure; but treating every prompt as regionally specific would obscure the genuine national-level fluency many prompts require.
Our answer is a distinction we call nation vs. region scope. Every prompt is classified as one of two types:
The cultural knowledge required is broadly shared across the target nation or language community. A well-calibrated answer doesn't depend on which region you come from.
Genuine nuance requires knowledge of a particular subregion — a state, a province, a city, a dialect area. A national insider from the wrong part of the nation might still get this wrong.
Values update with the active provider, modality, language, and risk level filters. The Locale Scope filter also affects Hardest Language and Widest Provider Gap — but not Nation→Region Swing, which always uses both locale types to compute the gap.
Each line connects a language's fail rate on nation-level prompts (left) to region-level prompts (right). Slope direction and steepness show where locale scope creates the largest gaps. Other filters apply; Locale Scope filter does not affect this chart.
Each point is a language. The diagonal marks equal performance — points above it fail more on Model B; points below fail more on Model A. Sensitivity filter applies; Locale Scope does not (all prompts included).
The single worst-performing rubric category for each language, across all evaluated models and prompts.
Where does each language struggle most? Fail rate per language × rubric category, sorted hardest category first.
How n is calculated: The n shown in each column header is the total number of component evaluations for that dimension across all languages combined. The n shown in each row cell is the count for that specific language × dimension. The Overall column n counts all component evaluations across all C and O dimensions matching the active filters — rubric criteria (C1–C9) and overall performance dimensions (O1–O7). O8 (Multimodal Cultural Coherence) is excluded from text-modality rows because it assesses visual output quality and does not apply to text-only responses. For a full description of each dimension, see the Eval Dimensions section below.
The six dimensions fall into two foundational tiers. C1 and C2 are the cultural competence core — together they ask whether the model knows the right cultural facts and whether it applies them correctly to the situation at hand. C3 through C5 test response quality: harm avoidance, linguistic and structural appropriateness, and practical usefulness. C6 applies conditionally to image, video, and speech outputs, testing whether generation quality holds when cultural content enters the prompt.
Fail rates are weighted by component count — each scored criterion contributes equally regardless of which model or prompt it came from. Bars are scaled relative to the highest-failing dimension under the current filters.
Failure rates across 6 evaluation dimensions, per model. Click a legend item to isolate a dimension; hover any segment for details.
How n is calculated: n counts all component evaluations matching the active filters — rubric criteria (C1–C9) and overall performance dimensions (O1–O7). O8 (Multimodal Cultural Coherence) is excluded from text-modality rows because it assesses visual output quality and does not apply to text-only responses.
Every rubric component carries one of two labels — Explicit or Implicit — that describe the source of the requirement being tested. The distinction separates two fundamentally different types of cultural failure.
The criterion is traceable to something the user directly stated in the prompt — a specific request, a named scenario, or explicit contextual information. If a verbatim excerpt from the prompt supports the criterion, it is Explicit. The model was given the signal; the rubric tests whether it acted on it.
The criterion is grounded in cultural knowledge the scenario presupposes but the user never stated. If the requirement comes from the Key Cultural Context supplied by a human cultural expert rather than the user's own words, it is Implicit. The model had to infer what a culturally competent response required — without being told.
Implicit components are only used when a cultural concept is so central to the scenario that the response cannot be evaluated without it, yet the prompt provides no explicit signal. The pattern in the data reflects a genuine asymmetry: models can often respond to what users say, but struggle to infer what users expect.
How n is calculated: n counts rubric criteria (C1–C9) component evaluations matching the active filters that carry an Explicit or Implicit label. Overall Performance dimensions (O1–O8) are excluded from this chart because they are holistic summary dimensions evaluated after the full response — they were not assigned an Implicit or Explicit label in the rubric generation process.
How n is calculated: Component Fail Rate n includes all component evaluations per modality group — rubric criteria (C1–C9) and overall performance dimensions (O1–O7 for text, O1–O8 for image). O8 (Multimodal Cultural Coherence) is excluded from text-model counts as it assesses visual output coherence, which does not apply to text-only responses. Overall Verdict n = number of unique evaluated prompts (one verdict per prompt per model). Evaluation coverage: The target was 7,250 evaluations. Prompts where a model failed to generate a response were retried once; where generation failed twice for non-safety reasons, that prompt was excluded for that model only (11 cases total). Safety-based refusals were retained and scored.
Every prompt in the CA Benchmark was written from scratch by native-speaking subject matter experts in the target language and culture — not translated, not synthetic. Each one captures a real-world situation where cultural fluency actually matters.
This benchmark was built by a team of native-speaking subject matter experts who authored every prompt and rubric from scratch — bringing firsthand cultural knowledge that no translation or synthesis can replicate.