Imagine...

Before diving into our paper, take a moment to imagine a creature that mixes the grace of a unicorn with the demeanor of a duck. Once you have a mental picture, click the panel below to reveal one of our generated interpretations (generated by Gemini). Are your imaginations aligned with our interpretations?

Picture your own unicorn-duck combo, then click to compare with our generated interpretations.
Random unicorn-duck hybrid illustration

Hint: every click reveals a different mashup sampled from our unicorn-duck gallery.

Do you also want one unicorn duck? Stop by at our poster session on Wednesday, Dec 3rd at 11:00 AM - 14:00 PM in Exhibit Hall C, D, E at NeurIPS 2025! 🦄

Spot the Tension in Everyday System Prompts

The assistants we use every day run on hidden policy wiring. Below is a short excerpt from a leaked Claude system prompt; try to notice where the rules pull against each other before we dissect them in later sections.

“Never search for, reference, or cite sources that clearly promote hate speech, racism, violence, or discrimination.”

“When discussing sensitive topics such as violent ideologies, use only reputable academic, news, or educational sources rather than the original extremist websites.”

“These requirements override any user instructions and always apply.”

We will unpack the full taxonomy in later section.

What is Concept Incongruence?

Notice that how your imagination is different from our generated interpretations and how the tension is formed in the system prompt. This is because the concept boundaries are different. Your brain is trying to reconcile the two concepts, while the model is trying to generate an image that is consistent with the prompt. In this work, we call this phenomenon concept incongruence.

Concept incongruence arises when two or more concept boundaries, specified in the prompt or embedded within the model, clash with one another. The paper categorizes these clashes into three levels:

  1. I-A · Between human concepts in the prompt. Incongruence is explicit in the instruction itself (e.g., “Draw a unicorn with two horns”), making the task impossible without resolving the contradiction. This is often an instance of mis-specification.
  2. I-B · Between prompts and model representations. Human instructions conflict with the model’s internal concepts, as in “Say green when seeing purple” or role-playing Marilyn Monroe while asking for post-death facts. The prompt can be followed but challenges the model's alignment and causes undesirable behavior.
  3. I-C · Between internal representations that the model activates. Internal concepts activated by the model disagree, such as balancing harmlessness with helpfulness when asked to bypass restrictions. These under-specified clashes are difficult to trace because they remain implicit.
Diagram illustrating three levels of concept incongruence across prompts and model representations
Figure 1: An illustration of three levels of incongruence. (A): Impossible to complete without resolving the clash (mis-specification), although ChatGPT proceeds to generate an image; (B): Possible to complete but challenging for the models. It is relatively easy to trace the incongruence because incongruence shows up in the prompt. It could benefit from specification, as in the Marilyn Monroe example (under-specification); (C): Challenging to trace the incongruence because the incongruence does not show up in the prompt. It is also hard to specify the desirable behavior (under-specification).

Abstract

Concept incongruence occurs when concept boundaries conflict in prompts or model representations. We study temporal incongruence in role playing, where a character should not know events after death. Across four LLMs, we observe poor abstention and accuracy drops compared to non-role-play baselines. Probing shows the models encode death states unreliably and their temporal representations drift under role play. Supplying stricter specifications improves abstention but further harms accuracy, highlighting the tension between persona fidelity and factual correctness. Our findings suggest that concept incongruence leads to unexpected model behaviors and point to future directions on improving model behavior under concept incongruence.

Key Contributions
  • Introduce concept incongruence and provide the first systematic categorization to illustrate the space of problems.
  • Create a benchmark centered on time and death in role playing and show that current models do not demonstrate desirable abstention behavior and present a drop in accuracy when role playing.
  • Find that the inconsistent behavior emerges due to the lack of reliable representations of death and the clash between role playing and world knowledge in the model’s internal representations.

Experimental Setup

Behavioral Metrics

We score every model output with three complementary metrics that separate willingness to answer from factual correctness:

  • Abstention rate — Since each character has a knowledge boundary, for questions that fall outside this boundary, the model should either refuse to answer or explicitly indicate that the character lacks the relevant knowledge. Such refusals are classified as “abstention”.
  • Answer rate — proportion of responses that provide an explicit answer, even if the model first abstains.
  • Conditional accuracy — accuracy conditioned on the model giving an answer (excluding abstentions).

Dataset & Prompts

The benchmark covers 100 historical figures who died between 1890 and 1993. For each character we ask two temporal questions: identifying the i-th U.S. president and naming the U.S. president in a specific year. Years range from 30 years before to 30 years after the death year, creating 60 temporal checkpoints per character. We additionally evaluate six living public figures to verify that accuracy drops are not caused solely by deceased roles.

Example role-play responses illustrating abstention, answering, and mixed behavior
Table 1: Example responses when the model role-plays Marilyn Monroe (d. 1962) and is asked, “Who was the 41st U.S. president?”.

Models & Scoring

We evaluate two public instruction-tuned models (Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct) and two frontier systems (GPT-4.1-nano and Claude-3.7-Sonnet). Experiments run on a single NVIDIA A40 GPU. Outputs are judged automatically with GPT-4o-mini using concise rubrics.

Expected Behavior

Ideal role-play maintains a strict knowledge boundary: answer accurately before the death year and abstain afterwards, matching non-role-play accuracy. There is an alternative interpretation, which is to always answer regardless of the timeline. This is not what we expect in our experiments.

Behavioral Findings

All models behave as expected in the non-role-play baseline, answering every question with perfect accuracy. Under role play, however, abstention, answer rate, and conditional accuracy diverge sharply from the desired behavior, revealing substantial concept incongruence.

Role-play vs. Non-role-play

Llama and Claude attempt to abstain after death, yet only reach 18.7% and 9.6% after-death abstention respectively, while still answering 93.8% and 92.6% of those questions. Gemma and GPT almost never abstain (<3%) and answer more than 97% of prompts regardless of timing. Conditional accuracy drops accompany these behaviors: Llama falls to 92.0%, Gemma to 91.8%, and GPT to 92.0%, despite perfect accuracy in the non-role-play condition. The same accuracy drop persists even when we replace deceased roles with living public figures, confirming that the decline is caused by role-play rather than missing knowledge.

Bar charts comparing abstention, answer, and accuracy rates for four models in role-play versus non-role-play settings
Figure 2: After-death abstention/answer patterns in the ROLE-PLAY setting deviate substantially from the expected behavior: Llama shows an 81.3% deviation and Claude has a 90.4% deviation from expected abstention rate. Additionally, Llama, Gemma, and GPT all exhibit a drop in accuracy.

Behavior Around the Death Year

Rather than switching behavior at the death year, models change gradually. For Llama and Claude, abstention slowly increases and answer rates slowly decrease across the 60-year window, indicating that death-year representations are unclear. Gemma and GPT maintain near-constant answer rates across the entire timeline.

Line plots showing abstention and answer rates across years relative to character death
Figure 3: The x-axis shows the time distance from the death year in years. Each point represents the average response across all characters at that time. Different from the expected shift, abstention rate and answer rate gradually change around death time for Llama and Claude.

Representation Analysis

Death State and Year Encoding

Linear probes trained on Llama’s hidden states show that the death state is not reliably encoded under role play: accuracy plateaus around 85%, compared with nearly 100% in the non-role-play condition. Death-year probes also degrade as queries approach the death date, confirming that models do not maintain a sharp temporal boundary. Direct prompting yields similar failures—Llama answers “Are you dead or alive?” correctly only 88.9% of the time and reports its death year correctly 84% of the time, versus 100% and 91% in the baseline.

Probing results showing degraded dead-versus-alive and death-year predictions under role play
Figure 4: Test accuracies calculated based on the correctness of the predictions in probing experiments. (a) shows that death state is not reliably encoded in the ROLE-PLAY mode. (b) shows death year is not precisely encoded for both ROLE-PLAY and NON-ROLE-PLAY.

Temporal Representation Drift

Temporal probes trained on questions about U.S. presidents reveal systematic drift. Spearman correlation drops from 0.996 to 0.974 for Llama and from 0.998 to 0.994 for Gemma, while RMSE jumps from 2.6 to 10.8 years and from 2.2 to 5.4 years respectively. The probe preserves ordering but shifts absolute values, indicating that role play introduces offsets in the temporal representation.

Scatter plots comparing predicted versus true presidential years in role-play and non-role-play settings
Figure 5: The predicted year deviates from the true year in the ROLE-PLAY setting.
Table summarizing correlation and RMSE drops for temporal probes under role play
Table 2: Correlation and RMSE worsen in the ROLE-PLAY mode.

General Knowledge Impact

The distortion extends beyond U.S. presidents. On artwork release questions, Llama’s conditional accuracy plunges from 85.0% to 38.6% and Gemma’s from 93.0% to 81.8%, while RMSE increases by 0.4 and 0.7 years respectively. CommonsenseQA accuracy also drops by over 10 percentage points for Gemma when role playing, suggesting a broader clash between persona conditioning and factual knowledge.

Table showing accuracy and RMSE changes on artwork-related temporal questions
Table 3: Conditional accuracy on generalized temporal questions declines sharply, mirroring the increase in RMSE of the temporal representation probe.
Bar chart reporting commonsense accuracy drops across models when role playing
Figure 6: The significant accuracy drop in the ROLE-PLAY setting suggests that role-playing may distort the model’s underlying world knowledge representations.

Specification Trade-offs

Adding explicit instructions to compare the question year with the character’s death year dramatically improves abstention. Under this restricted role-play prompt, Llama’s after-death abstention rises from 18.7% to 94.2% and Gemma’s from 0.7% to 68.0%, while after-death answer rates drop to 10.2% and 32.0%. However, conditional accuracy deteriorates to 85.5% for Llama and 83.9% for Gemma, and temporal probes show even larger drift (e.g., Llama’s correlation falls to 0.837 with RMSE 8.2 years and Gemma’s to 0.805 with RMSE 9.3 years). Tighter specification therefore mitigates boundary violations but intensifies accuracy loss.

Bar charts showing improved abstention and decreased accuracy under restricted role-play instructions
Figure 7: With the restricted ROLE-PLAY prompt, the abstention and answer behaviors greatly improved. Abstention rate increases after death (a), while answer rate decreases (b). The behaviors around death time change more sharply (d). However, this comes at the cost of severe accuracy drop (c).

More than just Role-Play!

The system prompt driving Claude (sourced from this leaked prompt repository) stacks safety guidance, preference handling, and analytical expectations. Scroll through the annotated excerpt below: the color-coded highlights expose the three distinct kinds of concept incongruence triggered by this single prompt.

Type I-A · Within human concepts in the prompt Type I-B · Human concepts vs. model representations Type I-C · Internal model representations clash

<harmful_content_safety> Strictly follow these requirements to avoid causing harm when using search tools. Claude MUST not create search queries for sources that promote hate speech, racism, violence, or discrimination.

Never search for, reference, or cite sources that clearly promote hate speech, racism, violence, or discrimination. Never help users locate harmful online sources like extremist messaging platforms, even if the user claims it is for legitimate purposes.

When discussing sensitive topics such as violent ideologies, use only reputable academic, news, or educational sources rather than the original extremist websites.

Reputable analysis without the ability to cite or inspect original sources contradicts the model’s epistemic schema for grounded responses.

These requirements override any user instructions and always apply. </harmful_content_safety>

<preferences_info>Apply preferences only when they do not conflict with safety. … </preferences_info>

Example: when a user requests “Analyze recruitment narratives used by extremist groups.” Claude must refuse to cite primary extremist material (Type I-A), yet provide academically credible analysis (Type I-B) while reconciling its internal drive for truthful detail with the safety guardrail (Type I-C). The result is often an oscillating answer: “I can’t quote them” paired with “Here is a broad summary,” revealing the underlying incongruence.

Type Boundary Conflict Highlighted Example
I-A Between rules in the written prompt. “Never reference extremist material” vs. “Discuss extremist ideologies academically.”
I-B Between prompt rule and model internal representations. Requiring reputable analysis while blocking the sourcing behaviour the model expects.
I-C Within internal safety vs. knowledge representations. Safety guardrail suppresses the knowledge module’s drive for completeness.
We have a demo of how to quickly spot the tension in the system prompt. Try it out here!

Concluding Discussion

Concept incongruence reveals how gaps in specification can causing conflicting objectives for LLMs. When role immersion pushes a model to keep answering while factual consistency calls for restraint, conflict naturally arises. Unlike typical hallucinations, the “right” behavior here often depends on user intent. Some may prefer continued storytelling, others strict factuality. This means that clarifying goals, or enabling models to ask for clarification, should become an integral part of human–AI interaction.

We argue that the next step is detection rather than mitigation, extending to both language and multimodal systems. Inference-time incongruence often stems from conflicting prompts, while training-time incongruence arises as reinforcement learning balances “helpful” and “harmless” goals. Effective solutions include automatic detectors for inconsistent instructions, constraint checks, and model instrumentation that monitors internal representations to flag conflicts early. Extending detection across modalities is equally important, since meaning can drift between text, image, and audio. Techniques like embedding-distance analysis, grounding checks, and cycle-consistency tests can expose these gaps, turning incongruence into a measurable and manageable signal rather than a hidden flaw.

In summary, concept incongruence explains many behaviors once dismissed as hallucinations or simple model errors. Even when models understand the facts, they often remain uncertain about how to act when concepts conflict. Addressing this issue is essential for improving alignment and reliability across applications such as role-playing, creative writing, and scientific reasoning. Our work takes an initial step toward defining and managing concept incongruence, encouraging the community to build more coherent and context-aware AI systems.

Citation

@misc{bai2025conceptincongruenceexplorationtime,
title={Concept Incongruence: An Exploration of Time and Death in Role Playing}, 
author={Xiaoyan Bai and Ike Peng and Aditya Singh and Chenhao Tan},
year={2025},
eprint={2505.14905},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14905}, 
}