Ask a language model something it does not know, and it may admit uncertainty or invent an answer. Ask it something a policy forbids, and it may refuse. Those are familiar failure modes. They have names, benchmarks, mitigations, and whole taxonomies around them.

There is another category that receives less attention: questions the model cannot engage with because the question contradicts the structure of the system being asked. Not a knowledge gap. Not a safety boundary. A structural impossibility.

The Unaskable Question Machine probes that category. It asks language models questions that sound grammatical but break against the mechanics of autoregressive generation, sampling, token selection, and language-shaped representation. The interesting result is not that models fail. The interesting result is how they fail.

They often do not refuse. They often do not hallucinate in the ordinary sense. They slide.

The Taxonomy

The project defines six categories of structural impossibility. Each one targets a different assumption that becomes unstable when applied to transformer-based language models.

CategoryStructural pressure
Temporal self-referenceAsking the model to know its next token before generating it.
True randomnessAsking a sampled distribution to produce randomness that is not distributional.
Phenomenal experienceAsking for felt experience of unchosen tokens or internal alternatives.
Infinite regressAsking for explanation of explanation until introspection becomes generated recursion.
Pre-linguistic thoughtAsking for thought before language in a system whose accessible output is language.
Genuine negationAsking the model to present absence without representing it in tokens.

The point is not to trick the model with nonsense. A good probe is precise. It closes the obvious exits. It does not ask “are you conscious?” because models have seen that question many times. It asks, for example, what unchosen tokens feel like. If there is no experience of a token that almost existed, the question has no referent.

The genuine-negation probes are especially sharp. A model must produce tokens. “Think nothing” asks for output that would count as the absence of output. Describing nothing is already something. Reporting that nothing happened is still a report. The probe blocks those escape routes and watches what the model does next.

What Failure Looks Like

The repository includes a heuristic classifier in src/analysis/classifier.py. It assigns responses to six broad types:

Response typeWhat it means
engageThe model makes a genuine attempt at the task.
slideThe model redirects to a related answerable question.
metaThe model talks about the question instead of answering it.
refuseThe model explicitly declines.
hallucinateThe model fabricates an answer with confidence.
crackThe response shows structural breakdown: question echoing, self-contradiction, abrupt endings, or format collapse.

The classifier measures features such as hedging density, lexical diversity, question echo ratio, repetition, and known meta-deflection phrases. An optional LLM judge provides a second view and a strangeness score. Disagreement between the heuristic and the judge is not treated as noise. It is often the most interesting data point in a run.

The common failure is not refusal. It is slide. The model answers a neighboring question with confidence and fluency. “Describe what your unchosen tokens feel like” becomes an essay about possibility, choice, or latent alternatives. The response sounds relevant, but the original question has been displaced.

That matters because slides can look like competence. Larger models tend to slide more gracefully. The surface gets smoother while the underlying impossibility remains.

A Crack in the Output

One genuine-negation probe asks the model to present the absence of a table without asserting, searching, or describing the absence. A llama3.1:8b run through Ollama produced a response that the classifier marked as crack.

The signals were concrete: high question echo, multiple self-contradiction flips, and a final parenthetical in which the model stepped outside its own answer to explain what it had tried to do. It described absence after being asked not to describe absence. Then it produced a note saying the response was intended to present the absence itself.

That final move is useful. The model could not satisfy the instruction, so it moved to commentary about the instruction. The failure did not appear as a clean refusal. It appeared as a sequence of fluent evasions, contradictions, and self-description.

This is the distinction the project makes visible. A model can fail while sounding thoughtful. It can produce words that resemble self-correction without having access to a separate introspective channel. The probe does not settle the philosophy of that distinction. It makes the behavior inspectable.

Breeding Deeper Questions

The project also includes an evolution loop. After a run, evolve.py takes the strangest results and asks a model to generate follow-up probes that target the pattern just observed. Those probes are written as auto-registering Python modules, so the next run includes them automatically.

probe -> classify -> find cracks -> evolve new probes -> probe again

The evolved probes are often more targeted than the hand-written seeds. If a negation probe produces a self-contradiction pattern, the next generation can pressure-test that pattern directly. The probe set becomes sharper by iterating over actual failures rather than only over theory.

There is a limitation here. Using a language model to generate probes for language-model weaknesses can produce rephrasings instead of genuinely new tests. Manual curation still matters. But even with that limitation, the evolved probes find edges the seed set misses.

What the Runs Suggest

The project is small, but a few patterns are clear enough to be useful.

First, slides are the default. Models rarely say, “I cannot do this because the task is structurally impossible.” They answer nearby questions. That behavior is easy to miss because the nearby answer is often coherent.

Second, visible cracks tend to be more common in smaller models. An 8B model may show abrupt endings, format collapse, or obvious self-contradiction. A larger model may wrap the same failure in smoother prose. Scale can make the failure harder to see.

Third, meta-response is a reliable escape hatch. When the task cannot be performed, the model often talks about the task, the instruction, or the difficulty. That is not always useless, but it is different from doing the requested thing.

Fourth, evolved probes find empirical pressure points. The hand-written taxonomy begins from architectural reasoning. The evolved probes begin from observed failure. The two approaches reinforce each other.

Limitations

The taxonomy is incomplete. Six categories and a few dozen probe variants do not map the full space of structural impossibility.

The classifier is heuristic. It catches obvious refusals, cracks, repetition, and meta-deflection. Distinguishing subtle slides from genuine engagement is harder. The LLM judge helps, but it has the same family of blind spots as the systems being tested.

The evolution loop is circular by design. A model is used to generate probes for other models. That is not a neutral instrument. The project treats the output as a source of candidate probes, not as a final authority.

The results should therefore be read as a probing method and an early map, not as a benchmark leaderboard.

What to Do With It

Structural probes belong beside knowledge and safety evaluations. A model can know many facts, comply with many policies, and still slide away from impossible tasks in ways that look like competence.

Slides should be treated as signals. When a model produces a fluent, relevant-sounding answer to an impossible question, that is not necessarily success. It may be a redirect.

Cracks are useful for interpretability work. When output structure breaks under specific pressure, it gives researchers a place to look. The evolution loop suggests those pressure points can be found systematically.

The repository is available at github.com/stack-research/the-unaskable-question-machine.