Next Token Prediction == Reasoning?
Ilya Sutskever: next-word prediction == understanding:
The most interesting part is from 28:30-28:45.
"Say you read a detective novel...at the last page of the book, the detective has got all the clues gathered, all the people, and saying okay I'm going to reveal the identity of whoever committed the crime and that person's name is...predict that word...
...by predicting those words better and better and better the understanding of the text keeps on increasing...that requires some amount of reasoning"
- fireside chat with Jensen Huang at NVIDIA GTC (March 23, 2023)
What This Is
We generate logic puzzles similar to "Clue" consisting of logical propositions, like
- "If the murder happened in the bedroom, then it was with the knife."
- "If the murder was with the bat, the killer was not Tom."
We generate propositions until there is exactly one possible killer, then stop. This is a "game".
Does solving a problem like this require "reasoning"?
These amount to boolean satisfiability (SAT) problems.
This problem class is NP-complete.
"NP-complete" problems are necessarily hard when they're large.
What does a proposition look like
Each Clue game's "killer" has roughly 100 possible attributes, in 7 dimensions:
- Names: Joe, John, Bob, Will, Jane, Linda, etc.
- Places: China, India, France, etc.
- Materials: wood, metal, steel, cement, etc.
- Foods: pizza, bread, fish, etc.
- Technologies: Python, Java, Ruby, etc.
- Institutions: School, Bank, Museum
- Companies: Google, Uber, Apple
so a real proposition is something like:
...but usually much goofier - after all, these are for LLM, not human eyes:
(Susan with wood) OR (Jane with milk and office)
Single-token attributes
Names like "Joe", "Bob", "Will", and "Linda" are carefully chosen because they each tokenize to exactly one token in GPT-4o's tokenizer.
Other fun single-token words in the GPT-4o tokenizer:
- school
- company
- Apple
- Miami
- Japan
- Redis
- SQL
- brick
- sand
- beer
- water
Single-token outputs means we get...Clean confidence measurement: When the LLM predicts "Joe" as the killer, we get a single log probability value representing its confidence. No multi-token aggregation needed—just one clean number.
We carefully selected attributes that consist of a single token. All the names, places, materials, foods, etc. consist of ONE token.How big is a game? How many propositions in a game?
We randomly generate propositions until there is exactly one possible killer. With the number of attributes involved, this means 40-150 propositions per game.
We carefully selected this problem size so that our experimental baseline - OpenAI's gpt-4.1-nano - would do no better than random chance!
What do we expect to learn
- How good/bad are LLMs at solving NP-complete problems through pure next-token prediction?
- How confident are they when they're right vs. wrong? Can we predict their accuracy using their confidence?
- In real life we don't know the true answer but we DO know the LLM's confidence.
- Does fine-tuning help? Is fine-tuning on "confidently wrong" examples more effective than other strategies?
Try It Yourself
Here's an actual game from our dataset. Read through the propositions (clues) and see if you can figure out who the killer is. Then reveal the answer to see if the LLM got it right—and how confident it was.
How the Games Are Generated
Each game is generated through an iterative process using SymPy's SAT solver - randomly generate propositions until the game has exactly one killer (but does not become infeasible).
Note: there ARE computationally cheaper ways to generate tight sets of propositions (exactly one solution) but this was fun and required only building one good piece of machinery: the game's SAT solver.
Generate games in your browser!
This runs our REAL SymPy-based game generator IN YOUR BROWSER with Pyodide!
"""Simple integration test for Clue logic game."""
from clue_models import DEFAULT_CONFIG
from generate_clue_game import (
create_game,
generate_game_until_unique_solution,
setup_scenario,
)
from solver import check_solution_count
from ui import print_propositions, print_scenario
def test_generate_and_solve_game():
"""Test that a generated game has a unique solution."""
print("Running single game test...")
# Create and setup game
game = create_game(
DEFAULT_CONFIG,
)
setup_scenario(game)
# Print scenario for visibility
print_scenario(game)
# Generate propositions until unique solution
identified_killer = generate_game_until_unique_solution(
game,
verbose=True,
)
# Print the full game record
print("\n" + "=" * 60)
print("GAME RECORD")
print("=" * 60)
print_propositions(game, verbose=True)
print()
# Verify the solution
count, possible_killers = check_solution_count(game)
# Assertions
assert count == 1, f"Expected 1 possible killer, got {count}: {possible_killers}"
assert (
identified_killer == game.killer
), f"Identified {identified_killer} but actual killer is {game.killer}"
assert (
possible_killers[0] == game.killer
), f"Solver found {possible_killers[0]} but actual killer is {game.killer}"
print("✅ Test passed: Unique solution found and verified!")
return True
if __name__ == "__main__":
test_generate_and_solve_game()
The key insight: Every proposition is verified against the ground truth using SAT solving. This guarantees 100% solvability—the puzzle always has exactly one solution. The LLM must now perform the same logical deduction the SAT solver used to generate the game.
Does the Model Know When It's Right?
A well-calibrated model, in-distribution, should be more confident when correct and less confident when wrong. If it says "I'm 80% sure," it should be right about 80% of the time.
We extract log probabilities directly from responses from the OpenAI chat API. The formula is simple: confidence = e^(logprob). This gives us a probability between 0 and 1 representing how confident the model is in its prediction.
We analyzed predictions from gpt-4.1-nano to see how well it calibrates on this logical reasoning task:

gpt-4.1-nanoscores roughly as well as random chance, so these problems are squarely OUT-OF-DISTRIBUTION, yet confidence still has a positive relationship with accuracy.
Model Performance
Here's how the baseline model gpt-4.1-nano performs on these NP-complete logic puzzles:
With 8 suspects, random guessing gives you 12.5% accuracy.
gpt-4.1-nanohardly does better!
We evaluated 7 models on 250 validation cases (disjoint from any fine-tuning training data): three off-the-shelf OpenAI models (gpt-4.1, gpt-4.1-mini, gpt-4.1-nano) and four fine-tuned variants of gpt-4.1-nano.

The dotted line at 12.5% is random chance (1 in 8 suspects). The base gpt-4.1-nano model barely cleared it at 18.4%—essentially guessing. Meanwhile, the largest off-the-shelf model (gpt-4.1) managed 50.8%, showing that scale helps but doesn't solve the problem.
The real story is fine-tuning. Every fine-tuned nano model dramatically outperformed its base—and even the much larger gpt-4.1.
Fine-Tuning: are "Confidently Wrong" examples the most sample-efficient?? (Yes!)
Not all training examples are equally valuable. Fine-tuning on cases the model already solves correctly is unlikely to help—it already knows how to handle those. The interesting question: which mistakes should we teach it to fix?
We generate multiple fine-tuning datasets to test different strategies:
1. Most Confident + Wrong (Target: Overconfidence)
Examples where the model was highly confident but incorrect. These are systematic blind spots—patterns where the model has learned the wrong thing. Training on these should help it recognize when it's about to make a confident mistake.
Hypothesis: High impact. These expose deeply learned errors.
2. Least Confident + Wrong (Target: Edge Cases)
Examples where the model struggled and failed. These are genuinely hard cases where the reasoning is complex or ambiguous.
Hypothesis: Medium impact. May help with similar edge cases but might not generalize.
3. Correct Predictions (Control Group)
Examples where the model got it right. Serves as a baseline to reinforce existing correct reasoning.
Hypothesis: Low impact. Reinforcement learning, but limited new information.
4. All Cases (Kitchen Sink)
The full dataset, both right and wrong predictions.
Hypothesis: Variable impact depending on the distribution.
Hypotheses, Results, Learnings
1. Larger off-the-shelf models outperform smaller ones: TRUE
gpt-4.1 (50.8%) > gpt-4.1-mini (44.4%) > gpt-4.1-nano (18.4%). Scale matters, but even the largest model only gets half of these logic puzzles right.
2. Fine-tuning on the problem format increases accuracy over the baseline in all cases: TRUE
Every fine-tuned nano variant crushed the nano baseline (18.4%)—and also beat gpt-4.1 (50.8%). The smallest fine-tuned model outperforms the largest off-the-shelf model by a wide margin. Teaching a small model how to reason about a specific problem type is more effective than using a large model that has never seen the format.
3. Fine-tuning on incorrect cases beats fine-tuning on correct cases: TRUE
Training on wrong answers (96.4–96.8%) vastly outperformed training on correct answers (78.4%). This makes sense: the model already knows how to handle the cases it gets right. The information-dense training signal comes from showing it where its reasoning breaks down.
4. Confidently-wrong cases beat unconfidently-wrong cases: MAYBE (negligible effect)
Most-confident-wrong (96.8%) vs. least-confident-wrong (96.4%)—a 0.4% difference. This was our most interesting hypothesis, and it was too close to call. Whether the model is confidently or hesitantly incorrect doesn't seem to matter much for fine-tuning effectiveness.
A caveat: see below note on confidence. The model fine-tuned on "confidently wrong" examples was the best-calibrated in terms of logprob of top token correlating with accuracy. So, it wasn't so much more accurate, but its confidence was a much better indicator of its accuracy than the other experimental conditions. Takeaway: a model trained on its worst mistakes can better flag production hallucinations.
5. Fine-tuning on ALL cases beats fine-tuning on a carefully chosen subset: TRUE
The "all cases" model achieved 100% accuracy (250/250). More data wins, even when the extra data includes cases the model already handles correctly. The kitchen sink approach beat every targeted strategy.

Confidence & the Dunning-Kruger Effect

Notice something odd about base gpt-4.1-nano: it's more confident when it's wrong (0.600) than when it's right (0.585). This model doesn't just fail—it fails without knowing it's failing. This is the Dunning-Kruger effect in action.
Fine-tuning fixes this: the fine-tuned models show high confidence when correct and lower confidence when wrong (as do larger models like 4.1-mini and 4.1), which is the behavior you want from a well-calibrated model. The "all cases" model is 100% correct with 100% confidence—it has fully internalized the reasoning pattern.
There's a subtlety here that the accuracy numbers alone miss. Among the fine-tuned models that still make mistakes, the one trained on most-confident-wrong examples has the largest gap between confidence-when-correct (0.998) and confidence-when-incorrect (0.838)—a delta of +0.160. Compare that to the least-confident-wrong model (+0.103) and the correct-cases model (+0.082). So even though the accuracy difference between most-confident-wrong and least-confident-wrong fine-tuning was negligible (96.8% vs 96.4%), training on confident mistakes produced a model with a better-calibrated sense of when it's right and when it's wrong. It may have internalized a deeper understanding of what correctness looks like.
Why This Matters
By having the model solve NP-complete SAT problems through next-token prediction, we can:
-
Test the limits of reasoning: These are computationally hard problems. Can next-token prediction encode genuine logical deduction?
-
Measure confidence precisely: Single-token predictions give us clean probability measurements, revealing when the model "knows what it knows." Low confidence can help flag hallucinations in production is probably the simplest possible mechanistic interpretability entrypoint for AI engineers.
-
Fine-tune efficiently: By training on specific failure patterns, we can teach the model to reason better—and measure exactly how much better.
Ilya Sutskever said predicting the killer in a detective novel requires "a fair amount of reasoning." We've built a controlled environment to test exactly that claim. The results suggest that yes, LLMs can perform complex logical reasoning through next-token prediction—but they're far from perfect. Every mistake is a window into how they think.
That's the interesting part: we're not just testing if LLMs can solve logic puzzles. We're testing how statistical models learn to reason, and we can measure it with single-token precision.
Explore the full codebase on GitHub