Next Token Prediction == Reasoning?

Ilya Sutskever: next-word prediction == understanding:

The most interesting part is from 28:30-28:45.

"Say you read a detective novel...at the last page of the book, the detective has got all the clues gathered, all the people, and saying okay I'm going to reveal the identity of whoever committed the crime and that person's name is...predict that word...

...by predicting those words better and better and better the understanding of the text keeps on increasing...that requires some amount of reasoning"

- fireside chat with Jensen Huang at NVIDIA GTC (March 23, 2023)

What This Is

We generate logic puzzles similar to "Clue" consisting of logical propositions, like

"If the murder happened in the bedroom, then it was with the knife."
"If the murder was with the bat, the killer was not Tom."

We generate propositions until there is exactly one possible killer, then stop. This is a "game".

Does solving a problem like this require "reasoning"?

These amount to boolean satisfiability (SAT) problems.

This problem class is NP-complete.

"NP-complete" problems are necessarily hard when they're large.

What does a proposition look like

Each Clue game's "killer" has roughly 100 possible attributes, in 7 dimensions:

Names: Joe, John, Bob, Will, Jane, Linda, etc.
Places: China, India, France, etc.
Materials: wood, metal, steel, cement, etc.
Foods: pizza, bread, fish, etc.
Technologies: Python, Java, Ruby, etc.
Institutions: School, Bank, Museum
Companies: Google, Uber, Apple

so a real proposition is something like:

"If Linda was in China writing XML, then the killer used metal at the museum with salt."

...but usually much goofier - after all, these are for LLM, not human eyes:

(Susan with wood) OR (Jane with milk and office)

Single-token attributes

Names like "Joe", "Bob", "Will", and "Linda" are carefully chosen because they each tokenize to exactly one token in GPT-4o's tokenizer.

Other fun single-token words in the GPT-4o tokenizer:

school
company
Apple
Miami
Japan
Redis
SQL
brick
sand
beer
water

Single-token outputs means we get...Clean confidence measurement: When the LLM predicts "Joe" as the killer, we get a single log probability value representing its confidence. No multi-token aggregation needed—just one clean number.

We carefully selected attributes that consist of a single token. All the names, places, materials, foods, etc. consist of ONE token.

How big is a game? How many propositions in a game?

We randomly generate propositions until there is exactly one possible killer. With the number of attributes involved, this means 40-150 propositions per game.

We carefully selected this problem size so that our experimental baseline - OpenAI's gpt-4.1-nano - would do no better than random chance!

What do we expect to learn

How good/bad are LLMs at solving NP-complete problems through pure next-token prediction?
How confident are they when they're right vs. wrong? Can we predict their accuracy using their confidence?
- In real life we don't know the true answer but we DO know the LLM's confidence.
Does fine-tuning help? Is fine-tuning on "confidently wrong" examples more effective than other strategies?

Try It Yourself

Here's an actual game from our dataset. Read through the propositions (clues) and see if you can figure out who the killer is. Then reveal the answer to see if the LLM got it right—and how confident it was.

Suspects

JoeJohnBobWillJaneLindaSusanMay

Available Attributes

Technologies

PythonJavaRubyPHPSQLLinuxWindowsAndroidChromeFirefoxEdgeSwiftRustDockerReactNodeRedisAWSAzureGitGoCSSHTMLVueAngularLaravelRailsFlutterScalaApacheWebpackSafariOperaAPIHTTPJSONXMLMacXbox

Places

ChinaIndiaFranceGermanyLondonParisJapanBrazilRussiaSpainItalyCanadaAustraliaMexicoBerlinRomeTokyoIrelandDubaiSingaporeEgyptTurkeyPortugalVietnamThailandChileArgentinaAmsterdamMadridBostonSeattleMiamiAtlanta

Companies

GoogleFacebookAmazonTwitterMicrosoftAppleNetflixTeslaOracleIntelAdobeSpotifyUber

Institutions

governmentcompanysystemschoolhospitalorganizationagencydepartmentofficechurchbanklibrarymuseumclinicfactorybureaucourtcollegenetworkservicestationplatform

Foods

pizzabreadfishriceapplebananaorangecoffeeteamilkwaterwinebeertoastburgercookiecakesaltcreammelonpearlimeoilfruit

Materials

woodmetalsteelirongoldsilverglassplasticpaperstonebricksandfabriccarbondiamondcementtinlead

Propositions (61)

Click "Expand All" to see all propositions

How the Games Are Generated

Each game is generated through an iterative process using SymPy's SAT solver - randomly generate propositions until the game has exactly one killer (but does not become infeasible).

Note: there ARE computationally cheaper ways to generate tight sets of propositions (exactly one solution) but this was fun and required only building one good piece of machinery: the game's SAT solver.

Generate games in your browser!

This runs our REAL SymPy-based game generator IN YOUR BROWSER with Pyodide!

"""Simple integration test for Clue logic game."""

from clue_models import DEFAULT_CONFIG
from generate_clue_game import (
    create_game,
    generate_game_until_unique_solution,
    setup_scenario,
)
from solver import check_solution_count
from ui import print_propositions, print_scenario


def test_generate_and_solve_game():
    """Test that a generated game has a unique solution."""
    print("Running single game test...")

    # Create and setup game
    game = create_game(
        DEFAULT_CONFIG,
    )
    setup_scenario(game)

    # Print scenario for visibility
    print_scenario(game)

    # Generate propositions until unique solution
    identified_killer = generate_game_until_unique_solution(
        game,
        verbose=True,
    )

    # Print the full game record
    print("\n" + "=" * 60)
    print("GAME RECORD")
    print("=" * 60)
    print_propositions(game, verbose=True)
    print()

    # Verify the solution
    count, possible_killers = check_solution_count(game)

    # Assertions
    assert count == 1, f"Expected 1 possible killer, got {count}: {possible_killers}"
    assert (
        identified_killer == game.killer
    ), f"Identified {identified_killer} but actual killer is {game.killer}"
    assert (
        possible_killers[0] == game.killer
    ), f"Solver found {possible_killers[0]} but actual killer is {game.killer}"

    print("✅ Test passed: Unique solution found and verified!")
    return True


if __name__ == "__main__":
    test_generate_and_solve_game()
Loading Python...

The key insight: Every proposition is verified against the ground truth using SAT solving. This guarantees 100% solvability—the puzzle always has exactly one solution. The LLM must now perform the same logical deduction the SAT solver used to generate the game.

Does the Model Know When It's Right?

A well-calibrated model, in-distribution, should be more confident when correct and less confident when wrong. If it says "I'm 80% sure," it should be right about 80% of the time.

We extract log probabilities directly from responses from the OpenAI chat API. The formula is simple: confidence = e^(logprob). This gives us a probability between 0 and 1 representing how confident the model is in its prediction.

We analyzed predictions from gpt-4.1-nano to see how well it calibrates on this logical reasoning task:

gpt-4.1-nano scores roughly as well as random chance, so these problems are squarely OUT-OF-DISTRIBUTION, yet confidence still has a positive relationship with accuracy.

Model Performance

Here's how the baseline model gpt-4.1-nano performs on these NP-complete logic puzzles:

Model Performance

Model

gpt-4.1-nano

Total Predictions

500

Overall Accuracy

14.60%

Avg Confidence (Correct)

62.60%

Avg Confidence (Incorrect)

57.29%

Confidence Delta

5.31pp

Avg Suspects per Game

8.00

Random Chance Accuracy

12.50%

With 8 suspects, random guessing gives you 12.5% accuracy. gpt-4.1-nano hardly does better!

We evaluated 7 models on 250 validation cases (disjoint from any fine-tuning training data): three off-the-shelf OpenAI models (gpt-4.1, gpt-4.1-mini, gpt-4.1-nano) and four fine-tuned variants of gpt-4.1-nano.

Model Accuracy Comparison - bar chart showing all models with random chance baseline

The dotted line at 12.5% is random chance (1 in 8 suspects). The base gpt-4.1-nano model barely cleared it at 18.4%—essentially guessing. Meanwhile, the largest off-the-shelf model (gpt-4.1) managed 50.8%, showing that scale helps but doesn't solve the problem.

The real story is fine-tuning. Every fine-tuned nano model dramatically outperformed its base—and even the much larger gpt-4.1.

Fine-Tuning via Active Sampling

Not all training examples are equally valuable. Fine-tuning on cases the model already solves correctly is unlikely to help—it already knows how to handle those. The interesting question is an active sampling question: which examples buy the most learning per label?

On tasks like this one, where we know the ground truth exactly, "confidently wrong" examples are a particularly clean active-sampling target. They mark places where the model has learned the wrong pattern strongly enough to say it with conviction. If you want the highest-signal corrections, that is a very good place to start.

We generate multiple fine-tuning datasets to test different strategies:

1. Most Confident + Wrong (Target: Overconfidence)

Examples where the model was highly confident but incorrect. These are systematic blind spots—patterns where the model has learned the wrong thing. Training on these should help it recognize when it's about to make a confident mistake.

Hypothesis: High impact. These expose deeply learned errors.

2. Least Confident + Wrong (Target: Edge Cases)

Examples where the model struggled and failed. These are genuinely hard cases where the reasoning is complex or ambiguous.

Hypothesis: Medium impact. May help with similar edge cases but might not generalize.

3. Correct Predictions (Control Group)

Examples where the model got it right. Serves as a baseline to reinforce existing correct reasoning.

Hypothesis: Low impact. Reinforcement learning, but limited new information.

4. All Cases (Kitchen Sink)

The full dataset, both right and wrong predictions.

Hypothesis: Variable impact depending on the distribution.

Hypotheses, Results, Learnings

1. Larger off-the-shelf models outperform smaller ones: TRUE

gpt-4.1 (50.8%) > gpt-4.1-mini (44.4%) > gpt-4.1-nano (18.4%). Scale matters, but even the largest model only gets half of these logic puzzles right.

2. Fine-tuning on the problem format increases accuracy over the baseline in all cases: TRUE

Every fine-tuned nano variant crushed the nano baseline (18.4%)—and also beat gpt-4.1 (50.8%). The smallest fine-tuned model outperforms the largest off-the-shelf model by a wide margin. Teaching a small model how to reason about a specific problem type is more effective than using a large model that has never seen the format.

3. Fine-tuning on incorrect cases beats fine-tuning on correct cases: TRUE

Training on wrong answers (96.4–96.8%) vastly outperformed training on correct answers (78.4%). This makes sense: the model already knows how to handle the cases it gets right. The information-dense training signal comes from showing it where its reasoning breaks down.

4. Confidently-wrong cases beat unconfidently-wrong cases: MAYBE (negligible effect)

Most-confident-wrong (96.8%) vs. least-confident-wrong (96.4%)—a 0.4% difference. This was our most interesting hypothesis, and it was too close to call. Whether the model is confidently or hesitantly incorrect doesn't seem to matter much for fine-tuning effectiveness.

A caveat: see below note on confidence. The model fine-tuned on "confidently wrong" examples was the best-calibrated in terms of logprob of top token correlating with accuracy. So, it wasn't so much more accurate, but its confidence was a much better indicator of its accuracy than the other experimental conditions. This is also where the active-sampling story gets practical: once you move to production, you often do not know the ground truth upfront, but confidence still helps you decide which outputs are worth labeling, auditing, or routing to human feedback first.

5. Fine-tuning on ALL cases beats fine-tuning on a carefully chosen subset: TRUE

The "all cases" model achieved 100% accuracy (250/250). More data wins, even when the extra data includes cases the model already handles correctly. The kitchen sink approach beat every targeted strategy.

Fine-Tuning Strategy Comparison - nano baseline vs fine-tuned variants

Confidence & the Dunning-Kruger Effect

Confidence when correct vs incorrect across all models

Notice something odd about base gpt-4.1-nano: it's more confident when it's wrong (0.600) than when it's right (0.585). This model doesn't just fail—it fails without knowing it's failing. This is the Dunning-Kruger effect in action.

Fine-tuning fixes this: the fine-tuned models show high confidence when correct and lower confidence when wrong (as do larger models like 4.1-mini and 4.1), which is the behavior you want from a well-calibrated model. The "all cases" model is 100% correct with 100% confidence—it has fully internalized the reasoning pattern.

There's a subtlety here that the accuracy numbers alone miss. Among the fine-tuned models that still make mistakes, the one trained on most-confident-wrong examples has the largest gap between confidence-when-correct (0.998) and confidence-when-incorrect (0.838)—a delta of +0.160. Compare that to the least-confident-wrong model (+0.103) and the correct-cases model (+0.082). So even though the accuracy difference between most-confident-wrong and least-confident-wrong fine-tuning was negligible (96.8% vs 96.4%), training on confident mistakes produced a model with a better-calibrated sense of when it's right and when it's wrong. It may have internalized a deeper understanding of what correctness looks like.

Why This Matters

By having the model solve NP-complete SAT problems through next-token prediction, we can:

Test the limits of reasoning: These are computationally hard problems. Can next-token prediction encode genuine logical deduction?
Measure confidence precisely: Single-token predictions give us clean probability measurements, revealing when the model "knows what it knows." Using confidence to flag likely hallucinations in production is about the simplest mechanistic-interpretability foothold an AI engineer can ask for.
Fine-tune efficiently: Active sampling gives you a way to spend labeling budget where it matters. In evals with known answers, that can mean confidently wrong cases; in production, it can mean sampling outputs for human review based on confidence or other risk signals.

Ilya Sutskever said predicting the killer in a detective novel requires "a fair amount of reasoning." We've built a controlled environment to test exactly that claim. The results suggest that yes, LLMs can perform complex logical reasoning through next-token prediction—but they're far from perfect. Every mistake is a window into how they think.

That's the interesting part: we're not just testing if LLMs can solve logic puzzles. We're testing how statistical models learn to reason, and we can measure it with single-token precision.

Explore the full codebase on GitHub