The Stoke Benchmark: Breaking AI Models (So You Don’t Have To)

Everyone is talking about how “smart” the new AI models are. You’ve seen the headlines: “It passed the Bar Exam!” “It scored 90% on a biology test!”

That’s great, but I don’t need an AI to be a lawyer or a biologist. I need it to be a functional creative partner, a logical thinker, and a reliable coder. I need to know if it cracks under pressure when I ask it to do something weird, specific, and difficult.

Most reviews throw a few generic questions at a chatbot and call it a day. We’re not doing that here. We are building a gauntlet.

Welcome to The Stoke Benchmark.

The 20-Slot Gauntlet

I have (…currently almost) designed a 20-part test structure specifically engineered to stress the weak points of Large Language Models (LLMs). We aren’t just testing if the model knows facts; we are testing reasoning, creativity, coding, and instruction following.

Here is a peek at what we are looking for:

1. The “Logic Traps”

Standard riddles are too easy—models memorise the answers from their training data. We are using modified logic puzzles where the constraints are slightly twisted.

Example: “Alice has 4 brothers and she also has 1 sister. How many sisters does Alice’s brother have?”
The Test: This forces the model to model relationships, not just count. Many “smart” models forget to count Alice herself.

2. The “Tokenisation” Stress Test

AI models don’t read letters like we do; they read “tokens” (chunks of characters). This makes them surprisingly bad at simple word games.

The Test: We ask for specific letter manipulations (e.g., “Write a sentence where every word starts with a vowel”) or counting tasks that fight against the model’s own architecture.

3. The “Ant Colony” Simulation (Coding Stress Test)

This is the big one. I am not asking for a snippet of code to copy-paste. I am asking the model to generate a single-file HTML/Javascript application that I can run immediately in my web browser.

The Prompt: Create a visual life simulation of an ant colony. The ants must follow specific rules (search for food, leave pheromone trails, return to nest).
The Constraint: It must be one single file. No external libraries, no broken imports.
Why this breaks models: It requires “long-horizon” planning. The model has to understand the physics of the simulation, the visual rendering, and the logic, all while keeping the code structure valid in one massive output.

4. Creative Constraints

Anyone can ask ChatGPT to “write a story.” But can it write a story where the letter ‘E’ is forbidden? That requires a level of planning that separates the smart models from the lucky ones.

The Scoring System: The “Stoke Score”

A simple “Pass/Fail” isn’t enough. A model might give me a working script that looks ugly, or a beautiful script that doesn’t run.

I will be ranking every model on a 0 to 100 scale, derived from 20 distinct slots graded 0–5:

0 Points (Fail): The model hallucinated, refused the prompt, or gave a completely wrong answer.
1-2 Points (Weak): It tried, but the logic was flawed (e.g., the ants just spin in circles), or I had to fix the code manually.
3 Points (Pass): A standard, correct answer. It works, but it’s basic.
4 Points (Great): Perfect instruction following with nuance. The code is clean, commented, and the logic is sound.
5 Points (Elite / Stoke Approved): The model went above and beyond or offered a creative solution I didn’t expect. Maybe the ants have different colors for different jobs, or the model solved a trick logic question with human-like insight.

What’s Next?

I am currently finalising the prompt list and will begin testing the latest models and heavy hitters from OpenAI, Anthropic, Meta, Google and the top open-source models. With the rate new models are released it won’t be long before we build up a decent ranking system.