Beyond Chatbots: AI Models That Think More Like Humans
How to Understand, Prompt, and Run Your Own Local Reasoning Models (Like OpenAI's o1)
"With a little bit of extra effort it can get even better." - DeepSeek R1 paper, after their 7b model outperformed GPT-4o on complex math problems
The Breakthrough
Something remarkable happened in AI development over the past few months. Instead of just making AI models bigger and feeding them more data, researchers found a way to make them think more carefully about problems - much like how we might take extra time to think through a difficult puzzle.
Three Major Developments That Changed the Game
1. OpenAI Shows a New Path
It started when OpenAI released o1-preview in September 2024, long after rumours and small pieces of information had leaked about the underlying architecture, called strawberry. Rather than having AI rush to answer questions immediately, they let it take its time and work through problems step by step. The results were so impressive that it crushed benchmarks, and after releasing the full version, o1, o3 has been announced, which is supposed to be significantly more capable.
2. Microsoft's Chess-Like Approach
In early January 2025, Microsoft revealed they could make even a small AI model solve complex problems using an approach borrowed from chess computers. Their method, called Monte Carlo Tree Search (MCTS), lets the AI explore different solution paths systematically before choosing the best one.
Think of it like planning moves in chess: instead of making the first move that looks good, the AI considers many possible sequences of moves and their outcomes. This systematic exploration helps it find better solutions than just "going with its gut."
The result? A 7b model, which can be ran on regular consumer hardware, outperformed o1-preview on math specific tasks.
3. DeepSeek's Revolution: Teaching AI to Think
A few weeks later, DeepSeek took a completely different approach. Instead of using chess-like planning, they used reinforcement learning - essentially letting the AI learn through trial and error, much like how humans learn from experience. The results matched OpenAI's o1, but with a crucial difference: their models were small enough to run on regular computers, with their smallest model even working on phones.
The "Aha Moment"
One of the most fascinating findings in DeepSeek's research was what they call the "aha moment." During training, their AI spontaneously developed the ability to catch and correct its own mistakes. It would sometimes stop mid-solution, say "Wait, wait. That's an aha moment I can flag here," and then revise its approach - something it wasn't explicitly trained to do.
How They Got There: A Journey of Trial and Error
What Didn't Work
DeepSeek tried several approaches that failed before finding success:
Process Reward Models: Trying to teach the AI to evaluate its own thinking process turned out to be unreliable
Pure MCTS: Microsoft's approach proved too complex for practical use with language generation
Early attempts at reinforcement learning produced models that mixed languages and were hard to understand
What Finally Worked
The successful approach combined two key elements:
"Cold Start": Training the model on a small set of high-quality examples showing good reasoning
Reinforcement Learning: Letting the model practice and improve through trial and error
Real-World Impact
This breakthrough has immediate practical implications:
Developers can now run powerful reasoning models locally instead of paying for API calls
Researchers can experiment with and improve these models using modest computing resources. Training costs can range in hundreds of dollars, not millions.
Small companies can deploy advanced AI capabilities without massive infrastructure costs
You don’t have to send your data to other companies in order to use AI.
DeepSeek didn’t stop at developing reasoning breakthroughs; they worked to make these capabilities accessible to everyone. Using a process called "distillation," they transferred the reasoning abilities of their larger models into smaller ones.
The performance numbers are stunning: DeepSeek's reasoning version of a 1.5B model (which can effectively run on a phone) can solve complex math problems that GPT-4o struggled with. Their 7B model, can exceed some of GPT-4o and Sonnet 3.5’s capabilities.
They trained these smaller models on carefully curated examples, refining them to retain the reasoning skills of the larger systems. For instance, their 7B model surpasses previous state-of-the-art models like QwQ-32B on benchmarks, while being far easier to run. This approach makes it possible for developers, researchers, and even small companies to experiment with advanced AI reasoning without requiring massive infrastructure or costs.
Current Limitations
While these advances are exciting, there are still some important challenges:
The models can be sensitive to how you phrase prompts
They sometimes mix languages unexpectedly, especially in non-English contexts
They're not yet great at practical programming tasks due to long evaluation times
They still struggle with some everyday tasks that regular LLM’s handle well, like function calling and complex role-playing
This likely means these models are not a full on replacement for current LLM’s, but rather a really good addition. Similar to how chat messages and e-mail are perceived differently, where regular LLM’s are more about fast back and forth, and with e-mail, it’s more as if you’re delegating a task.
Reasoning Models are NOT chatbots
Although they might look similar, reasoning models are fundamentally different from the AI chatbots we’re used to, like Sonnet 3.5 or Llama 3.2. While reasoning models can function as chatbots, their real strength lies elsewhere. Think of them more like consultants rather than conversational partners. Instead of engaging in quick back-and-forth interactions, they excel at tackling complex, multi-step problems and producing detailed, thoughtful outputs. It’s more akin to sending a well-thought-out email than having a quick chat.
For example, if you’re rewriting a text or need quick suggestions, a regular LLM is probably your best bet. However, if you’re solving a critical bug, planning a multi-step AI integration, or figuring out a complicated database structure, a reasoning model’s ability to analyze and problem-solve becomes invaluable. They take their time to think, often working through problems methodically and from different angles, but this comes at the cost of slower response times and less direct control over their output.
The key takeaway? Use reasoning models when you need deep problem-solving and structured outcomes, not when you’re looking for speed or iterative brainstorming.
How to Prompt Reasoning Models
These models don’t excell at back and forth, but they are really good at solving problems, in other words, they’re really effective at one-shot prompts.
These models are designed to problem solve towards a solution, so start with giving the model a problem to solve. Otherwise, it’ll have to waste a bunch of effort thinking about what it is you’re looking for.
Similarly, adding clear constraints, maybe some edge cases and extremely clear rules to follow help guide the model towards finding the right direction.
If you fail to provide a clear goal, these models can get stuck overthinking things, and wasting valuable time thinking about irrelevant problems. I gave the smallest reasoning model a prompt that is a common start of a puzzle (Alice, Sarah and Charlie are sat a table), which completely threw it off. Now, the bigger models like Deepseek R1 and o1 do handle this better, but still, it steers the model to reason in the right direction.
Same goes for the output. Be clear with what you want the result to look like. Do you want it formatted in a table? Just the solution, or something else? Not giving clear instructions for the output format can make the model unpredictable in its output, which is a current problem with all reasoning models, including the big ones.
The last but definitely not least important part: Context. Explain yourself fully, add in any relevant context. This does not need to be structured.
Tip: Go back and forth with a regular LLM, have it ask you questions, or use transcription software to quickly ramble and add context
Try to keep irrelevant information out of the context. It’s good to slightly overshare to make sure everything is in there, but don’t be too careless. Extra context that is not relevant might be considered in ways you don’t want them to be considered, and valuable thinking time (compute) might go towards irrelevant reasoning steps.
Working with reasoning models effectively means:
Write a clear goal that you want the model to work towards
Give clear rules and point the model in the right direction with a clear problem statement
Providing lots of (potentially) relevant context upfront - make sure the model has what it needs to solve the issue.
Being clear about the format of the end result
Try it yourself
If you’d like to try it for yourself, there are a few options. You can try Deepseeks R1 model for free on their website, which should be about as good as o1. Do be aware though - using it through their site comes with very poor terms and conditions. Assume that your data will be used for training the model further.
Want to run a local model? It’s easier than you would think. I would recommend using LMStudio, which runs on almost any machine, is easy to set up, and doesn’t require specific knowledge. They were incredibly quick with making the reasoning models available on their platform, and you can give it a try yourself! Check the staff picks - they recommend a few good options for you.
Closing Thoughts
While OpenAI hinted at breakthrough reasoning techniques with o1, DeepSeek has done something remarkable: they've opened the black box. By detailing their reinforcement learning approach, they've shown exactly how AI can be taught to reason more systematically.
This isn't just an academic exercise. It means AI could soon tackle complex real-world problems more effectively—from scientific research and medical diagnostics to engineering challenges that require nuanced, step-by-step reasoning, without having to rely on a few big companies.
But before we get there, we first need to get used to using them, and figure out what we can reliably use them for.
A very interesting read. Note that Deepseek V3 is not ranking in the top 5 for coding on LLMSYS (my go-to on how users perceive the efficacy of different LLMs). A lot of these 'benchmarks' are manipulated unfortunately.