Wrangling Qwen's Overthinking with Workflows

So I've been running Qwen3.5 122b a10b lately on the M2 Ultra (currently GLM 5 is sitting on the M3), and if you've used any of the Qwen3.5 family, you've probably seen or heard about the overthinking issue. The models are great if you either have a lot of time to kill while you wait for a response, or for more straight forward work if you kill the reasoning. The 35b a3b with reasoning disabled has been my workhorse for the past couple of weeks and it is the greatest thing since sliced bread.

Anyhow, now that I want to use the 122b for actual hobby work, I've realized how painful the overthinking really is. I had a conversation a few days ago where I asked it to translate something simple. Not anything complex, just a straightforward translation request. It spat out over 5,000 tokens of reasoning before giving me the actual answer. I tested, and actually got a faster response by sending my request to GLM 5 with reasoning enabled, despite it being a 744b a40b model. It just thought so much less, because the request wasn't THAT complex.

I tried all of the Qwen recommended samplers, and even kicked up repetition penalty alongside their recommended presence penalty just to see what it would do. But nope; think think think. I also sleuthed around the net a bit and saw that several folks ultimately solved this with forceful thinking budgets in the newer llama.cpp, but I'm not a huge fan of that; if the reasoning isn't done, then it'll just get cut-off mid thought and you really aren't getting the benefit of reasoning at all.

So after banging my head on this for a bit, I went back to something I used to do when reasoning models were newer and their CoT actually hurt more than help: Wilmer workflows to the rescue.

What I ended up doing was disabling Qwen3.5's native reasoning entirely. I'm passing enable_thinking: false into chat_template_kwargs through the llama.cpp server payload to disable thinking, then I built a workflow that handles the chain-of-thought process manually.

The workflow does the usual context gathering that my setups always do, and then right before the final response there's a dedicated "thinking" node. This node gets all the context and produces a chain-of-thought analysis that then feeds into the responder node.

Rather than wing the CoT, since things have probably changed a bit since the last time I did that in 2024 (lol), I had Claude do a deep research pass on how how Deepseek and GLM 4.7 structure their reasoning internally, to see if I could get some ideas. In my experience, both of those do amazingly at CoT.

DeepSeek-R1 ended up having the most info available; it followed a four-phase pattern of problem definition, decomposition, reconstruction cycles, and final decision. The reconstruction cycles are where it either ruminates or genuinely tries new approaches. GLM 4.7 does something called interleaved thinking, where it reasons before each response and each tool call, not just at the start.

The research I found showed something interesting. Incorrect solutions have more and longer reconstruction cycles than correct ones. There's a problem-specific sweet spot for reasoning length. As we already knew: more reasoning doesn't always mean better answers. In fact, R1 had a bad habit of ruminating, re-examining the same formulations repeatedly, which actually hurts its ability to find novel solutions.

It was an overthinker, too; just not as bad as Qwen.

Anyhow, long story long: I took all that and threw together a new CoT prompt in a new node just before the responder. The model has to assess complexity first and scale its effort accordingly; a simple greeting gets maybe two or three sentences of thought, while a multi-step coding problem gets a thorough breakdown. Then it has to work through the problem, verify its reasoning, and output a response plan. If it catches itself repeating the same line of reasoning, it's instructed to stop and either move on or try a genuinely different approach.

Despite Qwen3.5 122b not being trained for this, the results have been solid. Instead of 5,000+ tokens of circular thinking on a simple translation, I'm seeing 900 to 1500 tokens now on that same request. The quality of the final responses seems about the same, maybe slightly better because the thinking is actually structured rather than meandering. And despite making two separate model calls instead of one, the total response time is lower because I'm not burning tokens on endless rumination.

This isn't a new idea. I had to do this two years ago as well; it's just funny that I'm circling back to it now with one of the most powerful models out there.

Anyhow, that's how I got Qwen3.5 to behave. Your mileage may vary. But if you've got a workflow system set up and you're willing to spend some time on prompt engineering, there's a lot you can do to tame a model that doesn't self-regulate well.

Wrangling Qwen's Overthinking with Workflows

Read more

A New Toy...

Slimming Down the Homelab Software Footprint

The Right Monitor is Hard to Come By

My Foray Back Into Linux...