More Messing Around with LLMs - Blog

Previously, I mentioned how I messed around with LLMs a bit. I’ve also made references to things I might release. I’m still not going to release them, but there’s some things to note. I also hinted at some of those earlier when discussing Claude Code at work.

I simultaneously love and hate Claude Code.

Claude Sonnet 4.6 is a remarkably good model for what I end up using it for at work, and Claude Code seems to be a very good harness for it. I like this entire way of interacting with the LLM through a terminal interface. I like that it takes notes on what I tell it and chooses to remember things in a smart way. I like that I can feed it manuals and it will do things that I’m bored with. I like that I got it to reverse engineer ADP Workforce Now’s XHR payloads and give me Javascript I can just inject into the developer console to bypass the single most tedious thing I have to do at work every week. I love that Claude used effectively, whether in Claude Code or not, can basically lower the “activation energy” it takes for me to create something I otherwise would be too tired or busy to write myself. I cranked out a small python script to poll my CPU temperature and change the RGB lights on my mouse accordingly. That sent me down a rabbit hole where I found out that my fan curve actually does nothing at various settings. Which now has given me yet a new fan curve. Which has lead me to learn that the system76-power service daemon actually spawns nvidia-cli processes every second to read GPU temperature. GUESS WHAT, I HAVE A NEW SIDE PROJECT NOW TO STOP ALL THAT SHIT! YAY!

I hate that Claude Code is actually a React app. Yes, this text based silly thing is actually using React to render a textual interface; I find this morally wrong in a way I can’t fully articulate. And Claude Code had a bunch of Very Stupid Bugs last month, which didn’t directly affect me, but make me wonder what it is they’re doing over there at Anthropic. More importantly, the recent game where it looked like they were taking away Claude Code from Pro subscriptions makes me reconsider whether or not I want to use this thing at all outside of work. I almost got a Pro subscription this week because I want to play with Claude Code outside of work. Now I’m thinking that’s probably a bad idea. Anthropic only aborted the rug pull when they got caught with their “testing”, but given the obvious compute shortage they’re suffering, I think it’s only a matter of time before they cut off all the fun things for Enterprise customers only.

So… outside of work, I’ve been playing with more models, and also experimenting with harnesses and tools and other things. I have been playing with OpenCode. It is not as good as Claude Code, but I also have not used it with a comparable model to Sonnet (including… Sonnet); I’ve been using local models only. There are some local models that are passable for use in OpenCode. I have also started using Open WebUI to chat with my Ollama models, largely so I can do it from my iPad in other parts of the house. I avoided Open WebUI for a while because it seemed bloated and the maintainer seems a little insane, but it’s actually working kinda nicely for me. I’ll probably play with it more.

Since I did a list of models last time, here’s an updated list of models I’ve played around with or changed lately:

gemma4:e4b - My new favorite. I primarily use it as a ChatGPT replacement for the chat-about-nonsense-whenever use case. It thinks, which is something I generally dislike, but I like its thinking because it actually seems really focused and tends to result in better answers. Due to how the model loads in Ollama, it actually leaves a lot of VRAM free, so I can set num_ctx to 96k and not even notice any degredation in performance.
gemma4:26b - The bigger sister of my new favorite. I use 26b for agentic coding tasks. It’s decent. Not as good as Claude but good enough to be useful, most of the time. Only problem with it is that its thinking tags sometimes get mangled and you end up getting the response stuck in thinking. I can run it with a 64k context window without swapping much.
gemma4:31b - This thing is too slow for my machine. It outputs at about 1 token per second in the Ollama CLI.
gpt-oss:20b - I used this before. But, now, it seems as though changes in Ollama have made it run much, much faster than it used to, and using only half of my RAM instead of all of it. I can run it with a 128K context window.
qwen3.5:4b - This guy is speedy and pretty good. I don’t like his thinking too much. “Wait! What about…” over and over again. That and it’s Chinese, so therefore has the CCP political nonsense in it, which is still worse than the woke Corporate America nonsense.
qwen3.5:9b - I found that this is just too tight for my GPU to run well. Which is weird because I’ve run larger models just fine.
qwen3-coder:30b - This model is really good, and is pretty good at writing code too. It’s really slow in OpenCode though.
deepseek-r1:8b - This is actually a distillation of DeepSeek into a Qwen 3 model. This sucks. I gave it one of my standard tasks to generate HTML, and it blew through the entire 4k context window generating thinking tokens, lost the original context entirely and ended up stuck in an infinite brainrot loop for 20 minutes. Never using this thing again.
granite4:3b - For a model this tiny, it’s pretty good. IBM optimized it for tool calling and instruction following. It’s one of the only local models to successfully one-shot my “GO AWAY” HTML file request, which is saying something because the only ones who get close usually are 20b+. I use it to classify items on my Outlook calendar.

Ok, now my dogs are begging for my attention, so I’m going to go deal with that.