Messing around with LLMs - Blog

A large part of what I’ve been doing for fun lately has involved playing with large language models. Specifically, I’ve been trying to build some sort of applications of them to do potentially useful things. “Useful” has an elastic definition.

My boss expressed the thought that we should try to use AI instead of hiring more people any time we had problems at work that we needed solved. The main problem I have at work is that business decisions are made too slowly, by people who are frequently traveling, and then talk too much whenever you actually do manage to get them in a room. Hence, we don’t have high quality discussions that result in decisive actions being taken. I thought it would be perfect to solve this by creating a multi-agent team that could augment or replace the entire C-suite. My thinking was two-fold:

If the problem is that the communication loop is too slow, agents could talk to each other much faster and therefore accelerate decision making. They’d probably come up with slop, but it would still be overall faster to have a bunch of slop to talk about than to have nothing and fumble about in the dark.
People talk a great deal about AI replacing coders, but my thought is that will never completely happen. Writing code is very detailed work, and even very good LLMs are still bad at details. A human being with expertise needs to check it. You know what kind of work there is where details matter less? Executive leadership. The details are quite literally beneath them.

Work has ChatGPT Enterprise, but IT didn’t give me any API keys. I told them this side project wasn’t that important (I didn’t tell them what it was), so I’m sure it fell straight to the bottom of their priority list. I’m actually perfectly OK with that because there are other things on their priority list that I’d rather have them do. But, I still wanted to try this crazy idea and play with LLMs programmatically.

I ended up installing Ollama on my own machine, along with a bunch of open models. I have an nVidia RTX 2070 with 8 GB of VRAM and 32 GB of system RAM, so I was surprised at how well this worked for many models. Just a few that I tried:

gpt-oss:20b - Too big to fit on my graphics card; I think it split 50/50 between CPU and GPU. That makes it slow to respond. But the responses are pretty high quality, and you can see it “thinking.” The thinking reminds me of how someone with social anxiety would attempt to navigate a conversation (this person called me “Gippity”; this is a nickname. I should respond politely.).
phi3:3.8b - Very fast, but not very capable. I had a conversation with it about some technical topic and it degraded into replying in Chinese when comparing Rust to Python. That might have been raw training data that came out. I suppose it is good at providing summaries but not much else.
gemma3:4b - This model is really good for how small it is. Very fast but still capable. It does hallucinate wildly though; I would not trust any answers it gives as factual. If you ask it “Is Windows 11 gay?” it will reliably hallucinate in great detail a new dumb culture war controversy that never happened, complete with links to articles on The Verge and Wired that don’t exist.
mixtral:8x7b - Way too big for my machine; it takes up all of my VRAM and almost all of my system RAM. This makes it slow. But it was extremely capable. It had an annoying habit of filling in the next prompt for me, especially if it wanted to force a topic change if I asked it something it was not permitted to speak about.
llama3.1:8b - The best model so far for my machine. It is highly capable and still fits entirely on my GPU. It also is a lot less likely to hallucinate than other models. It’s become my default model for the applications I’m running locally.
dolphin3:8b - This is basically llama3.1:8b but without any of the guardrails or censorship. I haven’t played with it much, but it will imitate Adolf Hitler when llama3.1 refuses to.

I tried to realize my “multi-agent executive” idea by wiring a bunch of models together with AutoGen, and applying the RAPID framework from Bain Capital to them (I saw an ad for it on LinkedIn, and thought it was wacky enough to try). The output that was produced was genuinely awful, and looked like a role-play of Soviet bureaucrats role-playing American business executives. It was enough to discourage me completely from the idea, at least for now.

I’ve been playing a lot with RAG approaches also. I’ll document those in another blog post, because I’ll probably add one of them up here as a side project to play with.