“Ship’s Computer” is the name I ended up giving the latest app that I’ve added to Starbase Zebra. It is a Retrieval Augmented Generation system for answering questions about the Original Series of Star Trek based on episode transcripts. This was a fun little project, that I had going for the past several months, and I’ve finally decided to release it semi-publicly. You can check it out if I’ve given you an access code.
Last year, I started to spend a lot of time playing with local LLMs. As I mentioned previously, some of my initial attempts at doing something with them led me to learn about RAG. My first thought on what to do with RAG was to see if I could get it to make an LLM act as a fictional character. I happen to have a backup of a very large database of content that I am the author of, that was written primarily from one character’s perspective. I wanted to know if feeding all of that content into a vector database, and then injecting relevant results into an LLMs context, could get it to talk in the voice of that character. The answer to that question was: No, not really; the kinds of small models I was using would more-often-than-not fall back to training data or insert details that were incorrect.
However, what I did notice was that this setup was actually very good at answering questions about the text. That is, if I asked Llama 3.1 8B to tell me who people were and how they were related to each other, it would usually give the correct answer from the context. This impressed me, because the characters here had some complicated stuff going on (alternate universe counterparts, flashbacks in other universes, love triangles, children from other parents, etc), but Llama was still able to keep up with it quite well.
Around the same time I was doing that experiment, I had stumbled upon someone’s attempt to use ChatGPT to simulate the experience of a computer on Star Trek. It was on their own webpage, and they invited people to ask any questions about the Star Trek universe and promised that the computer would give accurate answers. I tried this, and was thoroughly unimpressed by the chatbot’s inability to answer basic in-universe trivia questions about Star Trek. Further examination using my browser developer tools revealed that his application was nothing more than a front-end to ChatGPT-4o with a special system prompt telling it to pretend to be a Star Trek computer, all written in client-side Javascript, complete with a perfectly valid OpenAI API key in plain view for anyone to steal. I have lost this link, but even if I had it I would not share it, because I bet that guy still hasn’t rotated his key or kept it safe. Anyway, I immediately knew that I could do better than this; I could use the same pipeline I used with my own fictional content to ingest Star Trek episode transcripts, and Llama 3.1 8B with RAG could beat ChatGPT-4o without RAG.
Transcripts of Star Trek episodes are readily available, so I had a corpus I could ingest. I deliberately chose to limit myself to the Original Series of Star Trek, because I happened to have exactly 1000 multiple choice Star Trek Original Series trivia questions that I could use to evaluate the model’s performance objectively. Most of the questions had 4 possible answers to choose from, and Llama 3.1 8B with the RAG system was able to get the right answer about 60% of the time. Which is actually pretty good; the same model would only get the questions right without RAG 38% of the time.
I think I’ll write a separate post of how I converted this thing from a pile of scripts running on my own PC to the version that is currently in production on Starbase Zebra, along with details about model evaluation. It’s interesting technical detail, but I’m a little too tired to write it all right now.