Adventures in Ollama (Part 3)

Another day in the mighty series of days that is a week. Today we will be comparing deepseek-r1:70b with qwen3-coder:30b.

One might initially find this matchup unfair. After all, deepseek is a reasoning model and has more than 2x the parameters. But I have a feeling qwen3-coder:30b will surprise.

Deepseek-R1:70b

I put the prompts in, and waited just under a minute for it to think and start responding. The third run took 3 minutes to finish thinking. It’s slow for stuff like this. You would hope it was worth it. No, it’s useless.

Run #1: model produced a syntax error on line 67. It accidentally used a slash (“/”) instead of a $ on a variable name. Did not exit() at end of file.
Run #2: model forgot the closing ?> at the end of the script.
Run #3 had an extra comma in an array and somehow, used a chinese word for a variable name(!!) It said “内容” and then proceeded to check if $content was empty. Thirdly, it forgot a $ in front of a variable (again; see run #1). I corrected the errors and it ran.
Run #4 didn’t have any obvious errors when I scanned the code.

Deepseek run #3 did not produce a JSON file. Deepseek-R1:70b run #4 did not produce a text file.

The JSON was inconsistent between the runs, with one run including null values for w (word), which is an unacceptable result. In the end, only the first two deepseek runs were able to accurately reproduce the file. Runs 3 and 4 did not produce a file. Let it be said, Deepseek-R1:70b for coding… 50%.. it’s not great.

Qwen3-coder:30b

The first thing I noticed is that it is very quick and has a different style. it looks like Llama3.3:70b’s code at first (not great comments), but on second look the comments are ‘appropriate’, I suppose. The program is also very verbose, that is, it seems to print a lot of information. I think I prefer gpt-oss’s coding and comment style more, but let’s see how the tests run.

I also tried qwen3-coder:30b-a3b-fp16 and it seems to perform about the same.

one thing I noticed is that qwen doesn’t like to talk much about it’s code when it’s done. It just lets the code stand on it’s own. From the standpoint of learning and undersrtanding what the model has done, I like GPT-OSS’s commenting style and summary overview better. But this has it’s place, I suppose.

Testing Qwen3-coder

Now it’s time to test qwen3-coder!

In retrospect, I guess I could get used to Qwen’s comment style. Maybe. It’s okay I guess. Qwen3-coder:30b did much better, than deepseek, but there were still issues. Run #1 did not preserve whitespace, so it reproduced a text file without spaces. An easy fix, but a big oops since it was specifically instructed to be able to reproduce the original text file from the tokens. It didn’t understand that wouldn’t happen without indexing spaces. Notably, no other model or run ever made that mistake so far except this one.

Runs 2, 3 and 4 went off without a hitch. I wonder what happened with run #1? I thought I had maybe somehow entered the prompt wrong so I did it again, and the new version produced by qwen3-coder:30b had the story-2.txt filename bug! It seems that qwen3-coder is hit or miss, possibly moreso than gpt-oss:20b.

Model Selection Strategy going forward

I was a bit shocked by qwen3-coder:30b’s poor performance compared to gpt-oss:20b. Right now, the best models seem to be

gpt-oss:120b
gpt-oss:20b
qwen3-coder:30b
Llama3.3:70b

Unsuitable for coding:

Deepseek-R1:70b (fail)

The takeaway is that Deepseek-R1:70b doesn’t really work for coding, so the model I use for coding is going to be gpt-oss for now. For other things, well, I haven’t done a creative writing or summarization or translation test yet so I can’t really say for those use cases. Yet!

Hello Neo