Flag Operations are Free

By Appledog
January 11th, 2026

Abstract

Discoveries made during profiling to determine what was slowing down my CPU emulator revealed some surprising insights into emulation implementation.

Issue

Sometime during the development of the SD-8516 virtual retro CPU, the processing speed went from 60 mips to under 15 mips. I first thought it was my mac, since I benchmarked it on my mac and only got 20 mips. I was pretty upset and went down a rabbithole of trying to install various browsers, enabling SIMD, and compiling in C. The truth was quite different. During the implementation of various opcodes, decisions I made that struck at the heart of the ISA itself revealed that certain instructions were driving down the general performance of the CPU.

I began by creating a series of profiling programs. I will refer to them by the name of the opcode being profiled, with a number; ex. LDAL-1, LDAL-2, and so forth.

To make a very long story short, here are the programs, the results, and the conclusions:

Here's LDAL-1:

; Test program: 1 million LDAL [$1000] operations
; Uses CD (32-bit "count down" counter register)

.address $010100

    SEF ;; fast flags mode : do not perform flag ops for LD and DEC operations.
    LDCD #1000000        ; Load 1,000,000 into CD (0x989680 is 10 mil)

loop:
    LDAL [$1000]          ; Load AL from address $1000
    DEC CD                ; Decrement CD
    CMP CD, 0
    JNZ loop              ; Jump to loop if CD != 0

    HALT                  ; Halt when done

This is a pretty simple take on a simple concept; Execute 1 million LDAL operations and see what happens. The result was a MIPS score of 1.85. I became depressed. How had my beautiful CPU become so slow? Just a few weeks ago it was pulling over 60 MIPS. Now, it was showing scores that didn't make sense.

This was, in fact, the purpose of adding the SEF instruction you see above. In desperation to find the source of the slowdown, I had commented out all of the debug IF checks and I had set up a fence around most of the flag operations. Changing SEF to CLF above gives us LDAL-2, which turns on flag checks for LD and DEC. It does not change the operation of this program, since we explicitly check for zero with CMP.

The results of LDAL-2 shocked me. Even with fast flag mode turned on, the program remained locked at 1.85 MIPS for multiple runs. In other words, even though there were over 4 million additional checks to set FLAGS data, the processing time did not increase or decrease and remained locked in at 1.85 MIPS.

Next I moved to LDAL-3 where I removed the CMP since it was no longer needed:

; Test program: 1 million LDAL [$1000] operations
; Uses CD (32-bit "count down" counter register)

.address $010100

    CLF                  ; fast flags off : perform flag ops for LD and DEC operations.
    LDCD #1000000        ; Load 1,000,000 into CD (0x989680 is 10 mil)

loop:
    LDAL [$1000]         ; Load AL from address $1000
    DEC CD               ; Decrement CD
;   CMP CD, 0            ; removed since DEC CD will set zero flag if it DECs from 1 to 0.
    JNZ loop             ; Jump to loop if CD != 0

    HALT                 ; Halt when done

Now this was a real eye opener. Removing the explicit check and keeping the flag ops ON, resulted in a MIPS score of 2.1! Well now, this was surprising but not entirely unexpected. Well, no, it was unexpected. Removing flag operations for LD and DEC is significant as they are both being executed 1 million times each. Here's the code that we're talking about:

ZERO_FLAG = (value & 0xFF) === 0;
NEGATIVE_FLAG = (value & 0x80) !== 0;
ZERO_FLAG = result === 0;
NEGATIVE_FLAG = (result & 0x8000) !== 0;
OVERFLOW_FLAG = value === 0x8000;

That is a significant amount of code to remove, but ONE compare op was killing it. Having this make no impact whatsoever was surprising, so I removed the IF statements blocking these flags on DEC. This produces LDAL-2b, which surprised me by getting again the exact same 2.1 MIPS. So, over 2 million if statements AND two million times the five lines of code above wasn't moving the needle? Wow.

I replaced the flag fences and I created LDAL-3; this time, I had 100,000 runs of 10 LDAL operations. My heart lept for joy when I saw the score; 7.55 MIPS! This meant that LDAL was executing much faster than the other instructions. I immediately created LDAL-4 which had 1,000 lines of LDAL and loaded CD with 1 million. The goal was simple: execute 1 billion LDAL instructions and time the result. The results were spectacular. 78 MIPS. I did try with CMP,0 and SEF mode, and it was slower (73 MIPS). The immediate conclusion is that SEF mode was useless. CMP was dragging everything down. But I didn't know why.

I experimented with some other LD instructions It turned out that LDBLX and LDAB were extremely slow, just as slow as CMP. I once again tested CMP with and without SEF/CLF just to confirm: Yes, one CMP operation was many times slower than millions of by-the-way flag checks. Adding a CMP lowered the MIPS to 73 but removing it got us over 78.

The final conclusion was that my memory system was not optimized. One of the major issues was that I was creating an array in web assembly every register access. I moved that out of the loop and saw MIPS return to normal. In fact it was better than normal- for normal load and store operations I was at 55 MIPS.

Hello Neo

Table of Contents

Flag Operations are Free

Abstract

Issue