Differences

This shows you the differences between two versions of the page.

--- flag_operations_are_free [2026/01/11 16:21] – appledog
+++ flag_operations_are_free [2026/01/11 23:19] (current) – appledog
@@ Line 15: / Line 15: @@
 Here's LDAL-1:
-<codify nasm>
+<codify armasm>
 ; Test program: 1 million LDAL [$1000] operations
 ; Uses CD (32-bit "count down" counter register)
@@ Line 67: / Line 67: @@
 * OVERFLOW_FLAG = value === 0x8000;
-That is a significant amount of flags. Having this make no impact whatsoever was surprising, so I removed the IF statements blocking these flags on DEC. This produces LDAL-2b, which surprised me by getting again the exact same 2.1 MIPS. So, over 2 million if statements wasn't moving the needle? That felt strange.
+That is a significant amount of code to remove, but ONE compare op was killing it. Having this make no impact whatsoever was surprising, so I removed the IF statements blocking these flags on DEC. This produces LDAL-2b, which surprised me by getting again the exact same 2.1 MIPS. So, over 2 million if statements AND two million times the five lines of code above wasn't moving the needle? Wow.
-I replaced the flag fences and I created LDAL-3; this time, I had only 100,000 execution cycles, but 10 copies of LDAL. My heart lept when I saw the score; 7.55 MIPS! This meant that LDAL was executing much faster than the other instructions. I immediately created LDAL-4 which had 1,000 lines of LDAL and loaded CD with 1 million. The goal was simple: execute 1 billion LDAL instructions and time the result. The results were spectacular. 78 MIPS. I did try with CMP,0 and SEF mode, and it was slower (73 MIPS). The immediate conclusion is that SEF mode was useless. CMP was dragging everything down. But I didn't know why.
+I replaced the flag fences and I created LDAL-3; this time, I had 100,000 runs of 10 LDAL operations. My heart lept for joy when I saw the score; 7.55 MIPS! This meant that LDAL was executing much faster than the other instructions. I immediately created LDAL-4 which had 1,000 lines of LDAL and loaded CD with 1 million. The goal was simple: execute 1 billion LDAL instructions and time the result. The results were spectacular. 78 MIPS. I did try with CMP,0 and SEF mode, and it was slower (73 MIPS). The immediate conclusion is that SEF mode was useless. CMP was dragging everything down. But I didn't know why.
-For the record, I created versions which used LDA and LDAB
+I experimented with some other LD instructions It turned out that LDBLX and LDAB were extremely slow, and when put into an unrolled loop would drop to under 10 MIPS.
+<codify armasm>
+; Test program: 1 million LDAL [$1000] operations
+; Uses CD (32-bit "count down" counter register)
-MIPS With SEF & CMP CD, 0
+.address $010100
-MIPS With CLF & no CMP
+    LDCD #1000        ; Load 1,000 into CD (0x989680 is 10 mil)
+loop:
+    LDAB [$1000]          ; Many of these; x1000
+    DEC CD                ; Decrement CD
+    CMP CD, 0
+    JNZ loop              ; Jump to loop if CD != 0
+    HALT                  ; Halt when done
+</codify>
+The final conclusion was that my memory system was not optimized. One of the major issues was that I was creating an array in web assembly every register access. I moved that out of the loop and inlined memory access directly ito the LD/ST instructions. That brought MIPS for LDA up to 87 and MIPS for LDAB to 55. These were better numbers than before. I probably didn't notice how badly some instructions were weighing down the system.
+The turbo boost over and above this was batching all the reads to the start of the opcode handler and then masking down depending on how we needed to access the registers. In closing, here's the 87.5 MIPS version of LDA [$addr]:
+<codify armasm>
+        case OP.LD_MEM: {
+            // Load reg (1 byte) + addr (3 bytes) = 4 bytes total
+            let instruction = load<u32>(RAM + IP);
+            let reg:u8 = instruction as u8;                    // Extract low byte
+            let addr = (instruction >> 8) & 0x00FFFFFF;      // Extract upper 3 bytes
+            // Pre-load 32 bits from target address
+            let value = load<u32>(RAM + addr);
+            let reg_index = reg & 0x0F;  // Extract physical register 0-15
+            IP += 4;
+            if (reg < 16) {
+                set_register_16bit(reg, value as u16);
+                ZERO_FLAG = value === 0;
+                NEGATIVE_FLAG = (value & 0x8000) !== 0;
+                //if (DEBUG) log(`$${hex24(IP_now)}    LD${reg_names(reg)} [$${hex24(addr)}] ; = ${hex16(value)}`);
+            } else if (reg < 48) {
+                set_register_8bit(reg, value as u8);
+                ZERO_FLAG = value === 0;
+                NEGATIVE_FLAG = (value & 0x80) !== 0;
+                //if (DEBUG) log(`$${hex24(IP_now)}    LD${reg_names(reg)} [$${hex24(addr)}] ; = ${hex8(value)}`);
+            } else if (reg < 80) {
+                set_register_24bit(reg, value & 0x00FFFFFF);
+                ZERO_FLAG = value === 0;
+                NEGATIVE_FLAG = (value & 0x800000) !== 0;
+                //if (DEBUG) log(`$${hex24(IP_now)}    LD${reg_names(reg)} [$${hex24(addr)}] ; = ${hex24(value)}`);
+            } else {
+                set_register_32bit(reg, value);
+                ZERO_FLAG = value === 0;
+                NEGATIVE_FLAG = (value & 0x80000000) !== 0;
+                //if (DEBUG) log(`$${hex24(IP_now)}    LD${reg_names(reg)} [$${hex24(addr)}] ; = ${hex32(value)}`);
+            }
+            break;
+        }
+</codify>
+for more information please contact Appledog.