User Tools

Site Tools


flag_operations_are_free

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
flag_operations_are_free [2026/01/11 14:58] appledogflag_operations_are_free [2026/01/11 23:19] (current) appledog
Line 15: Line 15:
 Here's LDAL-1: Here's LDAL-1:
  
-<Code:Assembler>+<codify armasm>
 ; Test program: 1 million LDAL [$1000] operations ; Test program: 1 million LDAL [$1000] operations
 ; Uses CD (32-bit "count down" counter register) ; Uses CD (32-bit "count down" counter register)
Line 31: Line 31:
  
     HALT                  ; Halt when done     HALT                  ; Halt when done
-</Code>+</codify>
  
 This is a pretty simple take on a simple concept; Execute 1 million LDAL operations and see what happens. The result was a MIPS score of 1.85. I became depressed. How had my beautiful CPU become so slow? Just a few weeks ago it was pulling over 60 MIPS. Now, it was showing scores that didn't make sense. This is a pretty simple take on a simple concept; Execute 1 million LDAL operations and see what happens. The result was a MIPS score of 1.85. I became depressed. How had my beautiful CPU become so slow? Just a few weeks ago it was pulling over 60 MIPS. Now, it was showing scores that didn't make sense.
Line 41: Line 41:
 Next I moved to LDAL-3 where I removed the CMP since it was no longer needed: Next I moved to LDAL-3 where I removed the CMP since it was no longer needed:
  
-<Code:Assembly>+<codify armasm>
 ; Test program: 1 million LDAL [$1000] operations ; Test program: 1 million LDAL [$1000] operations
 ; Uses CD (32-bit "count down" counter register) ; Uses CD (32-bit "count down" counter register)
Line 57: Line 57:
  
     HALT                 ; Halt when done     HALT                 ; Halt when done
-</Code>+</codify>
  
 Now this was a real eye opener. Removing the explicit check and keeping the flag ops ON, resulted in a MIPS score of 2.1! Well now, this was surprising but not entirely unexpected. Well, no, it was unexpected. Removing flag operations for LD and DEC is significant as they are both being executed 1 million times each. Here's the code that we're talking about: Now this was a real eye opener. Removing the explicit check and keeping the flag ops ON, resulted in a MIPS score of 2.1! Well now, this was surprising but not entirely unexpected. Well, no, it was unexpected. Removing flag operations for LD and DEC is significant as they are both being executed 1 million times each. Here's the code that we're talking about:
Line 67: Line 67:
 * OVERFLOW_FLAG = value === 0x8000; * OVERFLOW_FLAG = value === 0x8000;
  
-That is a significant amount of flags. Having this make no impact whatsoever was surprising, so I removed the IF statements blocking these flags on DEC. This produces LDAL-2b, which surprised me by getting again the exact same 2.1 MIPS. So, over 2 million if statements wasn't moving the needle? That felt strange.+That is a significant amount of code to remove, but ONE compare op was killing it. Having this make no impact whatsoever was surprising, so I removed the IF statements blocking these flags on DEC. This produces LDAL-2b, which surprised me by getting again the exact same 2.1 MIPS. So, over 2 million if statements AND two million times the five lines of code above wasn't moving the needle? Wow.
  
-I replaced the flag fences and I created LDAL-3; this time, I had only 100,000 execution cycles, but 10 copies of LDAL. My heart lept when I saw the score; 7.55 MIPS! This meant that LDAL was executing much faster than the other instructions. I immediately created LDAL-4 which had 1,000 lines of LDAL and loaded CD with 1 million. The goal was simple: execute 1 billion LDAL instructions and time the result. The results were spectacular. 78 MIPS. I did try with CMP,0 and SEF mode, and it was slower (73 MIPS). The immediate conclusion is that SEF mode was useless. CMP was dragging everything down. But I didn't know why.+I replaced the flag fences and I created LDAL-3; this time, I had 100,000 runs of 10 LDAL operations. My heart lept for joy when I saw the score; 7.55 MIPS! This meant that LDAL was executing much faster than the other instructions. I immediately created LDAL-4 which had 1,000 lines of LDAL and loaded CD with 1 million. The goal was simple: execute 1 billion LDAL instructions and time the result. The results were spectacular. 78 MIPS. I did try with CMP,0 and SEF mode, and it was slower (73 MIPS). The immediate conclusion is that SEF mode was useless. CMP was dragging everything down. But I didn't know why.
  
-For the record, created versions which used LDA and LDAB+experimented with some other LD instructions It turned out that LDBLX and LDAB were extremely slow, and when put into an unrolled loop would drop to under 10 MIPS.
  
 +<codify armasm>
 +; Test program: 1 million LDAL [$1000] operations
 +; Uses CD (32-bit "count down" counter register)
 +
 +.address $010100
 +
 +    LDCD #1000        ; Load 1,000 into CD (0x989680 is 10 mil)
 +
 +loop:
 +    LDAB [$1000]          ; Many of these; x1000 
 +    DEC CD                ; Decrement CD
 +    CMP CD, 0
 +    JNZ loop              ; Jump to loop if CD != 0
 +
 +    HALT                  ; Halt when done
 +</codify>
  
-78 MIPS With SEF & CMP CD, 0 +The final conclusion was that my memory system was not optimized. One of the major issues was that I was creating an array in web assembly every register access. I moved that out of the loop and inlined memory access directly ito the LD/ST instructions. That brought MIPS for LDA up to 87 and MIPS for LDAB to 55. These were better numbers than before. I probably didn't notice how badly some instructions were weighing down the system.
-73 MIPS With CLF & no CMP+
  
 +The turbo boost over and above this was batching all the reads to the start of the opcode handler and then masking down depending on how we needed to access the registers. In closing, here's the 87.5 MIPS version of LDA [$addr]:
  
 +<codify armasm>
 +        case OP.LD_MEM: {
 +            // Load reg (1 byte) + addr (3 bytes) = 4 bytes total
 +            let instruction = load<u32>(RAM + IP);
 +            let reg:u8 = instruction as u8;                    // Extract low byte
 +            let addr = (instruction >> 8) & 0x00FFFFFF;      // Extract upper 3 bytes
 +            // Pre-load 32 bits from target address
 +            let value = load<u32>(RAM + addr);
 +            let reg_index = reg & 0x0F;  // Extract physical register 0-15
 +            IP += 4;
  
 +            if (reg < 16) {
 +                set_register_16bit(reg, value as u16);
 +                ZERO_FLAG = value === 0;
 +                NEGATIVE_FLAG = (value & 0x8000) !== 0;
 +                //if (DEBUG) log(`$${hex24(IP_now)}    LD${reg_names(reg)} [$${hex24(addr)}] ; = ${hex16(value)}`);
 +            } else if (reg < 48) {
 +                set_register_8bit(reg, value as u8);
 +                ZERO_FLAG = value === 0;
 +                NEGATIVE_FLAG = (value & 0x80) !== 0;
 +                //if (DEBUG) log(`$${hex24(IP_now)}    LD${reg_names(reg)} [$${hex24(addr)}] ; = ${hex8(value)}`);
 +            } else if (reg < 80) {
 +                set_register_24bit(reg, value & 0x00FFFFFF);
 +                ZERO_FLAG = value === 0;
 +                NEGATIVE_FLAG = (value & 0x800000) !== 0;
 +                //if (DEBUG) log(`$${hex24(IP_now)}    LD${reg_names(reg)} [$${hex24(addr)}] ; = ${hex24(value)}`);
 +            } else {
 +                set_register_32bit(reg, value);
 +                ZERO_FLAG = value === 0;
 +                NEGATIVE_FLAG = (value & 0x80000000) !== 0;
 +                //if (DEBUG) log(`$${hex24(IP_now)}    LD${reg_names(reg)} [$${hex24(addr)}] ; = ${hex32(value)}`);
 +            }
 +            break;
 +        }
 +</codify>
  
 +for more information please contact Appledog.
  
flag_operations_are_free.1768143533.txt.gz · Last modified: by appledog

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki