Table of Contents
SD-8516 ISA Profile
Writing assembly language programs on the SD-8516 is similar, but different to writing them on the 8510. For one, programs are entered in much the same way:
ISA Profile Chart
The following program illustrates a baseline:
.address $010100
LDA $1010
LDB $0101
LDCD $0FFFFFFF ; Load 1,000,000 into CD (0x989680 is 10 mil)
loop:
DEC CD ; Decrement CD
JNZ @loop ; Jump to loop if CD != 0
HALT ; Halt when done
This program executes at a certain speed we can call X. It doesn't matter what the speed is for now, suffice to say it is X in terms of MIPS or some other benchmark. When discussing the profiling of a command, we have to determine if it pulls the execution of this loop up or down from X. In this manner we can judge the relative speed of the instruction; if A is the speed of DEC, and B is the speed of JNZ, then the portion remaining goes to the instruction being profiled. However, when adding just one instruction, it is difficult to judge the true speed of the instruction in question. The solution is to increase the number of instructions per loop, which is known in a way as unrolling the loop.
One idea is to increase the number DEC instructions relative to JNZ and see what happens. In the regular run I got a score of 77 MIPS on my 12600k. Increasing the DEC:JNZ ratio to 10:1 brought us down to 56 mips. At 100:1 we got 54 MIPS.
On the other side, a program that tests JNZ to DEC 10:1 brings MIPS up to 91. In either case, a nearly 20 MIPS difference. Therefore clearly, JNZ is a much faster operation than DEC, although you would expect DEC to be a lot faster than JNZ! The reason why is that DEC CD is very slow, as it is a dual register DEC. Moving to single register DEC increases the speed by 50-100%:
.address $010100
LDC #10000
LDD #25000
loop:
DEC C
DEC C
DEC C
DEC C
DEC C
DEC C
DEC C
DEC C
DEC C
DEC C
JNZ loop
; C reached zero, decrement D
LDC #10000
DEC D
JNZ loop
; done
HALT
This version runs at 90 MIPS. Considering all of the results so far, we'll use the double C counter version with 20 executions of the profiling instruction unrolled inside the loop. We'll also take the C loop down to 10,000 from 30,000 seeing as how we will be unrolling instructions in the loop, and they are almost surely bound to be slower.
The following chart indicates the best results out of several runs:
LDA
| Instruction | Execution time | Notes |
|---|---|---|
| Empty Loop | 97 MIPS | |
| LDA [$1000]x10 | 90 MIPS | |
| LDA [$1000]x100 | 95 MIPS | |
| LDAL [$1000]x20 | 85 MIPS | Not native word size |
| LDAB [$1000]x20 | 76 MIPS | unexpected! will check code |
| LDBLX [$1000]x20 | 25 MIPS | array method method |
| LDBLX [$1000]x20 | 45 MIPS | switch method |
| LDBLX [$1000]x20 | 64 MIPS | unified memory reads |
| LDBLX [$1000]x20 | 73 MIPS | inlined acceess |
Notes on LDA/LDAL
This is likely a branch prediction and instruction cache artifact in the Web Assembly/JavaScript JIT compiler. With the empty loop, the CPU's branch predictor may be working against speculative execution overhead. Adding a single LDA gives the pipeline something productive to do between branches, potentially hiding some of the branch misprediction penalty or better aligning the instruction stream. At 10-20 instructions, you're hitting different bottlenecks:
Increased loop body size may cause instruction cache pressure More register pressure in the generated machine code Loop overhead becomes proportionally smaller but absolute instruction decode cost increases
The LDAL slowdown confirms this - non-native 32-bit operations require more complex codegen, putting additional pressure on the optimizer. This is classic JIT behavior: a tiny amount of work can sometimes improve performance by giving the CPU's execution units better scheduling opportunities, but too much work overwhelms those benefits. You might also be seeing alignment effects - the single instruction could be placing the loop branch at an optimal address boundary.
Finally, using LDBLX as a proxy for the process we went through earlier, we achieved a 3x speedup by using a switch versus a map, unifying <u8> memory reads into <u32>, and inlining the the load() calls into the opcode handler.
I wouldn't want to do this for every instruction because it produces ugly, hard to maintain code, but it works like a charm!
DEC
A loop with 20xDEC had a high mark of 104.7 MIPS.
PUSH/POP
- PUSH and POP are slower operations, in the 80-85 MIPS range.
- But PUSHA/POPA are noticeably slow, in the 27 MIPS range.
- Using PUSHA/POPA everywhere will kill performance. We saw a 25% increase in speed after moving from PUSHA to push (reg).
