> ARM assembly is orthogonal and almost as high-level as C. The AArch64 is wacky...

kragen · 2025-08-06T18:21:38 1754504498

Yeah, ARM64 is a little weird, and I'm not sure it's a good design, though it does seem to be workable. But I'm talking about the original ARM instruction set implemented on the ARM2, as evidence that architectural design quality matters—the same 29000 transistors can give you 12 times the performance and a much better programming model.

The PDP-11 seems pleasant and orthogonal, but I've never written a program for it, just helped to disassemble the original Tetris, written for a Soviet PDP-11 clone. The instruction set doesn't feel nearly as pleasant as the ARM: no conditional execution, no bit-shifted index registers, no bit-shifted addends, only 8 registers instead of 16, and you need multiple instructions for procedure prologues and epilogues if you have to save multiple registers. They share the pleasant attribute of keeping the stack pointer and program counter in general-purpose registers, and having postincrement and predecrement addressing modes, and even the same condition-code flags. (ARM has postdecrement and preincrement, too, including by variable distances determined by a third register.)

The PDP-11 also wasn't a speed demon the way the ARM was. I believe that speed trades off against everything, and I think you're on board with that from your language designs. According to the page I linked above, a PDP-11/34 was about the same speed as an IBM PC/XT.

Loading a constant into a register is still a problem on the ARM2, but it's a problem that the assembler mostly solves for you with constant pools. And ARM doesn't have indirect addressing (via a pointer in memory), but most of the time you don't need it because of the much larger register set.

The ARM2 and ARM3 kept the condition code in the high bits of the program counter, which meant that subroutine calls automatically preserved it. I thought that was a cool feature, but later ARMs removed it in order to support being able to execute code out of more than just the low 16 mebibytes of memory.

Here's an operating system I wrote in 32-bit ARM assembler. r10 is reserved for the current task pointer, which doesn't conform to the ARM procedure call standard. (I probably should have used r9.) It's five instructions:

            .syntax unified
            .thumb
            .fpu fpv4-sp-d16
            .cpu cortex-m4

            .thumb_func
    yield:  push {r4-r9, r11, lr}   @ save all callee-saved regs except r10
            str sp, [r10], #4       @ save stack pointer in current task
            ldr r10, [r10]          @ load pointer to next task
            ldr sp, [r10]           @ switch to next task's stack
            pop {r4-r9, r11, pc}    @ return into yielded context there

http://canonical.org/~kragen/sw/dev3/monokokko.S

WalterBright · 2025-08-07T01:43:45 1754531025

Thanks for the interesting post!

The -11 could do things like:

    mov (PC)+,R0

where the PC+ addressing mode picked the constant out of the next 16 bits in the instruction scheme. It's just brilliant.

kragen · 2025-08-07T03:05:27 1754535927

I'm glad you liked it!

Yeah, with (PC)+ (27), you didn't need a separate immediate addressing mode where you tried to stuff an operand such as 2 into the leftover bits in the instruction word; you could just put your full-word-sized immediate operands directly in the instruction stream, the way you did with subroutine parameters on the PDP-8. And there was a similar trick for @(PC)+ (37) where you could include the 16-bit address of the data you wanted to access instead of the literal data itself. But that kind of thing, plus the similarly powerful indexed addressing modes (6x and 7x), also meant that even the instruction decoder in a fast pipelined implementation of the PDP-11 instruction set would have been a lot more difficult, because it has to decode all the addressing modes—so, AFAIK, nobody ever tried to build one.

And different kinds of PC-relative addressing is basically the only benefit of making the PC a general-purpose register; it's really rare to want to XOR the PC, multiply it, or compare it to another register. And it cost you one of the only eight registers.

And you still can't do ARM things like

    @ if (≥) r2 := mem[r0 + 4*r1]
    ldrge r2, [r0, r1, lsl 2]

    @ if (≤) { r2 := mem[r0]; r0 += 4*r1; }
    ldrle r2, [r0], r1, lsl 2

    @ store four words at r3 and increment it by 16
    stmia r3!, {r0, r1, r7, r9}

    @ load the first and third fields of the three-word
    @ object at r3, incrementing r3 to point to the next object
    ldr r1, [r3]
    ldr r2, [r3, #8]!

A lot of the hairier combinations have been removed from Thumb and ARM64, including most of conditional execution and, in ARM64, ldm and stm. Those probably made sense as instructions when you didn't have an instruction cache to execute instructions out of, because a single stm can store theoretically 16 registers in 17 cycles, so you can get almost your full memory bandwidth for copying and in particular for procedure prologues and epilogues, instead of wasting half of it on instruction fetch. And they're very convenient, as you saw above. But nowadays you could call a millicode subroutine if you want the convenience.

All these shenanigans (both PDP-11 and ARM) also make it tricky to restart instructions after a page fault, so AFAIK the only paged PDP-11 anyone ever built was the VAX. A single instruction can perform up to four memory accesses or modify up to two registers, which may be PC (with autoincrement and decrement), as well as modifying a memory location, which could have been one of the values you read from memory—or one of the pointers that told you where to read from memory, or where to write. Backing out all those state changes successfully to handle a fault seems like a dramatic amount of complexity and therefore slowness.

I'm aware that I'm talking about things I don't know very much about, though, because I've:

- never programmed a PDP-11;

- never programmed a PDP-8;

- never programmed in VAX assembly;

- never designed a pipelined CPU;

- never designed a CPU that could handle page faults.

So I could be wrong about even the objective factors—and of course no argument could ever take away your pleasure of programming in PDP-11 assembly.

WalterBright · 2025-08-07T03:53:58 1754538838

I rewrote my Empire game in PDP-11 assembler, long ago:

https://github.com/DigitalMars/Empire-for-PDP-11

but I have little knowledge of how the CPU works internally. One could learn the -11 instruction set in a half hour, but learning the AArch64 is a never-ending quest. 2000 instructions!

kragen · 2025-08-19T04:08:59 1755576539

That sounds like a lot of fun! I had a Heathkit myself, but it was an H89.

As for ARM64, sure, but I'm not talking about ARM64, in case that wasn't just a randomly chosen unmanageable architecture. Check out the VLSI ARM3. All 26 instructions are listed in Table 1 on the bottom of page 1–7 of the datasheet: https://www.chiark.greenend.org.uk/~theom/riscos/docs/ARM3-d...

That's cheating a little bit because it doesn't include the addressing modes, conditionals, and bit shifts and rotations, because those are bitfields in other instructions, but even so, it's not cheating much. You can still learn the whole instruction set in an afternoon (though not half an hour!), and it's an instruction set that can be implemented much more efficiently than most ISAs before or since.