I had to translate it for the PIC and Hitachi H8 MCUs I was using for the project, but that wasn't a major hassle. The puzzle was why they seized upon that moment to insist I use a specific, and short algorithm. Did they think I was a mathematical dunce who couldn't implement something involving a few XORs? Were they just trying to impress me with their claim (I'm willing to accept it) that they actually wrote that algorithm and got it published in BSI standards? Were they just sticklers for standardized solutions?

Well, I was quite happy, because it was a bit of an education. It wasn't until about 5 years later I started to realize something was amiss with CRC16 algorithms. And what was amiss, was that no-one else seemed to be using byte parallel algorithms. Surely a BSI (and presumably US / ISO) standard would dominate? But here I was in other commercial environments where the references and implementations were all bit-serial or table-driven algorithms of this kind:

uint16_t crc16(char *addr, int len, uint16_t crc)

{

int bit;

while(num-- >0) { // For each byte.

crc = crc ^ (*addr++ <<8); // xor into high byte.

for(bit=0; bit<8;bit++)

if ((int16_t)crc<0) // top bit of poly set.

crc = ((crc<<1) ^ 0x1021); // xor with poly

else

crc<<=1;

}

}

return crc;

}

I could see they were all highly inefficient, or worse still, wrong - some of the table-driven versions actually contained wrong entries as I found out last year when I wasted a day or so trying to figure out why a colleague's Crc algorithm didn't generate the same results as mine. And that was because he'd just copied and pasted the first implementation he'd seen on the internet - and that was even

*after*I'd pointed him to this byte-parallel version and asked him to translate it to C#.

The reason I figure table-driven algorithms became popular is because they're easy on the brain. It's easy to grasp how a bit-serial algorithm relates to the CRC's polynomial and then easy to jump to a bit-serial algorithm that generates a table or just copy a table version directly. However, byte-parallel algorithms are, thankfully, making a comeback. Why? I guess because constrained MCUs are still used in a lot of applications and because cache-misses on table-driven CRCs are pretty costly on higher-end processors.

This leads me to - an alternative for a set of CRC algorithms published on MDFS.net, a wonderful site that has conversions of BBC Basic for every processor you'd ever want :-) (with the exception of an AVR). Here are equivalent byte-parallel versions for the same set of ancient processors and they run at least twice as fast:

First, Crc16_6502, which clocks in at 64 bytes and 92cycles per byte, over twice as fast as the bit-algorithm. In common with most of the 8-bit implementations, it's more efficient to swap the CrcHi and CrcLo nearer the end and instead perform the calculations on the 'wrong' halves of the Crc until then. The 6502 version also saves cycles by using y to represent both an index into the buffer, and the length of the buffer (which is incremented until it gets to 0). This means we have to adjust the buffer pointer and negate the length.

Crc16_6502: ;buff^buffer, y=len.

clc

tya

eor #255

adc #1 ;Negate length

beq Crc16_6502End

adc buff ;buff-len

;Carry would mean we don't need to dec buff+1

;but since we really have 256-length in y,

;then we need to inc buff+1 instead (no-carry

;means we don't need to inc buff+1)

bcc Crc16_6502Loop

inc buff+1

Crc16_6502Loop:

lda (buff),y

eor CrcHi

sta CrcHi ;really CrcLo

lsra

lsra

lsra

lsra ;(Crc&0xff)>>4.

eor CrcHi

sta Temp1

sta Temp0 ;Copy low byte for <<5 later.

asla

asla

asla

asla ;(Crc<<12)

eor CrcLo

sta CrcHi ;this is the new CrcHi

asl Temp0

rola

asl Temp0

rola

asl Temp0

rola

asl Temp0

rola

asl Temp0

rola ;<<5

eor CrcHi

sta CrcHi

lda Temp0

eor Temp1

sta CrcLo

iny

bne Crc16_6502

Crc16_6502End

rts

Then there's a Z80 version. This is more straight-forward, since there are enough registers to handle the entire algorithm. It clocks in at 33b and 139 T-states per byte, making it the shortest version and only 51% slower than a 6502 at the same clock speed. Here we use c as a temp while we perform the crc hi and lo swap over the course of the first two shifts, so that they end up nicely in hl ready for when we do the <<5 near the end.

Crc16_Z80: ;(Z80 style, b=length, de^data, hl=CRC).

ld a,(de)

inc de

xor h

ld c,a

rra ;the Z80 doesn't have a fast

rra ;right shift so we

rra ;rotate and mask.

rra

and a,15 ;>>4,

xor c

ld c,a ;new low byte, gets shifted.

add a,a

add a,a

add a,a

add a,a ;<<12.

xor l

ld h,a ;new crc hi

rrca

rrca

rrca ;<<5

ld l,a ;save in c

and 31

xor h

ld h,a

ld l,a

and 0E0h

xor c

ld l,a ;done.

djnz Crc16_Z80

Next up, the 6809. Despite having several 16-bit registers, the 6809's accumulator architecture means we need to allocate 2 temporary bytes on the stack and we can't make use of 16-bit operations on D. I estimate the length at 45bytes and the speed as 80 cycles per byte, a little faster than a 6502.

Crc16_6809: ;D=CRC, X^data, Y=Len.

leas -2,s ;Allocate 2 temp bytes.

Crc16_6809Loop:

eora ,x+

staa ,s

lsra

lsra

lsra

lsra

eora ,s

std ,s

lsla

lsla

lsla

lsla

eora 1,s

staa 1,s

ldab ,s

rorb

rorb

rorb

tfr b,a

anda #31

eora 1,s

rorb

andb #0xe0

eorb ,s

leay -1,y

bne Crc16_6809Loop

leas 2,s ;Deallocate 2 temp bytes.

rts

Finally, the pdp11 (with Extended Arithmetic). The pdp11's adequate number of 16-bit registers and programmer-friendly instruction set makes it easy to implement the algorithm. Nevertheless, if run on a typical 1970s pdp-11/34 it would require 40 bytes of code and 47.56µs per byte, roughly equivalent to a 2MHz 6502 or a 3MHz Z80. Yet more evidence to demonstrate that the pdp-11 in the 1970s wasn't theoretically much faster than a humble late 70s Microprocessor. Check out my Z80 Dhrystones article to discover why this might be.

Crc16_pdp11:

;With Extended Arithmetic. r0=crc, r1^data, r2=len, r3=tmp

swab r0

clr r4

movb r4,(r1)+

xor r0,r4 ;no byte version.

mov r3,#-4

movb r4,r0

ash r4,r3 ;>>4.

xor r0,r4 ;crc^=(crc&0xff)>>4

mov r4,r0 ;need copy

mov r3,#12 ;adds 1.6us vs much more for swap /and etc.

ash r4,r3 ;crc<<12

xor r0,r4

mov r4,r0

mov r3,#5

ash r4,r3 ;<<5

xor r0,r3 ;done.

sob r2,Crc16_pdp11