I've found the bug.
The circumstances that trigger the bug are a bit complicated, but I'll try to explain.
During multiplication, this routine is called 4 times for each byte of the mantissa of one of the factors. The current mantissa byte is contained in A, and the Z flag has been set accordingly:
Code: Select all
.DA59 D0 03 BNE $DA5E
.DA5B 4C 83 D9 JMP $D983
.DA5E 4A LSR A
.DA5F 09 80 ORA #$80
.DA61 A8 TAY
.DA62 90 19 BCC $DA7D
.DA64 18 CLC
.DA65 A5 29 LDA $29
.DA67 65 6D ADC $6D
.DA69 85 29 STA $29
.DA6B A5 28 LDA $28
.DA6D 65 6C ADC $6C
.DA6F 85 28 STA $28
.DA71 A5 27 LDA $27
.DA73 65 6B ADC $6B
.DA75 85 27 STA $27
.DA77 A5 26 LDA $26
.DA79 65 6A ADC $6A
.DA7B 85 26 STA $26
.DA7D 66 26 ROR $26
.DA7F 66 27 ROR $27
.DA81 66 28 ROR $28
.DA83 66 29 ROR $29
.DA85 66 70 ROR $70
.DA87 98 TYA
.DA88 4A LSR A
.DA89 D0 D6 BNE $DA61
.DA8B 60 RTS
Now, the first two instructions,
BNE $DA5E and
JMP $D983 are supposed to execute a shortcut in case the mantissa byte to multiply with is 0. It is sensible to optimize for this, especially because small integer constants do contain zeroes in their lower significant parts of the mantissa.
The routine would still work without that optimization, but the branch at $DA62 would always be executed (for 8 times in total), skipping the additions to $26 .. $29 - and all the routine then does is painstakingly move the bytes $26 .. $29 over to $27 .. $29, $70, one bit at a time. Of course this can be done much faster with just four load and store instrĂșctions. For this another routine at $D983 is 'reused', which normally 'normalizes' the mantissa:
Code: Select all
.D983 A2 25 LDX #$25
.D985 B4 04 LDY $04,X
.D987 84 70 STY $70
.D989 B4 03 LDY $03,X
.D98B 94 04 STY $04,X
.D98D B4 02 LDY $02,X
.D98F 94 03 STY $03,X
.D991 B4 01 LDY $01,X
.D993 94 02 STY $02,X
.D995 A4 68 LDY $68
.D997 94 01 STY $01,X
.D999 69 08 ADC #$08
.D99B 30 E8 BMI $D985
.D99D F0 E6 BEQ $D985
.D99F E9 08 SBC #$08
.D9A1 A8 TAY
.D9A2 A5 70 LDA $70
.D9A4 B0 14 BCS $D9BA ---+ normally,
.D9A6 16 01 ASL $01,X | this branch
.D9A8 90 02 BCC $D9AC | is supposed
.D9AA F6 01 INC $01,X | to happen.
.D9AC 76 01 ROR $01,X |
.D9AE 76 01 ROR $01,X |
.D9B0 76 02 ROR $02,X |
.D9B2 76 03 ROR $03,X |
.D9B4 76 04 ROR $04,X |
.D9B6 6A ROR A |
.D9B7 C8 INY |
.D9B8 D0 EC BNE $D9A6 |
.D9BA 18 CLC <--+
.D9BB 60 RTS
The instructions from $D983 to $D993 do as supposed, however already beginning at $D995,
LDY $68 with
STY $01,X is a little bit dubious: Without the optimization, a 0 would end up at $26 as the result of being rotated to the right 8 times. Whether there is a 0 at $68 at all times the routine is called with this purpose cannot be guaranteed. But anyway, that is not what triggers the bug in the first place.
The routine at $DA59 is first called with a non-0 byte - for the examples in the earlier posts this is ultimately the result of adding something around 1..59 divided by 10^9 to the constant 0.75. The routine exits with C=1, which will become important now!
The next mantissa byte is zero, so now the shortcut at $D983 is called. With A=0 and C=1 on entry, the two instructions ADC #$08 and SBC #$08 result in A=0 and C=1 again. At $D9A4, the instruction
BCS $D9BA skips the second half of the routine, which is a good thing, however it also executes a
CLC at $D9BA, and from there everything goes downhill.
The third mantissa byte is *also* zero, so now the shortcut gets called a second time. This time, however C is 0 (with A again being 0), which results in C=0 and A=255 after SBC #$08. The instructions from $D9A6 .. $D9B8 are now executed,
which shift the whole mantissa at least one bit to the right!
For the remaining mantissa byte, the check for a shortcut is skipped. Whatever is finally added to the resulting mantissa, the earlier parts have been inadvertently divided by 2 before, which is exactly what can be seen as false result.