Line drawing algorithm

MrSterlingBS · Post by **MrSterlingBS** » Fri Nov 01, 2024 11:48 am

Here are the results of the faster line drawing routine for a 128x256 Pixel screen.
There is no low byte calculation necessary.

Line from X0,Y0 to X1,Y1 / cycles
0,0 to 99,0 / 4347
0,0 to 99,0 / 4346
0,0 to 99,99 / 4386 !!!

Greeting
Sven

: FLDA.png (1.4 KiB) Viewed 3754 times

pixel · Post by **pixel** » Fri Nov 01, 2024 6:29 pm

Quite cool! I went for a char boundary check instead of unconditional address calculations. Something like this:

Code: Select all

    lda dx
loop2:
    sta error
loop:
    lda (d),y
    ora pixels,x
    sta (d),y
    dey
    cpy y1
    beq done
    lda error
    sbc dy
    bcs loop2
    adc dx
    sta error
    inx
    lda is_charboundary,x ; <--- !
    beq loop
    lda column_addrs_l,x
    sta d
    lda column_addrs_h,x
    sta d+1
    bne loop ; (jmp)

That's at least -5 cycles for your code, I assume. BUT: There's still clipping, so I'm thinking of doing just that with the cpx/cpy, correcting the drawing direction beforehand.

Mike · Post by **Mike** » Sat Nov 02, 2024 5:40 am

To get the rounding right, the first LDA dx should be followed by LSR A (i.e., error := dx/2).

pixel wrote:Quite cool! I went for a char boundary check instead of unconditional address calculations.

The routine knows where the character boundaries are without the need to explicitly check, if the loop is unrolled 8 times, and at start, the preparation part jumps into the loop body instance "x1 mod 8". Each loop body instance knows which bit to set (or reset/toggle), so that bit is encoded as immediate constant.

The address calculation/lookup is factored out and done simply by adding the column size to the address column pointer. This allows to re-purpose register X to count down the number of remaining pixels.

Other than all that, your code snippet pretty much does what my line routine does for |dy|>|dx|.

The whole adventure laid out in the thread here is about obtaining a fast line routine. The key to that is not stuffing more logic into the loop body, but less!

pixel · Post by **pixel** » Sat Nov 02, 2024 7:47 am

Mike wrote: ↑Sat Nov 02, 2024 5:40 am To get the rounding right, the first LDA dx should be followed by LSR A (i.e., error := dx/2).

Thanks!

I know what the adventure is. Some strategies might not include unrolling the loop.

Mike · Post by **Mike** » Sat Nov 02, 2024 12:01 pm

pixel wrote:Some strategies might not include unrolling the loop.

As can be seen by these two posts (here and here), 20000 pixels/second on a 1 MHz 65xx are easily attainable without loop unrolling, mostly by streamlining the address calculation, at the expense of using huge tables.

As long as there still happens a full or even only partial address calculation per pixel, you can't improve much on that. Moving the inner loop into the zeropage and applying those measures as hinted at by the last posts of MrSterlingBS may eke out some cycles, but still won't be able to approach the speeds possible by unrolling the loop.

So-called "chunky" Bresenham implementations use (variants of) this code snippet as part of their inner loop:

Code: Select all

.loop
 LSR mask          ; 5
 BEQ fixup_column  ; 2 (assumed not taken)
 SBC dy            ; 3
 BCC fixup_y       ; 2 (assumed not taken)
 DEX               ; 2
 BNE loop          ; 3 (assumed taken)
-----------------------------
             total: 17 cycles

... which however lengthen each step to about 51 cycles when either column fixup or y fixup are required. In any case, these chunky algorithms only work at all for |dx|>|dy| and only with |dx|/|dy| > 2 *) they can break even with the routine I keep in my drawer **): 50% of the pixels then need 17 cycles, and 50% of the pixels need 51 cycles, so (17+51)/2 = 34 cycles on average. A bit lower - down to a minimum of about (7 x 17 + 51)/8 ~= 21 cycles - for even smaller slopes and for horizontal lines. The majority of slopes (including those where |dy|>|dx|) do not profit at all.

As I wrote earlier in this thread, in any case - and already with my above implementation of Bresenham at 20000 pixels/second - we are working with drawing speeds here that demand a good and fast(!) geometry engine. Even my faster line routine, with a sustained speed of 30000 pixels/second, won't save the day when the geometry engine is crap.

*) ... measured in pixel steps. If one takes the PAR into account, that corresponds to a maximum slope angle of phi = arctan(1/2 x 3/5) ~= 16.7° (for PAL). When drawing objects in vector graphics, we can assume an equidistribution of angles (provided the programmer took the PAR into account), which means only 16.7°/90° ^= 18.6% of all lines have a chance to be faster with a chunky Bresenham. For NTSC (with its PAR ~= 3:2), this figure comes out to 20.6%, not much more. This should give you a good idea why I never bothered to actually implement that variant.

**) not entirely in the drawer: it is used in my Vector demo (rotating a icosahedron inscribed in a cube), the "Rotating Earth Globe" demo (see the *.d64 of the MG batch suite), in the Madness demo, and also in MINISKETCH, where it connects the single points that are returned from the lightpen each frame to a line traverse.

pixel · Post by **pixel** » Mon Nov 04, 2024 10:23 am

Mike wrote: ↑Sat Nov 02, 2024 12:01 pm Even my faster line routine, with a sustained speed of 30000 pixels/second, won't save the day when the geometry engine is crap.

Can be a very productive time figuring one out.
“One of my most productive days was throwing away 1000 lines of code.” (Ken Thompson)

I have a concept that looks very promising with throw-away-day dooming all over it. At least I arrived at only adding, subtracting, negating, swapping and look-up tables and some kind of 8.8 fixed-point math. Feels like doing bloody homework again. Did that with a cold, so there'd better be another round of reassurance and trying to science something out of it. But the new video mode is definitely a new kind of boost.

Mike · Post by **Mike** » Mon Nov 04, 2024 12:19 pm

pixel wrote:[...] the new video mode is definitely a new kind of boost.

You mean the 128x256 resolution brought into the discussion here by MrSterlingBS?

I can tell you that the speed increase from saving the addition to the low byte (to get the next column address) is marginal with my faster line routine, as - as I already wrote - the address calculation/lookup already has been factored out of the inner loop.

Can be a very productive time figuring one out.

The geometry engine I implemented for the vector cube/icosahedron demo was a first for me in 65xx code after an effective 16 year hiatus on coding for the VIC-20 (1986..2002). I put the focus on getting it working, with a good leeway on arithmetic precision (actually 2:14 fixpoint with 4:28 intermediate results), not so much on speed. Those are 20 rotated vertices (8 for the cube, 12 for the icosahedron) and 42 lines (12 for the cube, 30 for the icosahedron), around all three axes. Not actually using a 3x3 matrix but 3 single axis rotations, i.e. 12 multiplies instead of 9 so that could be streamlined. Perspective correction (and taking into account the PAR, for NTSC actually) needing another 2 multiplies per vertex (with lookup from a table of inverses). The mult/trig/inverse tables are rather big, and could also be streamlined for less size at the expense of accuracy. As long as the calculated screen co-ordinates still differ less than 1/2 pixel from the ideal value, that would be in order.

As I wrote somewhere else here in Denial, in the end the demo spends about the same time on geometry processing as it does on drawing the lines, these two tasks combined take about 80% of the total time. The rest of the time is spent in business logic, clearing the shadow buffer and copying the shadow buffer over to the display buffer when the current animation step has been completely rendered. As stated in the demo, it arrives at roughly 4 to 5 frames/second.

This very demo has been unchallenged for over 20 years now. So here's your benchmark: try to substantially better the frame rate with otherwise the same visual result.

pixel · Post by **pixel** » Mon Nov 04, 2024 2:57 pm

Mike wrote: ↑Mon Nov 04, 2024 12:19 pm This very demo has been unchallenged for over 20 years now.

+20y! That's even longer than my game release delays! True: a kick in the glockenspiel is overdue.

But instead I'll keep tossing around bytes in what's to become the engine of a game, hoping to get away with rather primitive calculations, LOGO-style.

At the moment I'm… well… kinda throwing things away. But I managed to rotate a cube:

Mike · Post by **Mike** » Mon Nov 04, 2024 3:18 pm

pixel wrote:But I managed to rotate a cube: [...]

That cube must have been exposed to a rather strong gravity field in the vicinity of a black hole.

pixel · Post by **pixel** » Mon Nov 04, 2024 3:46 pm

Mike wrote: ↑Mon Nov 04, 2024 3:18 pm
pixel wrote:But I managed to rotate a cube: [...]
That cube must have been exposed to a rather strong gravity field in the vicinity of a black hole.

That's the last working version. With the new thing I hope to be able to get the (8-bit) vertices of a rotated cube in about 600 cycles.

MrSterlingBS · Post by **MrSterlingBS** » Tue Nov 12, 2024 11:34 am

Hi,

After implementing the SBC/BCS technique, the line drawing routine became a little bit faster.
The results for a line are around 37 cycles per pixel. X1=10;Y1=10; X2=110;Y2=110
Screen sice 128 by 256 pixels.

The two dots are only to verify the start and end of the line.

: BCS_Line.png (1.49 KiB) Viewed 3114 times

Thanks a lot for the feedback.

BR
Sven

thegg · Post by **thegg** » Tue Nov 12, 2024 3:08 pm

Sorry, I've been following this with some interest, but I can't recall a SBC/BCS technique being discussed. Unless you are referring to Mike's chunky Bresenham snippet. Can you either explain further, or point me to the right post please.

Mike · Post by **Mike** » Wed Nov 13, 2024 12:39 am

Also, regardless how it is implemented, with

MrSterlingBS wrote:X1=10; Y1=10; X2=110; Y2=110

as start and end co-ordinates of the line, the routine should only do diagonal steps. However, in the screenshot I see interspersed horizontal steps that are not supposed to be there. So either the screenshot uses another set of co-ordinates, or there is a flaw with the routine. I take the given cycle count with a grain of salt.

pixel · Post by **pixel** » Wed Nov 13, 2024 1:53 am

Ah, yes, looks like a carry flag sneaking in with each new char column.

MrSterlingBS · Post by **MrSterlingBS** » Wed Nov 13, 2024 2:19 am

Yes, there was an error in the routine.
I hope I can fix it today.

Sven

The routine is not finished yet and need further optimizations.But the principle works…

: BCS_SBC_Line_Corrected.png (1.39 KiB) Viewed 2947 times

Here is the updated routine with no error. The screensize is now 168x192 pixels.

Denial

Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm