Line drawing algorithm

Mike · Post by **Mike** » Sat Mar 16, 2024 2:51 pm

MrSterlingBS wrote:[...]

PM sent

Mike · Post by **Mike** » Sun Mar 24, 2024 7:53 am

Warning, longer post.

Before I posted my own version of the "20000 pixels/second" line routine on the previous page, I also had a PM exchange with Merytsetesh about MrSterlingBS's original implementation:

in a PM to Merytsetesh, Mike wrote:I'll take a look into this the next days whether I can derive a version of the routine that does not have those rounding issues, and will do a post about the improved version here in Denial. FYI, that routine roughly manages 20000 pixels/second.

While doing so, I found the 'skeleton' of the main loop to be general enough that only a single copy was necessary (not two, as I originally had imagined). One of the self-modifications changes between CPX and CPY at the end of the loop as necessary, that had been the decisive factor in it.

The routine is remarkable in the sense that it manages to keep the X and Y co-ordinates as work values in the respective registers throughout the whole operation of the main loop - that is something you normally do not see in non-trivial 6502 code. Normally, you would always need to swap between the registers and (zero page) memory. Still the 6502 is missing a second accumulator, and that means the single accumulator has to hold the current bitmap byte value and the decision/error value in turns (and for a short time, also the low and high byte parts of the bit column address).

Yet, this routine still does a full address calculation for each pixel, even if that is extremely streamlined by avoiding shift and mask instructions during retrieval of the bitmap column pointer which would otherwise be helpful to keep the tables small.

To further improve on this, the crucial observation is that the Y indexed address mode already does half the job of calculating the effective address of the bitmap byte, and the bitmap column pointer only changes its value when the respective 8-pixel bitmap column is left to either side. Most of the time, the bitmap column pointer stays put so we only need to update it as necessary. Factoring out that part is one of the two necessary steps.

The second step is saving the time on the bit mask retrieval. Again, the routine here reads the bit mask from a table with an X indexed ORA instruction. However, while stepping along the line, the bit in the mask either stays put or changes its position one to the left or to the right (with possible wraparound). This can be streamlined by unrolling the main loop and providing 8 copies of the loop body, one for each position of the bit to be set (or reset), as immediate constant.

In the end, the main loop of the line routine changes from:

Code: Select all

.Line_03
 INX            ; or DEX for x as main direction (or INY/DEY for y as main direction)
 SEC
 LDA accu
 SBC #&00       ; effectively, SBC #dy for x as main direction (or SBC #dx for y as main direction)
 BCS Line_04
 ADC #&00       ; effectively, ADC #dx for x as main direction (or ADC #dy for y as main direction)
 INY            ; or DEY for x as main direction (or INX/DEX for y as main direction)
.Line_04
 STA accu
.Line_05
 LDA Line_06,X
 STA scrn
 LDA Line_07,X
 STA scrn+1
 LDA (scrn),Y
 ORA Line_08,X
 STA (scrn),Y
 CPX #&00       ; effectively, CPX #x2 for x as main direction (or CPY #y2 for y as main direction)
 BNE Line_03
 
49.5/52.5 cycles/pixel

... (the comments show where the preparation part of the routine modifies the main loop) ...

to:

Code: Select all

5.5     LDA (base),Y
  2     EOR #bit     ; ** or AND/ORA
  6     STA (base),Y
  2     DEX
  2     BEQ end
  3     LDA error
  3     SBC dy
3/2 +-- BCS
0/3 |   ADC dx
0/2 |   INY          ; ** or DEY
  3 +-> STA error

29.5/33.5 cycles/pixel

**) self-modifying code

The latter is one copy of the loop body for x as main direction, the loop body for y as main direction has these instructions slightly re-arranged. Both main loops contain 8 copies of the loop body (i.e. they have been "unrolled 8 times") to cover all bit positions within a byte.

The X register now has its role changed from keeping the X co-ordinate to counting down the number of remaining pixels. If that number becomes 0, one of the "BEQ end" instructions is executed to terminate the loop. That is another small optimization as it saves one cycle per pixel in most cases. Furthermore, DEX does not change the carry flag (as do the CPX or CPY instructions in the former routine) so it is not necessary to do a SEC to prepare the SBC instruction in the Bresenham step. It is only necessary to enter the main loop with the C flag set once (actually, when the bitmap column pointer is adjusted, the state of the C flag needs to be kept in mind as well).

Adjusting the bitmap column pointer weighs in with 23 cycles, at most for every 8th pixel (less often so for y as main direction), so this adds in another 3 cycles per pixel at worst. In the end, the average number of cycles per pixel goes down from 51 cycles to 31.5(+3) cycles for the faster implementation, resulting in ~50 % more speed.

As the full address calculation of the bitmap column pointer is only done once, the tables can be kept small: only 20 entries for the low and high bytes each (assuming 160 pixels horizontal width for the bitmap, as used by MINIGRAFIK). The bit position table is not anymore necessary as it is now 'contained' within the immediate fields of the ORA/AND/EOR instructions. Another small improvement in play: the line is always drawn left to right - this is always possible by exchanging the line ends, should x2 be less than x1.

What does not carry over is using immediate operands for the SBC and ADC instructions - modifying these would require to change their operand fields for all 8 copies of the loop body in the relevant main loop. During init, this would incur an extra cost of at least 64 cycles (for 16 STA ABS instructions) which needs a line of at least 32 pixels length to break even.

All that needs to be done now is preparing the relevant of the two main loops with the correct instructions for stepping along the y axis (either INY or DEY), at 8 places. When the draw action is changed between set, reset or invert, also those corresponding instructions, possibly also their operands need be modified. My faster routine takes care to only do these modifications when they are really necessary. If one disregards the change of draw action, the setup part still only takes about the same time on average as the setup part of the former routine (150 cycles vs 130 cycles).

...

That being said, this faster routine now sits at a sweet spot of being quite fast with 30000 pixels/second while still being less than 1 KB in size. Any further improvement in speed to the direction of 20 cycles/pixel (as pondered about by me earlier in this thread) likely needs an exuberant increase in code size to cover all the definitions by cases. If the flexibility to change the draw action is still supposed to be kept, I would estimate it at about 10 KB or more, which IMO makes such a routine unfit for general use.

In any case, whether with the 20000 pixels/second routine here or with the faster 30000 pixels/second routine in my repository, the challenge now is to supply the line routine with co-ordinate data at a fast pace. When the geometry engine is sluggish, neither of the two line routines will save your day.

Greetings,

Michael

Merytsetesh · Post by **Merytsetesh** » Sun Mar 24, 2024 12:01 pm

First off, I'd like to take my hat off to you, Mike: this is incredible. I would never have looked at things from this perspective -- not for a long time, anyway. Getting 20K-30K px/s on a VIC is incredible. Everything here is a masterclass in elegance in engineering... I'm lost for words.

Mike wrote:Another small improvement in play: the line is always drawn left to right - this is always possible by exchanging the line ends, should x2 be less than x1.

I'm curious: what if you had y2>y1, so lines were always drawn in one direction vertically as well? I admit I may have missed something in your explanation, so as ever I apologise in advance.

Once again, I'm blown away, and humbled, by this kind of vision. Thank you for sharing this, Mike.

Best wishes,

Merytsetesh

Mike · Post by **Mike** » Sun Mar 24, 2024 1:37 pm

Merytsetesh wrote:I'm curious: what if you had y2>y1, so lines were always drawn in one direction vertically as well?

You can't have both.

Given x2>x1 and y2>y1, a line on screen looks like a backslash, "\", assuming a left-handed co-ordinate system. You can't enforce y2>y1 though when the endpoints already have been sorted for x2>x1, as that would exclude lines that look like a slash, "/".

Rewriting the line routine to sort the y co-ordinates for y2>y1 instead would indeed allow one to use INY only for stepping the y-co-ordinate, but in turn also requires two more main loops: the pixel bit sequence is hard-coded within the main loops! For x2>x1, one of the original two main loops would be used (with the bits ordered $80, $40, ..., $01 for ORA), x2<x1 would require the operands of the ORA/AND/EOR instructions to be placed in inverse order (for ORA: $01, $02, ..., $80). Alternatively, with still only two main loops, these operands would need to be modified everytime they are not in the correct order, which surely takes more time than just replacing INY for DEY (or vice versa).

Merytsetesh · Post by **Merytsetesh** » Mon Mar 25, 2024 1:17 pm

Mike wrote: ↑Sun Mar 24, 2024 1:37 pm Given x2>x1 and y2>y1, a line on screen looks like a backslash, "\", assuming a left-handed co-ordinate system. You can't enforce y2>y1 though when the endpoints already have been sorted for x2>x1, as that would exclude lines that look like a slash, "/".

Of course, and I'm an idiot for asking in the first place.

Teach me not to post when I've had not coffee.

Mike wrote: ↑Sun Mar 24, 2024 1:37 pmAlternatively, with still only two main loops, these operands would need to be modified everytime they are not in the correct order, which surely takes more time than just replacing INY for DEY (or vice versa).

No, that doesn't make sense to do. Better to keep things simple. Do the tables get generated each time the routine is called or only once on initialisation?

Mike · Post by **Mike** » Mon Mar 25, 2024 2:00 pm

Merytsetesh wrote:Do the tables get generated each time the routine is called or only once on initialisation?

I suppose you mean the tables that are written in lines 20..27 of the test rig: these of course need only be written once, at program start.

The main loop of the test rig just sets random co-ordinates for the endpoints, calls the line routine with SYS 15360 and then loops.

MrSterlingBS · Post by **MrSterlingBS** » Mon Jun 24, 2024 3:39 am

Mike wrote: ↑Sat Mar 16, 2024 4:59 am Here's my take on this subject:

Before any questions come up in this regard: MINIGRAFIK already contains a line routine, which is not as fast as the routine below, however it is much more compact (just about 100 bytes extra beyond the pixel set function otherwise already provided within the MG code) and also more capable (also works for multi-colour and sets the colour RAM along as necessary). The speed of the built-in line routine, about 4000 pixels per second, is entirely sufficient within the context of a running BASIC program. You would not see any substantial speed difference with the faster line routine shown below, as the surrounding overhead of BASIC takes the majority of time spent.

That being said, here is an implementation of Bresenham very similar to MrSterlingBS's port of "Bitbreaker's" line routine. Unlike the latter, my implementation gets the rounding right, and it only needs one single main loop. That main loop is heavily "self-modified" to handle all 8 possible octants. The code itself is fairly compact (just 151 bytes), however it needs another 3x160 = 480 bytes of tables worth for table lookups of low bytes, high bytes and bit positions for each X co-ordinate. The routine is embedded into a small BASIC test rig, just decompress the *.zip file and drag the file "boot" into the main window of VICE with at least a +8K RAM expansion (download):
Code: Select all
10 POKE55,0:POKE56,60:CLR:FORT=15360TO15510:READA:POKET,A:NEXT
11 :
12 DATA 56,162,232,165,253,229,251,176,6,162,202,73,255,105,1,134,5,133,3,56,160,200
13 DATA 165,254,229,252,176,6,160,136,73,255,105,1,132,6,133,4,166,251,164,252,165,3
14 DATA 197,4,176,34,141,121,60,165,6,141,116,60,165,4,141,125,60,74,133,3,165,5,141
15 DATA 126,60,169,192,141,146,60,165,254,141,147,60,76,129,60,141,125,60,74,133,3,165
16 DATA 5,141,116,60,165,4,141,121,60,165,6,141,126,60,169,224,141,146,60,165,253,141
17 DATA 147,60,76,129,60,232,56,165,3,233,0,176,3,105,0,200,133,3,189,0,61,133,5,189,0
18 DATA 62,133,6,177,5,29,0,63,145,5,224,0,208,222,96
19 :
20 BI=128:FORX=0TO159
21 AD=4352+192*INT(X/8)
22 HI=INT(AD/256):LO=AD-256*HI
23 POKE15616+X,LO
24 POKE15872+X,HI
25 POKE16128+X,BI
26 BI=BI/2:IFBI<1THENBI=128
27 NEXT
28 :
29 @ON:@CLR
30 POKE251,RND(1)*160:REM X1
31 POKE252,RND(1)*192:REM Y1
32 POKE253,RND(1)*160:REM X2
33 POKE254,RND(1)*192:REM Y2
34 SYS15360
35 GETA$:IFA$=""THEN35
36 GOTO30
The tables are not contained in the DATA lines, but calculated and written by the loop in lines 20 to 27. They have been aligned on page starts to avoid the odd extra cycle on page-crossing accesses.

The routine approaches a speed of roughly 20000 pixels per second. Roughly 130 cycles are spent until the first pixel is plotted, all subsequent pixels then are drawn at about 51 cycles per pixel. The archive contains the source code.

...

I do have another implementation of Bresenham in hand, which approaches 30000 pixels per second, i.e. it is roughly 50% faster than the routine presented here. For a comparison, that routine spends 150 cycles on average until the first pixel is plotted, all subsequent pixels then are drawn at about 33 cycles per pixel. Break even already happens for the second pixel plot. The routine (including some *small* tables) weighs in at about 800 bytes.

Unlike the routine here, I reserve the use of that routine for serious applications, and I have no intentions to publish its source just for seeing it used in toy programs (and that includes benchmark tests).

Greetings,

Michael

Dear Mike,

Thanks for sharing your line drawing routine with us.

As always, it is very clever, compact and precise

If I load the main loop into the zero page and execute it there, I have a very slight speed advantage of 1 cycle per run in the main loop.
Am I seeing this correctly?

BR
Sven

MrSterlingBS · Post by **MrSterlingBS** » Tue Jun 25, 2024 2:37 am

Hello,

I have adapted Mike's LineDrawing routine a little and think it is a tiny bit faster. The opcodes for INX and DEX are no longer stored in a ZP and called up again later, but are stored directly in the corresponding address. This means that an LDA and STA are no longer required in the Y query. The opcode SEC has also been implemented so that it only needs to be executed once per X query. As a further optimization, the main loop for drawing the line has been moved to the ZP (only 30bytes). It now uses one cycle less per point drawn.

The init loop need 95-105 cycles and the Pixel loop 40-45 cycles with JSR/RTS included.
Of cousre the drawing area is only in HighByte and faster if you plot on a larger screen.
I can't determine the speed exactly, but it should be a little over 22,000 pixels/second?

You found some comments in the code.

Best regards
Sven

Mike · Post by **Mike** » Tue Jun 25, 2024 2:26 pm

MrSterlingBS wrote:The opcode SEC has also been implemented so that it only needs to be executed once per X query.

You cannot assume C is always set before the code does the calculation of |dy|. It is fairly easy to find counterexamples.

...

I put the step instruction opcodes into temporaries because at the time their direction is determined it is not yet clear where they are supposed to be put.

Unfortunately, by going the route of a 'regenerated' source instead of adapting my original source to your's and your assembler's taste, your work is much less comprehensible than it ought to be: the target addresses of the to-be-modified instructions are hardcoded as literal constants, not labels, which makes following your implementation at least somewhat difficult, and any relocation attempt error-prone.

Putting the main loop into zeropage saves one cycle indeed (by holding the decision variable as immediate operand of LDA and doing a ZP store, i.e. 5 cycles in total vs. 6), it is however a design decision whether occupying a good part of the zeropage for code is in order. When the code is supposed to be interoperable with BASIC and KERNAL, the answer likely is no.

JonBrawn · Post by **JonBrawn** » Fri Jul 05, 2024 4:43 pm

I think I'm about to steal the line-drawing crown. Victor is currently clocking twenty-four million pixels per second.

OK, yes, it's a cheat - Victor is an FPGA replacement for the 6560, and the internal clock is running at 56MHz. It's calculating which point to plot in one cycle and then plotting it in the next. The only problem is keeping it fed with coordinates to draw!

MrSterlingBS · Post by **MrSterlingBS** » Fri Aug 30, 2024 10:48 am

Hello,

there is a new version of the LineDrawing routine.
Two special cases are also dealt with. The horizontal and vertical lines. The vertical line seems to work perfectly. The horizontal one still has some problems with values when X1 is below 6. It's best to try it out yourself. In addition, the cycle counter is added to show you what speed is being achieved. The horizontal line only needs 6 cycles per point the vertical is under 20 cycles.

I'm hoping for some good ideas to make the code even faster. Maybe there will actually be an ELITE version possible on the VIC-20 at some point.

BR
Sven

Line_Speed_Test.zip: (4.04 KiB) Downloaded 40 times

darkatx · Post by **darkatx** » Sun Sep 01, 2024 7:19 pm

You might try this idea and that is the calculate the slope and the midpoint and literally draw two halves of the line in the routine cutting down some time.

MrSterlingBS · Post by **MrSterlingBS** » Mon Sep 09, 2024 1:37 pm

Hi,

While working on an even faster line drawing algorithm, I was able to find three major improvements.
1.) An even more efficient algorithm for vertical lines ; a line from 0 to 127 needs 2147 cycles = <17 cycles each pixel

Code: Select all

deltaX_zero:	
		ldy y1
		cpy y0
		bcs noswap
		
		sec
		sty temp		
		ldy y0
		sty y1			
		ldy temp		
		sty y0					
noswap:
		;ldx x1
		lda y1
		sbc y0
		tay 
		
		lda Pixel_high_db,x
		sta LoopVLinex+2
		sta SLoopVLinex+2
		
		sta LastPixel+2
		sta LastPixel+7

		clc
		lda Pixel_low_db,x
		adc y0
		sta LoopVLinex+1
		sta SLoopVLinex+1
		
		sta LastPixel+1
		sta LastPixel+6

		lda Setpixel,x
		sta LoopVLinex+4
		
		sta LastPixel+4

LoopVLinex:		
		lda $1000,y
		ora #0			
SLoopVLinex:		
		sta $1000,y
		dey
		bne LoopVLinex

2.) A very fast diagonal algorithm X0=Y0=0;X1=Y1=127; 4578 cycles = 25 cycles each pixel

Code: Select all

	
		...
		ldx x0
		lda Pixel_low_db,x
		sta ZP_P_low
		lda Pixel_high_db,x
		sta ZP_P_high
		ldy y0
loopxy0:
;---------------------------------
; the core routine to plot a pixel
		lda Pixel_low_db,x
		cmp ZP_P_low
		beq nextpoint0
		sta ZP_P_low
		lda Pixel_high_db,x
		sta ZP_P_high
nextpoint0:		
		lda (ZP_P_low),y
		ora SetPixel,x
		sta (ZP_P_low),y				
;---------------------------------
		iny
		inx
toy0:   	cpx #0
		bne loopxy0
        	rts

3.) A small efficiency increase of the normal line algorithm. The increase in speed depends on the specific case that every eight Pixel the low Byte change.
The main change is the test to see if the X-axis contains a new address. If this is not the case, you can jump directly to the next point,
otherwise the high and low byte addresses are read again.

Code: Select all

X3C74:
	DEY						
	SEC						
	LDA #$00		
X3C78:
	SBC #$0A				
	BCS X3C7F			
X3C7C:
	ADC #$33				
X3C7E:
	DEX						
X3C7F:	
	STA $13			
X3C81:	
	LDA Pixel_low_db,x			
	cmp ZP_P_low
	beq nextstep
	STA ZP_P_low					
	LDA Pixel_high_db,x			
	STA ZP_P_high					
nextstep:
	LDA (ZP_P_low),Y				
	ORA Setpixel,X		
	STA (ZP_P_low),Y				
X3C92:
	CPY #$1B				
	BNE X3C74							
	RTS

Faster_Line_Drawing.zip: (10.07 KiB) Downloaded 28 times

Additional note:
Unfortunately, the very fast horizontal and vertical algorithms do not work with the line drawing routine. There are always some artifacts that are printed on the screen. Reflections of the actual line. So far I have not been able to identify the problem.

MrSterlingBS · Post by **MrSterlingBS** » Wed Sep 11, 2024 3:56 am

I found this book while searching for the fastest line drawing algorithm on the internet.
There are very useful pieces of code, ideas and funny passages printed.

https://twimgs.com/ddj/abrashblackbook/gpbb35.pdf

Denial

Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm

Re: Line drawing algorithm