DASM - how to force table to not cross page boundary?

IsaacKuo · Post by **IsaacKuo** » Tue Aug 25, 2009 8:19 pm

I'm defining a couple lookup tables to precalculate row*11 and 128>>(x&7). The source looks like this:

MASKTAB:
.byte 128,64,32,16,8,4,2,1

ROWTAB:
.byte 0*11,1*11,2*11,3*11,4*11,5*11
.byte 6*11,7*11,8*11,9*11,10*11,11*11
.byte 12*11,13*11,14*11,15*11,16*11,17*11
.byte 18*11,19*11,20*11,21*11,22*11

For performance reasons I'd like to force things so that neither of these tables crosses a page boundary. Is there a nice way to do this?

Thanks!

Mike · Post by **Mike** » Wed Aug 26, 2009 1:17 am

Would you also care to provide the code, of which performance you are concerned of?

nippur72 · Post by **nippur72** » Wed Aug 26, 2009 1:55 am

I suppose he's concerned about the extra CPU cycle needed when "LDA addr,X" crosses the page bound.

In that case I would put an "origin" directive to force the data to a fixed addres I am sure of (e.g. at the start of the program between the basic stub and your program entry point).

carlsson · Post by **carlsson** » Wed Aug 26, 2009 2:05 am

The keyword you are looking for is ALIGN:

[label] ALIGN N[,fill]

Align the current PC to an N byte boundry. The default fill character is always 0, and has nothing to do with the default fill character specifiable in an ORG.

See doc/dasm.txt in your installation for a somewhat incomplete but still documentation.

Mike · Post by **Mike** » Wed Aug 26, 2009 2:18 am

ALIGN 256 will nearly always pad to the begin of the next page, except in the case the address is already aligned. While it ensures, that there is no boundary cross with indexed addressing, as nippur72 already had pointed out, it leads to unneccessary waste of bytes.

Obviously a macro would also take the length of the following table as parameter to calculate, whether the table would cross the page boundary, and only then align to the next page.

The question remains whether it is really appropriate to be concerned about that extra cycle. That's the reason I ask IsaacKuo to post the code as well.

IsaacKuo · Post by **IsaacKuo** » Wed Aug 26, 2009 4:46 am

I'm still working on the code:

Code: Select all

xytowork: ; converts from x,y to work coordinates
; (WORK),y points to SCREEN+(y>>3)*22+(x>>3)
; WORK+2 is bitmask (calculated from x&7)
; WORK+3 is y&7

  tya
  and #7
  sta WORK+3     ; y&7 --> WORK+3

  tya
  lsr
  lsr
  lsr
  tay            ; y>>3 --> y
  lda ROWTAB,y
  ldy #>SCREEN
  asl
  bcc xytoworkskip
  iny
xytoworkskip:
  sta WORK
  sty WORK+1    ; SCREEN+(y>>3)*22  --> WORK, WORK+1

  txa
  and #7
  tay
  lda MASKTAB,y
  sta WORK+2    ; 128>>(x>>3) --> WORK+2

  txa
  lsr
  lsr
  lsr
  tay           ; x>>3 --> y

  rts

I read about ALIGN in the DASM.TXT documentation, but as noted it would lead to a lot of memory waste (if used in the most obvious way).

For this example, it's not really a problem but I find it inelegant to sloppily ignore the page boundary issue.

I'm more concerned with future applications involving subpixel rendering. I anticipate a lot of tightly looped use of table lookups to deal with the inconvenient 14 pixel wide blocks and multiple irregular mask tables. I won't be able to just bit shift to step a pixel left/right, so text/sprite rendering will need a lot of table lookups. Pre-rolled/rotated/reflected sprites/fonts would consume far too much RAM, so pixel-by-pixel rendering looks like the most practical option.

Mike · Post by **Mike** » Wed Aug 26, 2009 5:26 am

If you try to avoid cycle wastage by avoiding the extra cycle when crossing page boundaries, you're looking at the wrong place.

The code you've posted can be made more tight by providing two tables, which contain the precomputed low-, and high-byte of the start of the screen lines, including the SCREEN start.

Furthermore, you make a lookup of the bitmask, which is quite possibly later processed via ORA. I'd keep X&7 within register X, and then do an ORA powers_of_2,X in-place. This saves you a LDA, and STA into a temporary location, which is most probably only used once during a point plot. Same applies to Y&7, which seems to handle the line within a character definition where the pixel is going to be set. That value should be computed in place, and then used via LDA/STA (char),Y. Inlining the sub-routine will omit a JSR, and RTS, saving another 6 cycles.

Since you're going to re-compute the address for every point plot, the code won't be very fast, anyway. And this:

IsaakKuo wrote:[...] but I find it inelegant to sloppily ignore the page boundary issue.

is something nobody will blame you for.

Michael

P.S.: There are valid reasons to be concerned about the extra cycles when crossing page boundaries. For example cycle exact processing within raster lines, or transfer code for speedy disk I/O. In these cases the function critically depends on every cycle. But then we don't need to talk about performance in the sense of operations per second.

IsaacKuo · Post by **IsaacKuo** » Wed Aug 26, 2009 6:14 am

Mike wrote:The code you've posted can be made more tight by providing two tables, which contain the precomputed low-, and high-byte of the start of the screen lines, including the SCREEN start.

The original version of my code used two tables. I was experimenting with using just one table. Like I said, I'm still working on the code.

Furthermore, you make a lookup of the bitmask, which is quite possibly later processed via ORA. I'd keep X&7 within register X, and then do an ORA powers_of_2,X in-place.

I'll keep that in mind; right now it seems like there's just way too much processing going on in between to sacrifice the X register. The character in location SCREEN+(y>>3)*22+(x>>3) needs to be tested to see if it's a custom character or a ROM character, then its base address calculated, and so on.

Depending on the operation, a custom character may need to be allocated or deallocated. I can temporarily save the X register if I need to in that case, of course.

Hmm...looking at my TESTXY, SETXY, and RESETXY code, it looks like I can rewrite them to leave the X register untouched until it finally gets to the appropriate stage...

Inlining the sub-routine will omit a JSR, and RTS, saving another 6 cycles.

Yes, at the expense of triplicating the routine for TESTXY, SETXY, and RESETXY routines. Since I'm still working on the code, I'm going to use a shared routine (easier to debug).

Mike · Post by **Mike** » Wed Aug 26, 2009 10:02 am

BTW, you should take a look at the 8 bytes starting at the addresses $8268, and $8668.

IsaacKuo · Post by **IsaacKuo** » Wed Aug 26, 2009 12:13 pm

Huh? But that's in the middle of Character ROM...

...oh...

Well, that just saved me 16 bytes. Too bad it won't work for my irregular subpixel rendering mask tables.

Anyway, try as I might I can't shake the requirement to use the X register in my SETXY and RESETXY routines. I use INC and DEC commands to adjust the flea population counters of custom chars, and that requires using the X register.

Rather than obsess over a few cycles here or there, I'll just go back to the original plan of storing x&7 in zero page. It's not like I need high performance to animate just a single pixel per field.

matsondawson · Post by **matsondawson** » Wed Aug 26, 2009 4:44 pm

Is it possible to use a 24 wide screen?
Then you could save a fair bit of code and time.
Note: Code needs to live in zeropage, and not tested yet.

Code: Select all

; (WORK),y points to SCREEN+(y>>3)*24


2  tya                   ; this does (y&~7)*3 i.e. (y>>3)*24
2  lsr                   ; y>>1  
3  and #~3               ; (y>>1)&~3 = (y&~7)>>1 
3  sta $self_modifying   
2  asl                   ; * 3
2  adc #self_modifying   ; A = ((y>>1)&~3) + (((y>>1)&~3)<<1)
2  asl                   ; A<<1, A holds low result now, carry holds high bit
3  sta WORK              ;
2  lda #0                ;
2  adc #>SCREEN          ; add carry bit to screen location
3  sta WORK+1            ; 

26 cycles

26 cycles total 18 bytes
vs
27/28 cycles 16+23 bytes  for (y>>3)*22 version

Edited: I've fixed previous non working code.

Mike · Post by **Mike** » Wed Aug 26, 2009 6:00 pm

Code: Select all

3  sta $self_modifying    
2  asl                   ; * 3 
2  adc #self_modifying   ; A = ((y>>1)&~3) + (((y>>1)&~3)<<1)

Unless the immediate operand of ADC, and with that also the whole code, resides in the zeropage, STA needs 4 cycles. So this part takes 8, not 7 cycles.

Otherwise, 'STA zp/ASL A/ADC zp' takes the same 8 cycles, and doesn't rely on self-modifying code. There should be a spare ZP address available without problems.

And at the start of the routine, 'AND #$F8/LSR A' indeed gives the same result as 'LSR A/AND #$FC'.

P.S.: I first missed the comment in the post about that the code indeed is supposed to live in the zeropage. IMO with that you're wasting a lot of valuable resources down there.

matsondawson · Post by **matsondawson** » Wed Aug 26, 2009 6:02 pm

Yup it was intended for zeropage, i've modified the note comment.
If you were putting it somewhere else, you'd add the extra cycle and use a temp zp location.

matsondawson · Post by **matsondawson** » Wed Sep 02, 2009 10:41 pm

Saved another cycle and removed the zero page self modifying code!

Code: Select all

2  tya                   ; this does (y&~7)*3 i.e. (y>>3)*24
3  and #~7               ; (y&~7)
3  sta $temp             ; $temp = (y&~7)  
5  rra $temp             ; a = (y&~7) + (y&~7)>>1 = (y&~7)*1.5
2  asl                   ; A<<1, A holds low result now, carry holds high bit
3  sta WORK              ;
2  lda #0                ;
2  adc #>SCREEN          ; add carry bit to screen location
3  sta WORK+1            ;

25 cycles 16 bytes
vs
27/28 cycles 16+23 bytes for (y>>3)*22 version

IsaacKuo · Post by **IsaacKuo** » Thu Sep 03, 2009 1:02 am

The current version of the code is:

Code: Select all

  tya
  lsr
  lsr
  lsr
  tay            ; y>>3 --> y
  lda ROWTAB,y
  asl
  sta WORK
  lda #SCREEN/512
  rol
  sta WORK+1     ; SCREEN+(y>>3)*22  --> WORK, WORK+1

And yes, it must be 22 columns so ROM print routines can be used without...interesting issues.

16+23 bytes, 26 cycles.

I use lda #SCREEN/512 and rol instead of lda #0 and adc#>SCREEN because it's more consistent with the routines for calculating CHARBASE addresses. They can span across up to 2K, so multiple rol's and asl's are used to pump it up.

For my dynavic routines, though, I use a 20 column screen (no mixing of ROM characters; no concerns over ROM print routines). 20 is almost as convenient as 24 for mathematics, so I use math routines instead of a lookup table.