DASM - how to force table to not cross page boundary?
Moderator: Moderators
DASM - how to force table to not cross page boundary?
I'm defining a couple lookup tables to precalculate row*11 and 128>>(x&7). The source looks like this:
MASKTAB:
.byte 128,64,32,16,8,4,2,1
ROWTAB:
.byte 0*11,1*11,2*11,3*11,4*11,5*11
.byte 6*11,7*11,8*11,9*11,10*11,11*11
.byte 12*11,13*11,14*11,15*11,16*11,17*11
.byte 18*11,19*11,20*11,21*11,22*11
For performance reasons I'd like to force things so that neither of these tables crosses a page boundary. Is there a nice way to do this?
Thanks!
MASKTAB:
.byte 128,64,32,16,8,4,2,1
ROWTAB:
.byte 0*11,1*11,2*11,3*11,4*11,5*11
.byte 6*11,7*11,8*11,9*11,10*11,11*11
.byte 12*11,13*11,14*11,15*11,16*11,17*11
.byte 18*11,19*11,20*11,21*11,22*11
For performance reasons I'd like to force things so that neither of these tables crosses a page boundary. Is there a nice way to do this?
Thanks!
The keyword you are looking for is ALIGN:
See doc/dasm.txt in your installation for a somewhat incomplete but still documentation.[label] ALIGN N[,fill]
Align the current PC to an N byte boundry. The default fill character is always 0, and has nothing to do with the default fill character specifiable in an ORG.
Anders Carlsson






- Mike
- Herr VC
- Posts: 5134
- Joined: Wed Dec 01, 2004 1:57 pm
- Location: Munich, Germany
- Occupation: electrical engineer
ALIGN 256 will nearly always pad to the begin of the next page, except in the case the address is already aligned. While it ensures, that there is no boundary cross with indexed addressing, as nippur72 already had pointed out, it leads to unneccessary waste of bytes.
Obviously a macro would also take the length of the following table as parameter to calculate, whether the table would cross the page boundary, and only then align to the next page.
The question remains whether it is really appropriate to be concerned about that extra cycle. That's the reason I ask IsaacKuo to post the code as well.
Obviously a macro would also take the length of the following table as parameter to calculate, whether the table would cross the page boundary, and only then align to the next page.
The question remains whether it is really appropriate to be concerned about that extra cycle. That's the reason I ask IsaacKuo to post the code as well.
I'm still working on the code:
I read about ALIGN in the DASM.TXT documentation, but as noted it would lead to a lot of memory waste (if used in the most obvious way).
For this example, it's not really a problem but I find it inelegant to sloppily ignore the page boundary issue.
I'm more concerned with future applications involving subpixel rendering. I anticipate a lot of tightly looped use of table lookups to deal with the inconvenient 14 pixel wide blocks and multiple irregular mask tables. I won't be able to just bit shift to step a pixel left/right, so text/sprite rendering will need a lot of table lookups. Pre-rolled/rotated/reflected sprites/fonts would consume far too much RAM, so pixel-by-pixel rendering looks like the most practical option.
Code: Select all
xytowork: ; converts from x,y to work coordinates
; (WORK),y points to SCREEN+(y>>3)*22+(x>>3)
; WORK+2 is bitmask (calculated from x&7)
; WORK+3 is y&7
tya
and #7
sta WORK+3 ; y&7 --> WORK+3
tya
lsr
lsr
lsr
tay ; y>>3 --> y
lda ROWTAB,y
ldy #>SCREEN
asl
bcc xytoworkskip
iny
xytoworkskip:
sta WORK
sty WORK+1 ; SCREEN+(y>>3)*22 --> WORK, WORK+1
txa
and #7
tay
lda MASKTAB,y
sta WORK+2 ; 128>>(x>>3) --> WORK+2
txa
lsr
lsr
lsr
tay ; x>>3 --> y
rts
For this example, it's not really a problem but I find it inelegant to sloppily ignore the page boundary issue.
I'm more concerned with future applications involving subpixel rendering. I anticipate a lot of tightly looped use of table lookups to deal with the inconvenient 14 pixel wide blocks and multiple irregular mask tables. I won't be able to just bit shift to step a pixel left/right, so text/sprite rendering will need a lot of table lookups. Pre-rolled/rotated/reflected sprites/fonts would consume far too much RAM, so pixel-by-pixel rendering looks like the most practical option.
- Mike
- Herr VC
- Posts: 5134
- Joined: Wed Dec 01, 2004 1:57 pm
- Location: Munich, Germany
- Occupation: electrical engineer
If you try to avoid cycle wastage by avoiding the extra cycle when crossing page boundaries, you're looking at the wrong place.
The code you've posted can be made more tight by providing two tables, which contain the precomputed low-, and high-byte of the start of the screen lines, including the SCREEN start.
Furthermore, you make a lookup of the bitmask, which is quite possibly later processed via ORA. I'd keep X&7 within register X, and then do an ORA powers_of_2,X in-place. This saves you a LDA, and STA into a temporary location, which is most probably only used once during a point plot. Same applies to Y&7, which seems to handle the line within a character definition where the pixel is going to be set. That value should be computed in place, and then used via LDA/STA (char),Y. Inlining the sub-routine will omit a JSR, and RTS, saving another 6 cycles.
Since you're going to re-compute the address for every point plot, the code won't be very fast, anyway. And this:

Michael
P.S.: There are valid reasons to be concerned about the extra cycles when crossing page boundaries. For example cycle exact processing within raster lines, or transfer code for speedy disk I/O. In these cases the function critically depends on every cycle. But then we don't need to talk about performance in the sense of operations per second.
The code you've posted can be made more tight by providing two tables, which contain the precomputed low-, and high-byte of the start of the screen lines, including the SCREEN start.
Furthermore, you make a lookup of the bitmask, which is quite possibly later processed via ORA. I'd keep X&7 within register X, and then do an ORA powers_of_2,X in-place. This saves you a LDA, and STA into a temporary location, which is most probably only used once during a point plot. Same applies to Y&7, which seems to handle the line within a character definition where the pixel is going to be set. That value should be computed in place, and then used via LDA/STA (char),Y. Inlining the sub-routine will omit a JSR, and RTS, saving another 6 cycles.
Since you're going to re-compute the address for every point plot, the code won't be very fast, anyway. And this:
is something nobody will blame you for.IsaakKuo wrote:[...] but I find it inelegant to sloppily ignore the page boundary issue.

Michael
P.S.: There are valid reasons to be concerned about the extra cycles when crossing page boundaries. For example cycle exact processing within raster lines, or transfer code for speedy disk I/O. In these cases the function critically depends on every cycle. But then we don't need to talk about performance in the sense of operations per second.
The original version of my code used two tables. I was experimenting with using just one table. Like I said, I'm still working on the code.Mike wrote:The code you've posted can be made more tight by providing two tables, which contain the precomputed low-, and high-byte of the start of the screen lines, including the SCREEN start.
I'll keep that in mind; right now it seems like there's just way too much processing going on in between to sacrifice the X register. The character in location SCREEN+(y>>3)*22+(x>>3) needs to be tested to see if it's a custom character or a ROM character, then its base address calculated, and so on.Furthermore, you make a lookup of the bitmask, which is quite possibly later processed via ORA. I'd keep X&7 within register X, and then do an ORA powers_of_2,X in-place.
Depending on the operation, a custom character may need to be allocated or deallocated. I can temporarily save the X register if I need to in that case, of course.
Hmm...looking at my TESTXY, SETXY, and RESETXY code, it looks like I can rewrite them to leave the X register untouched until it finally gets to the appropriate stage...
Yes, at the expense of triplicating the routine for TESTXY, SETXY, and RESETXY routines. Since I'm still working on the code, I'm going to use a shared routine (easier to debug).Inlining the sub-routine will omit a JSR, and RTS, saving another 6 cycles.
Huh? But that's in the middle of Character ROM...
...oh...
Well, that just saved me 16 bytes. Too bad it won't work for my irregular subpixel rendering mask tables.
Anyway, try as I might I can't shake the requirement to use the X register in my SETXY and RESETXY routines. I use INC and DEC commands to adjust the flea population counters of custom chars, and that requires using the X register.
Rather than obsess over a few cycles here or there, I'll just go back to the original plan of storing x&7 in zero page. It's not like I need high performance to animate just a single pixel per field.
...oh...

Well, that just saved me 16 bytes. Too bad it won't work for my irregular subpixel rendering mask tables.
Anyway, try as I might I can't shake the requirement to use the X register in my SETXY and RESETXY routines. I use INC and DEC commands to adjust the flea population counters of custom chars, and that requires using the X register.
Rather than obsess over a few cycles here or there, I'll just go back to the original plan of storing x&7 in zero page. It's not like I need high performance to animate just a single pixel per field.
-
- The Most Noble Order of Denial
- Posts: 343
- Joined: Fri May 01, 2009 4:44 pm
Is it possible to use a 24 wide screen?
Then you could save a fair bit of code and time.
Note: Code needs to live in zeropage, and not tested yet.
Edited: I've fixed previous non working code.
Then you could save a fair bit of code and time.
Note: Code needs to live in zeropage, and not tested yet.
Code: Select all
; (WORK),y points to SCREEN+(y>>3)*24
2 tya ; this does (y&~7)*3 i.e. (y>>3)*24
2 lsr ; y>>1
3 and #~3 ; (y>>1)&~3 = (y&~7)>>1
3 sta $self_modifying
2 asl ; * 3
2 adc #self_modifying ; A = ((y>>1)&~3) + (((y>>1)&~3)<<1)
2 asl ; A<<1, A holds low result now, carry holds high bit
3 sta WORK ;
2 lda #0 ;
2 adc #>SCREEN ; add carry bit to screen location
3 sta WORK+1 ;
26 cycles
26 cycles total 18 bytes
vs
27/28 cycles 16+23 bytes for (y>>3)*22 version
- Mike
- Herr VC
- Posts: 5134
- Joined: Wed Dec 01, 2004 1:57 pm
- Location: Munich, Germany
- Occupation: electrical engineer
Unless the immediate operand of ADC, and with that also the whole code, resides in the zeropage, STA needs 4 cycles. So this part takes 8, not 7 cycles.Code: Select all
3 sta $self_modifying 2 asl ; * 3 2 adc #self_modifying ; A = ((y>>1)&~3) + (((y>>1)&~3)<<1)
Otherwise, 'STA zp/ASL A/ADC zp' takes the same 8 cycles, and doesn't rely on self-modifying code. There should be a spare ZP address available without problems.
And at the start of the routine, 'AND #$F8/LSR A' indeed gives the same result as 'LSR A/AND #$FC'.

P.S.: I first missed the comment in the post about that the code indeed is supposed to live in the zeropage. IMO with that you're wasting a lot of valuable resources down there.
Last edited by Mike on Wed Aug 26, 2009 6:05 pm, edited 1 time in total.
-
- The Most Noble Order of Denial
- Posts: 343
- Joined: Fri May 01, 2009 4:44 pm
-
- The Most Noble Order of Denial
- Posts: 343
- Joined: Fri May 01, 2009 4:44 pm
Saved another cycle and removed the zero page self modifying code!
Code: Select all
2 tya ; this does (y&~7)*3 i.e. (y>>3)*24
3 and #~7 ; (y&~7)
3 sta $temp ; $temp = (y&~7)
5 rra $temp ; a = (y&~7) + (y&~7)>>1 = (y&~7)*1.5
2 asl ; A<<1, A holds low result now, carry holds high bit
3 sta WORK ;
2 lda #0 ;
2 adc #>SCREEN ; add carry bit to screen location
3 sta WORK+1 ;
25 cycles 16 bytes
vs
27/28 cycles 16+23 bytes for (y>>3)*22 version
The current version of the code is:
And yes, it must be 22 columns so ROM print routines can be used without...interesting issues.
16+23 bytes, 26 cycles.
I use lda #SCREEN/512 and rol instead of lda #0 and adc#>SCREEN because it's more consistent with the routines for calculating CHARBASE addresses. They can span across up to 2K, so multiple rol's and asl's are used to pump it up.
For my dynavic routines, though, I use a 20 column screen (no mixing of ROM characters; no concerns over ROM print routines). 20 is almost as convenient as 24 for mathematics, so I use math routines instead of a lookup table.
Code: Select all
tya
lsr
lsr
lsr
tay ; y>>3 --> y
lda ROWTAB,y
asl
sta WORK
lda #SCREEN/512
rol
sta WORK+1 ; SCREEN+(y>>3)*22 --> WORK, WORK+1
16+23 bytes, 26 cycles.
I use lda #SCREEN/512 and rol instead of lda #0 and adc#>SCREEN because it's more consistent with the routines for calculating CHARBASE addresses. They can span across up to 2K, so multiple rol's and asl's are used to pump it up.
For my dynavic routines, though, I use a 20 column screen (no mixing of ROM characters; no concerns over ROM print routines). 20 is almost as convenient as 24 for mathematics, so I use math routines instead of a lookup table.