My Favorite Instruction of 2023

Basic and Machine Language

Moderator: Moderators

User avatar
chysn
Vic 20 Scientist
Posts: 1204
Joined: Tue Oct 22, 2019 12:36 pm
Website: http://www.beigemaze.com
Location: Michigan, USA
Occupation: Software Dev Manager

My Favorite Instruction of 2023

Post by chysn »

In my 6502 projects this year, I made liberal use of an instruction that I've heretofore ignored:

Code: Select all

ldy TABLE_IX
ldx Data,y
It's the absolute,Y mode of LDX! Something might seem weird about loading an index register with an indexed table member. But I started out doing something like this

Code: Select all

ldy TABLE_IX
lda Data,y
tax
There's no need to involve the Accumulator here.

The beauty of this instruction (and its LDY counterpart) is that it acts as a translator between data tables. It's like Inception, dropping down into another level of reality. I used it for interfacing with other hardware, specifically a musical instrument that has its own way of organizing tabular data. It's a table of sixty (or thereabouts) data points, of about a dozen different types and ranges.

So when I designed my own software's tables, they needed to reference the instrument's native data layout. And it's LDX absolute,Y that glues the instrument's data structures and my data structures together!

Imagine a representation of a target system's data in memory, with each index in the table representing a specific attribute. Oscillator A frequency at index 0, Oscillator B frequency at index 1, and so on.

I want to create an abstraction for the idea of "Oscillator frequency," so I have a Types table in the EEPROM that stores the attributes for this kind of field (high/low ranges, pointer to a draw subroutine, and other stuff). Then, I have a third level that specifies physical attributes of each field: which page it's on, where it's placed on the screen, AND (this is the key part) a reference index to the target system's data table. If I know which field the user is currently editing, then sending the real data to a type-specific interface routine goes like this:

Code: Select all

ldy USER_CURRENT_FIELD_INDEX
ldx TARGET_DATA_INDICES,y
lda DRAW_SUBROUTINE_POINTERS_HIGH,y
pha
lda DRAW_SUBROUTINE_POINTERS_LOW,y
pha
lda NATIVE_DATA_TABLE,x
rts
Updating the native data table works in much the same way, allowing me to check ranges based on the type table and then write to the location gleaned via LDX absolute,Y.

It's a great instruction when you need to create more-complex-than-usual data structures.
VIC-20 Projects: wAx Assembler, TRBo: Turtle RescueBot, Helix Colony, Sub Med, Trolley Problem, Dungeon of Dance, ZEPTOPOLIS, MIDI KERNAL, The Archivist, Ed for Prophet-5

WIP: MIDIcast BASIC extension

he/him/his
User avatar
Mike
Herr VC
Posts: 5134
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: My Favorite Instruction of 2023

Post by Mike »

My first thought when I read the topic title was you probably meant the BIT instruction, but then we two had already covered that in 2020. :mrgreen:

These are that kind of instructions I only remember about when I actually need them, to put it this way. It's a shame though they are missing the corresponding store instructions for absolute indexed.
chysn wrote:The beauty of [LDX ABS,Y] (and its LDY counterpart) is that it acts as a translator between data tables.
Yeah, LDY ABS,X features quite prominently in my port of Killer Comet to 'fold' two complicated nested FOR loops constructs in BASIC ...

Code: Select all

25 F=1:REM ERA METEOR
26 FORE=0TO44STEP22
27 FORD=C+ETOC+3+E:POKED,32 :F=F+1:NEXT:NEXT
[...]
29 F=1:REM DRAW METEOR
30 [...]:FORE=0TO44STEP22
35 FORD=C+ETOC+3+E:POKED,B(F):F=F+1:NEXT:NEXT
... into a six-instruction copy loop ...

Code: Select all

.119A LDX #$0F
.119C LDY $101F,X ; <-- !!!
.119F LDA $102F,X
.11A2 STA ($FB),Y
.11A4 DEX
.11A5 BNE $119C
... using these two tables to index the meteor 'contents' into the screen:

Code: Select all

>1010 20 A0 A0 A0 A0
>1015 20 A0 A0 A0 A0
>101A 20 A0 A0 A0 A0

>1020 00 01 02 03 04
>1025 16 17 18 19 1A
>102A 2C 2D 2E 2F 30
...

If you can spare 256 bytes for an identity table (all values $00..$FF stored in that order), you can use LDX table,Y or LDY table,X to synthesize the 'missing' TYX or TXY instructions ... ;)
User avatar
chysn
Vic 20 Scientist
Posts: 1204
Joined: Tue Oct 22, 2019 12:36 pm
Website: http://www.beigemaze.com
Location: Michigan, USA
Occupation: Software Dev Manager

Re: My Favorite Instruction of 2023

Post by chysn »

Mike wrote: Mon Jan 01, 2024 6:59 am It's a shame though they are missing the corresponding store instructions for absolute indexed.
I haven't acutely felt the absence of the STi (i=some index register). Since LDi ABS,i frees up the Accumulator, it's more natural to use either STA (ZP),Y (as in your Comet loop) or STA ABS,X for these table translations. I did vaguely wish for it recently, but I don't remember the context.
If you can spare 256 bytes for an identity table (all values $00..$FF stored in that order), you can use LDX table,Y or LDY table,X to synthesize the 'missing' TYX or TXY instructions ... ;)
If there was already such a table in ROM for some reason*, that would be cool. I never have a spare 256 bytes!

________________
* Like finding ROM bytes to synthesize BIT #, or using character ROM for bit value positions (0, 2, 4, 8...).
User avatar
Mike
Herr VC
Posts: 5134
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: My Favorite Instruction of 2023

Post by Mike »

chysn wrote:I never have a spare 256 bytes!
My reaction was roughly similar when I first read about it on nesdev.org - but one never knows ... perhaps some day this idea might come handy.
User avatar
chysn
Vic 20 Scientist
Posts: 1204
Joined: Tue Oct 22, 2019 12:36 pm
Website: http://www.beigemaze.com
Location: Michigan, USA
Occupation: Software Dev Manager

Re: My Favorite Instruction of 2023

Post by chysn »

Mike wrote: Mon Jan 01, 2024 9:59 am
chysn wrote:I never have a spare 256 bytes!
My reaction was roughly similar when I first read about it on nesdev.org - but one never knows ... perhaps some day this idea might come handy.
The thing that comes to mind is the kind of stuff you do with raster timing, where cycles need to be exact. But you seem to have done just fine without this technique.
User avatar
chysn
Vic 20 Scientist
Posts: 1204
Joined: Tue Oct 22, 2019 12:36 pm
Website: http://www.beigemaze.com
Location: Michigan, USA
Occupation: Software Dev Manager

Re: My Favorite Instruction of 2023

Post by chysn »

It's fun to think about how long a 6502 program would have to be to make up the 256 bytes and get a savings.

I mean, I'd love CMPX and CMPY, those are near the top of my wish list. But there'd need to be so many of them to make an identity table worthwhile.
User avatar
Mike
Herr VC
Posts: 5134
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: My Favorite Instruction of 2023

Post by Mike »

chysn wrote:The thing that comes to mind is the kind of stuff you do with raster timing, where cycles need to be exact. But you seem to have done just fine without this technique.
If you refer to that example in my recent "raster paper", the heavy processing that takes place in the inner loops is somewhat untypical:

Code: Select all

[...]
.Frame1
 LDA &900E                                ;    Load current aux. colour/volume register,
 EOR aux_extbrd_1,Y                       ;    change the aux. colour,
 AND #&0F                                 ;    but keep the volume
 EOR aux_extbrd_1,Y                       ;    and
 STA &FB                                  ;    store in $FB for later use.
 LDA aux_extbrd_1,Y                       ;    Calculate next combination
 EOR bck_intbrd_1,Y                       ;    of exterior border colour
 AND #&0F                                 ;    plus background colour from ...
 STX &900F                                ; ** re-instate exterior border colour at right edge of display window (keeping the current background colour)
 EOR bck_intbrd_1,Y                       ;    ... the table data and 
 TAX                                      ;    keep in X for later use.
 LDA &FB                                  ;    Load $FB and
 STA &900E                                ; ** write $FB as new value of aux. colour/volume register (immediately before horizontal retrace).
 STX &900F                                ; ** Change to new exterior border colour and background colour during horizontal retrace.
 ]
 IF NOT ntsc THEN [OPT pass:CMP (&00,X):] ;    6 cycles extra delay for PAL
[OPT pass
 LDA bck_intbrd_1,Y                       ;    Load combination of background colour and 'interior border colour' for the %01 multicolour pixels and
 STA &900F                                ; ** write background/border register at left edge of display window.
 NOP                                      ;    Not much leeway here.
 INY                                      ;    Count ...
 CPY #&C1                                 ;    ... 192+1 ...
 BCC Frame1                               ;    ... lines.
 [...]
The "**" markers show where the stores to the VIC registers happen.

If anything, that might have called for LD% ABS,% (% := X | Y) ... but what I actually wanted was to keep the colour tables compact. The exterior border colour is stored with the auxiliary colour, and the 'interior border colour' which then actually serves as independent colour source for the %01 multicolour pixels is stored with the background colour. All those EOR and AND instructions merely serve to mask out the necessary bits for the colour registers while still keeping the volume register intact. Indeed I ran out of registers here, which explains the presence of STA $FB and LDA $FB ($FB is preserved on stack during the IRQ).

Most other cycle exact raster code that tokra and I had written so far looks more like a data pump, which mostly just does LDA/STA to shuffle data from cartridge memory to either screen RAM, colour RAM or VIC registers. In the extreme cases, that code is fully unrolled, with immediate LDA instructions (instead of reading off tables) to provide the relevant data.
I'd love CMPX and CMPY, those are near the top of my wish list.
Comparing two registers can also use either a zeropage temporary or self-modifying code for 6 cycles in each version.
User avatar
chysn
Vic 20 Scientist
Posts: 1204
Joined: Tue Oct 22, 2019 12:36 pm
Website: http://www.beigemaze.com
Location: Michigan, USA
Occupation: Software Dev Manager

Re: My Favorite Instruction of 2023

Post by chysn »

Mike wrote: Mon Jan 01, 2024 11:12 am
I'd love CMPX and CMPY, those are near the top of my wish list.
Comparing two registers can also use either a zeropage temporary or self-modifying code for 6 cycles in each version.
ZP is the approach I take, and it's what you'd need to use as the standard for determining whether an identity table is worthwhile. The default ZP usage is four bytes and six cycles, whereas the identity table is three bytes and 4-5 cycles. So you'd need to find 256 uses of the identity table to break even in memory. And that's a tall order. I'd need to have used identity tables in like 10% of instructions in my most recent project.

I guess the "256 bytes to spare" really comes into play. But I have a strong tendency to go right to the limit to add features, subtle niceties, and clear instructions/labels. If I have any "extra" memory, it's probably going into text.

I'm resolved to leave myself a couple hundred bytes right now to accommodate future firmware updates to the instrument. If I must, I can make the Help Screen more terse.
If you refer to that example in my recent "raster paper", the heavy processing that takes place in the inner loops is somewhat untypical
That's the one!
User avatar
MrSterlingBS
Vic 20 Afficionado
Posts: 304
Joined: Tue Jan 31, 2023 2:56 am

Re: My Favorite Instruction of 2023

Post by MrSterlingBS »

I have found some code snippet in the circle routine from http://michaeljmahon.com/FASTCIRC%20description.html

Code: Select all

	stx savex
	
	txa 				; X = 15 - X
	eor #$0F
	tax
	
	sec 				; y = yc - dy
	lda yc
	sbc dy,x
	sta y
He uses the EOR command there in a very clever way.
I really like such ingenious tricks.;-)
User avatar
MrSterlingBS
Vic 20 Afficionado
Posts: 304
Joined: Tue Jan 31, 2023 2:56 am

My Favorite Instruction of 2023 / 2024

Post by MrSterlingBS »

The commands I haven't used so far were TSX and TXS. At Aitsch's suggestion, I looked into the topic of stacks in more detail.
Then the following occurred to me.
Normally the X regsiter is pushed onto the stack via TXA and PHA. This takes a lot of time, 5 cycles. A better solution would be to store the X register in the zero page above STX ZP and retrieve it later, because the 6502 has not the PHX and PLX command like the newer 65C02. This solution is already better. An even more efficient solution is to use the TXS instruction if possible and no further value is pushed onto or picked up from the stack. This happens at an insane speed of two cycles.

For me, this is my insight of the year. ;-)

I used thich technique in my 3D-Stars Demo now and save 5 cycles each star, one ZP and 5 bytes.
Not much, but if you only have 5kb @ unexpandded VIC...
User avatar
Mike
Herr VC
Posts: 5134
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: My Favorite Instruction of 2023 / 2024

Post by Mike »

MrSterlingBS wrote:The commands I haven't used so far were TSX and TXS.
For the most time TSX and TXS are used for system level code. Any application code that messes with the stack pointer should only do so with interrupts disabled.

In the OP of the thread "65xx Opcode statistics, an example", I took the editor library of MINIPAINT as an example. There, the instructions BRK, BVC, CLV, NOP, RTI, TSX and TXS are not used at all. CLV and BVC may be used in combination for relocatable jumps or as spin loop - the latter requires the SO pin of the CPU be connected, which allows for ultra-fast response times. The CBM disk drives make use of that facility. NOP almost always is used for timing purposes. BRK, RTI, TSX, TXS are most likely part of interrupt code or other system level code.

Otherwise, the MP editor library uses every other instruction, and most of the available address modes with each of them. About the only address mode that goes unused is (ZP,X) - and only in more recent times I have found use for that address mode, not just for the trivial case of X=0, but also when X<>0.
A [...] solution would be to store the X register in the zero page above STX ZP and retrieve it later [...]
The purpose of a stack is to keep context even in the case when a routine is recursively reentered. Storing register values for later retrieval in fixed addresses is only a legit approach, when you can guarantee that part of the code is not reentered, ever. That may be an easy thing for simple programs, but as the complexity of programs goes up, the more likely it is that recursive calls happen.

A most prominent example is the expression parser in the BASIC interpreter. Each operator that is followed by a higher priority operator or by an expression in parentheses requires to 'set aside' the running expression value A before the lower priority operator, the lower priority operator B itself and the value C between those operators. Then, the value or expression E to the right of the higher priority operator D is evaluated. What then gets evaluated is C D E, and then A B (C D E). That means, that part of the expression is evaluated right-to-left, not left-to-right what would be the normal direction of parsing. This can happen multiple times within an expression and may even lead to an expression 'tree'. The only case where such an expression tree is not necessary is when (binary) operators are only ever followed by same- or lower-priority operators.

That's my 2 cents about this. Advocating the use of TXS and TSX comes - for most application code - with the prospect that any non-trivial code that uses this technique will break. In case of modern CPUs and OSes, manipulating the stack pointer (by any other means than PUSH/POP or sub-routine calls/returns) is actually forbidden.
tlr
Vic 20 Nerd
Posts: 594
Joined: Mon Oct 04, 2004 10:53 am

Re: My Favorite Instruction of 2023 / 2024

Post by tlr »

MrSterlingBS wrote: Fri Nov 15, 2024 4:49 am The commands I haven't used so far were TSX and TXS. At Aitsch's suggestion, I looked into the topic of stacks in more detail.
Then the following occurred to me.
Normally the X regsiter is pushed onto the stack via TXA and PHA. This takes a lot of time, 5 cycles. A better solution would be to store the X register in the zero page above STX ZP and retrieve it later, because the 6502 has not the PHX and PLX command like the newer 65C02. This solution is already better. An even more efficient solution is to use the TXS instruction if possible and no further value is pushed onto or picked up from the stack. This happens at an insane speed of two cycles.

For me, this is my insight of the year. ;-)
Indeed! Usage of the stack pointer as a general register is a very useful trick that I've sometimes used. You can also use it for indexing a 256 byte table (the stack) which may be useful in some speed code.
User avatar
Wilson
Vic 20 Devotee
Posts: 252
Joined: Mon Sep 28, 2009 7:19 am
Location: Brooklyn, NY

Re: My Favorite Instruction of 2023

Post by Wilson »

On the C64 the stack is sometimes used as video memory (in tightly constrained scenarios- i.e. demos) by setting the screen RAM pointer to $0000.
By using PHA instead of STA, writes can also be done 1 cycle faster in the stack page (or page 1 of the screen RAM). As a perk, writes to the zeropage are also a cycle faster and 1 byte smaller. This technique is pretty much limited to use in the oh so ubiquitous precalc'd load-store table. As a perk, the PHA and STA ZP instructions are smaller than the STA ABS instructions they replace, so the memory savings could be significant in a fully unrolled loop.

Maybe this would be possible on a VFLI modified Vic-20? Maybe it's already been done? ;)
User avatar
Mike
Herr VC
Posts: 5134
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Re: My Favorite Instruction of 2023 / 2024

Post by Mike »

tlr wrote:Usage of the stack pointer as a general register [...]
The stack pointer is not a general purpose register. It is a special purpose register intended for indexing into the 65xx hardware stack.
Wilson wrote:On the C64 the stack is sometimes used as video memory [...]
Unless you come up with actual examples, I would say you either confuse that with decompressors that use the screen RAM - at its original address! - as buffer to hold the decompressor code and work variables, or with the graphics modes tokra and I built in the last years that actually use zeropage, stack and most of the lower 1 KB in $0000..$03FF to hold graphics data ... on the VIC-20!

In those graphics modes, the data that happen to reside in the stack area are indeed updated with an unrolled sequence of LDA #imm:PHA.
Maybe this would be possible on a VFLI modified Vic-20? Maybe it's already been done? ;)
As I just wrote, that's already "possible" on an unmodified VIC-20. The VFLI mod enables one to put these overscan bitmap modes onto display without those tricks.
tlr
Vic 20 Nerd
Posts: 594
Joined: Mon Oct 04, 2004 10:53 am

Re: My Favorite Instruction of 2023 / 2024

Post by tlr »

Mike wrote: Sat Nov 16, 2024 9:45 am
tlr wrote:Usage of the stack pointer as a general register [...]
The stack pointer is not a general purpose register. It is a special purpose register intended for indexing into the 65xx hardware stack.
Ok then, "general" register.

It can index data on the stack, and it can serve as temporary storage of the X index register. The stack pointer also has its post-decrement (PHA) and pre-increment (PLA) modes which may be used for some trickery.
Post Reply