ML Optimization discussion (split from: ROM calls and other tricks)

Noizer · Post by **Noizer** » Sat Oct 24, 2020 9:25 am

viewtopic.php?t=676&start=30

It would be nice if you would give short code examples within a code tag. Thank you in advance

J.E.E.K. · Post by **J.E.E.K.** » Sat Oct 24, 2020 11:07 am

For all points or just the ones I added?

Noizer · Post by **Noizer** » Mon Oct 26, 2020 2:41 pm

J.E.E.K. wrote: ↑Sat Oct 24, 2020 11:07 am For all points or just the ones I added?

Maybe for all the points.

It seems Wimoos hasn‘t time left to further explain his points? Oh, probably you have to look at his last posting

chysn · Post by **chysn** » Mon Oct 26, 2020 3:49 pm

It seems like ML optimization and ROM tips could be separate topics.

One solid optimization that I always try to look out for is when a subroutine sets up data and then ends in a subroutine call, like this:

Code: Select all

sub:   ; Subroutine in which something is done
       jsr $cb1e ; and then some call is made to act on data
       rts

JSR is really just a JMP that puts a return address (-1) on the stack, and RTS is really just a JMP that sets PC to a stack address+1; so you save one byte of code, two bytes of stack space, and three cycles by ending such a subroutine with

Code: Select all

sub:   ; Subroutine in which something is done
       jmp $cb1e ; etc.

Mike · Post by **Mike** » Mon Oct 26, 2020 4:30 pm

Noizer wrote:It seems Wimoos hasn‘t time left to further explain his points? Oh, probably you have to look at his last posting

IMO, the list of techniques that wimoos posted (link) is pretty much self-evident. It is you who would have to come up with a specific question about an own coding problem where you don't know how to apply one of the techniques listed, then you stand good chances that someone here will show you how. Conversely I doubt a random collection of code examples explaining these techniques would be helpful in case.

srowe · Post by **srowe** » Tue Oct 27, 2020 1:14 am

chysn wrote: ↑Mon Oct 26, 2020 3:49 pm so you save one byte of code, two bytes of stack space, and three cycles

Actually you save 9 cycles using the approach (6+6 vs 3) subroutine calls have quite an overhead on the 6502.

chysn · Post by **chysn** » Tue Oct 27, 2020 5:40 am

srowe wrote: ↑Tue Oct 27, 2020 1:14 am Actually you save 9 cycles using the approach (6+6 vs 3) subroutine calls have quite an overhead on the 6502.

Yes, that's right, thanks!

Mike · Post by **Mike** » Tue Oct 27, 2020 6:12 am

Wimoos already mentioned tail call optimization in his post: viewtopic.php?t=676&start=22 ... and to quote from there:

wimoos wrote: ↑Wed Dec 12, 2012 4:27 am[...]

Change JSRs, followed by RTS into JMPs

[...]

You'll have to be wary about two cases, where this technique cannot (easily) be applied:

the RTS may itself be a labeled (jump-)target, or
the called sub-routine uses the return address on stack. PRIMM is a good example. If you call PRIMM with JMP, wrong data will be printed (likely, from the return address one level higher) and same return address will be advanced somewhat uncontrolled, likely leading to a crash.

Any other jumps or branches pointing to that RTS (or references using the RTS opcode value as data) would need to be retargeted elsewhere.

Noizer · Post by **Noizer** » Sat Oct 31, 2020 5:40 am

J.E.E.K. wrote: ↑Sat Oct 24, 2020 9:02 am
wimoos wrote: ↑Wed Dec 12, 2012 4:27 amRemove NOP’s
[...]
This list could be extended by

Hold flag in bit 7 of a location, so you can check the flag by "BIT flag" followed by "BMI/BPL destination" keeping A untouched, clear the flag with "LSR flag". If carry is set by an condition before, use "ROR flag" to set the flag.

Operations without temporarily save the accumulator to memory:

To subtract the value in A from location's value ( (mem)-A ) use EOR #$FF; SEC; ADC mem (negate A with two's complement and simply add).

Merge some bits from A into a memory location using EOR mem; AND #mask_bits_taken_from_acc; EOR mem; STA mem.

I‘m still waiting for some code examples, any chance ?

chysn · Post by **chysn** » Sat Oct 31, 2020 10:04 pm

Noizer wrote: ↑Sat Oct 31, 2020 5:40 am
[*]Hold flag in bit 7 of a location, so you can check the flag by "BIT flag" followed by "BMI/BPL destination" keeping A untouched, clear the flag with "LSR flag". If carry is set by an condition before, use "ROR flag" to set the flag.
I‘m still waiting for some code examples, any chance ?

Set a flag (3 or 4 bytes)

Code: Select all

sec
ror flag

Check a flag (4 or 5 bytes)

Code: Select all

bit flag
bmi if_set ; ~or~ bpl if_unset

Clear a Flag (2 or 3 bytes)

Code: Select all

lsr flag

The idea here is that you save a couple bytes clearing the flag (leveraging the fact that LSR doesn't use Carry as an input), and you don't need to use a register when checking the flag. For setting the flag, you're leveraging the fact that ROR does use Carry as an input, and saving a byte over LDA/STA. The register-saving benefit of this system is almost as valuable as the smaller memory footprint, in my view.

For the other bullet point, "Operations without temporarily save the accumulator to memory," the code is right there in the nested bullet points.

Noizer · Post by **Noizer** » Wed Nov 04, 2020 2:05 pm

chysn wrote: ↑Sat Oct 31, 2020 10:04 pm...

Thanx for you efforts. I certainly will use some of the code snipplets as needed.
Now here are my different approaches on ML optimizations of MHO:

Use of self modifying code -- load everything you have to process from outer loops

Use of tables for all operations where
z = x operation y — why slowly calculate values if you can simple do a table lookup with at most two bytes indexing (x and y)

Always starting to code keeping in mind execution time is not a bad choice — if you can do it fast, you can do it slow too

avoiding of PHA / PLA, if you can use zeropage addresses for storing or indexing — saves cycles and values are still available, this beans counting is more annoying than the necessary and instructive cycle counting

use of unintended opcodes, very handy and mind freeing, new coding possibilities with almost zero overhead — example Zeropage indirect addressing indexed in X

to be continued

Mike · Post by **Mike** » Wed Nov 04, 2020 3:56 pm

Noizer, there is yet another dimension of optimization you left out in your list (BTW - been there, done that), and that is minimizing the time spent on writing the code and getting it right, to specification.

I'd presume quite a lot of fellow coders try to optimize their code for size or speed ... and never finish it.

Just to put things into perspective - I worked roughly 100 hours to implement MINIPAINT (spread over half a year in 2009, mostly on rainy afternoons and weekends), resulting in roughly 7 KB library code 'scripted' by a main program written in MINIGRAFIK-enhanced BASIC.

I optimized only a few routines of said library in any of the dimensions speed or size - the zoom routine of the main edit window and the cursor display are the most notable examples. The rest of the lot, around 50 JSR-callable sub-routines, is just straightforward code to get the job done.

So, what track record are you able to show? Finished programs I mean. No cheap talk.

chysn · Post by **chysn** » Wed Nov 04, 2020 6:50 pm

Most of the time, when we speak of optimization, we're talking about things that we do to improve code when it's already somewhat-functional. But a great deal of optimization is available in the design stage.

Consider a data structure for a game enemy that requires four properties: direction, x position, y position, hit points. It's tempting to put data related to an instance in consecutive memory locations.

Code: Select all

OBJ_DATA = $03c0
DIRECTION = OBJ_DATA
X_POS = OBJ_DATA + 1
Y_POS = OBJ_DATA + 2
HP = OBJ_DATA + 3

The downside to this is that it can be costly to manage the index to the object, because there's math involved:

Code: Select all

GetHP:  asl        ; Multiply object index by 4 to get the table index
        asl
        tax
        lda HP,x
        rts

I've used this sort of data structure, which comes from my background in C, and my love of structs. Unfortunately, it's inefficient for 6502, and I've now switched to storing data with properties together instead of instances together:

Code: Select all

MAX_OBJ = 6
DIRECTION = $03c0
X_POS = DIRECTION + MAX_OBJ
Y_POS = X_POS + MAX_OBJ
HP = Y_POS + MAX_OBJ

And now all you need is an instance index to access properties

Code: Select all

lda HP,x

Notice that you do need to specify a maximum number of instances (MAX_OBJ) to allocate space. But in VIC-20 practice, this is easy enough.

Noizer · Post by **Noizer** » Thu Nov 05, 2020 9:29 am

Mike wrote: ↑Wed Nov 04, 2020 3:56 pm Noizer, there is yet another dimension of optimization you left out in your list (BTW - been there, done that), and that is minimizing the time spent on writing the code and getting it right, to specification.
...
So, what track record are you able to show? Finished programs I mean. No cheap talk.

Dear Mike, some things just take time. And since it's a hobby, I alone decide when I invest time in it and when I want to make something public. I like to program cheaply in my quiet little room, I don't live for it and nobody pays me for it. When I feel like it, I write a little bit of my tiny experience in the world where you are and that's it.

J.E.E.K. · Post by **J.E.E.K.** » Thu Nov 05, 2020 9:43 am

Noizer wrote: ↑Sat Oct 31, 2020 5:40 am
J.E.E.K. wrote: ↑Sat Oct 24, 2020 9:02 am [..]

[..]

Operations without temporarily save the accumulator to memory:

To subtract the value in A from location's value ( (mem)-A ) use EOR #$FF; SEC; ADC mem (negate A with two's complement and simply add).

Merge some bits from A into a memory location using EOR mem; AND #mask_bits_taken_from_acc; EOR mem; STA mem.

I‘m still waiting for some code examples, any chance ?

Ad "subtraction":
Request Memory from String Space (BASIC V2, C64)

The length of the request (bytes to allocate) is given in A. The heap pointer ($33/$34, the result is finally in A/Y) has to be moved to lower addresses:

Code: Select all

B4F7: 49 FF     EOR #$FF
B4F9: 38        SEC
B4FA: 65 33     ADC $33       ; Pointer: Bottom of String space
B4FC: A4 34     LDY $34       ; Pointer: Bottom of String space
B4FE: B0 01     BCS $B501
B500: 88        DEY

Ad "merge":

In my Graphic-Extension code I am using a mode byte (savemo), which consists of bits from different sources:

Code: Select all

                                ; put A's bit 4-7 into savemo
        EOR savemo              ; ********
        AND #%11110000          ; ****0000
        EOR savemo              ; AAAAmmmm
        STA savemo              ;

Denial

ML Optimization discussion (split from: ROM calls and other tricks)

ML Optimization discussion (split from: ROM calls and other tricks)

Re: ML Optimization discussion (split from: ROM calls and other tricks)

Re: ML Optimization discussion (split from: ROM calls and other tricks)

Re: ML Optimization tips and tricks

Re: ML Optimization discussion (split from: ROM calls and other tricks)

Re: ML Optimization tips and tricks

Re: ML Optimization tips and tricks

Re: ML Optimization discussion (split from: ROM calls and other tricks)

Re: ML Optimization tips and tricks

Re: ML Optimization tips and tricks

Re: ML Optimization tips and tricks

Re: ML Optimization discussion (split from: ROM calls and other tricks)

Re: ML Optimization discussion (split from: ROM calls and other tricks)

Re: ML Optimization discussion (split from: ROM calls and other tricks)

Re: ML Optimization tips and tricks