copy from bank1 to bank0 via stack

Started by wte, February 18, 2010, 10:04 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

wte

Maybe someone remembers ...

It is possible to use the "relacation of the stack"-trick to copy data from bank1 to bank0. This is much quicker than bank switching (of course). I know that I've read an article or forum message about this but can't find it anymore.

Any idea how to find this information in the world wide swamp?

Regards WTE

Hydrophilic

Although I have never seen such a program on the "intertubes" (as airship would say), it is a practical idea because the 8502 can PHA in only 3 cycles.

Here is a short program I have tested (that runs in RAM Bank 1).  I can't promise it is the most effecient, but it should be must faster than explicit bank switching.

It copies $1300~$22FF in bank 1 to $1300~22FF in bank 0.  It is easy to modify to other ranges, as long as they are page aligned (for example, $c000~cfff in bank 1 to $2200~21ff in bank 0).


.10400 lda #$7e    ;bank 1 + i/o
.10402 sta $ff00
.10405 ldy #0
.10407 sty $fc     ;source low
.10409 lda #$13
.1040b sta $fd     ;source high
.1040d lda #$10
.1040f sta $fe     ;# pages
.10411 tsx
.10412 sei
.10413 stx $ff     ;save SP
.10415 tya      ;zero
.10416 tax
.10417 txs      ;zero
.10418 ldx $d509   ;save page pointer
.1041b lda #$13    ;destination high
.1041d sta $d509

;page loop

;byte loop
.10420 lda ($fc),y ;read bank 1
.10422 pha      ;write bank 0
.10423 dey
.10424 bne $0420   ;byte loop

.10426 inc $d509   ;next page (bank 0)
.10429 inc $fd     ;next page (bank 1)
.1042b dec $fe     ;count #pages
.1042d bne $0420   ;page loop

.1042f stx $d509   ;restore page pointer
.10432 ldx $ff     ;get SP
.10434 cli
.10435 txs
.10436 rts


The inner (byte) loop takes 13 cycles.  The outer (page) loop takes 256*13-1 + 19 = 3346 cycles.  Of course you can get a marginal improvement (almost 1/13) by using self modifying code... that is, LDA xx00,Y instead of LDA ($fc),Y.

I hope this helps!

Please note, this code assumes "default" MMU config of stack page in RAM Bank 0 and Common RAM at bottom of memory.  (However the stack page need not be page 1 because the program saves and restores the stack page).

When transfering from bank 0 to bank 1 (opposite of what wte requested), there is a potential problem with a "bug" in the MMU (because it is documented, some may claim it is a "feature")... you see, when the CPU stack is included in Common RAM (i.e., everytime low Common RAM is active), the MMU will ignore the "bank byte" of stack relocation and always go to bank 0... grrrrr!

I can try to explain this problem further if anyone is interested.
I'm kupo for kupo nuts!

airship

Quote...on the "intertubes" (as airship would say)
Yes. Yes, I would.  :)
Serving up content-free posts on the Interwebs since 1983.
History of INFO Magazine

orinoco

Quote from: Hydrophilic on February 18, 2010, 07:14 PMI can try to explain this problem further if anyone is interested.

Here!!!
?FORMULA TOO COMPLEX ERROR IN 10
READY.
â–ˆ

wte

Thank you very much Hydrophilic!

That is what I need. I'm sure I've seen something like this in the "intertubes" ;) long time ago. Unfortunately the code has to run in bank 1. But it should also work in the common area (*g*) so I have to copy the code to $0200 (there should be enough space).

Quote from: Hydrophilic on February 18, 2010, 07:14 PM
When transfering from bank 0 to bank 1 (opposite of what wte requested), there is a potential problem ...
I think in this case you use PLA instead of PHA and STA,Y instead of LDA,Y. Than it should work too.

Reagrds WTE

Hydrophilic

#5
Please note this is the opposite of what wte requested, here we copy from RAM Bank 0 to 1.
Quote from: wte
I think in this case you use PLA instead of PHA and STA,Y instead of LDA,Y.
My thought exactly!  Below is code I have tested (copy Bank 0 $1300~1AFF to Bank 1 $1000~17FF)... again this is run from Bank 1, unless you put it in Common RAM like wte suggested in a prior post.

.10400 lda #$7e    ;RAM 1 + I/O
.10402 sta $ff00
.10405 ldy #0      ;destination low
.10407 sty $fc
.10409 lda #$10    ;destination high
.1040b sta $fd
.1040d lda #8      ;#pages
.1040f sta $fe
.10411 tsx
.10412 sei
.10413 stx $ff     ;save SP
.10415 tya         ;zero
.10416 tax
.10417 txs         ;zero
.10418 ldx $d509   ;save stack page
.1041b lda #$13    ;source high
.1041d sta $d509

;page loop

;byte loop
.10420 pla         ;read bank 0
.10421 sta ($fc),y ;write bank 1
.10423 iny
.10424 bne $0420

.10426 inc $d509   ;next page (bank 0)
.10429 inc $fd     ;next page (bank 1)
.1042b dec $fe     ;count #pages
.1042d bne $0420

.1042f stx $d509   ;restore stack page
.10432 ldx $ff     ;recall SP
.10434 cli
.10435 txs
.10436 rts


orinoco asked that I explain the page redirection - common RAM problem of the MMU.  In my opinion this a bug (a SERIOUS bug) with the MMU, but since it is documented, some may call it a "feature."  Whatever!

You see, the MMU has specific registers for the RAM Bank to use for page 0 and page 1 redirection (at $d508 and $d50a, respectively).  Thus from a theroretical programming perspective, one should be able to redirect zero page or stack page to any page of any bank by setting appropriate values in the MMU page registers.  This is what VICE does, from as far I can remember up to the latest version 2.2

Below is a program that works fine in VICE (version 2.0 to 2.2 tested by me), but fails on a real C128.  See my post regarding VICE 2.2 where I said "I hope they fixed the page redirection..." (obviously they did not)

;copy RAM from Bank 1 to Bank 0
;this works in VICE, but does NOT work on a real C128
.00b00 lda #$3e    ;RAM 0 + I/O
.00b02 sta $ff00
.00b05 ldy #0      ;destination low
.00b07 sty $fc
.00b09 lda #$13    ;destination high
.00b0b sta $fd
.00b0d lda #8     ;#pages
.00b0f sta $fe
.00b11 tsx
.00b12 sei
.00b13 stx $ff     ;save SP
.00b15 tya         ;zero
.00b16 tax
.00b17 txs         ;zero
.00b18 inx         ;1
.00b19 stx $d50a   ;set stack bank (1)
.00b1c ldx $d509   ;save stack page
.00b1f lda #4      ;source high
.00b21 sta $d509   ;set stack page

;page loop

;byte loop
.00b24 pla         ;read bank 1
.00b25 sta ($fc),y ;write bank 0
.00b27 iny
.00b28 bne $0b24

.00b2a inc $d509   ;next page (bank 1)
.00b2d inc $fd     ;next page (bank 0)
.00b2f dec $fe     ;count #pages
.00b31 bne $0b24

.00b33 sty $d50a   ;zero stack bank
.00b36 stx $d509   ;restore stack page
.00b39 ldx $ff     ;recall SP
.00b3b cli
.00b3c txs
.00b3d rts


It fails on a real C128, when Common RAM is active at bottom of memory (this will always include zero page and stack page) because the MMU will direct zero page and stack page addresses to RAM Bank 0, regardless of the page (bank) registers!

In my opinion, the page registers are more specific and thus should take priority.  Unfortunately for programmers, the MMU designer(s) let Common RAM take priority.  So using the default/KERNAL setting of low Common RAM, you can not use RAM Bank 1 (or 2 or 3) for either zero page or stack page (if low Common RAM is active).

ANYWAY, if you run the above program on a real C128, you will discover that you have copied Bank 0 $400~BFF to Bank 0 $1300~1AFF... cross-bank transfer failed!!

You see, even though the program specifies Bank 1 for stack access, Common RAM over-rides the setting and the MMU will grab data from Bank 0 (Common RAM) when accessing the stack.  This is "the problem" that I mentioned in a previous post of this thread.  I hope that explains it.  If not, I hope somebody else has a better explaination...

I guess I share some blame because I noticed this problem in VICE some time ago, and even wrote a patch to fix the problem.  But I never submitted my patch to the VICE team.  Sorry for all trouble this has caused!

Edit
Note the code uses PLA; STA ($FC),Y so the byte loop takes 15 cycles.  If the MMU bug/feature did not exist, then we could use LDA ($FC),Y; PHA like the first example program which takes only 13 cycles.

Also, instead of LDA ($FC),Y / STA ($FC),Y you could use LDA $00,X / STA $00,X using page zero redirection to gain a little more speed.  This should save about 256 / 512 cycles per page, but would require extra code to deal with special locations $00 and $01.  But don't try this unless you disable (low) Common RAM!
/Edit
I'm kupo for kupo nuts!

wte

I made some test with the last code (in VICE) and got:

source (bank 1)
byte1 byte2 byte3 ... byte256

target (bank 0)
byte2 byte3 ... byte256 byte1

Hmmm, needs some brainwork ....

Regads WTE

wte

Brainwork done. The stack has to be initialized with $ff that's all.

I also added some code to solve the common area problem and made it "flexible" [start the routine with sys (adr),a,x,y]. The following code should work in bank 0 (!) also on a real C128 (but I didn't check it so far [works fine in VICE]).

; --- copy data from bank 1 to bank 0
; syntax: bank15:sys copy,a,x,y
;         a: target high
;         x: source high
;         y: # of pages
;
copy:
sei
sta $fd     ;target high
stx $fa     ;save source high
sty $fe     ;# of pages
;
lda $ff00   ;save pcr
pha
lda #$3e    ;ram 0 + i/o
sta $ff00
lda $d506   ;ram config. register
pha         ;save rcr (common area)
and #$f3    ;common area off
sta $d506
;
lda #$00
sta $fc     ;target low
tsx
stx $ff     ;save org stack pointer
tay         ;zero
tax
dex
txs         ;$ff
inx
inx         ;$01
stx $d50a   ;set stack bank 1
ldx $d509   ;save page pointer
lda $fa     ;source high
sta $d509
;
;page and byte loop
;
copyloop:
pla         ;read bank 1
sta ($fc),y ;write bank 0
iny
bne copyloop;next byte
;
inc $d509   ;next page (bank 1)
inc $fd     ;next page (bank 0)
dec $fe     ;count #pages
bne copyloop;next page
;
sty $d50a   ;zero stack bank
stx $d509   ;restore page pointer
ldx $ff     ;get stack pointer
txs
pla
sta $d506   ;restore rcr
pla
sta $ff00   ;restore pcr
cli
rts


Regards WTE

Hydrophilic

#8
Quote from: wteI made some test with the last code (in VICE) and got:

source (bank 1)
byte1 byte2 byte3 ... byte256

target (bank 0)
byte2 byte3 ... byte256 byte1

Hmmm, needs some brainwork ....

The original code I posted (what you originally requested) works fine when I test it.  It is strange the way it works:

... [note we start with Y=0 and SP=0]
.10420 lda ($fc),y ;read bank 1
.10422 pha      ;write bank 0
.10423 dey
.10424 bne $0420   ;byte loop
...


You see it first reads byte 0, then PHA byte 0, then reads byte 255, PHA byte 255, then bytes 254, 253, 252... finally it reads byte 1, PHA byte 1, then exits the loop.

Of course I guess it is possible one of the other codes I wrote was Off By 1.  That has never happened before... ever... I promise  ::)   Just let me know which one.

Well, the important thing is you provided a solution...  Awesome!  I see your code completely disables Common RAM.  I haven't tested your code on a real C128 myself, but that is the main problem with VICE MMU, so I think it should work!  The other problems with VICE MMU are buffering of the page-bank byte and active Common RAM during DMA (some other things I can elaborate upon if anyone cares).
I'm kupo for kupo nuts!

wte

Quote from: Hydrophilic on February 22, 2010, 03:09 PM
Of course I guess it is possible one of the other codes I wrote was Off By 1.  That has never happened before... ever... I promise  ::)   Just let me know which one.

As I wrote: "I made some test with the last code ..."
First two lines ==>
;copy RAM from Bank 1 to Bank 0
;this works in VICE, but does NOT work on a real C128


The original program was ok, I know.

Regards WTE

Could be a good subject of a magazin article or a c128 codebase.

Hydrophilic

Oh the last code... you did specifically say "last."  Sorry I missed that.  Thanks for your response, wte!

Looking at that code, it is easy to correct without changing any numbers!  Here is how the main loop should be coded (moving bank 1 to bank 0, which works in VICE but not real C128):

;byte loop
.00b24 pla         ;read bank 1
.00b25 iny
.00b26 sta ($fc),y ;write bank 0
.00b28 bne $0b24

The keen reader should notice that INY has been moved before STA.  The reason for this is the way PLA works.  It firsts increments the SP before grabbing data.  So by doing the same thing in software (INY, STA) everything is correctly aligned.  This code opperates strangly (similar but opposite the first code)... this code first moves byte 1, then byte 2, byte 3, ... byte 255, then finally byte 0.

Again remember the last code I wrote only works in VICE... it is to demonstrate MMU bug in the emulator.  Use wte's code or my first code for use with real C128 (luckily it works in VICE also).
I'm kupo for kupo nuts!

wte

Brilliant!

first iny than sta

That makes the code 2 bytes shorter (in relation to my solution). Not easier to understand but really smart!

Brilliant!

Regards WTE