ASSEMBLER STANDARDS

BigDumbDinosaur · February 12, 2008, 11:42 AM

MOS TECHNOLOGY ASSEMBLER SYNTAX STANDARDS

Scattered about the Internet landscape are a number of homegrown 6502/65C02 assemblers free for the downloading. Unfortunately, many of them don't conform to the original MOS Technology standard developed with the 6502 microprocessor (MPU), which drives us old-time 6502 programmers to drink (we have to blame something for our bad habits). Variances or outright errors are typically found in instruction mnemonic representation, number radices, operand representation, label/symbol representation, comment delimiters, whitespace rules or assembler pseudo-ops.

For this who are reading this and are contemplating the scratch-development of a 6502 assembler, perhaps the following might be of interest to you.

Mnemonics and Operands
MOS Technology defined a complete set of mnemonics to describe every legal instruction that the MPU can execute, and also defined how to represent the operand with any instruction that can operate on either memory or an MPU register. For example:

ASL FLAG ;Memory is the operand.

ASL A ;The MPU accumulator is the operand.

ASL ;WRONG! Any questions?

Any instruction that can operate on memory or an MPU register MUST be followed by an operand to avoid ambiguity.

Number Radices
A radix (plural radices) is a symbol for the base representation of a number. In almost all assembly language programs, the programmer may freely mix binary, octal, decimal and hexadecimal.

The correct MOS Technology radices for representing the decimal number 165 would be as follows:

%10100101 Base 2 or bitwise binary.

@245 Base 8 or octal.

165 Base 10 or decimal.

$A5 Base 16 or hexadecimal.

The MOS Technology standard designated base 10 as the assembler default, which is nearly universal. However, the C-128's machine language monitor uses hexadecimal as the default base (e.g., LDA #A5 is actually LDA #$A5) and symbolizes decimal numbers with a plus sign (e.g., +165 = $A5 = @245 = %10100101). The other MOS Technology radices are supported as described above.

Operand Representation
The operand to any MPU instruction or assembler directive (see labels and symbols below) must be in a format that clearly delineates between different addressing modes. Also, the operand must be in a form that the assembler can evaluate in an unambiguous fashion. Examples of proper operands are as follows:

165 165, an eight bit value.

$3FFF 16383, a 16 bit value.

165+21 186.

5+4*2/3 6.

'A 65, the ASCII value for the uppercase letter A.

* The program counter (assembly address).

*=$2000 Sets the program counter to decimal location 8192.

*=*+10 Advances the program counter ten locations.

Assignments in all cases are evaluated from left to right with no algebraic precedence.

The assembler should also support the entry of character strings and other data into a static data storage area, which will be discussed below.

Label and Symbol Notation
A label refers to a memory location within a program and is automatically assigned the value of the program counter when assembled. A symbol (symbolic constant), on the other hand, refers to the explicit assignment of a value by program code. The difference between a label and a symbol, aside from the manner in which the assignment is made, is that some assemblers allow the value of a symbol to be changed later in the program (reassigned). On the other hand, a label cannot be changed, and an attempt to do so will cause assembly to halt with an error.

The MOS Technology standard for label and symbol naming states that the name must begin with an alphabetic character, may be followed by up to five alphanumeric characters, and may not be any MPU instruction mnemonic, such as CLC or LDA. A single character label or symbol cannot be A, X or Y, as those are reserved for representing the MPU registers—the use of single character names is not recommended. Labels and symbols are not supposed to be case-sensitive, which in practice is seldom observed anymore. Although I will be using uppercase examples below, I never use uppercase names in any assembly language programs that I develop, as it is too easy to get mixed up.

The size limitation of six characters was due to the modest capabilities of the systems in use at the time of the 6502's development, and is an unnecessary restriction in modern implementations. Contemporary assemblers (such as the HCD65 C-128 assembler) generally permit much longer names and also allow the use of the underscore character as part of the name. Hence a symbol such as THE_SIZE_OF_AN_INVENTORY_RECORD is acceptable (although ridiculous). This could also be done as TheSizeOfAnInventoryRecord, as is often found in Windows software development (and also ridiculous), as a standards-compliant assembler would ignore case.

Some label and symbol examples:

S_CMMAS =721 ;customer record master record size

The above creates the symbol S_CMMAS, representing the size of a customer master record, and assigns the value 721 to it. Following this assignment, any other instruction in the program can (and should) use S_CMMAS when referring to the size of a customer master record, instead of using the hard coded number 721. It is bad practice to bury "magic numbers" like 721 in code. Always symbolically represent them in a separate "include" file, where changes can be quickly and easily made if needed.

A_NUL =0 ;ASCII <NUL>
A_LF =10 ;ASCII <LF>
A_CR =13 ;ASCII <CR>

The above creates the symbolic constants A_NUL (ASCII null), A_LF (ASCII linefeed) and A_CR (ASCII carriage return) with the values 0, 10 and 13, respectively. Again, it is bad practice to hard code such values into programs.

OPCMMAS *=*+S_CMMAS ;customer master record image

The above code assigns the location in memory where a customer record will be buffered and reserves S_CMMAS bytes for storage, where S_CMMAS was previously defined as 721. The expression *=*+S_CMMAS advances the program counter S_CMMAS bytes. The rest of the program can refer to the record image location with OPCMMAS, such as in the following example:

LDX #<OPCMMAS
LDY #>OPCMMAS
JSR PUTREC

The above example illustrates the MOS Technology standard notation for taking the least significant byte (LSB, represented by <) and most significant byte (MSB, represented by >) of the 16 bit value of OPCMMAS, which we defined in the previous example, and passing the address to a function for processing.

COMMENTS and CODE LINE STRUCTURE
All programs should be sufficiently commented to allow a third party to read and understand what the program is supposed to accomplish. Uncommented code is a sign of laziness and (in my opinion) programmer ignorance, especially when complex routines are involved. Nobody's memory is that good that s/he can write a program this year and expect several years hence to recall all the details of what was supposed to happen.

The MOS Technology standard for comments require that they begin with a semicolon (;). Any text following the semicolon up to the end of the line will be ignored by the assembler, although it will usually be regurgitated in a printed listing as the assembler runs. Here's a simple example, which also illustrates how a fully formed line of code should appear:

PUTREC STX IMGPTR ;store image pointer LSB
STY IMGPTR+1 ;store image pointer MSB

Note that a fully formed code line (the STX IMGPTR line above) consists of four fields: label, instruction, operand and comment. The MOS Technology standard requires that the label always start at the left end of the line, followed by whitespace, followed by the instruction mnemonic, followed by whitespace, followed by the operand (if required), followed by whitespace again, followed by the comment, if any.

Whitespace is defined as at least one blank (ASCII 32) or horizontal tab (ASCII 9) character that is not bounded by quotes. If no label is to be used (e.g., the STY IMGPTR+1 line above), whitespace should start the line as shown above. Incidentally, MOS Technology never stated how the assembler should behave when it encounters a blank line or one consisting only of whitespace characters. The logical thing to do would be to discard the line without complaining about it, or add a semicolon to its beginning so it appears on a printed listing as a blank comment line.

PSEUDO-OPS
A pseudo-op (pseudo-operation) is an instruction to which the assembler itself responds, as opposed to one that is assembled into code. The distinguishing characteristic of a pseudo-op is that it always starts with a period (.). The standard pseudo-ops required by MOS Technology are:

.BYTE Assembles one eight bit value

.DBYTE Assembles one big-endian 16 bit value.

.END Terminates assembly, even if not at end-of-file.

.TEXT Assembles a null-terminated character string.

.WORD Assembles one little-endian 16 bit value.

As with symbols and labels, pseudo-ops are not supposed to be case-sensitive. Here are some
syntactically correct examples:

FLAG .byte $00

The label FLAG is defined at the current address in the program counter and zero is assembled into that location.

INPBFR =$0200
IPBADR .WORD INPBFR

The symbol INPBFR refers to a fixed location in memory (an input buffer, in this case, at decimal 512 in RAM). The label IPBADR is set to the current address in the program counter and the little-endian address of the input buffer INPBFR is assembled at that location.

DSPTCH .DBYTE DPADD-1,DPSUB-1,DPMUL-1,DBDIV-1

The above assembles a comma-delimited table of big-endian addresses at location DSPTCH, each address being one location before the actual location to which it refers. If the following are true:

DPADD =$2000
DPSUB =$2140
DPMUL =$22FF
DPDIV =$2368

and the program counter is $1000 when assembly of the table occurs, RAM will appear as follows:

>1000 1F FF 21 3F 22 FE 23 67

A typical use for the above table would be as follows:

LDX #INDEX ;table index
LDA DSPTCH,X ;routine-1 MSB
PHA ;setting up a return address
LDA DSPTCH+1,X ;routine-1 LSB
PHA
RTS ;goto routine

The above is an example of the 6502 machine code equivalent of an ON GOTO statement in BASIC. By the way, this unorthodox technique works because RTS, in pulling the "return address" from the stack, will increment it before loading it into the MPU's program counter. Hence the MPU will go to the selected routine's real address, not the address minus one.

TITLE .TEXT 'C-128 80 Column Display Manager'

The above assembles the null-terminated character string C-128 80 Column Display Manager and stores the string starting at memory location TITLE, which will be whatever the program counter happens to be when the code is encountered. An alternate way of accomplishing the above would be:

TITLE .BYTE 'C-128 80 Column Display Manager',A_CR,A_LF,A_NUL

which allows characters that can't be entered from the keyboard to be included in the string. Each character bounded by the single quites ('') is assembled with its ASCII value into memory. Values defined by symbols (e.g., A_CR) are assembled in the order shown. Commas are used to delimit values, as the entire series of bytes is treated as a single operand by the assembler. Note the use of the symbolic representation for the carriage return, linefeed and null characters, rather than hard-coded numbers.

Incidentally, to get a single quote into the string you would code as follows:

TITLE .BYTE 'BigDumbDinosaur''s Software',A_CR,A_LF,A_NUL

Note the BigDumbDinosaur''s construct, which will assemble as BigDumbDinosaur's.

In addition to the above pseudo-ops, others may be supported to control assembly. This would include macro-instructions (macros were not part of the original MOS Technology-supplied assembler), listing controls, conditional assembly, and so forth. The implementation of any of these features would depend on the inclination of the assembler's developer.

The Commodore MADS assembler package only looked at the first, second and third characters of pseudo-ops, which meant that the pseudo-ops .WORD and .WOR were functionally identical. This, of course, was not in keeping with the MOS Technology standard.

hannenz · February 12, 2008, 10:15 PM

hey - thanks for this one. this is quite informative.
though, i think - form having written an assembler myself and knowing the demands of a ml programmer towards an assembler - i think some things are not covered yet by this standard and some things should just be handled differently:

1.) the < and > operators for designating low and high byte values of an operand. they are still handled way differently in many assemblers and i didn't find something towards this in the standard. Here someone should clear things up once and forever, on which part of an expression or even whole Pseudo-Directve Line the operands should apply to.
The most frightening thing i ever saw was Power-Assembler who interprets a sequence of expressions in a Psueod-Opcode-INstruction beginning with a < or > operand assigning it to the WHOLE line, like:

ASSEMBLER STANDARDS

BigDumbDinosaur

hannenz

BigDumbDinosaur

hydrophilic

BigDumbDinosaur

hydrophilic

hannenz

BigDumbDinosaur

hannenz