% This file is part of the MMIXware package (c) Donald E Knuth 1999
@i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES!

\def\title{MMIX}
\input epsf % input macros for dvips to include METAPOST illustrations

\def\MMIX{\.{MMIX}}
\def\NNIX{\hbox{\mc NNIX}}
\def\Hex#1{\hbox{$^{\scriptscriptstyle\#}$\tt#1}} % experimental hex constant
\def\beginword{\vcenter\bgroup\let\\=\wordrule\halign\bgroup&\hfil##\hfil\cr}
\def\endword{\noalign{\vskip\baselineskip}\egroup\egroup
   \advance\belowdisplayskip-\baselineskip}
\def\wordrule{\vrule height 9.5pt depth 4.5pt width .4pt}
\newdimen\bitwd \bitwd=6.6pt
\def\field#1#2{\vrule depth 3pt width 0pt \hbox to#1\bitwd{\hss$#2$\hss}}

\def\XF{\\{XF}}\def\XM{\\{XM}}\def\XD{\\{XD}} % these not in \tt
\def\PC{\\{PC}}
\def\Jx{\.{J}} % conversely, I type J_ to get J in \tt

\def\s{{\rm s}}
\def\rX{{\rm\$X}} \def\rY{{\rm\$Y}} \def\rZ{{\rm\$Z}}
\def\mm{{\rm M}} \def\xx{{\rm X}} \def\yy{{\rm Y}} \def\zz{{\rm Z}}
%\def\ll{{\rm L}} \def\gg{{\rm G}}
\def\ll{L} \def\gg{G}
\def\?{\mkern-1mu}

\def\9#1{} % this is used for sort keys in the index via @@:sort key}{entry@@>

@* Introduction to MMIX.
Thirty-eight years have passed since the \.{MIX} computer was designed, and
computer architecture has been converging during those years
towards a rather different
style of machine. Therefore it is time to replace \.{MIX} with a new
computer that contains even less saturated fat than its predecessor.

Exercise 1.3.1--25 in the third edition of
{\sl Fundamental Algorithms\/} speaks of an extended
\.{MIX} called MixMaster, which is upward compatible with the old version.
But MixMaster itself is hopelessly obsolete; although it allows for
several gigabytes of memory, we can't even use it with {\mc ASCII} code to
get lowercase letters. And ouch, the standard subroutine calling convention
of \.{MIX} is irrevocably based on self-modifying code! Decimal arithmetic
and self-modifying code were popular in 1962, but they sure have disappeared
quickly as machines have gotten bigger and faster. A completely new
design is called for, based on the principles of RISC architecture as
expounded in {\sl Computer Architecture\/} by Hennessy and Patterson
(Morgan Kaufmann, 1996). % first ed was "Morgan Kaufman"! but now "nn" is legit
@^Hennessy, John LeRoy@>
@^Patterson, David Andrew@>

So here is \MMIX, a computer that will totally replace \.{MIX}
in the ``ultimate'' editions of {\sl The Art of Computer Programming},
Volumes 1--3, and in the first editions of the remaining volumes.
I~must confess that
I~can hardly wait to own a computer like this.

How do you pronounce \MMIX? I've been saying ``em-mix'' to myself,
because the first `\.M' represents a new millennium. Therefore I~use
the article ``an'' instead of~``a'' before the name \MMIX\
in English phrases like ``an \MMIX\ simulator.''

Incidentally, the {\sl Dictionary of American Regional English\/ \bf3} (1996)
lists ``mommix'' as a common dialect word used both as a noun and a verb;
to mommix something means to botch it, to bollix it. Only time will
tell whether I~have mommixed the definition of \MMIX.

@ The original \.{MIX} computer could be operated without an operating
system; you could bootstrap it with punched cards or paper tape and do
everything yourself. But nowadays such power is no longer in the hands of
ordinary users. The \MMIX\ hardware, like all other computing machines
made today, relies on an operating system to get jobs
started in their own address spaces and to provide I/O capabilities.

Whenever anybody has asked if I will be writing about operating systems,
my reply has always been ``Nix.'' Therefore the name of\/ \MMIX's operating
system, \NNIX, will come as no surprise.
@:NNIX}{\NNIX\ operating system@>
@^operating system@>
From time to time I will necessarily have to refer to things that \NNIX\ does
for its users, but I am unable to build \NNIX\ myself. Life is
too short. It would be wonderful if some expert in operating system design
became inspired to write a book that explains exactly how to construct a nice,
clean \NNIX\ kernel for an \MMIX\ chip.

@ I am deeply grateful to the many people who have helped me shape the behavior
of\/ \MMIX. In particular, John Hennessy and (especially) Dick Sites
have made significant contributions.
@^Hennessy, John LeRoy@>
@^Sites, Richard Lee@>

@ A programmer's introduction to \MMIX\ appears in ``Volume~1, Fascicle~1,''
@^Fascicle 1@>
a booklet containing tutorial material that will ultimately appear in the
fourth edition of {\sl The Art of Computer Programming}.
The description in the following sections is rather different, because
we are concerned about a complete implementation, including all of the
features used by the operating system and invisible to normal programs.
Here it is important to emphasize exceptional cases that were glossed over
in the tutorial, and~to consider
nitpicky details about things that might go wrong.

@* MMIX basics.
\MMIX\ is a 64-bit RISC machine with at least 256 general-purpose registers
and a 64-bit address space.
Every instruction is four bytes long and has the form
$$\vcenter{\offinterlineskip
 \def\\#1&{\omit&#1&}
 \hrule
 \halign{&\vrule#&\hbox to 4em{\tt\hfil#\hfil}\cr
 height 9pt depth4pt&OP&&X&&Y&&Z&\cr}
 \hrule}\,.$$
The 256 possible OP codes fall into a dozen or so easily remembered
@^OP codes@>
categories; an instruction usually means, ``Set register X to the
result of\/ Y~OP~Z\null.'' For example,
$$\vcenter{\offinterlineskip
 \def\\#1&{\omit&#1&}
 \hrule
 \halign{&\vrule#&\hbox to 4em{\tt\hfil#\hfil}\cr
 height 9pt depth4pt&32&&1&&2&&3&\cr}
 \hrule}$$
sets register~1 to the sum of registers 2 and 3.
A few instructions combine the Y and Z bytes into
a 16-bit YZ field; two of the jump instructions use a 24-bit XYZ field.
But the three bytes X, Y, Z usually have three-pronged significance
independent of each other.

Instructions are usually represented in a symbolic form corresponding
to the \MMIX\ assembly language, in which each operation code has a mnemonic
name. For example, operation~32 is \.{ADD}, and the instruction above
might be written `\.{ADD} \.{\$1,\$2,\$3}'; a dollar sign `\.\$' symbolizes
a register number. In general, the instruction
\.{ADD}~\.{\$X,\$Y,\$Z} is the operation of setting $\rX=\rY+\rZ$.
An assembly language instruction with two commas has three operand
fields X, Y,~Z; an instruction with one comma has two operand fields
X,~YZ; an instruction with no comma has one operand field,~XYZ;
an instruction with no operands has $\xx=\yy=\zz=0$.

\def\0{\$Z\char'174Z}
Most instructions have two forms, one in which the Z field stands for
register \$Z, and one in which Z is an unsigned ``immediate'' constant.
@^immediate operands@>
Thus, for example, the command `\.{ADD} \.{\$X,\$Y,\$Z}' has a counterpart
`\.{ADD} \.{\$X,\$Y,Z}', which sets $\rX=\rY+\zz$. Immediate constants
are always nonnegative.
In the descriptions
below we will introduce such pairs of instructions
by writing just `\.{ADD}~\.{\$X,\$Y,\0}' instead of naming both
cases explicitly.

The operation code for \.{ADD}~\.{\$X,\$Y,\$Z} is 32, but the operation
code for \.{ADD}~\.{\$X,\$Y,Z} is~33. The \MMIX\ assembler chooses the correct
code by noting whether the third argument is a register number or~not.

Register numbers and constants can be given symbolic names; for example, the
assembly language instruction `\.x~\.{IS}~\.{\$1}' makes \.x an
abbreviation for register number~1. Similarly, `\.{FIVE}~\.{IS}~\.5'
makes \.{FIVE} an abbreviation for the constant~5.
After these abbreviations have been specified, the instruction
\.{ADD}~\.{x,x,FIVE} increases \$1 by~5, using opcode~33, while
the instruction \.{ADD}~\.{x,x,x} doubles \$1 using opcode~32.
Symbolic names that stand for register numbers
conventionally begin with a lowercase letter, while names that stand
for constants conventionally begin with an uppercase letter.
This convention is not actually enforced by the assembler,
but it tends to reduce a programmer's confusion.

@ A {\it nybble\/} is a 4-bit quantity, often used to denote a decimal
or hexadecimal digit.
A {\it byte\/} is an 8-bit quantity, often used to denote an alphanumeric
character in {\mc ASCII} code. The Unicode standard extends {\mc ASCII} to
@^Unicode@>
@^ASCII@>
essentially all the world's languages by using 16-bit-wide characters called
{\it wydes\/}. (Weight watchers know that two nybbles make one byte,
but two bytes make one wyde.)
In the discussion below we use the term
{\it tetrabyte\/} or ``tetra'' for a 4-byte quantity, and the similar term
@^nybble@>
@^byte@>
@^wyde@>
@^tetrabyte@>
@^octabyte@>
{\it octabyte\/} or ``octa'' for an 8-byte quantity. Thus, a tetra is
two wydes, an octa is two tetras; an octabyte has 64~bits. Each \MMIX\
register can be thought of as containing one octabyte, or two tetras,
or four wydes, or eight bytes, or sixteen nybbles.

When bytes, wydes, tetras, and octas represent numbers they are said to be
either {\it signed\/} or {\it unsigned}. An unsigned byte is a number between
0~and $2^8-1=255$ inclusive; an unsigned wyde lies, similarly, between
0~and $2^{16}-1=65535$; an unsigned tetra lies between
0~and $2^{32}-1=4{,}294{,}967{,}295$; an unsigned octa lies between
0~and $2^{64}-1=18{,}446{,}744{,}073{,}709{,}551{,}615$.
Their signed counterparts use the
conventions of two's complement notation, by subtracting respectively $2^8$,
$2^{16}$, $2^{32}$, or~$2^{64}$ times the most significant bit. Thus,
the unsigned bytes 128 through 255 are regarded as the numbers $-128$
through~$-1$ when they are evaluated as signed bytes; a signed byte therefore
lies between $-128$ and $+127$, inclusive. A signed wyde is a number
between $-32768$ and $+32767$; a signed tetra lies between
$-2{,}147{,}483{,}648$ and $+2{,}147{,}483{,}647$; a signed octa lies between
$-9{,}223{,}372{,}036{,}854{,}775{,}808$ and $+9{,}223{,}372{,}036{,}854{,}775{,}807$.

The virtual memory of\/ \MMIX\ is an array M of $2^{64}$ bytes. If $k$ is any
unsigned octabyte, M[$k$]~is a 1-byte quantity. \MMIX\ machines do not
actually have such vast memories, but programmers can act as if $2^{64}$ bytes
are indeed present, because \MMIX\ provides address translation mechanisms by
which an operating system can maintain this illusion.

We use the notation $\mm_{2^t}[k]$ to stand for a number consisting of
$2^t$~consecutive bytes starting at location~$k\land\nobreak(2^{64}-2^t)$.
(The notation $k\land(2^{64}-2^t)$ means that the least
significant $t$ bits of~$k$ are set to~0, and only the least 64~bits
@^bit stuffing@>
of the resulting address are retained. Similarly, the notation
$k\lor(2^t-1)$ means that the least significant $t$ bits of~$k$ are set to~1.)
All accesses to $2^t$-byte quantities by \MMIX\ are {\it aligned}, in the sense
that the first byte is a multiple of~$2^t$.

Addressing is always ``big-endian.'' In other words, the
@^big-endian versus little-endian@>
@^little-endian versus big-endian@>
most significant (leftmost) byte of $\mm_{2^t}[k]$ is
$\mm_1[k\land\nobreak(2^{64}-2^t)]$
and the least significant (rightmost) byte is $\mm_1[k\lor(2^t-1)]$.
We use the notation $\s(\mm_{2^t}[k])$ when we want to regard
this $2^t$-byte number as a {\it signed\/} integer.
Formally speaking, if $l=2^t$,
@^signed integers@>
$$\s(\mm_l[k])=\bigl(\mm_1[k\land(-l)]\,\mm_1[k\land(-l)+1]\,\ldots\,
 \mm_1[k\lor(l-1)]\bigr)_{256}
-2^{8l}[\mm_1[k\land(-l)]\!\ge\!128].$$

@* Loading and storing.
Several instructions can be used to get information from memory into
registers. For example, the ``load tetra unsigned'' instruction
\.{LDTU} \.{\$1,\$4,\$5}
puts the four bytes $\mm_4[\$4+\$5]$ into register~1 as an unsigned
integer;
the most significant four bytes of register~1 are set to zero.
The similar instruction \.{LDT} \.{\$1,\$4,\$5}, ``load tetra,'' sets
\$1 to the {\it signed\/} integer $\s(\mm_4[\$4+\$5])$.
(Instructions generally treat numbers as
@^signed integers@>
signed unless the operation code specifically calls them
unsigned.) In the signed case, the most significant four bytes of the
register will be copies of the most significant bit of the tetrabyte
loaded; thus they will be all~0s or all~1s, depending on whether the
number is $\ge0$ or $<0$.

\def\bull{\smallbreak\textindent{$\bullet$}}
\def\bul{\par\textindent{$\bullet$}}
\def\<#1 #2 {\.{#1}~\.{#2} }
\def\>{\hfill\break}

\bull\<LDB \$X,\$Y,\0 `load byte'.\>
@.LDB@>
Byte $\s(\mm[\rY+\rZ])$ or $\s(\mm[\rY+\zz])$ is loaded into register~X as a
signed number between $-128$ and $+127$, inclusive.

\bull\<LDBU \$X,\$Y,\0 `load byte unsigned'.@>
@.LDBU@>
Byte $\mm[\rY+\rZ]$ or $\mm[\rY+\zz]$ is loaded into register~X as an
unsigned number between $0$ and $255$, inclusive.

\bull\<LDW \$X,\$Y,\0 `load wyde'.\>
@.LDW@>
Bytes $\s(\mm_2[\rY+\rZ])$ or $\s(\mm_2[\rY+\zz])$
are loaded into register~X as a signed number between $-32768$ and $+32767$,
inclusive. As mentioned above, our notation $\mm_2[k]$ implies that
the least significant bit of the address $\rY+\rZ$ or $\rY+\zz$ is
ignored and assumed to be~0.
@^bit stuffing@>

\bull\<LDWU \$X,\$Y,\0 `load wyde unsigned'.@>
@.LDWU@>
Bytes $\mm_2[\rY+\rZ]$ or $\mm_2[\rY+\zz]$ are loaded
into register~X as an unsigned number between $0$ and $65535$, inclusive.

\bull\<LDT \$X,\$Y,\0 `load tetra'.\>
@.LDT@>
Bytes $\s(\mm_4[\rY+\rZ])$ or $\s(\mm_4[\rY+\zz])$
are loaded into register~X as a signed number between $-2{,}147{,}483{,}648$ and
$+2{,}147{,}483{,}647$, inclusive.
As mentioned above, our notation $\mm_4[k]$ implies that
the two least significant bits of the address $\rY+\rZ$ or $\rY+\zz$ are
ignored and assumed to be~0.
@^bit stuffing@>

\bull\<LDTU \$X,\$Y,\0 `load tetra unsigned'.\>
@.LDTU@>
Bytes $\mm_4[\rY+\rZ]$ or $\mm_4[\rY+\zz]$
are loaded into register~X as an unsigned number between 0 and
4{,}294{,}967{,}296, inclusive.

\bull\<LDO \$X,\$Y,\0 `load octa'.\>
@.LDO@>
Bytes $\mm_8[\rY+\rZ]$ or $\mm_8[\rY+\zz]$ are loaded into
register~X\null.
As mentioned above, our notation $\mm_8[k]$ implies that
the three least significant bits of the address $\rY+\rZ$ or $\rY+\zz$ are
ignored and assumed to be~0.
@^bit stuffing@>

\bull\<LDOU \$X,\$Y,\0 `load octa unsigned'.\>
@.LDOU@>
Bytes $\mm_8[\rY+\rZ]$ or $\mm_8[\rY+\zz]$ are loaded into
register~X\null. There is in fact no difference between the behavior of
\.{LDOU} and~\.{LDO}, since
an octabyte can be regarded as either signed or unsigned. \.{LDOU} is included
in \MMIX\ just for completeness and consistency, in spite of the fact that
a foolish consistency is the hobgoblin of little minds.
@^Emerson, Ralph Waldo@>
(Niklaus Wirth made a strong plea for such consistency in his early critique
of System/360; see {\sl JACM\/ \bf15} (1967), 37--74.)
@^Wirth, Niklaus Emil@>
@^System/360@>

\bull\<LDHT \$X,\$Y,\0 `load high tetra'.\>
@.LDHT@>
Bytes $\mm_4[\rY+\rZ]$ or $\mm_4[\rY+\zz]$ are loaded into the most
significant half of register~X, and the least significant half is
cleared to zero. (One use of ``high tetra arithmetic'' is to detect
overflow easily when tetrabytes are added or subtracted.)

\bull\<LDA \$X,\$Y,\0 `load address'.\>
The address $\rY+\rZ$ or $\rY+\zz$ is loaded into register~X. This
instruction is simply another name for the \.{ADDU} instruction
discussed below; it can
be used when the programmer is thinking of memory addresses
instead of numbers.
The \MMIX\ assembler converts \.{LDA} into the same OP-code as \.{ADDU}.
@.LDA@>
@.ADDU@>

@ Another family of instructions goes the other way, storing registers into
memory. For example, the ``store octa immediate'' command
\<STO \$3,\$2,17 puts the current contents of register~3
into $\mm_8[\$2+17]$.

\bull\<STB \$X,\$Y,\0 `store byte'.\>
@.STB@>
The least significant byte of register~X is stored into
byte $\mm[\rY+\rZ]$ or $\mm[\rY+\zz]$. An integer overflow exception occurs if
@^overflow@>
\$X is not between $-128$ and $+127$. (We will discuss overflow and other
kinds of exceptions later.)

\bull\<STBU \$X,\$Y,\0 `store byte unsigned'.@>\>
@.STBU@>
The least significant byte of register~X is stored into
byte $\mm[\rY+\rZ]$ or $\mm[\rY+\zz]$. \.{STBU} instructions are the same
as \.{STB} instructions, except that no test for overflow is made.

\bull\<STW \$X,\$Y,\0 `store wyde'.\>
@.STW@>
The two least significant bytes of register~X are stored into
bytes $\mm_2[\rY+\rZ]$ or $\mm_2[\rY+\zz]$.
An integer overflow exception occurs if
\$X is not between $-32768$ and $+32767$.

\bull\<STWU \$X,\$Y,\0 `store wyde unsigned'.@>\>
@.STWU@>
The two least significant bytes of register~X are stored into
bytes $\mm_2[\rY+\rZ]$ or $\mm_2[\rY+\zz]$.
\.{STWU} instructions are the same
as \.{STW} instructions, except that no test for overflow is made.

\bull\<STT \$X,\$Y,\0 `store tetra'.\>
@.STT@>
The four least significant bytes of register~X are stored into
bytes $\mm_4[\rY+\rZ]$ or $\mm_4[\rY+\zz]$.
An integer overflow exception occurs if
\$X is not between $-2{,}147{,}483{,}648$ and $+2{,}147{,}483{,}647$.

\bull\<STTU \$X,\$Y,\0 `store tetra unsigned'.\>
@.STTU@>
The four least significant bytes of register~X are stored into
bytes $\mm_4[\rY+\rZ]$ or $\mm_4[\rY+\zz]$.
\.{STTU} instructions are the same
as \.{STT} instructions, except that no test for overflow is made.

\bull\<STO \$X,\$Y,\0 `store octa'.\>
@.STO@>
Register X is stored into bytes $\mm_8[\rY+\rZ]$ or
$\mm_8[\rY+\zz]$.

\bull\<STOU \$X,\$Y,\0 `store octa unsigned'.\>
@.STOU@>
Identical to \.{STO} \.{\$X,\$Y,\0}.

\bull\<STCO X,\$Y,\0 `store constant octabyte'.\>
@.STCO@>
An octabyte whose value is the unsigned byte X is stored into
$\mm_8[\rY+\rZ]$ or $\mm_8[\rY+\zz]$.

\bull\<STHT \$X,\$Y,\0 `store high tetra'.\>
The most significant four bytes of register~X are stored into
$\mm_4[\rY+\rZ]$ or $\mm_4[\rY+\zz]$.
@.STHT@>

@* Adding and subtracting.
Once numbers are in registers, we can compute with them. Let's consider
addition and subtraction first.

\bull\<ADD \$X,\$Y,\0 `add'.\>
@.ADD@>
The sum $\rY+\rZ$ or $\rY+\zz$ is placed into register~X using signed,
two's complement arithmetic.
An integer overflow exception occurs if the sum is $\ge2^{63}$ or $<-2^{63}$.
(We will discuss overflow and other kinds of exceptions later.)
@^overflow@>

\bull\<ADDU \$X,\$Y,\0 `add unsigned'.\>
@.ADDU@>
The sum $(\rY+\rZ)\bmod2^{64}$ or $(\rY+\zz)\bmod2^{64}$
is placed into register~X\null.
These instructions are the same
as \.{ADD}~\.{\$X,\$Y,\0} commands
except that no test for overflow is made.
(Overflow could be detected if desired by using the command
\<CMPU ovflo,\$X,\$Y after addition, where \.{CMPU} means
@.CMPU@>
``compare unsigned''; see below.)

\bull\<2ADDU \$X,\$Y,\0 `times 2 and add unsigned'.\>
@.2ADDU@>
The sum $(2\rY+\rZ)\bmod2^{64}$ or $(2\rY+\zz)\bmod2^{64}$
is placed into register~X\null.

\bull\<4ADDU \$X,\$Y,\0 `times 4 and add unsigned'.\>
@.4ADDU@>
The sum $(4\rY+\rZ)\bmod2^{64}$ or $(4\rY+\zz)\bmod2^{64}$
is placed into register~X\null.

\bull\<8ADDU \$X,\$Y,\0 `times 8 and add unsigned'.\>
@.8ADDU@>
The sum $(8\rY+\rZ)\bmod2^{64}$ or $(8\rY+\zz)\bmod2^{64}$
is placed into register~X\null.

\bull\<16ADDU \$X,\$Y,\0 `times 16 and add unsigned'.\>
@.16ADDU@>
The sum $(16\rY+\rZ)\bmod2^{64}$ or $(16\rY+\zz)\bmod2^{64}$
is placed into register~X\null.

\bull\<SUB \$X,\$Y,\0 `subtract'.\>
@.SUB@>
The difference $\rY-\rZ$ or $\rY-\zz$ is placed into register~X using
signed, two's complement arithmetic.
An integer overflow exception occurs if the difference is $\ge2^{63}$ or
$<-2^{63}$.

\bull\<SUBU \$X,\$Y,\0 `subtract unsigned'.\>
@.SUBU@>
The difference $(\rY-\rZ)\bmod2^{64}$ or $(\rY-\zz)\bmod2^{64}$
is placed into register~X\null.
These two instructions are the same
as \.{SUB}~\.{\$X,\$Y,\0} except that no test for overflow is made.

\bull\<NEG \$X,Y,\0 `negate'.\>
@.NEG@>
The value $\yy-\rZ$ or $\yy-\zz$ is placed into register~X using
signed, two's complement arithmetic.
An integer overflow exception occurs if the result is greater
than~$2^{63}-\nobreak1$.
(Notice that in this case \MMIX\ works with the ``immediate'' constant~Y,
not register~Y\null. \.{NEG} commands are analogous to the immediate variants
of other commands, because they save us from having to put one-byte
constants into a register. When $\yy=0$, overflow occurs if and
only if $\rZ=-2^{63}$. The instruction \<NEG \$X,1,2 has exactly the
same effect as \.{NEG}~\.{\$X,0,1}.)

\bull\<NEGU \$X,Y,\0 `negate unsigned'.\>
@.NEGU@>
The value $(\yy-\rZ)\bmod2^{64}$ or $(\yy-\zz)\bmod2^{64}$
is placed into register~X\null.
\.{NEGU} instructions are the same
as \.{NEG} instructions, except that no test for overflow is made.

@* Bit fiddling.
Before looking at multiplication and division, which take longer than
addition and subtraction, let's look at some of the other things that
\MMIX\ can do fast. There are eighteen instructions for bitwise
logical operations on unsigned numbers.

\bull\<AND \$X,\$Y,\0 `bitwise and'.\>
@.AND@>
Each bit of register Y is logically anded with the corresponding bit of
register~Z or of the constant~Z, and the result is placed in register~X\null.
In other words, a bit of register~X is set to~1 if and only if the
corresponding bits of the operands are both~1;
in symbols, $\rX=\rY\land\rZ$ or $\rX=\rY\land\zz$.
This means in particular that \<AND \$X,\$Y,Z always zeroes out the seven
most significant bytes of register~X, because 0s are prefixed to the
constant byte~Z\null.

\bull\<OR \$X,\$Y,\0 `bitwise or'.\>
@.OR@>
Each bit of register Y is logically ored with the corresponding bit of
register~Z or of the constant~Z, and the result is placed in register~X\null.
In other words, a bit of register~X is set to~0 if and only if the
corresponding bits of the operands are both~0;
in symbols, $\rX=\rY\lor\rZ$ or $\rX=\rY\lor\zz$.

In the special case $\zz=0$, the immediate variant of
this command simply copies register~Y to
register~X\null. The \MMIX\ assembler allows us to write
`\.{SET}~\.{\$X,\$Y}' as a convenient abbreviation for
`\.{OR}~\.{\$X,\$Y,0}'.
@.SET@>

\bull\<XOR \$X,\$Y,\0 `bitwise exclusive-or'.\>
@.XOR@>
Each bit of register Y is logically xored with the corresponding bit of
register~Z or of the constant~Z, and the result is placed in register~X\null.
In other words, a bit of register~X is set to~0 if and only if the
corresponding bits of the operands are equal;
in symbols, $\rX=\rY\oplus\rZ$ or $\rX=\rY\oplus\zz$.

\bull\<ANDN \$X,\$Y,\0 `bitwise and-not'.\>
@.ANDN@>
Each bit of register Y is logically anded with the complement of the
corresponding bit of
register~Z or of the constant~Z, and the result is placed in register~X\null.
In other words, a bit of register~X is set to~1 if and only if the
corresponding bit of register~Y is~1 and the other corresponding bit is~0;
in symbols, $\rX=\rY\setminus\rZ$ or $\rX=\rY\setminus\zz$.
(This is the {\it logical difference\/} operation; if the operands 
are bit strings representing sets, we are computing the elements that
lie in one set but not the other.)

\bull\<ORN \$X,\$Y,\0 `bitwise or-not'.\>
@.ORN@>
Each bit of register Y is logically ored with the complement of the
corresponding bit of
register~Z or of the constant~Z, and the result is placed in register~X\null.
In other words, a bit of register~X is set to~1 if and only if the
corresponding bit of register~Y is greater than or equal to the other corresponding bit;
in symbols, $\rX=\rY\lor\overline\rZ$
or $\rX=\rY\lor\overline\zz$.
(This is the complement of $\rZ\setminus\rY$ or $\zz\setminus\rY$.)

\bull\<NAND \$X,\$Y,\0 `bitwise not-and'.\>
@.NAND@>
Each bit of register Y is logically anded with the corresponding bit of
register~Z or of the constant~Z, and the complement of the result is placed in register~X\null.
In other words, a bit of register~X is set to~0 if and only if the
corresponding bits of the operands are both~1;
in symbols, $\rX=\rY\mathbin{\overline\land}\rZ$ or
$\rX=\rY\mathbin{\overline\land}\zz$.

\bull\<NOR \$X,\$Y,\0 `bitwise not-or'.\>
@.NOR@>
Each bit of register Y is logically ored with the corresponding bit of
register~Z or of the constant~Z, and the complement of the result is placed in register~X\null.
In other words, a bit of register~X is set to~1 if and only if the
corresponding bits of the operands are both~0;
in symbols, $\rX=\rY\mathbin{\overline\lor}\rZ$ or
$\rX=\rY\mathbin{\overline\lor}\zz$.

\bull\<NXOR \$X,\$Y,\0 `bitwise not-exclusive-or'.\>
@.NAND@>
Each bit of register Y is logically xored with the corresponding bit of
register~Z or of the constant~Z, and the complement of the result is placed in register~X\null.
In other words, a bit of register~X is set to~1 if and only if the
corresponding bits of the operands are equal;
in symbols, $\rX=\rY\mathbin{\overline\oplus}\rZ$ or
$\rX=\rY\mathbin{\overline\oplus}\zz$.

\bull\<MUX \$X,\$Y,\0 `bitwise multiplex'.\>
@.MUX@>
For each bit position~$j$, the $j$th bit of register~X is set either to
bit~$j$ of register~Y
or to bit~$j$ of the other operand \$Z~or~Z, depending on
whether bit~$j$ of the special {\it mask register\/}~rM is 1 or 0:
@^rM@>
if ${\rm M}_j$ then $\yy_j$ else~$\zz_j$.
In symbols, $\rm\rX=(\rY\land rM)\lor(\rZ\land\overline{rM})$ or
            $\rm\rX=(\rY\land rM)\lor(\zz\land\overline{rM})$.
(\MMIX\ has several such special registers, associated with instructions that
need more than two inputs or produce more than one output.)

@ Besides the eighteen bitwise operations, \MMIX\ can also perform unsigned
bytewise and biggerwise operations that are somewhat more exotic.

\bull\<BDIF \$X,\$Y,\0 `byte difference'.\>
@.BDIF@>
For each byte position~$j$, the $j$th byte of register~X is set to byte~$j$ of
register~Y minus byte~$j$ of the other operand \$Z~or~Z, unless that
difference is negative; in the latter case, byte~$j$ of~\$X is set to zero.

\bull\<WDIF \$X,\$Y,\0 `wyde difference'.\>
@.WDIF@>
For each wyde position~$j$, the $j$th wyde of register~X is set to wyde~$j$ of
register~Y minus wyde~$j$ of the other operand \$Z~or~Z, unless that
difference is negative; in the latter case, wyde~$j$ of~\$X is set to zero.

\bull\<TDIF \$X,\$Y,\0 `tetra difference'.\>
@.TDIF@>
For each tetra position~$j$, the $j$th tetra of register~X is set to tetra~$j$ of
register~Y minus tetra~$j$ of the other operand \$Z~or~Z, unless that
difference is negative; in the latter case, tetra~$j$ of~\$X is set to zero.

\bull\<ODIF \$X,\$Y,\0 `octa difference'.\>
@.ODIF@>
Register~X is set to register~Y minus the other operand \$Z~or~Z, unless
\$Z~or~Z exceeds register~Y; in the latter case,
\$X~is set to zero. The operands are treated as unsigned integers.

\smallskip
The \.{BDIF} and \.{WDIF} commands are useful
in applications to graphics or video; \.{TDIF} and \.{ODIF} are also
present for reasons of consistency. For example, if \.a and \.b are
registers containing
8-byte quantities, their bytewise maxima~\.c and
bytewise minima~\.d are computed by
$$\hbox{\tt BDIF x,a,b; ADDU c,x,b; SUBU d,a,x;}$$
similarly, the individual ``pixel differences'' \.e, namely the absolute
values of the differences of corresponding bytes, are computed by
$$\hbox{\tt BDIF x,a,b; BDIF y,b,a; OR e,x,y.}$$
To add individual
bytes of \.a and \.b while clipping all sums to 255 if they don't fit
in a single byte, one can say
$$\hbox{\tt NOR acomp,a,0; BDIF x,acomp,b; NOR clippedsums,x,0;}$$
in other words, complement \.a, apply \.{BDIF}, and complement the result.
The operations can also be used to construct efficient operations on
strings of bytes or wydes.
@^graphics@>
@^pixels@>
@^saturating arithmetic@>
@^nybble@>

Exercise: Implement a ``nybble difference'' instruction that operates
in a similar way on sixteen nybbles at a time.

Answer: {\tt\spaceskip=.5em minus .3em
AND x,a,m; AND y,b,m; ANDN xx,a,m; ANDN yy,b,m;
BDIF x,x,y; BDIF xx,xx,yy; OR ans,x,xx} where register \.m contains
the mask \Hex{0f0f0f0f0f0f0f0f}.

(The \.{ANDN} operation can be regarded as
a ``bit difference'' instruction that operates
in a similar way on 64 bits at a time.)

@ Three more pairs of bit-fiddling instructions round out the collection of exotics.

\bull\<SADD \$X,\$Y,\0 `sideways add'.\>
@.SADD@>
Each bit of register Y is logically anded with the complement of the
corresponding bit of
register~Z or of the constant~Z, and the number of 1~bits in the
result is placed in register~X\null.
In other words, register~X is set to the number of bit positions
in which register~Y has a~1 and the other operand has a~0;
in symbols, $\rX=\nu(\rY\setminus\rZ)$ or $\rX=\nu(\rY\setminus\zz)$.
When the second operand is zero this operation is sometimes called
``population counting,'' because it counts the number of 1s in register~Y\null.
@^population counting@>
@^counting ones@>

\bull\<MOR \$X,\$Y,\0 `multiple or'.\>
@.MOR@>
Suppose the 64 bits of register Y are indexed as
$$y_{00}y_{01}\ldots y_{07}y_{10}y_{11}\ldots y_{17}\ldots
  y_{70}y_{71}\ldots y_{77};$$
in other words, $y_{ij}$ is the $j$th bit of the $i$th byte, if we
number the bits and bytes from 0 to 7 in big-endian fashion from left to right.
Let the bits of the other operand, \$Z or~Z, be indexed similarly:
$$z_{00}z_{01}\ldots z_{07}z_{10}z_{11}\ldots z_{17}\ldots
  z_{70}z_{71}\ldots z_{77}.$$
The \.{MOR} operation replaces each bit $x_{ij}$ of register~X by the bit
$$ y_{0j}z_{i0}\lor y_{1j}z_{i1}\lor \cdots \lor y_{7j}z_{i7}.$$
Thus, for example, if register Z contains the constant
\Hex{0102040810204080},
\.{MOR} reverses the order of the bytes in register~Y, converting between
little-endian and big-endian addressing.
@^big-endian versus little-endian@>
@^little-endian versus big-endian@>
(The $i$th byte of~\$X depends on the bytes of~\$Y as specified by the
$i$th byte of~\$Z or~Z\null. If we regard
64-bit words as $8\times8$ Boolean matrices, with one byte per column,
this operation computes the
Boolean product $\rX=\rY\,\rZ$ or $\rX=\rY\,\zz$. Alternatively, if we
regard 64-bit words as $8\times8$ matrices with one byte per~{\it row},
\.{MOR} computes the Boolean product $\rX=\rZ\,\rY$ or $\rX=\zz\,\rY$
with operands in the opposite order. The immediate form
\<MOR \$X,\$Y,Z always sets the leading seven bytes of register~X
to zero; the other byte is set to the bitwise or of whatever bytes of
register~Y are specified by the immediate operand~Z\null.)

Exercise: Explain how to compute a mask \.m that is \Hex{ff} in byte
positions where \.a exceeds \.b, \Hex{00} in all other bytes.
Answer: \.{BDIF}~\.{x,a,b;} \.{MOR}~\.{m,minusone,x;}
here \.{minusone} is a register consisting of all 1s. (Moreover,
if we \.{AND} this result
with \Hex{8040201008040201}, then \.{MOR} with $\zz=255$, we get
a one-byte encoding~of~\.m.)

\bull\<MXOR \$X,\$Y,\0 `multiple exclusive-or'.\>
@.MXOR@>
This operation is like the Boolean multiplication just discussed, but
exclusive-or is used to combine the bits. Thus we obtain a matrix
product over the field of two elements instead of a Boolean matrix product.
This operation can be used to construct hash functions, among many other things.
(The hash functions aren't bad, but they are not ``universal'' in the
sense of {\sl Sorting and Searching}, exercise 6.4--72.)
@^matrices of bits@>
@^Boolean multiplication@>

@ Sixteen ``immediate wyde'' instructions are available for the common
case that a 16-bit constant is needed. In this case the Y~and~Z fields
of the instruction are regarded as a single 16-bit unsigned number~YZ\null.
@^immediate operands@>

\bull\<SETH \$X,YZ `set to high wyde';
@.SETH@>
     \<SETMH \$X,YZ `set to medium high wyde';
@.SETMH@>
     \<SETML \$X,YZ `set to medium low wyde';
@.SETML@>
     \<SETL \$X,YZ `set to low wyde'.\>
@.SETL@>
The 16-bit unsigned number YZ is shifted left
by either 48 or 32 or 16 or 0 bits, respectively, and placed into register~X\null.
Thus, for example, \.{SETML} inserts
a given value into the second-least-significant wyde of register~X and
sets the other three wydes to zero.

\bull\<INCH \$X,YZ `increase by high wyde';
@.INCH@>
     \<INCMH \$X,YZ `increase by medium high wyde';
@.INCMH@>
     \<INCML \$X,YZ `increase by medium low wyde';
@.INCML@>
     \<INCL \$X,YZ `increase by low wyde'.\>
@.INCL@>
The 16-bit unsigned number YZ is shifted left
by either 48 or 32 or 16 or 0 bits, respectively, and added to register~X,
ignoring overflow; the result is placed back into register~X\null.

If YZ is the hexadecimal constant \Hex{8000}, the command \<INCH \$X,YZ
complements the most significant bit of register~X\null. We will see
below that this can be used to negate a floating point number.
@^negation, floating point@>

\bull\<ORH \$X,YZ `bitwise or with high wyde';
@.ORH@>
     \<ORMH \$X,YZ `bitwise or with medium high wyde';
@.ORMH@>
     \<ORML \$X,YZ `bitwise or with medium low wyde';
@.ORML@>
     \<ORL \$X,YZ `bitwise or with low wyde'.\>
@.ORL@>
The 16-bit unsigned number YZ is shifted left
by either 48 or 32 or 16 or 0 bits, respectively, and ored with register~X;
the result is placed back into register~X\null.

Notice that any desired 4-wyde constant \.{GH} \.{IJ} \.{KL} \.{MN}
can be inserted into a register with a sequence of four instructions
such as
$$\hbox{\tt SETH \$X,GH; INCMH \$X,IJ; INCML \$X,KL; INCL \$X,MN;}$$
any of these \.{INC} instructions could also be replaced by \.{OR}.

\bull\<ANDNH \$X,YZ `bitwise and-not high wyde';
@.ANDNH@>
     \<ANDNMH \$X,YZ `bitwise and-not medium high wyde';\>
@.ANDNMH@>
     \<ANDNML \$X,YZ `bitwise and-not medium low wyde';
@.ANDNML@>
     \<ANDNL \$X,YZ `bitwise and-not low wyde'.\>
@.ANDNL@>
The 16-bit unsigned number YZ is shifted left
by either 48 or 32 or 16 or 0 bits, respectively, then
complemented and anded with register~X;
the result is placed back into register~X\null.

If YZ is the hexadecimal
constant \Hex{8000}, the command \<ANDNH \$X,YZ forces the most significant
bit of register~X to be~0. This can be used to compute the absolute value of
a floating point number.
@^absolute value, floating point@>

@ \MMIX\ knows several ways to shift a register left or right
by any number of bits.

\bull\<SL \$X,\$Y,\0 `shift left'.\>
@.SL@>
The bits of register~Y are shifted left by \$Z or Z places, and 0s
are shifted in from the right; the result is placed in register~X\null.
Register~Y is treated as a signed number, but
the second operand is treated as an unsigned number.
The effect is the same as multiplication by 
$2^{\mkern1mu\rZ}$ or by $2^\zz$; an integer overflow exception occurs if the
result is $\ge2^{63}$ or $<-2^{63}$.
In particular, if the second operand is 64 or~more, register~X will
become entirely zero, and integer overflow will be signaled unless
register~Y was zero.

\bull\<SLU \$X,\$Y,\0 `shift left unsigned'.\>
@.SLU@>
The bits of register~Y are shifted left by \$Z or Z places, and 0s
are shifted in from the right; the result is placed in register~X\null.
Both operands are treated as unsigned numbers. The \.{SLU} instructions
are equivalent to \.{SL}, except that no test for overflow is made.

\bull\<SR \$X,\$Y,\0 `shift right'.\>
@.SR@>
The bits of register~Y are shifted right by \$Z or Z places, and copies
of the leftmost bit (the sign bit) are shifted in from the left; the result is
placed in register~X\null.
Register~Y is treated as a signed number, but
the second operand is treated as an unsigned number.
The effect is the same as division by $2^{\mkern1mu\rZ}$ or by
$2^\zz$ and rounding down. In particular, if the second operand is 64 or~more,
register~X will become zero if \$Y was nonnegative, $-1$ if \$Y was negative.

\bull\<SRU \$X,\$Y,\0 `shift right unsigned'.\>
@.SRU@>
The bits of register~Y are shifted right by \$Z or Z places, and 0s
are shifted in from the left; the result is placed in register~X\null.
Both operands are treated as unsigned numbers.
The effect is the same as unsigned division of a 64-bit number
by $2^{\mkern1mu\rZ}$ or by~$2^\zz$;
if the second operand is 64 or~more, register~X will become entirely~zero. 

@* Comparisons.
Arithmetic and logical operations are nice,
but computer programs also need to compare numbers
and to change the course of a calculation depending on what they find.
\MMIX\ has four comparison instructions to facilitate such decision-making.

\bull\<CMP \$X,\$Y,\0 `compare'.\>
@.CMP@>
Register X is set to $-1$ if register Y is less than register Z or less than
the unsigned immediate value~Z, using the conventions of signed
arithmetic; it is set to 0 if register~Y is equal to register Z or equal to
the unsigned immediate value~Z; otherwise it is set to~1.
In symbols, $\rX=[\rY\!>\!\rZ]-[\rY\!<\!\rZ]$ or $\rX=[\rY\!>\!\zz]-[\rY\!<\!\zz]$.

\bull\<CMPU \$X,\$Y,\0 `compare unsigned'.\>
@.CMPU@>
Register X is set to $-1$ if register Y is less than register Z or less than
the unsigned immediate value Z, using the conventions of unsigned
arithmetic; it is set to 0 if register Y is equal to register Z or equal to
the unsigned immediate value~Z; otherwise it is set to~1.
In symbols, $\rX=[\rY\!>\!\rZ]-[\rY\!<\!\rZ]$ or $\rX=[\rY\!>\!\zz]-[\rY\!<\!\zz]$.

@ There also are 32 conditional instructions, which choose quickly between
two alternative courses of action.

\bull\<CSN \$X,\$Y,\0 `conditionally set if negative'.\>
@.CSN@>
If register Y is negative (namely if its most significant bit is~1),
register~X is set to the contents of register~Z or to the
unsigned immediate value~Z. Otherwise nothing happens.

\bull\<CSZ \$X,\$Y,\0 `conditionally set if zero'.
@.CSZ@>
\bul\<CSP \$X,\$Y,\0 `conditionally set if positive'.
@.CSP@>
\bul\<CSOD \$X,\$Y,\0 `conditionally set if odd'.
@.CSOD@>
\bul\<CSNN \$X,\$Y,\0 `conditionally set if nonnegative'.
@.CSNN@>
\bul\<CSNZ \$X,\$Y,\0 `conditionally set if nonzero'.
@.CSNZ@>
\bul\<CSNP \$X,\$Y,\0 `conditionally set if nonpositive'.
@.CSNP@>
\bul\<CSEV \$X,\$Y,\0 `conditionally set if even'.\>
@.CSEV@>
These instructions are entirely analogous to \.{CSN}, except
that register~X changes only if register~Y is respectively zero, positive,
odd, nonnegative, nonzero, nonpositive, or nonodd.

\bull\<ZSN \$X,\$Y,\0 `zero or set if negative'.\>
@.ZSN@>
If register Y is negative (namely if its most significant bit is~1),
register~X is set to the contents of register~Z or to the
unsigned immediate value~Z. Otherwise register~X is set to zero.

\bull\<ZSZ \$X,\$Y,\0 `zero or set if zero'.
@.ZSZ@>
\bul\<ZSP \$X,\$Y,\0 `zero or set if positive'.
@.ZSP@>
\bul\<ZSOD \$X,\$Y,\0 `zero or set if odd'.
@.ZSOD@>
\bul\<ZSNN \$X,\$Y,\0 `zero or set if nonnegative'.
@.ZSNN@>
\bul\<ZSNZ \$X,\$Y,\0 `zero or set if nonzero'.
@.ZSNZ@>
\bul\<ZSNP \$X,\$Y,\0 `zero or set if nonpositive'.
@.ZSNP@>
\bul\<ZSEV \$X,\$Y,\0 `zero or set if even'.\>
@.ZSEV@>
These instructions are entirely analogous to \.{ZSN}, except
that \$X is set to \$Z or~Z if register~Y is respectively zero, positive,
odd, nonnegative, nonzero, nonpositive, or even; otherwise
\$X is set to zero.

Notice that the two instructions \<CMPU r,s,0 and \<ZSNZ r,s,1 have
the same effect. So do the two instructions \<CSNP r,s,0 and \.{ZSP} \.{r,s,r}.
So do \<AND r,s,1 and \.{ZSOD}~\.{r,s,1}.

@* Branches and jumps.
\MMIX\ ordinarily executes instructions in sequence, proceeding from
an instruction in tetrabyte M$_4[\lambda]$ to the instruction in
M$_4[\lambda+4]$. But there are several ways to interrupt
the normal flow of control, most of which use the Y and Z fields of
an instruction as a combined 16-bit YZ field.
For example, \<BNZ \$3,@@+4000 (branch if nonzero)
is typical: It means that control should skip ahead 1000 instructions
to the command that appears $4000$ bytes after the
\.{BNZ}, if register~3 is not equal to zero.

There are eight branch-forward instructions, corresponding to the
eight conditions in the \.{CS} and \.{ZS} commands that we discussed earlier.
And there are eight similar branch-backward instructions; for example,
\<BOD \$2,@@-4000 (branch if odd) takes control to the
instruction that appears $4000$ bytes {\it before\/}
this \.{BOD} command, if register~2 is odd. The numeric OP-code when branching
backward is one greater than the OP-code when branching
forward; the assembler takes care of this automatically, just as it takes
cares of changing \.{ADD} from 32 to 33 when necessary.

Since branches are relative to the current location, the \MMIX\ assembler
treats branch instructions in a special way.
Suppose a programmer writes `\.{BNZ} \.{\$3,Case5}',
where \.{Case5} is the address of an instruction in location~$l$.
If this instruction appears in location~$\lambda$, the assembler first
computes the displacement $\delta=\lfloor(l-\lambda)/4\rfloor$. Then if
$\delta$ is nonnegative, the quantity~$\delta$
is placed in the YZ field of a \.{BNZ}
command, and it should be less than $2^{16}$; if $\delta$ is negative,
the quantity $2^{16}+\delta$ is placed in the YZ field of a \.{BNZ}
command with OP-code increased by~1,
and $\delta$ should not be less than $-2^{16}$.

The symbol \.{@@} used in our examples of
\.{BNZ} and \.{BOD} above is interpreted by the
assembler as an abbreviation for ``the location of the current
instruction.'' In the following
notes we will define pairs of branch commands by writing, for example,
`\.{BNZ}~\.{\$X,@@+4*YZ[-262144]}'; this stands for a branch-forward
command that
branches to the current location plus four times~YZ, as well as for
a branch-backward command that branches to the current
location plus four times $(\rm YZ-65536)$.

\bull\<BN \$X,@@+4*YZ[-262144] `branch if negative'.
@.BN@>
\bul\<BZ \$X,@@+4*YZ[-262144] `branch if zero'.
@.BZ@>
\bul\<BP \$X,@@+4*YZ[-262144] `branch if positive'.
@.BP@>
\bul\<BOD \$X,@@+4*YZ[-262144] `branch if odd'.
@.BOD@>
\bul\<BNN \$X,@@+4*YZ[-262144] `branch if nonnegative'.
@.BNN@>
\bul\<BNZ \$X,@@+4*YZ[-262144] `branch if nonzero'.
@.BNZ@>
\bul\<BNP \$X,@@+4*YZ[-262144] `branch if nonpositive'.
@.BNP@>
\bul\<BEV \$X,@@+4*YZ[-262144] `branch if even'.\>
@.BEV@>
If register X is respectively negative, zero, positive, odd, nonnegative,
nonzero, nonpositive, or even, and if this instruction appears in memory
location $\lambda$, the next instruction is taken from memory location
$\lambda+4{\rm YZ}$ (branching forward) or $\lambda+4({\rm YZ}-2^{16})$
(branching backward). Thus one can go from location~$\lambda$ to any location
between $\lambda-262{,}144$ and $\lambda+262{,}140$, inclusive.

\smallskip
Sixteen additional branch instructions called {\it probable branches\/}
are also provided. They have exactly the same meaning as ordinary
branch instructions; for example, \<PBOD \$2,@@-4000 and \<BOD \$2,@@-4000 both
go backward 4000 bytes if register~2 is odd. But they differ in running time:
On some implementations of\/ \MMIX,
a branch instruction takes longer when the branch is taken, while a
probable branch takes longer when the branch is {\it not\/} taken.
Thus programmers should use a \.B instruction when they think branching is
relatively unlikely, but they should use \.{PB} when they expect
branching to occur more often than not. Here is a list of the
probable branch commands, for completeness:

\bull\<PBN \$X,@@+4*YZ[-262144] `probable branch if negative'.
@.PBN@>
\bul\<PBZ \$X,@@+4*YZ[-262144] `probable branch if zero'.
@.PBZ@>
\bul\<PBP \$X,@@+4*YZ[-262144] `probable branch if positive'.
@.PBP@>
\bul\<PBOD \$X,@@+4*YZ[-262144] `probable branch if odd'.
@.PBOD@>
\bul\<PBNN \$X,@@+4*YZ[-262144] `probable branch if nonnegative'.
@.PBNN@>
\bul\<PBNZ \$X,@@+4*YZ[-262144] `probable branch if nonzero'.
@.PBNZ@>
\bul\<PBNP \$X,@@+4*YZ[-262144] `probable branch if nonpositive'.
@.PBNP@>
\bul\<PBEV \$X,@@+4*YZ[-262144] `probable branch if even'.
@.PBEV@>

@ Locations that are relative to the current instruction can be
transformed into absolute locations with \.{GETA} commands.

\bull\<GETA \$X,@@+4*YZ[-262144] `get address'.\>
@.GETA@>
The value $\lambda+4{\rm YZ}$ or $\lambda+4({\rm YZ}-2^{16})$ is placed in
register~X\null. (The assembly language conventions of branch instructions
apply; for example, we can write `\.{GETA} \.{\$X,Addr}'.)

@ \MMIX\ also has unconditional jump instructions, which change the
location of the next instruction no matter what.

\bull\<JMP @@+4*XYZ[-67108864] `jump'.\>
@.JMP@>
A \.{JMP} command treats bytes X, Y, and Z as an unsigned 24-bit
integer XYZ. It allows a program to transfer control from location $\lambda$ to any
location between $\lambda-67\?{,}108{,}864$ and $\lambda+67\?{,}108{,}860$
inclusive, using relative addressing as in the \.{B} and \.{PB} commands.

\bull\<GO \$X,\$Y,\0 `go to location'.\>
@.GO@>
\MMIX\ takes its next instruction from location $\rY+\rZ$ or $\rY+\zz$,
and continues from there. Register~X is set equal to $\lambda+4$, the
location of the instruction that would ordinarily have been executed next.
(\.{GO} is similar to a jump, but it is not relative
to the current location. Since \.{GO} has the same format as a load or store
instruction, a loading routine can treat program labels with the same mechanism
that is used to treat references to data.)

An old-fashioned type of subroutine linkage can be implemented by saying
either `\.{GO}~\.{r,subloc,0}' or `\.{GETA}~\.{r,@@+8;}
\.{JMP}~\.{Sub}' to~enter a subroutine,
then `\.{GO}~\.{r,r,0}' to return.
But subroutines are normally entered with the instructions
\.{PUSHJ} or \.{PUSHGO}, described below.

The two least significant bits of the address
in a \.{GO} command are essentially ignored. They will, however, appear in
the value of~$\lambda$ returned by \.{GETA} instructions, and in the
@^bit stuffing@>
return-jump register~rJ after \.{PUSHJ} or \.{PUSHGO} instructions are
performed, and in
@^rJ@>
the where-interrupted register at the time of an interrupt. Therefore they
could be used to send some kind of signal to a subroutine or (less likely)
to an interrupt handler.

@* Multiplication and division.
Now for some instructions that make \MMIX\ work harder.

\bull\<MUL \$X,\$Y,\0 `multiply'.\>
@.MUL@>
The signed product of the number in register Y by either the
number in register~Z or the unsigned byte~Z
replaces the contents of register~X\null. An
integer overflow exception can occur, as with \.{ADD} or \.{SUB}, if the
result is less than $-2^{63}$ or greater than $2^{63}-1$. (Immediate
multiplication by powers of~2 can be done more rapidly with the \.{SL}
instruction.)

\bull\<MULU \$X,\$Y,\0 `multiply unsigned'.\>
@.MULU@>
The lower 64 bits of the
unsigned 128-bit product of register~Y and either
register~Z or~Z are placed in register~X, and the upper 64 bits are
placed in the special {\it himult register\/}~rH\null. (Immediate multiplication
@^rH@>
by powers of~2 can be done more rapidly with the \.{SLU} instruction,
if the upper half is not needed.
Furthermore, an instruction like \<4ADDU \$X,\$Y,\$Y is faster than
\.{MULU} \.{\$X,\$Y,5}.)

\bull\<DIV \$X,\$Y,\0 `divide'.\>
@.DIV@>
The signed quotient of the number in register Y divided
by either the number in register~Z or the unsigned byte~Z
replaces the contents of register~X, and the signed remainder
is placed in the special {\it remainder register\/}~rR\null.
@^rR@>
An integer divide check exception occurs if the divisor is zero; in that
case \$X is set to zero and rR is set to~\$Y\null.
@^divide check exception@>
@^overflow@>
An integer overflow exception occurs if the number $-2^{63}$ is divided
by~$-1$; otherwise integer overflow is impossible. The quotient of
$y$ divided by~$z$ is defined to be $\lfloor y/z\rfloor$, and the remainder
is defined to be $y-\lfloor y/z\rfloor z$ (also written $y\bmod z$).
Thus, the remainder is either
zero or has the sign of the divisor. Dividing by $z=2^t$ gives
exactly the same quotient as shifting right~$t$ via the \.{SR} command, and
exactly the same remainder as anding with $z-1$ via the \.{AND} command.
Division of a positive 63-bit number by a positive constant can be accomplished
more quickly by computing the upper half of a suitable unsigned product and
shifting it right appropriately.

\bull\<DIVU \$X,\$Y,\0 `divide unsigned'.\>
@.DIVU@>
The unsigned 128-bit number obtained by prefixing the special {\it dividend
register}~rD to the contents of register~Y is divided either by the
@^rD@>
unsigned number in register~Z or by the unsigned byte~Z, and the quotient is placed
in register~X\null. The remainder is placed in the remainder
register~rR\null.
However, if rD is greater than or equal to
the divisor (and in particular if the divisor is zero),
then \$X is set to~rD and rR is set to~\$Y\null.
(Unsigned arithmetic never signals an exceptional condition, even
when dividing by zero.)
If rD is zero, unsigned division by $z=2^t$ gives exactly the same quotient as
shifting right~$t$ via the \.{SRU} command, and
exactly the same remainder as anding with $z-1$ via the \.{AND} command.
Section 4.3.1 of {\sl Seminumerical Algorithms\/}
explains how to use unsigned division to obtain the quotient and remainder
of extremely large numbers.

@* Floating point computations.
Floating point arithmetic conforming to the famous IEEE/ANSI
Standard~754 is provided for arbitrary 64-bit numbers. The IEEE standard
refers to such numbers as ``double format'' quantities, but \MMIX\
calls them simply floating point numbers because 64-bit quantities are
the~norm.
@^floating point arithmetic@>
@^IEEE/ANSI Standard 754@>
@^subnormal numbers@>
@^normal numbers@>
@^NaN@>
@^overflow@>
@^underflow@>
@^invalid exception@>
@^inexact exception@>
@^signaling NaN@>
@^quiet NaN@>
@^infinity@>
@^rounding modes@>

A positive floating point number has 53 bits of precision and can range
from approximately $10^{-308}$ to $10^{308}$. ``Subnormal numbers''
between $10^{-324}$ and $10^{-308}$ can also be represented, but with fewer
bits of precision.
Floating point numbers can be
infinite, and they satisfy such identities as $1.0/\infty=+0.0$, $-2.8\times\infty
=-\infty$. Floating
point quantities can also be ``Not-a-Numbers'' or NaNs, which are
further classified into signaling NaNs and quiet NaNs.

Five kinds
of exceptions can occur during floating point computations, and they
each have code letters: Floating
overflow~(O) or underflow~(U); floating divide by zero~(Z);
floating inexact~(X); and floating invalid~(I).
For example, the multiplication of sufficiently small integers causes
no exceptions, and the division of 91.0 by~13.0 is also exception-free,
but the division 1.0/3.0 is inexact. The multiplication of extremely
large or extremely small floating point numbers is inexact and it
also causes overflow or underflow.
Invalid results occur when taking the square root of a negative
number; mathematicians can remember the I exception
by relating it to the square root of $-1.0$.
Invalid results also occur when trying to convert infinity
or a quiet NaN to a fixed-point
integer, or when any signaling NaN is encountered, or when
mathematically undefined operations like $\infty-\infty$ or $0/0$ are
requested.
(Programmers can be sure that they have not erroneously
used uninitialized floating point data if they initialize all their variables
to signaling NaN values.)

Four different rounding modes for inexact results are available:
round to nearest (and to even in case of ties);
round off (toward zero); round up (toward $+\infty)$;
or round down (toward $-\infty$). \MMIX\
has a special {\it arithmetic status register\/}~rA that specifies the
@^rA@>
current rounding mode and the user's current preferences for exception
handling.

\def\NaN{{\rm NaN}}
IEEE standard arithmetic provides an excellent foundation for scientific
calculations, and it will be thoroughly explained in the fourth
edition of {\sl Seminumerical Algorithms}, Section 4.2.
For our present purposes, we need not study all the details; but
we do need to specify \MMIX's behavior with respect to several
things that are not completely defined by the standard.
For example, the IEEE standard does not fully define the
result of operations with NaNs.

When an octabyte represents a floating point number
in \MMIX's registers, the leftmost bit is the sign; then come 11 bits for an
exponent~$e$; and the remaining 52 bits are the fraction part~$f$.
We regard $e$ as an integer between 0 and $(11111111111)_2=2047$, and we regard $f$ as
a fraction between 0 and $(.111\ldots1)_2=1-2^{-52}$.
Each octabyte has the following
significance:
$$\vbox{\halign{\hfil$\pm#$,\quad if &#\hfil\cr
0.0&$e=f=0$ (zero);\cr
2^{-1022}f&$e=0$ and $f>0$ (subnormal);\cr
2^{\mkern1mu e-1023}(1+f)&$0<e<2047$ (normal);\cr
\infty&$e=2047$ and $f=0$ (infinite);\cr
\NaN(f)&$e=2047$ and $0<f<1/2$ (signaling NaN);\cr
\NaN(f)&$e=2047$ and $f\ge1/2$ (quiet NaN).\cr}}$$
Notice that $+0.0$ is distinguished from $-0.0$; this fact is
important for interval arithmetic.
@^minus zero@>

Exercise: What 64 bits represent the floating point number 1.0?
Answer: We want $e=1023$ and $f=0$, so the answer is \Hex{3ff0000000000000}.

Exercise: What is the largest finite floating point number?
Answer: We want $e=2046$ and $f=1-2^{-52}$, so the answer is
$\Hex{7fefffffffffffff}=2^{1024}-2^{971}$.

@ The seven IEEE floating point arithmetic operations (addition, subtraction,
multiplication, division, remainder, square root, and nearest-integer)
all share common features, called the {\it standard floating point
conventions\/} in the discussion below:
@^standard floating point conventions@>
@^overflow@>
@^underflow@>
The operation is performed on floating point numbers found in two registers,
\$Y and~\$Z, except that square root and integerization
involve only one operand.
If neither input operand is a NaN, we first determine the exact result,
then round it using the current rounding mode
found in special register~rA\null. Infinite results are exact and need
no rounding. A floating overflow exception occurs if the rounded
result is finite but needs an exponent greater than 2046.
A floating underflow exception occurs if the rounded result needs an exponent
less than~1 and either (i)~the unrounded result cannot be represented exactly
@^rA@>
as a subnormal number or (ii)~the ``floating underflow trip'' is enabled in~rA\null.
(Trips are discussed below.)
NaNs are treated specially as follows: If either \$Y or~\$Z is a signaling NaN,
an invalid exception occurs and the NaN is quieted by adding 1/2 to its
fraction part. Then if \$Z is a quiet NaN, the result is set
to \$Z; otherwise if \$Y is a quiet NaN, the result is set to \$Y\null.
(Registers \$Y and \$Z do not actually change.)

\bull\<FADD \$X,\$Y,\$Z `floating add'.@>\>
@.FADD@>
The floating point sum $\rY+\rZ$ is computed by the
standard floating point conventions just described,
and placed in register~X\null.
An invalid exception occurs if the sum is $(+\infty)+(-\infty)$ or
$(-\infty)+(+\infty)$; in that case the result is $\NaN(1/2)$ with the sign
of~\$Z\null. If the sum is exactly zero and the current mode is
not rounding-down, the result is $+0.0$ except that $(-0.0)+(-0.0)=-0.0$. If the
@^minus zero@>
sum is exactly zero and the current mode is rounding-down, the result
is $-0.0$ except that $(+0.0)+(+0.0)=+0.0$.
These rules for signed zeros turn out to be useful when doing interval
arithmetic: If the lower bound of an interval is $+0.0$ or if the
upper bound is $-0.0$, the interval does not contain zero, so the
numbers in the interval have a known sign.

Floating point underflow cannot occur unless the U-trip has been enabled,
because any underflowing result of floating point
addition can be represented exactly as a subnormal number.

Silly but instructive exercise: Find all pairs of numbers $(\rY,\rZ)$ such
that the commands \<FADD \$X,\$Y,\$Z and \<ADDU \$X,\$Y,\$Z both produce
the same result in~\$X
(although \.{FADD} may cause floating exceptions).
Answer: Of course \$Y or \$Z could be zero, if the other one is not a signaling
NaN. Or one could be signaling and the other \Hex{0008000000000000}.
Other possibilities
occur when they are both positive and less than
\Hex{0010000000000001}; or when one operand is \Hex{0000000000000001}
and the other is an odd number between \Hex{0020000000000001} and
\Hex{002ffffffffffffd} inclusive (rounding to nearest).
And still more surprising possibilities exist, such as
\Hex{7f6001b4c67bc809}\thinspace+\thinspace\Hex{ff5ffb6a4534a3f7}.
All eight families of solutions will be revealed some day in the fourth edition
of {\sl Seminumerical Algorithms}.

\bull\<FSUB \$X,\$Y,\$Z `floating subtract'.\>
@.FSUB@>
This instruction is equivalent to \.{FADD}, but with the sign of~\$Z negated
unless \$Z is a~NaN.

\bull\<FMUL \$X,\$Y,\$Z `floating multiply'.\>
@.FMUL@>
The floating point product $\rY\times\rZ$ is computed by
the standard floating point conventions, and placed in register~X\null.
An invalid exception occurs if
the product is $(\pm0.0)\times(\pm\infty)$ or $(\pm\infty)\times(\pm0.0)$;
in that case the result is $\pm\NaN(1/2)$. No exception occurs for the
product $(\pm\infty)\times(\pm\infty)$. If neither \$Y nor~\$Z is a NaN,
the sign of the result is the product of the signs of \$Y and~\$Z\null.

\bull\<FDIV \$X,\$Y,\$Z `floating divide'.\>
@.FDIV@>
The floating point quotient $\rY\?/\rZ$ is computed by
the standard floating point conventions, and placed in \$X\null.
@^standard floating point conventions@>
A floating divide by zero exception occurs if the
quotient is $(\hbox{normal or subnormal})/(\pm0.0)$. An invalid exception occurs if
the quotient is $(\pm0.0)/(\pm0.0)$ or $(\pm\infty)/(\pm\infty)$; in that case the
result is $\pm\NaN(1/2)$. No exception occurs for the
quotient $(\pm\infty)/(\pm0.0)$. If neither \$Y nor~\$Z is a NaN,
the sign of the result is the product of the signs of \$Y and~\$Z\null.

If a floating point number in register X is known to have an exponent between
2 and~2046, the instruction \<INCH \$X,\char`\#fff0 will divide it by~2.0.

\bull\<FREM \$X,\$Y,\$Z `floating remainder'.\>
@.FREM@>
The  floating point remainder $\rY\,{\rm rem}\,\rZ$ is computed by
the standard floating point conventions, and placed in register~X\null.
(The IEEE standard defines the remainder to be $\rY-n\times\rZ$,
where $n$ is the nearest integer to $\rY/\rZ$, and $n$ is an even
integer in case of ties. This is not the same as the remainder
$\rY\bmod\rZ$ computed by \.{DIV} or \.{DIVU}.)
A zero remainder has the sign of~\$Y\null.
An invalid exception occurs if \$Y is infinite and/or \$Z is zero; in
that case the result is $\NaN(1/2)$ with the sign of~\$Y\null.

\bull\<FSQRT \$X,\$Z `floating square root'.\>
@.FSQRT@>
The floating point square root $\sqrt\rZ$ is computed by the
standard floating point conventions, and placed in register~X\null.  An
invalid exception occurs if \$Z is a negative number (either infinite, normal,
or subnormal); in that case the result is $-\NaN(1/2)$. No exception occurs
when taking the square root of $-0.0$ or $+\infty$. In all cases the sign of
the result is the sign of~\$Z\null.

The Y field of \.{FSQRT} can be used to specify a
special rounding mode, as explained below.
             
\bull\<FINT \$X,\$Z `floating integer'.\>
@.FINT@>
The floating point number in register~Z is rounded (if
necessary) to a floating point integer, using the current
rounding mode, and placed in register~X\null. Infinite values and quiet NaNs
are not changed; signaling NaNs are treated as in the standard conventions.
Floating point overflow and underflow exceptions cannot occur. 

The Y field of \.{FINT} can be used to specify a
special rounding mode, as explained below.
             
@ Besides doing arithmetic, we need to compare floating point numbers
with each other, taking proper account of NaNs and the fact that $-0.0$
should be considered equal to $+0.0$. The following instructions are
analogous to the comparison operators \.{CMP} and \.{CMPU} that we
have used for integers.
@^minus zero@>

\bull\<FCMP \$X,\$Y,\$Z `floating compare'.\>
@.FCMP@>
Register X is set to $-1$ if $\rY<\rZ$ according to the conventions of
floating point arithmetic, or to~1 if $\rY>\rZ$ according to those
conventions. Otherwise it is set to~0. An invalid exception
occurs if either \$Y or \$Z is a NaN; in such cases the result is zero.

\bull\<FEQL \$X,\$Y,\$Z `floating equal to'.\>
@.FEQL@>
Register X is set to 1 if $\rY=\rZ$ according to the conventions of
floating point arithmetic. Otherwise it is set to~0. The result is zero if
either \$Y or \$Z is a NaN, even if a NaN is being compared with itself.
However, no invalid exception occurs, not even when \$Y or \$Z is a signaling
NaN\null. (Perhaps \MMIX\ differs slightly from the IEEE standard in this
regard, but programmers sometimes need to look at signaling NaNs without
encountering side effects.
Programmers who insist on raising
an invalid exception whenever a signaling NaN is compared for floating equality
should issue the instructions \<FSUB \$X,\$Y,\$Y; \<FSUB \$X,\$Z,\$Z just before saying
\.{FEQL}~\.{\$X,\$Y,\$Z}.)

Suppose $w$, $x$, $y$, and $z$ are unsigned 64-bit integers with
$w<x<2^{63}\le y<z$. Thus, the leftmost bits of $w$ and~$x$ are~0,
while the leftmost bits of $y$ and~$z$ are~1.
Then we have $w<x<y<z$ when these numbers are considered
as unsigned integers, but $y<z<w<x$ when they are considered as signed
integers, because $y$ and~$z$ are negative. Furthermore, we have
$z<y\le w<x$ when these same 64-bit quantities are considered to be
floating point numbers, assuming that no NaNs are present,
because the leftmost bit of a floating point
number represents its sign and the remaining bits represent its magnitude.
The case $y=w$ occurs in floating point comparison
 if and only if $y$ is the representation of $-0.0$
and $w$ is the representation of $+0.0$.

\bull\<FUN \$X,\$Y,\$Z `floating unordered'.\>
@.FUN@>
Register X is set to 1 if \$Y and \$Z are unordered according to the conventions
of floating point arithmetic (namely, if either one is a NaN); otherwise
register~X is set to~0. No invalid exception occurs, not even when \$Y or \$Z is
a signaling NaN\null.

\smallskip
The IEEE standard discusses 26 different possible
relations on floating point numbers;
\MMIX\ implements 14 of them with single instructions, followed by
a branch (or by a \.{ZS} to make a ``pure'' 0~or~1 result); all 26
can be evaluated with a sequence of at most four \MMIX\ commands
and a subsequent branch. The
hardest case to handle is `?$>=$' (unordered or greater or equal,
to be computed without exceptions), for which the following
sequence makes $\rX\ge0$ if and only if $\rY\mathrel?>=\rZ$:
$$\vbox{\halign{&\tt#\hfil\ \cr
  &FUN &\$255,\$Y,\$Z\cr
  &BP  &\$255,1F&\% skip ahead if unordered\cr
  &FCMP&\$X,\$Y,\$Z&\% \$X=[\$Y>\$Z]-[\$Y<\$Z]; no exceptions will arise\cr
1H&CSNZ &\$X,\$255,1&\% \$X=1 if unordered\cr
}}$$

@ Exercise: Suppose \MMIX\ had no \.{FINT} instruction. Explain how to
@.FINT@>
obtain the equivalent of \<FINT \$X,\$Z using other instructions. Your
program should do the proper thing with respect to NaNs and exceptions.
(For example, it should cause an invalid exception if and only if \$Z is
a signaling NaN; it should cause an inexact exception only if \$Z needs
to be rounded to another value.)
@^emulation@>

Answer: (The assembler prefixes hexadecimal constants by \.\#.)
$$\vbox{\halign{&\tt#\hfil\ \cr
  &SETH &\$0,\char`\#4330&\% \$0=2\char`\^52\cr
  &SET &\$1,\$Z&\% \$1=\$Z\cr
  &ANDNH &\$1,\char`\#8000&\% \$1=abs(\$Z)\cr
  &ANDN &\$2,\$Z,\$1&\% \$2=signbit(\$Z)\cr
  &FUN &\$3,\$Z,\$Z&\% \$3=[\$Z is a NaN]\cr
  &BNZ &\$3,1F&\% skip ahead if \$Z is a NaN\cr
  &FCMP &\$3,\$1,\$0&\% \$3=[abs(\$Z)>2\char`\^52]-[abs(\$Z)<2\char`\^52]\cr
  &CSNN &\$0,\$3,0&\% set \$0=0 if \$3>=0\cr
  &OR  &\$0,\$2,\$0&\% attach sign of \$Z to \$0\cr
1H\ &FADD &\$1,\$Z,\$0&\% \$1=\$Z+\$0\cr
  &FSUB &\$1,\$1,\$0&\% \$X=\$1-\$0\cr
  &OR   &\$X,\$1,\$2&\% make sure minus zero isn't lost\cr}}$$
This program handles most cases of interest by adding and subtracting
$\pm2^{52}$ using floating point arithmetic.
It would be incorrect to do this in all cases;
for example, such addition/subtraction might fail to give the correct
answer when \$Z is a small negative
quantity (if rounding toward zero), or when \$Z is a number like
$2^{105}+2^{53}$ (if rounding to nearest).

@ \MMIX\ goes beyond the IEEE standard to define additional relations
between floating point numbers, as suggested by the theory in
Section 4.2.2 of {\sl Seminumerical Algorithms}. Given a nonnegative
number~$\epsilon$, each normal floating point number $u=(f,e)$ has
a {\it neighborhood\/}
$$N_\epsilon(u)=\{x\,\mid\,\vert x-u\vert\le 2^{e-1022}\epsilon\};$$
we also define $N_\epsilon(0)=\{0\}$,
$N_\epsilon(u)=\{x\mid\vert x-u\vert\le2^{-1021}\epsilon\}$ if $u$ is
subnormal; $N_\epsilon(\pm\infty)=\{\pm\infty\}$ if $\epsilon<1$,
$N_\epsilon(\pm\infty)=\{$everything except $\mp\infty\}$ if $1\le\epsilon<2$,
$N_\epsilon(\pm\infty)=\{$everything$\}$ if $\epsilon\ge2$. Then we write
$$\vbox{\halign{$u#v\ (\epsilon)$, &#\hfil\cr
\prec&if $u<N_\epsilon(v)$ and $N_\epsilon(u)<v$;\cr
\sim&if $u\in N_\epsilon(v)$ or $v\in N_\epsilon(u)$;\cr
\approx&if $u\in N_\epsilon(v)$ and $v\in N_\epsilon(u)$;\cr
\succ&if $u>N_\epsilon(v)$ and $N_\epsilon(u)>v$.\cr}}$$

\def\rE{{\rm rE}}
\bull\<FCMPE \$X,\$Y,\$Z `floating compare (with respect to epsilon)'.\>
@.FCMPE@>
Register X is set to $-1$ if $\rY\prec\rZ\ \ (\rE)$ according to the
conventions of {\sl Seminumerical Algorithms} as stated above; it is set to~1
if $\rY\succ\rZ\ \ (\rE)$ according to those conventions; otherwise
it is set to~0. Here rE is a floating point number in
@^rE@>
the special {\it epsilon register\/}, which is used only by the
floating point comparison operations \.{FCMPE}, \.{FEQLE}, and \.{FUNE}.
An invalid exception occurs, and the result is zero,
if any of \$Y, \$Z, or rE are NaN, or if rE is negative.
If no such exception occurs, exactly one of the three conditions
$\rY\prec\rZ$, $\rY\sim\rZ$, $\rY\succ\rZ$ holds with respect to~rE.

\bull\<FEQLE \$X,\$Y,\$Z `floating equivalent (with respect to epsilon)'.\>
@.FEQLE@>
Register X is set to 1 if $\rY\approx\rZ\ \ (\rE)$ according to the
conventions of {\sl Seminumerical Algorithms\/} as stated above; otherwise
it is set to~0.
An invalid exception occurs, and the result is zero,
if any of \$Y, \$Z, or rE are NaN, or if rE is negative.
Notice that the relation $\rY\approx\rZ$ computed by \.{FEQLE} is
stronger than the relation $\rY\sim\rZ$ computed by \.{FCMPE}.

\bull\<FUNE \$X,\$Y,\$Z `floating unordered (with respect to epsilon)'.\>
@.FUNE@>
Register X is set to 1 if
\$Y, \$Z, or~rE are exceptional as discussed for \.{FCMPE} and \.{FEQLE};
otherwise it is set to~0. No exceptions occur, even if \$Y, \$Z, or~rE is
a signaling NaN.

\smallskip\noindent
Exercise: What floating point numbers does \.{FCMPE} regard
as $\sim0.0$ with respect to
$\epsilon=1/2$, when no exceptions arise? \ Answer: Zero, subnormal
numbers, and normal numbers with $f=0$.
(The numbers similar to zero with respect to~$\epsilon$ are zero,
subnormal numbers with $f\le2\epsilon$, normal numbers with $f\le2\epsilon-1$,
and $\pm\infty$ if $\epsilon>=1$.)\looseness=-1

@ The IEEE standard also defines 32-bit floating point quantities, which
it calls ``single format'' numbers. \MMIX\ calls them {\it short floats},
@^short float@>
and converts between 32-bit and 64-bit forms when such numbers are
loaded from memory or stored into memory. A short float consists of a sign
bit followed by an 8-bit exponent and a 23-bit fraction. After it has
been loaded into one of\/ \MMIX's registers, its 52-bit fraction part
will have 29 trailing zero bits, and its exponent~$e$ will be one of the
256 values 0, $(01110000001)_2=897$, $(01110000010)_2=898$, \dots,
$(10001111110)_2=1150$, or~2047, unless it was subnormal; a subnormal
short float loads into a normal number with $874\le e\le896$.

\bull\<LDSF \$X,\$Y,\0 `load short float'.\>
@.LDSF@>
Register~X is set to the 64-bit floating point number corresponding
to the 32-bit floating point number represented by
$\mm_4[\rY+\rZ]$ or $\mm_4[\rY+\zz]$.
No arithmetic exceptions occur, not even if a signaling NaN is loaded.

\bull\<STSF \$X,\$Y,\0 `store short float'.\>
@.STSF@>
The value obtained by rounding register~X to a 32-bit floating
point number is placed in $\mm_4[\rY+\rZ]$ or $\mm_4[\rY+\zz]$.
Rounding is done with the current rounding mode, in a manner
exactly analogous to the standard conventions for rounding 64-bit results,
except that the precision and exponent range are limited. In particular,
floating overflow, underflow, and inexact exceptions might occur;
a signaling NaN will trigger an invalid exception and it will become quiet.
The fraction part of a NaN is truncated if necessary to a multiple of
$2^{-23}$, by ignoring the least significant 29 bits.

If we load any two short floats and operate on them once with either \.{FADD},
\.{FSUB}, \.{FMUL}, \.{FDIV}, \.{FREM}, \.{FSQRT}, or \.{FINT}, and if we then
store the result as a short float, we obtain the results required by
the IEEE standard for single format arithmetic, because
the double format can be shown to have enough precision to avoid any
problems of ``double rounding.'' But programmers are usually better
off sticking
to 64-bit arithmetic unless they have a strong reason to emulate the
precise behavior of a 32-bit computer; 32 bits do not offer
much precision.

@ Of course we need to be able to go back and forth between integers and
floating point values.

\bull\<FIX \$X,\$Z `convert floating to fixed'.\>
@.FIX@>
The floating point number in register~Z is converted to an integer
as with the \.{FINT} instruction, and the resulting integer (mod~$2^{64}$)
is placed in register~X\null.
An invalid exception occurs if \$Z is infinite
or a NaN; in that case \$X is simply set equal to~\$Z\null. A float-to-fix
exception occurs if the result is less than
@^float-to-fix exception@>
@^short float@>
$-2^{63}$ or greater than $2^{63}-1$.

\bull\<FIXU \$X,\$Z `convert floating to fixed unsigned'.\>
@.FIXU@>
This instruction is identical to \.{FIX} except that no float-to-fix
exception occurs.

\bull\<FLOT \$X,\0 `convert fixed to floating'.\>
@.FLOT@>
The integer in \$Z or the immediate constant~Z is
converted to the nearest floating point value (using the current rounding
mode) and placed in register~X\null. A floating inexact exception
occurs if rounding is necessary.

\bull\<FLOTU \$X,\0 `convert fixed to floating unsigned'.\>
@.FLOTU@>
\.{FLOTU} is like \.{FLOT}, but \$Z is treated as an unsigned integer.

\bull\<SFLOT \$X,\0 `convert fixed to short float';
\<SFLOTU \$X,\0 `convert fixed to short float unsigned'.\>
@.SFLOT@>
@.SFLOTU@>
The \.{SFLOT} instructions are like the \.{FLOT} instructions, except that
they round to a floating point number whose fraction part is a multiple
of $2^{-23}$. (Thus, the resulting value will not be changed by a ``store
short float'' instruction.) Such conversions appear in \MMIX's repertoire only
to establish complete conformance with the IEEE standard; a programmer
needs them only when emulating a 32-bit machine.
@^emulation@>

@ Since the variants of \.{FIX} and \.{FLOT} involve only one input operand (\$Z
or~Z), their Y~field is normally zero. A programmer can, however, force the
mode of rounding used with these commands by setting
$$\vbox{\halign{$\yy=#$,\quad &\.{ROUND\_#}\hfil&\quad(#);\hfil\cr
1&OFF&none\cr
2&UP&away from zero\cr
3&DOWN&toward zero\cr
4&NEAR&to closest\cr}}$$
for example, the instruction \<FLOTU \$X,ROUND\_OFF,\$Z will set the
exponent~$e$ of register~X to $1086-l$ if \$Z is a nonzero quantity with
$l$~leading zero bits. Thus we can count leading zeros by continuing
with \.{SETL}~\.{\$0,1086}; \.{SR}~\.{\$X,\$X,52}; \.{SUB}~\.{\$X,\$0,\$X};
\.{CSZ}~\.{\$X,\$Z,64}.
@^counting leading zeros@>
@.FLOT@>
@.FLOTU@>
@.SFLOT@>
@.SFLOTU@>
@.FIX@>
@.FIXU@>
@:ROUND_OFF}\.{ROUND\_OFF@>
@:ROUND_UP}\.{ROUND\_UP@>
@:ROUND_DOWN}\.{ROUND\_DOWN@>
@:ROUND_NEAR}\.{ROUND\_NEAR@>

The Y field can also be used in the same way
to specify any desired rounding mode in the other
floating point instructions that have only a single operand, namely
\.{FSQRT} and~\.{FINT}.
@.FSQRT@>
@.FINT@>
An illegal instruction interrupt occurs if Y exceeds~4 in any of these
commands.
@^illegal instructions@>

@* Subroutine linkage.
\MMIX\ has several special operations designed to facilitate the process of
calling and implementing subroutines. The key notion is the idea of a
hardware-supported {\it register stack}, which can coexist with a
software-supported stack of variables that are not maintained in registers.
From a programmer's standpoint, \MMIX\ maintains a potentially unbounded list
$S[0]$, $S[1]$, \dots,~$S[\tau-1]$ of octabytes holding the contents
of registers that are temporarily inaccessible; initially $\tau=0$.
When a subroutine is entered, registers can be ``pushed'' on to the end of
this list, increasing~$\tau$; when the subroutine has finished its
execution, the registers are ``popped'' off again and $\tau$~decreases.

Our discussion so far has treated all 256 registers \$0, \$1, \dots,~\$255 as if
they were alike. But in fact, \MMIX\ maintains two internal one-byte counters
$L$ and~$G$, where $0\le\ll\le\gg<256$, with the property that
$$\vbox{\halign{#\hfil\cr
registers 0, 1, \dots, $\ll-1$ are ``local'';\cr
registers @!|L|, $\ll+1$, \dots, $\gg-1$ are ``marginal'';\cr
registers @!|G|, $\gg+1$, \dots, 255 are ``global.''\cr}}$$
A marginal register is zero when its value is read.
@^illegal instructions@>
@^rG@>
@^rL@>
@^local registers@>
@^marginal registers@>
@^global registers@>
@^register stack@>

The $G$ counter is normally set to a fixed value once and for all when a program
is loaded, thereby defining the number of program variables that will live
entirely in registers rather than in memory during the course of execution.
A programmer may, however, change~$G$ dynamically using the \.{PUT}
instruction described below.

The $L$ counter starts at 0. If an instruction places a value into a register
that is currently marginal, namely a register $x$ such that
$\ll\le x<\gg$, the value of~$L$ will increase to $x+1$, and any
newly local registers will be zero. For example, if $\ll=10$ and
$\gg=200$, the instruction \<ADD \$5,\$15,1 would simply set \$5 to~1. But the
instruction \<ADD \$15,\$5,\$200 would set \$10, \$11, \dots,~\$14 to zero,
\$15 to $\$5+\$200$, and $L$~to~16. (The process of clearing registers and
increasing~$L$ might take quite a few machine cycles in the worst case. We will
see later that \MMIX\ is able to take care of any high-priority interrupts
that might occur during this time.)

\bull\<PUSHJ \$X,@@+4*YZ[-262144] `push registers and jump'.
\bul\<PUSHGO \$X,\$Y,\0 `push registers and go'.\>
@.PUSHGO@>
@.PUSHJ@>
Suppose first that $\xx<\ll$.
Register~X is set equal to the number~X, then
registers 0, 1, \dots,~X are pushed onto the register stack as
described below.
If this instruction is in
location $\lambda$, the value $\lambda+4$ is placed into the special {\it
return-jump register\/}~rJ\null. Then control jumps to instruction
@^rJ@>
$\lambda+4\rm YZ$ or $\lambda+4\rm YZ-262144$ or
$\rY+\rZ$ or $\rY+\zz$, as in a 
\.{JMP} or \.{GO} command.

Pushing the first $\xx+1$ registers onto the stack means essentially that we
set $S[\tau]\gets\$0$, $S[\tau+1]\gets\$1$, \dots, $S[\tau+\xx]\gets\$\xx$,
$\tau\gets\tau+\xx+1$, $\$0\gets\$(\xx+1)$, \dots,
$\$(\ll-\xx-2)\gets\$(\ll-1)$, $\ll\gets\ll-\xx-1$. For example, if
$\xx=1$ and $\ll=5$, the current contents of \$0 and the number~1 are
placed on the register stack, where they will be temporarily inaccessible.
Then control jumps to a subroutine with $L$ reduced to~3; the registers that we
had been calling \$2, \$3, and \$4 appear as \$0, \$1, and \$2 to the subroutine.

If $\ll\le\xx<\gg$, the value of $\ll$ increases to $\xx+1$ as described
above; then the rules for $\xx<\ll$ apply.

If $\xx\ge\gg$ the actions are similar, except that {\it all\/} of the local
registers \$0, \dots,~$\$(\ll-1)$ are placed on the register stack
followed by the number~$L$, and $L$~is reset to zero. In particular, the
instruction \<PUSHGO \$255,\$Y,\$Z pushes all the local registers
onto the stack and sets $L$ to zero, regardless of the previous value of~$L$.

We will see later that \MMIX\ is able to achieve the effect of pushing and
renaming local registers without actually doing very much work at all.

\bull\<POP X,YZ `pop registers and return from subroutine'.\>
@.POP@>
This command preserves X of the current local registers,
undoes the effect of the most recent \.{PUSHJ} or \.{PUSHGO}, and jumps
to the instruction in $\mm_4[{\rm4YZ+rJ}]$. If $\xx>0$, the value of
$\$(\xx-1)$ goes into the ``hole'' position where \.{PUSHJ} or
\.{PUSHGO} stored the number of registers previously pushed.

The formal details of \.{POP} are slightly complicated, but we will see that
they make sense: If $\xx>\ll$, we first replace X by $\ll+1$. Then we
set $x\gets S[\tau-1]\bmod 256$; this is the effective value of the X~field
in the push instruction that is being undone. Stack position $S[\tau-1]$ is
now set to $\$(\xx-1)$ if $0<\xx\le L$, otherwise it is set to zero.
Then we essentially set
$\ll\gets\min(x+\xx,\gg)$, $\$(\ll-1)\gets\$(\ll-x-2)$, \dots,
$\$(x+1)\gets\$0$, $\$x\gets S[\tau-1]$, \dots,
$\$0\gets S[\tau-x-1]$, $\tau\gets\tau-x-1$. The operating system should
@^operating system@>
arrange things so that a memory-protection
interrupt will occur if a program does more pops than pushes.
(If $x>\gg$, these formulas don't make sense as written; we actually
set $\$j\gets S[\tau-x-1+j]$ for $\ll>j\ge0$ in that rare case.)

Suppose, for example, that a subroutine has three input parameters
$(\$0,\$1,\$2)$ and produces two outputs $(\$0,\$1)$. If the subroutine does
not call any other subroutines, it can simply end with \.{POP} \.{2,0},
because rJ will contain the return address. Otherwise it should begin by
saving rJ, for example with the instruction \<GET \$4,rJ if it will be
using local registers \$0 through~\$3, and it should use \<PUSHJ \$5 or
\<PUSHGO \$5 when
calling sub-subroutines; finally it should \<PUT rJ,\$4 before
saying \.{POP}~\.{2,0}. To call the subroutine from another routine that
has, say, 6~local registers, we would put the input arguments into \$7, \$8,
and~\$9, then issue the command \.{PUSHGO} \.{\$6,base,Subr};
in due time the outputs of the subroutine will appear in \$7 and~\$6.

Notice that the push and pop commands make use of a one-place ``hole'' in the
register stack, between the registers that are pushed down and the registers
that remain local. (The hole is position \$6 in the example just considered.)
\MMIX\ needs this hole position to remember the number of
registers that are pushed down.
A subroutine with no outputs ends with \<POP 0,0 and the hole disappears
(becomes marginal). A subroutine with one output~\$0 ends with \<POP 1,0 and
the hole gets the former value of~\$0. A subroutine with two outputs
$(\$0,\$1)$ ends with \<POP 2,0 and the hole gets the former value of~\$1; in
this case, therefore, the relative order of the two outputs has been switched
on the register stack. If a subroutine has, say, five outputs
$(\$0,\ldots,\$4)$, it ends with \<POP 5,0 and \$4~goes into the hole position,
where it is followed by $(\$0,\$1,\$2,\$3)$.
\MMIX\ makes this curious permutation in the case of multiple outputs because
the hole is most easily plugged by moving one value down (namely~\$4) instead
of by sliding each of five values down in the stack.

These conventions for parameter passing are admittedly a bit confusing in the
general case, and I~suppose people who use them extensively might someday find
themselves talking about ``the infamous \MMIX\ register shuffle.'' However,
there is good use for subroutines that convert
a sequence of register contents like $(x,a,b,c)$ into $(f,a,b,c)$ where
$f$ is a function of $a$, $b$, and $c$ but not~$x$. Moreover,
\.{PUSHGO} and \.{POP} can be implemented with great efficiency,
and subroutine linkage tends to be a significant bottleneck when
other conventions are used.

Information about a subroutine's calling conventions needs to be communicated
to a debugger. That can readily be done at the same time as we inform the
debugger about the symbolic names of addresses in memory.

A subroutine that uses 50 local registers will not function properly if it is
called by a program that sets $G$ less than~50. \MMIX\ does not allow the
value of~$G$ to become less than~32. Therefore any subroutine that avoids
global registers and uses at most~32 local registers
can be sure to work properly regardless of the current value of~$G$.

The rules stated above imply that a \.{PUSHJ} or
\.{PUSHGO} instruction with $\xx=255$ pushes all of the currently defined
local registers onto the stack and sets $L$ to~zero.
This makes $G$ local registers available for use by the subroutine
jumped~to. If that subroutine later returns with \.{POP} \.{0,0}, the former
value of~$L$ and the former contents of \$0, \dots,~$\$(\ll-1)$ will be
restored (assuming that $G$ doesn't decrease).

A \.{POP} instruction with $\xx=255$
preserves all the local registers as outputs of
the subroutine (provided that the total doesn't exceed~$G$ after popping),
and puts zero into the hole (unless $L=G=255$). The best policy, however, is
almost always to use \.{POP} with a small value of~X, and in general to keep
the value of~$L$ as small as
possible by decreasing it when registers are no longer active.
A smaller value of~$L$ means that \MMIX\ can change context more
easily when switching from one process to another.

@* System considerations.
High-performance implementations of\/ \MMIX\ gain speed by keeping {\it
caches\/} of instructions and data that are likely to be needed as computation
@^caches@>
proceeds. [See M.~V. Wilkes, {\sl IEEE Transactions\/ \bf EC-14} (1965),
270--271; J.~S. Liptay, {\sl IBM System J. \bf7} (1968), 15--21.]
@^Wilkes, Maurice Vincent@>
@^Liptay, John S.@>
Careful programmers can make the computer run even faster by giving
hints about how to maintain such caches.

\bull\<LDUNC \$X,\$Y,\0 `load octa uncached'.\>
@.LDUNC@>
These instructions, which have the same meaning as \.{LDO}, also
inform the computer that the loaded octabyte (and its neighbors in a cache
block) will probably not be read or written in the near future.

\bull\<STUNC \$X,\$Y,\0 `store octa uncached'.\>
@.STUNC@>
These instructions, which have the same meaning as \.{STO}, also
inform the computer that the stored octabyte (and its neighbors in a cache
block) will probably not be read or written in the near future.

\bull\<PRELD X,\$Y,\0 `preload data'.\>
@.PRELD@>
These instructions have no effect on registers or memory, but they inform the
computer that many of the $\xx+1$ bytes $\mm[\rY+\rZ]$ through
$\mm[\rY+\rZ+\xx]$, or $\mm[\rY+\zz]$ through $\mm[\rY+\zz+\xx]$,
will probably be loaded and/or stored in the near future.
No protection failure occurs if the memory is not accessible.

\bull\<PREGO X,\$Y,\0 `prefetch to go'.\>
@.PREGO@>
These instructions have no effect on registers or memory, but they inform the
computer that many of the $\xx+1$ bytes $\mm[\rY+\rZ]$ through
$\mm[\rY+\rZ+\xx]$, or $\mm[\rY+\zz]$ through $\mm[\rY+\zz+\xx]$,
will probably be used as instructions in the near future.
No protection failure occurs if the memory is not accessible.

\bull\<PREST X,\$Y,\0 `prestore data'.\>
@.PREST@>
These instructions have no effect on registers or memory if the computer has
no data cache. But when such a cache exists, they inform the
computer that all of the $\xx+1$ bytes $\mm[\rY+\rZ]$ through
$\mm[\rY+\rZ+\xx]$, or $\mm[\rY+\zz]$ through $\mm[\rY+\zz+\xx]$,
will definitely be stored in the near future before they are loaded.
(Therefore it is permissible for the machine to ignore the present contents of
those bytes. Also, if those bytes are being shared by several processors,
the current processor should try to acquire exclusive access.)
No protection failure occurs if the memory is not accessible.

\bull\<SYNCD X,\$Y,\0 `synchronize data'.\>
@.SYNCD@>
When executed from nonnegative locations, these instructions have no effect on
registers or memory if neither a write buffer nor a ``write back''
data cache are present. But when such a buffer or cache exists, they force the
computer to make sure that all data for the $\xx+1$ bytes 
$\mm[\rY+\rZ]$ through $\mm[\rY+\rZ+\xx]$, or
$\mm[\rY+\zz]$ through $\mm[\rY+\zz+\xx]$,
will be present in memory.
(Otherwise the result of a previous store instruction might appear only
in the cache; the computer is being told that now is the time to
write the information back, if it hasn't already been written. A program
can use this feature before outputting directly from memory.)
No protection failure occurs if the memory is not accessible.

The action is similar when \.{SYNCD} is executed from a negative address,
but in this case the specified bytes are also removed from the data
cache (and from a secondary cache, if present). The operating system can
use this feature when a page of virtual memory is being swapped out,
or when data is input directly into memory.
@^operating system@>

\bull\<SYNCID X,\$Y,\0 `synchronize instructions and data'.\>
@.SYNCID@>
When executed from nonnegative locations these instructions have no effect on
registers or memory if the computer has no instruction cache separate from a
data cache. But when such a cache exists, they force the
computer to make sure that the $\xx+1$ bytes 
$\mm[\rY+\rZ]$ through $\mm[\rY+\rZ+\xx]$, or
$\mm[\rY+\zz]$ through $\mm[\rY+\zz+\xx]$,
will be interpreted correctly
if used as instructions before they are next modified.
(Generally speaking, an \MMIX\ program is not expected to store anything in
memory locations that are also being used as instructions.
Therefore \MMIX's instruction cache is allowed to become inconsistent with
respect to its data cache. Programmers who insist on executing instructions
that have been fabricated dynamically, for example when setting a breakpoint
for debugging, must first \.{SYNCID} those instructions
in order to guarantee that the intended results will be obtained.) A \.{SYNCID}
command might be implemented in several ways; for example, the machine
might update its instruction cache to agree with its data cache. A simpler
solution, which is good enough because the need for \.{SYNCID} ought to
be rare, removes instructions in the specified range
from the instruction cache, if
present, so that they will have to be fetched from memory the next time
they are needed; in this case the machine also carries out the effect of
a~\.{SYNCD} command.
No protection failure occurs if the memory is not accessible.

The behavior is more drastic, but faster, when \.{SYNCID} is executed
from a negative location. Then all bytes in the specified range are
simply removed from all caches, and the memory corresponding to
any ``dirty'' cache blocks involving such bytes is {\it not\/} brought up
to date. An operating system can use this version of the command
when pages of virtual memory are being discarded (for example, when
a program is being terminated).

@ \MMIX\ is designed to work not only on a single processor but also
in situations where several processors
share a common memory. The following commands are useful
for efficient operation in such circumstances.

\bull\<CSWAP \$X,\$Y,\0 `compare and swap octabytes'.\>
@.CSWAP@>
If the octabyte $\mm_8[\rY+\rZ]$ or $\mm_8[\rY+\zz]$
is equal to the contents of the special {\it prediction register\/}~rP,
@^rP@>
it is replaced in memory with the contents of register~X, and
register~X is set equal to~1. Otherwise the octabyte in memory
replaces rP and register~X is set to zero.
This is an atomic (indivisible, uninterruptible) operation,
useful for interprocess communication
when independent computers are sharing the same memory.

The compare-and-swap operation was introduced by IBM in late
models of the
@^IBM Corporation@>
@^compare-and-swap@>
@^atomic instruction@>
System/370 architecture, and it soon spread to several
@^System/370@>
other machines. Significant ways to use it are discussed, for example,
in section 7.2.3 of Harold Stone's
{\sl High-Performance Computer Architecture\/} (Reading, Massachusetts:\
Addison--Wesley, 1987), and in sections 8.2 and 8.3 of {\sl Transaction
Processing\/} by Jim Gray and Andreas Reuter (San Francisco:\ Morgan
Kaufmann, 1993). % Kaufmann: stet
@^Stone, Harold Stuart@>
@^Gray, James Nicholas@>
@^Reuter, Andreas Horst@>

\bull\<SYNC XYZ `synchronize'.\>
@.SYNC@>
If $\rm XYZ=0$, the machine drains its pipeline (that is, it
stalls until all preceding instructions have completed their activity).
If $\rm XYZ=1$, the machine controls its actions less drastically,
in such a way that all
store instructions preceding this \.{SYNC} will be completed
before all store instructions after it.
If $\rm XYZ=2$, the machine controls its actions in such a way that all
load instructions preceding this \.{SYNC} will be completed
before all load instructions after it.
If $\rm XYZ=3$, the machine controls its actions
in such a way that all {\it load or store\/} instructions preceding this
\.{SYNC} will be completed before all load or store instructions after it.
If $\rm XYZ=4$, the machine goes into a power-saver mode, in which
@^power-saver mode@>
instructions may be executed more slowly (or not at all) until some kind
of ``wake-up'' signal is received.
If $\rm XYZ=5$, the machine empties its write buffer and
cleans its data caches, if any (including a possible secondary cache);
the caches retain their data,
but the cache contents also appear in memory.
If $\rm XYZ=6$, the machine clears its virtual address translation
caches (see below).
If $\rm XYZ=7$, the machine clears its instruction and data caches,
discarding any information in the data caches that wasn't previously
in memory. (``Clearing'' is stronger than ``cleaning''; a clear cache
remembers nothing. Clearing is also faster, because it simply obliterates
everything.)
If $\rm XYZ>7$, an illegal instruction interrupt occurs.

Of course no \.{SYNC} is necessary between a command that loads from or stores
into memory and a subsequent command that loads from or stores into exactly
the same location. However, \.{SYNC} might be necessary in certain cases even
on a one-processor system, because input/output processes take place in
parallel with ordinary computation.

The cases $\rm XYZ>3$ are {\it privileged}, in the sense that
only the operating system can use them. More precisely, if a \.{SYNC}
command is encountered with $\rm XYZ=4$ or
$\rm XYZ=5$ or $\rm XYZ=6$ or $\rm XYZ=7$,
a ``privileged instruction interrupt'' occurs unless that interrupt
is currently disabled. Only the operating system can disable
interrupts (see below).
@^privileged operations@>

@* Trips and traps.
Special register rA records the current status information
about arithmetic exceptions. Its least significant byte contains eight
``event'' bits called DVWIOUZX from left to right, where D stands for
integer divide check, V~for integer overflow, W~for float-to-fix overflow,
I~for invalid operation, O~for floating overflow, U~for
floating underflow, Z~for floating division by zero, and X~for floating
inexact. % The low order five bits agree with SPARC I conventions
% but Alpha, for example, uses the order VXUOZI
The next least significant byte of rA contains eight
``enable'' bits with the same names DVWIOUZX and the same meanings.
When an exceptional condition occurs, there are two cases: If the
corresponding enable bit is~0, the corresponding event bit is set
to~1.  But if the corresponding enable bit is~1, \MMIX\ interrupts
its current instruction stream and executes a special ``exception
handler.'' Thus, the event bits record exceptions that have not been
``tripped.''
@^overflow@>
@^underflow@>
@^exceptions@>
@^handlers@>
@^float-to-fix exception@>
@^inexact exception@>
@^invalid exception@>
@^divide check exception@>

Floating point overflow always causes two exceptions, O and~X\null.
(The strictest interpretation of the IEEE standard would raise exception~X
on overflow only if floating overflow is not enabled, but \MMIX\ always
considers an overflowed result to be inexact.)
Floating point underflow always causes both U and~X when underflow is
not enabled, and it might cause both U and~X when underflow is enabled.
If both enable bits are set to~1 in such cases, the overflow or underflow
handler is called and the inexact handler is ignored. All other types
of exceptions arise one at a time, so there is no ambiguity about which
exception handler should be invoked unless exceptions are raised by
``ropcode~2'' (see below); in general the first enabled exception
in the list DVWIOUZX takes precedence.

What about the six high-order bytes of the status register rA?
@^rA@>
@^rounding modes@>
At present, only two of those 48 bits are defined;
the others must be zero for compatibility
with possible future extensions. The two bits corresponding to $2^{17}$
and $2^{16}$ in rA specify a rounding mode, as follows: 00~means
round to nearest (the default); 01~means round off (toward zero);
10~means round up (toward positive infinity); and
11~means round down (toward negative infinity).
% Alpha conventions differ: 10,00,11,01 for nearest,off,up,down

@ The execution of\/ \MMIX\ programs can be interrupted in several ways.
We have just seen that arithmetic exceptions will cause interrupts if
they are enabled; so will illegal or privileged instructions, or instructions
@^illegal instructions@>
@^privileged operations@>
@^emulation@>
@^interrupts@>
@^I/O@>
@^input/output@>
that are emulated in software instead of provided by the hardware.
Input/output operations or external timers are another common source
of interrupts; the operating system knows how to deal with
all gadgets that might be hooked up to an \MMIX\ processor chip.
Interrupts occur also when memory accesses fail---for example if
memory is nonexistent or protected.
Power failures that force the machine to use its backup battery power
in order to keep running in an emergency,
or hardware failures like parity errors,
all must be handled as gracefully as possible.

Users can also force interrupts to happen by giving explicit \.{TRAP} or
\.{TRIP} instructions:

\bull\<TRAP X,Y,Z `trap'; \<TRIP X,Y,Z `trip'.\>
@.TRIP@>
@.TRAP@>
Both of these instructions interrupt processing and transfer control
to a handler. The difference between them is that \.{TRAP}
is handled by the operating system but \.{TRIP} is handled by the user.
@^operating system@>
More precisely, the X, Y, and Z fields of \.{TRAP} have special significance
predefined by the operating system kernel. For example, a system call---say an I/O
command, or a command to allocate more memory---might be invoked
by certain settings of X, Y, and~Z\null.
The X, Y, and Z fields of \.{TRIP}, on the other hand, are definable by
users for their own applications, and users also define their own
handlers. ``Trip handler'' programs
invoked by \.{TRIP} are interruptible, but interrupts are normally inhibited
while a \.{TRAP} is being serviced. Specific details about the
precise actions of \.{TRIP} and \.{TRAP} appear below, together
with the description of another command called \.{RESUME} that
returns control from a handler to the interrupted program.

Only two variants of \.{TRAP} are predefined by the \MMIX\ architecture:
If $\rm XYZ=0$ in a \.{TRAP}
command, a user process should terminate. If $\rm XYZ=1$,
the operating system should provide default action for cases in which
the user has not provided any handler for a particular
kind of interrupt (see below).

A few additional variants of \.{TRAP} are predefined in the rudimentary
operating system used with \MMIX\ simulators. These variants, which
allow simple input/output operations to be done, all have $\xx=0$,
and the Y~field is a small positive constant. For example, $\yy=1$ invokes
the \.{Fopen} routine, which opens a file. (See the program
{\mc MMIX-SIM} for full details.)
@^I/O@>
@^input/output@>

@ Non-catastrophic interrupts in \MMIX\ are always {\it precise}, in the sense that all legal
instructions before a certain point have effectively been executed, and
no instructions after that point have yet been executed. The current
instruction, which may or may not have been completed at the time of
interrupt and which may or may not need to be resumed after the interrupt has
been serviced, is
put into the special {\it execution register\/}~rX, and its operands (if any)
are placed in special registers rY and~rZ\null. The address of the following
instruction is placed in the special {\it where-interrupted
register\/}~rW\null.
@^interrupts@>
@^rW@>
@^rX@>
@^rY@>
@^rZ@>
The instruction in~rX might not be the same as the instruction in
location $\rm rW-4$; for example, it might be an instruction that
branched or jumped to~rW\null. It might also be an instruction
inserted internally by the \MMIX\ processor.
(For example, the computer silently inserts an internal instruction
that increases~$L$ before an instruction
like \<ADD \$9,\$1,\$0 if $L$~is currently less than~10. If an interrupt
occurs, between the inserted instruction and the \.{ADD},
the instruction in~rX will
say \.{ADD}, because an internal instruction retains the identity of the
actual command that spawned it; but rW will point to the {\it real\/}
\.{ADD} command.)

When an instruction has the normal meaning ``set \$X to
the result of \$Y~op~\$Z'' or ``set \$X to the result of \$Y~op~Z,''
special registers rY and~rZ will relate in the
obvious way to the Y and~Z operands of the instruction; but this is not
always the case. For example, after an interrupted
store instruction, the first operand~rY will hold
the virtual memory address (\$Y plus either \$Z or~Z),
and the second operand~rZ will be the octabyte to be stored in memory
(including bytes that have not changed, in cases like \.{STB}). In
other cases the actual
contents of rY and~rZ are defined by each implementation of\/ \MMIX,
and programmers should not rely on their significance.

Some instructions take an unpredictable and possibly long amount of time, so
it may be necessary to interrupt them in progress. For example, the \.{FREM}
@.FREM@>
instruction (floating point remainder) is extremely difficult to compute
rapidly if its first operand has an exponent of~2046 and its second operand
has an exponent of~1. In such cases the rY and rZ registers saved during an
interrupt show the current state of the computation, not necessarily the
original values of the operands. The value of $\rm rY\,{rem}\,rZ$ will still
be the desired remainder, but rY may well have been reduced to a
number that has an exponent closer to the exponent of~rZ\null.
After the interrupt has been processed, the remainder
computation will continue where it left off.
(Alternatively, an operation like \.{FREM} or even \.{FADD} might be
implemented in software instead of hardware, as we will see later.)

Another example arises with an instruction like \.{PREST} (prestore), which can
@.PREST@>
specify prestoring up to 256 bytes. An implementation of\/ \MMIX\ might choose
to prestore only 32 or 64 bytes at a time, depending on the cache block size;
then it can change the contents of rX to reflect the unfinished part of
a partially completed \.{PREST} command.

Commands that decrease $G$, pop the stack, save the
current context, or unsave an old context also are interruptible. Register~rX
is used to communicate information about partial completion in such a
way that the interruption will be essentially ``invisible'' after
a program is resumed.

@ Three kinds of interruption are possible: trips, forced traps, and
dynamic traps. We will discuss each of these in turn.
@^interrupts@>
@^trips@>
@^traps@>
@^forced traps@>
@^dynamic traps@>
@^handlers@>
@^operating system@>

A \.{TRIP} instruction puts itself into the right half of the execution
@.TRIP@>
register~rX, and sets the 32 bits of the left half to \Hex{80000000}.
(Therefore rX is {\it negative\/}; this fact will
tell the \.{RESUME} command not to \.{TRIP} again.) The special registers
rY and rZ are set to the contents of the registers specified by the
Y and Z fields of the \.{TRIP} command, namely \$Y and~\$Z.
Then \$255 is placed into the special {\it bootstrap
register\/}~rB, and \$255 is set to~rJ. \MMIX\ now takes its next instruction
@^rB@>
@^rJ@>
from virtual memory address~0.

Arithmetic exceptions interrupt the computation in essentially the
same way as \.{TRIP}, if they are enabled. The only difference is that
their handlers begin at the respective addresses
16, 32, 48, 64, 80, 96, 112, and~128, for exception bits D, V, W, I, O, U,
Z, and~X of~rA; registers rY and~rZ are set to the operands of the
interrupted instruction as explained earlier.

A 16-byte block of memory is just enough for a sequence of commands like
$$\hbox{\tt PUSHJ 255,Handler; PUT rJ,\$255; GET \$255,rB; RESUME}$$
which will invoke a user's handler. And if the user does not choose to
provide a custom-designed handler, the operating system provides a
default handler via the instructions
$$\hbox{\tt TRAP 1; GET \$255,rB; RESUME.}$$

A trip handler might simply record the fact that tripping occurred.
But the handler for an arithmetic interrupt might want to change the
default result of a computation. In such cases, the handler should place
the desired substitute result into~rZ, and it should change the most
significant byte of~rX from \Hex{80} to \Hex{02}. This will have the desired
effect, because of the rules of \.{RESUME} explained below, {\it unless\/}
the exception occurred on a command like \.{STB} or \.{STSF}. (A~bit more
work is needed to alter the effect of a command that stores into memory.)

Instructions in {\it negative\/} virtual locations do not invoke trip
handlers, either for \.{TRIP} or for arithmetic exceptions. Such instructions
are reserved for the operating system, as we will see.
@^negative locations@>

@ A \.{TRAP} instruction interrupts the computation essentially
@^interrupts@>
like \.{TRIP}, but with the following modifications:
@^rT@>
@.TRAP@>
@^rK@>
(i)~the interrupt mask register~rK is cleared
to zero, thereby inhibiting interrupts; (ii)~control jumps to virtual memory
address~rT, not zero; (iii)~information is placed
@^rBB@>
@^rWW@>
@^rXX@>
@^rYY@>
@^rZZ@>
in a separate set of special registers rBB, rWW, rXX, rYY, and~rZZ, instead of
rB, rW, rX, rY, and~rZ\null. (These special registers are needed because a trap
might occur while processing a \.{TRIP}.)

Another kind of forced trap occurs on implementations of\/ \MMIX\ that
emulate certain instructions in software rather than in hardware.
Such instructions cause a \.{TRAP} even though their opcode is something
else like \.{FREM} or \.{FADD} or \.{DIV}. The trap handler can tell
what instruction to emulate by looking at the opcode, which appears
in~rXX\null. 
In such cases the left-hand half of~rXX is set to \Hex{02000000}; the handler
emulating \.{FADD}, say, should compute the floating point sum of rYY and~rZZ
and place the result in~rZZ\null. A~subsequent
\.{RESUME}~\.1 will then place the value of~rZZ in the proper register.
@^emulation@>
@^forced traps@>

When a forced trap occurs on a store instruction because of memory protection
failure, the settings of rYY and rZZ are undefined. They do not necessarily
correspond to the virtual address rY and the octabyte to be stored rZ
that are supplied to a trip handler after a tripped store instruction,
because a forced trap aborts its instruction as soon as possible.

Implementations of\/ \MMIX\ might also emulate the process of
virtual-address-to-physical-address translation described below,
instead of providing for page table calculations in hardware.
Then if, say, a \.{LDB} instruction does not know the physical memory
address corresponding to a specified virtual address, it will cause
a forced trap with the left half of~rXX set to \Hex{03000000} and with
rYY set to the virtual address in question. The trap handler should
place the physical page address into~rZZ; then \.{RESUME}~\.1 will
complete~the~\.{LDB}.

@ The third and final kind of interrupt is called a {\it dynamic\/} trap.
@^interrupts@>
@^dynamic traps@>
Such interruptions occur when one or more of the 64 bits in the
special {\it interrupt request register\/}~rQ have been set to~1,
@^rQ@>
@^rK@>
and when at least one corresponding bit of the special
{\it interrupt mask register\/}~rK is also equal to~1. The bit positions
of rQ and~rK have the general form
$$\beginword
&\field{24}{24}&&\field88&&\field{24}{24}&&\field88\cr
\noalign{\hrule}
\\&low-priority I/O&\\&program&\\&high-priority I/O&\\&machine&\\\cr
\noalign{\hrule}\endword$$
where the 8-bit ``program'' bits are called \.{rwxnkbsp} and have
the following meanings:
$$\vbox{\halign{\.# bit: &#\hfil\cr
r&instruction tries to load from a page without read permission;\cr
w&instruction tries to store to a page without write permission;\cr
x&instruction appears in a page without execute permission;\cr
n&instruction refers to a negative virtual address;\cr
k&instruction is privileged, for use by the ``kernel'' only;\cr
b&instruction breaks the rules of\/ \MMIX;\cr
s&instruction violates security (see below);\cr
p&instruction comes from a privileged (negative) virtual address.\cr}}$$
Negative addresses are for the use of the operating system only;
@^operating system@>
@^protection bits@>
@^permission bits@>
@^security violation@>
@^privileged instructions@>
@^illegal instructions@>
@^page fault@>
a security violation occurs if an instruction in a nonnegative address
is executed without the \.{rwxnkbsp} bits of~rK all set to~1.
(In such cases the \.s bits of both rQ and~rK are set to~1.)

The eight ``machine'' bits of rQ and rK represent the most urgent
kinds of interrupts. The rightmost bit stands for power failure,
the next for memory parity error, the next for nonexistent memory, 
the next for rebooting, etc.
Interrupts that need especially quick service, like requests from
a high-speed network, also are allocated bit positions near the right end.
Low priority I/O devices like keyboards are assigned to bits at the left.
The allocation of input/output devices to bit positions will
differ from implementation to implementation, depending on
what devices are available.
@^I/O@>
@^input/output@>

Once $\rm rQ\land rK$ becomes nonzero, the machine waits
briefly until it can give a precise interrupt.
Then it proceeds as with a forced trap,
except that it uses the special ``dynamic
trap address register''~rTT instead of~rT. The trap handler that
@^rTT@>
begins at location~rTT can figure out the reason for interrupt by
examining $\rm rQ\land rK$. (For example, after the instructions
$$\hbox spread-10pt{\tt\spaceskip .5em minus .1em
 GET \$0,rQ; LDOU \$1,savedK; AND \$0,\$0,\$1; SUBU \$1,\$0,1;
   SADD \$2,\$1,\$0; ANDN \$1,\$0,\$1}$$
the highest-priority offending bit will be in \$1 and its position will be
in~\$2.)
@^counting trailing zeros@>

If the interrupted instruction contributed 1s to any of the \.{rwxnkbsp} bits
of~rQ, the corresponding bits are set to~1 also in~rXX\null. A~dynamic trap
handler might be able to use this information (although it should
service higher-priority interrupts first if the right half
of $\rm rQ\land rK$ is nonzero).
@^rX@>

The rules of\/ \MMIX\ are rigged
so that only the operating system can execute instructions
with interrupts suppressed. Therefore the operating system can in fact
use instructions that would interrupt an ordinary program. Control of
register rK turns out to be the ultimate privilege, and in a sense the
only important one.
@^privileged operations@>

An instruction that causes a dynamic trap is usually executed before the
interruption occurs. However, an instruction that traps with
bits \.x, \.k, or \.b does nothing; a load instruction that traps
with \.r or \.n loads zero; a store instruction that traps with any
of \.{rwxnkbsp} stores nothing.

@ After a trip handler or trap handler has done its thing, it
generally invokes the following command.

\bull\<RESUME Z `resume after interrupt'; the X and Y fields must be zero.\>
@.RESUME@>
@^interrupts@>
@^handlers@>
If the Z field of this instruction is zero,
\MMIX\ will use the
information found in special registers rW, rX, rY, and~rZ to restart an
@^rW@>
@^rX@>
@^rY@>
@^rZ@>
@^rBB@>
@^rWW@>
@^rXX@>
@^rYY@>
@^rZZ@>
@^rK@>
interrupted computation. If the execution register rX is negative, it will be
ignored and instructions will be executed starting at virtual address~rW\null;
otherwise the instruction in the right half of the execution register will be
inserted into the program as if it had appeared in location $\rm rW-4$,
subject to certain modifications that we will explain momentarily,
and the {\it next\/} instruction will come from rW.

If the Z field of \.{RESUME}
is 1 and if this instruction appears in a negative location,
registers rWW, rXX, rYY, and~rZZ are used instead of rW, rX, rY, and~rZ\null.
Also, just before resuming the computation,
mask register rK is set to \$255 and \$255 is set to rBB\null.
(Only the operating system gets to use this feature.)
@^operating system@>

An interrupt handler within the operating system might choose to allow itself
to be interrupted. In such cases it should save the contents of
rBB, rWW, rXX, rYY, and~rZZ on some kind of stack, before making rK nonzero.
Then, before resuming whatever caused the base level interrupt, it
must again disable all interrupts; this can be done with \.{TRAP},
because the trap handler can tell from the virtual address in~rWW that
it has been invoked by the operating system. Once rK is again zero,
the contents of rBB, rWW, rXX, rYY, and~rZZ are restored from the stack,
the outer level interrupt mask is placed in \$255, and \<RESUME 1
finishes the job.

Values of Z greater than 1 are reserved for possible later
definition. Therefore they cause an illegal instruction interrupt (that
is, they set the `\.b' bit of~rQ) in the present version of\/ \MMIX.
@^illegal instructions@>

If the execution register rX is nonnegative, its leftmost byte controls
the way its right-hand half will be inserted into the program.
Let's call this byte the ``ropcode.'' A ropcode of~0 simply
inserts the instruction into the execution stream; a ropcode of~1
is similar, but it substitutes rY and rZ for the 
two operands, assuming that this makes sense for the operation considered.
@^ropcodes@>

Ropcode~2 inserts a command that sets \$X to rZ, where
X~is the second byte in the right half of rX\null.
This ropcode is normally used with forced-trap emulations, so that the result
of an emulated instruction is placed into the correct register.
It also uses the third-from-left byte of~rX to raise any or all of the
arithmetic exceptions DVWIOUZX, at the same time as rZ is
being placed in \$X. Emulated instructions and
explicit \.{TRAP} commands can therefore cause overflow, say,
just as ordinary instructions can.
(Such new exceptions may, of
course, spawn a trip interrupt, if any of the corresponding bits are enabled
in~rA.)
@^rA@>
@^emulation@>

Finally, ropcode 3 is the same as ropcode 0, except that it also
tells \MMIX\ to treat rZ as the page table entry for the virtual
address~rY\null. (See the discussion of virtual address translation below.)
Ropcodes greater than~3 are not permitted; moreover,
only \<RESUME 1 is allowed to use ropcode~3.

The ropcode rules in the previous paragraphs should of course be understood to
involve rWW, rXX, rYY, and rZZ instead of rW, rX, rY, and~rZ when
the ropcode is seen by \.{RESUME}~\.1. Thus, in particular, ropcode~3
always applies to rYY and~rZZ, never to rY and~rZ.

Special restrictions must hold if resumption is to work properly: Ropcodes
0~and~3 must not insert a \.{RESUME} instruction; ropcode~1 must insert
a ``normal'' instruction, namely one whose opcode begins with
one of the hexadecimal digits \Hex{0}, \Hex{1}, \Hex{2}, \Hex{3}, \Hex{6},
\Hex{7}, \Hex{C}, \Hex{D}, or~\Hex{E}. (See the opcode chart below.)
Some implementations may also allow ropcode~1 with \.{SYNCD[I]}
and \.{SYNCID[I]}, so that those instructions can conveniently be
interrupted.
Moreover, the destination register \$X used with ropcode 1 or~2 must
not be marginal. All of these restrictions hold automatically in normal
use; they are relevant only if the programmer tries to do something tricky.

Notice that the slightly tricky sequence
$$\hbox{\tt LDA \$0,Loc; PUT rW,\$0; LDTU \$1,Inst; PUT rX,\$1; RESUME}$$
will execute an almost arbitrary instruction \.{Inst} as if it had been in
location \.{Loc-4}, and then will jump to location \.{Loc} (assuming
that \.{Inst} doesn't branch elsewhere).

@* Special registers.
@^special registers@>
Quite a few special registers have been mentioned so far, and \MMIX\ actually
has even more. It is time now to enumerate them all, together with their
internal code numbers:
$$\vbox{\halign{\hfil#,\quad&#;\hfil\cr
rA&arithmetic status register [21]\cr
rB&bootstrap register (trip) [0]\cr
rC&continuation register [8]\cr
rD&dividend register [1]\cr
rE&epsilon register [2]\cr
rF&failure location register [22]\cr
rG&global threshold register [19]\cr
rH&himult register [3]\cr
rI&interval counter [12]\cr
rJ&return-jump register [4]\cr
rK&interrupt mask register [15]\cr
rL&local threshold register [20]\cr
rM&multiplex mask register [5]\cr
rN&serial number [9]\cr
rO&register stack offset [10]\cr
rP&prediction register [23]\cr
rQ&interrupt request register [16]\cr
rR&remainder register [6]\cr
rS&register stack pointer [11]\cr
rT&trap address register [13]\cr
rU&usage counter [17]\cr
rV&virtual translation register [18]\cr
rW&where-interrupted register (trip) [24]\cr
rX&execution register (trip) [25]\cr
rY&Y operand (trip) [26]\cr
rZ&Z operand (trip) [27]\cr
rBB&bootstrap register (trap) [7]\cr
rTT&dynamic trap address register [14]\cr
rWW&where-interrupted register (trap) [28]\cr
rXX&execution register (trap) [29]\cr
rYY&Y operand (trap) [30]\cr
rZZ&Z operand (trap) [31]\cr}}$$
@^rG@>
@^rL@>
In this list rG and rL are what we have been calling simply $G$ and $L$; \
rC, rF, rI, rN, rO, rS, rU, and~rV have not been mentioned before.

@ The {\it interval counter\/}~rI decreases by~1
 on every ``clock pulse'' of the
@^rI@>
\MMIX\ processor. Thus if \MMIX\ is running at 500 MHz, the interval
counter decreases every 2 nanoseconds.
It causes an {\it interval interrupt\/}
when it reaches zero. Such interrupts can be extremely useful for
``continuous profiling'' as a means of studying
the empirical running time of programs;
see Jennifer~M. Anderson, Lance~M. Berc, Jeffrey Dean, Sanjay Ghemawat,
Monika~R. Henzinger, Shun-Tak~A. Leung, Richard~L. Sites, Mark~T. Vandevoorde,
Carl~A. Waldspurger, and William~E. Weihl, {\sl ACM Transactions on Computer
Systems\/ \bf15} (1997), 357--390.
The interval interrupt is achieved by setting the next-to-leftmost bit of the
``machine'' byte of~rQ equal to~1; this is the seventh-least-significant bit.
@^rQ@>
@^continuous profiling@>
@^performance monitoring@>
@^Anderson, Jennifer-Ann Monique@>
@^Berc, Lance Michael@>
@^Dean, Jeffrey Adgate@>
@^Ghemawat, Sanjay@>
@^Henzinger, Monika Hildegard Rauch@>
@^Leung, Shun-Tak Albert@>
@^Sites, Richard Lee@>
@^Vandevoorde, Mark Thierry@>
@^Waldspurger, Carl Alan@>
@^Weihl, William Edward@>

The {\it usage counter\/}~rU consists of three fields $(u_p,u_m,u_c)$,
@^rU@>
called the usage pattern~$u_p$, the usage mask~$u_m$,
and the usage count~$u_c$. The most significant byte of~rU is the usage
pattern; the next most significant byte is the usage mask; and
the remaining 48 bits are the usage count. Whenever an instruction whose
${\rm OP}\land u_m=u_p$ has been executed, the value of $u_c$ increases by~1
(modulo~$2^{47}$).
Thus, for example, the OP-code chart below implies that
all instructions are counted if $u_p=u_m=0$;
all loads and stores are counted together with \.{GO} and \.{PUSHGO}
if $u_p=(10000000)_2$ and $u_m=(11000000)_2$;
all floating point instructions are counted together with fixed point
multiplications and divisions if $u_p=0$ and $u_m=(11100000)_2$;
fixed point multiplications and divisions alone are counted if
$u_p=(00011000)_2$ and $u_m=(11111000)_2$; completed subroutine calls
are counted if $u_p=\.{POP}$ and $u_m=(11111111)_2$.
Instructions in negative locations, which belong to the operating system,
are exceptional: They are included in the usage count only if the leading bit
of $u_c$ is~1.
@^negative locations@>

Incidentally, the 64-bit counter rI can be implemented rather cheaply with
only two levels of logic, using an old trick called ``carry-save addition''
[see, for example, G.~Metze and J.~E. Robertson, {\sl Proc.\ International
Conf.\ Information Processing\/} (Paris:\ 1959), 389--396]. One nice
embodiment of this idea is to
@^Metze, Gernot@>
@^Robertson, James Evans@>
@^carry-save addition@>
represent a binary number~$x$ in a redundant form as the difference $x'-x''$
of two binary numbers. Any two such numbers can be added without carry
propagation as follows: Let
$$f(x,y,z)=
 (x\land\bar y)\lor(x\land z)\lor(\bar y\land z), \qquad
% ((x\oplus y)\land(x\oplus z))\oplus z, \qquad
  g(x,y,z)=x\oplus y\oplus z.$$
Then it is easy to check that $x-y+z=2f(x,y,z)-g(x,y,z)$; we need only verify
this in the eight cases when $x$, $y$, and~$z$ are 0 or~1.
Thus we can subtract~1 from a counter $x'-x''$ by setting
$$(x',x'')\gets(f(x',x'',-1)\LL1,\;g(x',x'',-1));$$
we can add~1 by setting
$(x',x'')\gets(g(x'',x',-1),f(x'',x',-1)\LL1)$.
The result is zero if and only if
$x'=x''$. We need not actually compute the difference $x'-x''$ until
we need to examine the register. The computation
of $f(x,y,z)$ and $g(x,y,z)$ is particularly simple in the special
cases $z=0$ and $z=-1$. A similar trick works for~rU,
but extra care is needed in that case
because several instructions might finish at the same time.
(Thanks to Frank Yellin for his improvements to this paragraph.)
@^Yellin, Frank Nathan@>

@ The special {\it serial number register\/}~rN is permanently set to
@^rN@>
the time this particular instance of\/ \MMIX\ was created (measured as the
number of seconds since 00:00:00 Greenwich Mean Time on 1~January 1970),
in its five least significant bytes. The three most significant bytes
are permanently set to the {\it version number\/} of the \MMIX\ architecture
that is being implemented together with
two additional bytes that modify the version
number. This quantity serves as an essentially unique identification
number for each copy of\/ \MMIX.
@^version number@>

Version 1.0.0 of the architecture is described in the present document.
Version~1.0.1 is similar, but simplified to avoid the
complications of pipelines and operating systems.
Other versions may become necessary in the future.

@ The {\it register stack offset\/}~rO and {\it register stack
pointer\/}~rS are especially interesting, because they are used to implement
@^register stack@>
@^rO@>
@^rS@>
\MMIX's register stack~$S[0]$, $S[1]$, $S[2]$,~\dots. 

The operating system
initializes a register stack by assigning a large area of virtual memory to
each running process, beginning at an address like
\Hex{6000000000000000}.
If this starting address is~$\sigma$, stack entry $S[k]$ will go into
the octabyte $\mm_8[\sigma+8k]$. Stack underflow will be detected because
the process does not have permission to read from $\mm[\sigma-1]$.
Stack overflow will be detected because something will give out---either
the user's budget or the user's patience or the user's swap space---long before
$2^{61}$~bytes of virtual memory are filled by a register stack.
@^terabytes@>

The \MMIX\ hardware maintains the register stack by having two banks
of 64-bit general-purpose registers, one for globals and one for locals.
The global registers $\rm g[32]$, $\rm g[33]$, \dots, $\rm g[255]$ are used for
register numbers that are $\ge\gg$ in \MMIX\ commands;
recall that $G$~is always 32 or more. The local
registers come from another array that contains $2^n$ registers for
some~$n$ where $8\le n\le10$; for simplicity of exposition
we will assume that there are exactly 512 local
registers, but there may be only 256 or there may be 1024.

\def\l{{\rm l}}
@^ring of local registers@>
The local register slots l[0], l[1], \dots, l[511] act as a cyclic buffer with
addresses that wrap around mod~512, so that $\l[512]=\l[0]$,
$\l[513]=\l[1]$, etc. This buffer is divided into three parts by three
pointers, which we will call $\alpha$, $\beta$, and $\gamma$.
$$\epsfbox{mmix.1}$$
Registers $\l[\alpha]$, $\l[\alpha+1]$, \dots,~$\l[\beta-1]$ are
what program instructions currently call \$0, \$1, \dots,~$\$(\ll-1)$;
registers $\l[\beta]$, $\l[\beta+1]$, \dots,~$\l[\gamma-1]$ are currently
unused; and registers $\l[\gamma]$, $\l[\gamma+1]$, \dots,~$\l[\alpha-1]$
contain items of the register stack that have been pushed down but not yet
stored in memory. Special register~rS holds the virtual memory address where
$\l[\gamma]$ will be stored, if necessary. Special register~rO holds the
address where $\l[\alpha]$ will be stored; this always equals $8\tau$ plus
the address of~$S[0]$. We can deduce the values of $\alpha$, $\beta$,
and~$\gamma$ from the contents of rL, rO, and~rS, because
$$\rm\alpha=(rO/8)\bmod512,\qquad \beta=(\alpha+rL)\bmod512,\qquad
\hbox{and}\qquad \gamma=(rS/8)\bmod512.$$

To maintain this situation we need to make sure that the pointers $\alpha$,
$\beta$, and $\gamma$ never move past each other. A~\.{PUSHJ} or
\.{PUSHGO} operation simply
advances $\alpha$ toward~$\beta$, so it is very simple. The first part of a
\.{POP} operation, which moves $\beta$ toward~$\alpha$, is also very simple.
But the next part of a~\.{POP} requires $\alpha$ to move downward, and
memory accesses might be required. \MMIX\ will decrease rS by~8 (thereby
decreasing $\gamma$ by~1) and set $\l[\gamma]\gets\mm_8[{\rm rS}]$,
one or more times if necessary, to keep $\alpha$ from decreasing
past~$\gamma$. Similarly, the operation of increasing~$L$ may cause \MMIX\ to
set $\mm_8[{\rm rS}]\gets\l[\gamma]$ and increase rS by~8 (thereby increasing
$\gamma$ by~1) one or more times, to keep $\beta$ from increasing
past~$\gamma$. (Actually $\beta$ is never allowed to increase to the point
where it becomes {\it equal\/} to $\gamma$.)
If many registers need to be loaded or stored at once,
these operations are interruptible.

[A somewhat similar scheme was introduced by David R. Ditzel and H.~R.
McLellan in {\sl SIGPLAN Notices\/ \bf17},\thinspace4 (April 1982), 48--56,
and incorporated in the so-called {\mc CRISP} architecture developed at
AT{\AM}T Bell Labs. An even more similar scheme was adopted in the late 1980s
@^AT{\AM}T Bell Laboratories@>
@^Advanced Micro Devices@>
by Advanced Micro Devices, in the processors of their Am29000 series---a
family of computers whose instructions have essentially the
format `OP~X~Y~Z' used by~\MMIX.]
@^Ditzel, David Roger@>
@^McLellan, Hubert Rae, Jr.@>

Limited versions of\/ \MMIX, having fewer registers, can also be envisioned. For
example, we might have only 32 local registers $\l[0]$, $\l[1]$,
\dots,~$\l[31]$ and only 32 global registers $\rm g[224]$, $\rm g[225]$,
\dots,~$\rm g[255]$. Such a machine could run any \MMIX\ program that
maintains the inequalities $\ll<32$ and $\gg\ge224$.

@ Access to \MMIX's special registers is obtained via the \.{GET} and
\.{PUT} commands.
@^special registers@>
@^rL@>
@^rQ@>

\bull\<GET \$X,Z `get from special register'; the Y field must be zero.\>
@.GET@>
Register X is set to the contents of the special register identified by
its code number~Z, using the code numbers listed earlier.
An illegal instruction interrupt occurs if $\zz\ge32$.

Every special register is readable; \MMIX\ does not keep secrets from
an inquisitive user. But of course only the operating system is allowed
@^operating system@>
to change registers like rK and~rQ (the interrupt mask and request
registers). And not even the operating system is allowed to change
rN~(the serial number) or the stack pointers
rO~and~rS.

\bull\<PUT X,\0 `put into special register';
@.PUT@>
the Y field must be zero.\>
The special register identified by~X is set to
the contents of register Z or to the unsigned byte~Z itself,
if permissible. Some changes are, however, impermissible:
Bits of rA that are always zero must remain zero; the leading seven bytes
of rG and rL must remain zero, and rL must not exceed~rG;
special registers 9--11 (namely rN, rO, and~rS) must not change;
special registers 8 and 12--18 (namely rC,
rI, rK, rQ, rT, rU, rV, and~rTT) can be changed only if the privilege
bit of rK is zero;
and certain bits of~rQ (depending on available hardware) might not
allow software to change them from 0 to~1. Moreover, any bits of~rQ that have
changed from 0 to~1 since the most recent \<GET x,rQ
will remain~1 after \.{PUT}~\.{rQ,z}.
The \.{PUT} command will not increase~rL; it sets rL to the minimum
of the current value and the new value. (A~program should say
\<SETL \$99,0 instead of \<PUT rL,100 when rL is known to be less than~100.)

Impermissible \.{PUT} commands cause an illegal instruction interrupt,
or (in the case of rC, rI, rK, rQ, rT, rU, rV, and~rTT) a privileged
operation interrupt.
@^illegal instructions@>
@^privileged operations@>

\bull\<SAVE \$X,0 `save process state';
@.SAVE@>
@^register stack@>
@^ring of local registers@>
@^rO@>
@^rS@>
\<UNSAVE 0,\$Z `restore process state'; the Y~field must be~0, and
so must the Z field of~\.{SAVE}, the X~field of \.{UNSAVE}.\>
@.UNSAVE@>
The \.{SAVE} instruction stores all registers and special registers
that might affect the computation of the currently running process. 
First the current local registers \$0, \$1, \dots,~$\$(\ll-1)$ are
pushed down as in \.{PUSHGO}~\.{\$255}, and $L$~is set to zero.
Then the current global
registers $\$\gg$, $\$(\gg+1)$, \dots,~\$255 are placed above them
in the register stack; finally
rB, rD, rE, rH, rJ, rM, rR, rP, rW, rX, rY, and~rZ
are placed at the very top, followed by registers rG and~rA packed
into eight bytes:
$$\beginword
&\field88&&\field{24}{24}&&\field{32}{32}\cr
\noalign{\hrule}
\\&rG&\\&0&\\&rA&\\\cr
\noalign{\hrule}\endword$$
The address of the topmost octabyte is then placed in register~X, which
must be a global register. (This instruction is interruptible. If an
interrupt occurs while the registers are being saved, we will have
$\alpha=\beta=\gamma$ in the ring of local registers;
thus rO will equal~rS and rL will be zero. The interrupt handler
essentially has a new register stack, starting on top of the partially
saved context.) Immediately after a \.{SAVE} the values of rO and~rS
are equal to the location of the first byte following the stack
just saved. The current register stack is effectively empty at this
point; thus one shouldn't do a \.{POP} until this context
or some other context has been unsaved.
@^rO@>
@^rS@>

The \.{UNSAVE} instruction goes the other way, restoring all the
registers when given an address in register~Z that was returned
by a previous \.{SAVE}. Immediately after an \.{UNSAVE} the values of
rO and~rS will be equal. Like \.{SAVE}, this instruction is interruptible.

The operating system uses \.{SAVE} and \.{UNSAVE}
to switch context between different processes.
It can also use \.{UNSAVE} to
establish suitable initial values of rO and~rS\null.
But a user program that knows what it is doing can in fact allocate its own
register stack or stacks and do its own process switching.

Caution: \.{UNSAVE} is destructive, in the sense that a program can't reliably
\.{UNSAVE} twice from the same saved context. Once an
\.{UNSAVE} has been done,
further operations are likely to change the memory
record of what was saved. Moreover, an interrupt during the middle
of an \.{UNSAVE} may have already clobbered some of the data in memory before
the \.{UNSAVE} has completely finished, although the data will appear
properly in all registers.

@* Virtual and physical addresses.
Virtual 64-bit addresses are converted to physical addresses in a manner
@^virtual addresses@>
@^physical addresses@>
governed by the special {\it virtual translation register\/}~rV. Thus
@^rV@>
$\rm M[A]$ really refers to $\rm m[\phi(A)]$, where m~is the physical
memory array and $\phi(A)$
is determined by the physical mapping function~$\phi$. The details of
this conversion are rather technical and of interest mainly to the operating
system, but two simple rules are important to ordinary users:
@^operating system@>

\bull Negative addresses are mapped directly to physical addresses, by simply
@^negative locations@>
suppressing the sign bit:
$$\phi(A)=A+2^{63}=A\land\Hex{7fffffffffffffff},\qquad
\hbox{if $A<0$.}$$
{\it All accesses to negative addresses are privileged}, for use by the
operating system only.
@^privileged operations@>
(Thus, for example, the trap addresses in~rT and~rTT should be negative,
because they are addresses inside the operating system.) Moreover, all physical
addresses $\ge2^{48}$ are intended for use by memory-mapped I/O devices;
values read from or written to such locations are never placed in a cache.
@^I/O@>
@^input/output@>
@^memory-mapped input/output@>

\bull Nonnegative addresses belong to four {\it segments}, depending on
@^segments@>
whether the three leading bits are 000, 001, 010, or 011. These $2^{61}$-byte
segments are traditionally used for a program's text, data, dynamic
memory, and register stack, respectively, but such conventions are
not mandatory. There are four mappings $\phi_0$, $\phi_1$, $\phi_2$,
and~$\phi_3$ of 61-bit addresses into 48-bit physical memory space, one for
each segment:
$$\phi(A)=\phi_{\lfloor A/2^{61}\rfloor}(A\bmod2^{61}),\qquad
\hbox{if $0\le A<2^{63}$.}$$
In general, the machine is able to access smaller addresses of a segment more
efficiently than larger addresses. Thus a programmer should let each segment
grow upward from zero, trying to keep any of the 61-bit addresses from
becoming larger than necessary, although arbitrary addresses are legal.

@ Now it's time for the technical details of virtual address translation.
@^segments@>
@^virtual addresses@>
@^physical addresses@>
@^rV@>
The mappings $\phi_0$, $\phi_1$, $\phi_2$, and~$\phi_3$ are defined
by the following rules.
\smallskip

(1) The first two bytes of rV are four nybbles called $b_1$, $b_2$, $b_3$,
$b_4$; we also define $b_0=0$. Segment~$i$ has at most $1024^{\,b_{i+1}-b_i}$
pages. In particular, segment~$i$ must have at most one page when
$b_i=b_{i+1}$, and it must be entirely empty if $b_i>b_{i+1}$.

(2) The next byte of rV, $s$, specifies the current {\it page size},
which is $2^s$ bytes. We must have $s\ge13$ (hence at least 8192~bytes
per page). Values of~$s$ larger than, say, 20 or~so are of use only in rather
large programs that will reside in main memory for long periods of time,
because memory protection and swapping are applied to entire pages.
The maximum legal value of~$s$ is~48.

(3) The remaining five bytes of rV are a 27-bit {\it root location\/}~$r$,
a 10-bit {\it address space number\/}~$n$, and a 3-bit {\it function
field\/}~$f$:
$$\centerline{$\hbox{rV}=\beginword
&\field44&&\field44&&\field44&&\field44&&\field88&&
 \field{27}{27}&&\field{10}{10}&&\field33\cr
\noalign{\hrule}
\\&$b_1$&\\&$b_2$&\\&$b_3$&\\&$b_4$&\\&$s$&\\&$r$&\\&$n$&\\&$f$&\\\cr
\noalign{\hrule}\endword$}$$
Normally $f=0$; if $f=1$, virtual address translation will be done by
software instead of hardware, and the $b_1$, $b_2$, $b_3$, $b_4$,
and~$r$ fields of~rV will be ignored by the hardware.
(Values of $f>1$ are reserved for possible future use; if $f>1$
when \MMIX\ tries to translate an address, a memory-protection
failure will occur.)
@^illegal instructions@>

(4) Each page has an 8-byte {\it page table entry\/} (PTE), which looks
@^page table entry@>
@^PTE@>
like this:
$$\centerline{$\hbox{PTE}=\beginword
&\field{16}{16}&&\field{32}{48-s}&&\field3{s-13}&&\field{10}{10}&&
 \field33\cr
\noalign{\hrule}
\\&$x$&\\&$a$&\\&$y$&\\&$n$&\\&$p$&\\\cr
\noalign{\hrule}\endword$}$$
Here $x$ and $y$ are ignored (thus they are usable for any purpose by the
operating
system); $2^s a$~is the physical address of byte~0 on the page; and $n$~is
the address space number (which must match the number in~rV). The final three
bits are the {\it protection bits\/} $p_r\,p_w\,p_x$; the user needs
$p_r=1$ to load from this page, $p_w=1$ to store on this page, and
$p_x=1$ to execute instructions on this page. If $n$~fails to
match the number in~rV, or if the appropriate protection bit is zero,
a memory-protection fault occurs.
@^protection fault@>

Page table entries should be writable only by the operating system.
The 16 ignored bits of~$x$ imply that physical memory size is limited
to $2^{48}$ bytes (namely 256 large terabytes); that should be enough capacity
for awhile, if not for the entire new millennium.
@^terabytes@>

(5) A given 61-bit address $A$ belongs to page $\lfloor A/2^s\rfloor$ of
its segment, and
$$\phi_i(A)=2^s\,a+(A\bmod2^s)$$
if $a$ is the address in the PTE for page $\lfloor A/2^s\rfloor$ of
segment~$i$.

(6) Suppose $\lfloor A/2^s\rfloor$ is equal to $(a_4a_3a_2a_1a_0)_{1024}$
in the radix-1024 number system. In the common case $a_4=a_3=a_2=a_1=0$, the
PTE is simply the octabyte ${\rm m}_8[2^{13}(r+b_i)+8a_0]$; this rule
defines the mapping for the first 1024 pages. The next million or~so pages are
accessed through an auxiliary {\it page table pointer}
@^page table pointer@>
@^PTP@>
$$\centerline{$\hbox{PTP}=\beginword
&\field11&&\field{50}{50}&&\field{10}{10}&&\field33\cr
\noalign{\hrule}
\\&1&\\&$c$&\\&$n$&\\&$q$&\\\cr
\noalign{\hrule}\endword$}$$
in ${\rm m}_8[2^{13}(r+b_i+1)+8a_1]$; here the sign must be~1 and the
$n$-field must match~rV, but the $q$~bits are ignored. The desired PTE for
page $(a_1a_0)_{1024}$ is then in ${\rm m}_8[2^{13}c+8a_0]$. The next billion
or so pages, namely the pages $(a_2a_1a_0)_{1024}$ with $a_2\ne0$,
are accessed similarly, through an auxiliary PTP at level~two; and
so on.

Notice that if $b_3=b_4$, there is just one page in segment~3, and its PTE
appears all alone in physical location $2^{13}(r+b_3)$.
Otherwise the PTEs appear in 1024-octabyte blocks. We usually
have $0<b_1<b_2<b_3<b_4$, but the null case $b_1=b_2=b_3=b_4=0$ is
worthy of mention: In this special case there is only one page, and the
segment bits of a virtual address are ignored; the other $61-s$ bits of each
virtual address must be zero.

If $s=13$, $b_1=3$, $b_2=2$, $b_3=1$, and $b_4=0$, there are at most
$2^{30}$ pages of 8192 bytes each, all belonging to segment~0. This is
essentially the virtual memory setup in the Alpha~21064 computers with
{\mc DIGITAL~UNIX}$^{\rm\,TM}$.
@^Alpha computers@>

Several special cases have weird behavior, which probably isn't going to
be useful. But I might as well mention them so that the flexibility
of this scheme is clarified: If, for example, $b_1=2$, $b_2=b_3=1$, and
$b_4=5$, then $r+1$ is used both for PTPs of segment~0 and PTEs of
segment~2. And if $b_2=b_3<b_4$, then $r+b_2$ is used for the PTE of
page~0 segments 2 and~3; page~1 of segment~2 is not allowed, but there
is a page~1 in segment~3.

I know these rules look extremely complicated, and I sincerely wish I could
have found an alternative that would be both simple and efficient in practice.
I tried various schemes based on hashing, but came to the conclusion that
``trie'' methods such as those described here are better for this application.
Indeed, the page tables in most contemporary computers are based on very
similar ideas, but with significantly smaller virtual addresses and without
the shortcut for small page numbers. I tried also to find formats for rV
and the page tables that would match byte boundaries in a more friendly way,
but the corresponding page sizes did not work well. Fortunately these grungy
details are almost always completely hidden from ordinary users.

Stack overflow presents a potential problem:
@^Stack overflow@>
If $\gamma$ increases to a virtual address on a new page for which there
is no permission to write, the protection interrupt handler would have no stack
space in which to work! Therefore \MMIX\ has a
{\it continuation register\/}~rC,
@^rC@>
which contains the physical address of a ``continuation page.'' Pushed-down
information is written to the continuation page until \MMIX\ comes to
an instruction that is safely interruptible. Then a stack overflow
interrupt occurs, and the operating system can restore order.
The format of~rC is just like an ordinary PTE entry, except that
the $n$ field is ignored.

@ Of course \MMIX\ can't afford to perform a lengthy calculation of physical
addresses every time it accesses memory. The machine therefore maintains a
{\it translation cache\/} (TC),
@^translation caches@>
@^TC@>
which contains the translations of recently
accessed pages. (In fact, there usually are two such caches,
one for instructions
and one for data.) A~TC holds a set of 64-bit translation keys
$$\beginword
&\field{1.2}1&&\field22&&\field{44.8}{61-s}&&\field3{s-13}&&\field{10}{10}&&
 \field33\cr
\noalign{\hrule}
\\&0&\\&$i$&\\&$v$&\\&0&\\&$n$&\\&0&\\\cr
\noalign{\hrule}\endword$$
associated with 38-bit translations
$$\beginword
&\field{32}{48-s}&&\field3{s-13}&&\field33\cr
\noalign{\hrule}
\\&$a$&\\&0&\\&$p$&\\\cr
\noalign{\hrule}\endword$$
representing the relevant parts of the PTE for page $v$ of segment $i$.
Different processes typically have different values of~$n$, and possibly also
different values of~$s$. The operating system needs a way to keep such caches
up to date when pages are being allocated, moved, swapped, or recycled.
The operating system also likes to know which pages have been recently
used. The \.{LDVTS} instructions facilitate such operations:
@^protection bits@>
@^permission bits@>

\bull\<LDVTS \$X,\$Y,\0 `load virtual translation status'.\>
@.LDVTS@>
The sum $\rY+\rZ$ or $\rY+\zz$ should have the form of
a translation cache key as above,
except that the rightmost three bits need not be zero.
If this key is present in a TC,
the rightmost three bits replace the current protection code~$p$;
however, if $p$ is thereby set to zero, the key is removed from
the TC. Register~X is set to 0 if the key was not present
in any translation cache, or to 1 if the key was present in the TC
for instructions, or to 2 if the key was present in the TC for data,
or to~3 if the key was present in both. This instruction is for the
operating system only. (Changes to the TC are not immediate; so \.{SYNC}
and/or \.{SYNCD} ought to be done when appropriate, as discussed in
{\mc MMIX-PIPE}.)

@ We mentioned earlier that
cheap versions of\/ \MMIX\ might calculate the physical addresses with
@^emulation@>
@^rV@>
software instead of hardware, using forced traps when the operating
system needs to do page table calculations.
@^operating system@>
Here is some code that could be used for
such purposes; it defines the translation process precisely, given a
nonnegative virtual
address in register~rYY\null. First we must unpack the fields of~rV and
@^virtual addresses@>
@^physical addresses@>
@^rV@>
@^PTE@>
@^PTP@>
@^segments@>
compute the relevant base addresses for PTEs and PTPs:
$$\vbox{\halign{&\tt#\hfil\ \cr
&GET &virt,rYY\cr
&GET &\$7,rV         &\% \$7=(virtual translation register)\cr
&SRU    &\$1,virt,61    &\% \$1=i (segment number of virtual address)\cr
&SLU    &\$1,\$1,2       \cr
&NEG    &\$1,52,\$1      &\% \$1=52-4i\cr
&SRU    &\$1,\$7,\$1      \cr
&SLU    &\$2,\$1,4       \cr
&SETL   &\$0,\#f000     \cr
&AND    &\$1,\$1,\$0      &\% \$1=b[i]<<12\cr
&AND    &\$2,\$2,\$0      &\% \$2=b[i+1]<<12\cr
&SLU    &\$3,\$7,24      \cr
&SRU    &\$3,\$3,37      \cr
&SLU    &\$3,\$3,13      &\% \$3=(r field of rV)\cr
&ORH   &\$3,\#8000     &\% make \$3 a physical address\cr
&2ADDU   &base,\$1,\$3    &\% base=address of first page table\cr
&2ADDU   &limit,\$2,\$3   &\% limit=address after last page table\cr
&SRU    &s,\$7,40      \cr
&AND    &s,s,\#ff       &\% s=(s field of rV)\cr
&CMP    &\$0,s,13       \cr
&BN     &\$0,Fail       &\% s must be 13 or more\cr
&CMP    &\$0,s,49       \cr
&BNN    &\$0,Fail       &\% s must be 48 or less\cr
&SETH  &mask,\#8000 \cr
&ORL   &mask,\#1ff8&\% mask=(sign bit and n field)\cr 
&ORH   &\$7,\#8000      &\% set sign bit for PTP validation below\cr
&ANDNH &virt,\#e000     &\% zero out the segment number\cr
&SRU    &\$0,virt,s     &\% \$0=a4a3a2a1a0 (page number of virt)\cr
&ZSZ   &\$1,\$0,1       &\% \$1=[page number is zero]\cr
&ADD    &limit,limit,\$1&\% increase limit if page number is zero\cr
&SETL&\$6,\#3ff\cr
}}$$
The next part of the routine finds the ``digits'' of
the page number $(a_4a_3a_2a_1a_0)_{1024}$, from right to left:
$$
\vcenter{\halign{&\tt#\hfil\ \cr
&CMP &\$5,base,limit\cr
&SRU &\$1,\$0,10\cr
&PBZ &\$1,1F\cr
&AND &\$0,\$0,\$6\cr
&INCL&base,\#2000\cr}}
\vcenter{\halign{&\tt#\hfil\ \cr
&CMP &\$5,base,limit\cr
&SRU &\$2,\$1,10\cr
&PBZ &\$2,2F\cr
&AND &\$1,\$1,\$6\cr
&INCL&base,\#2000\cr}}
\vcenter{\halign{&\tt#\hfil\ \cr
&CMP &\$5,base,limit\cr
&SRU &\$3,\$2,10\cr
&PBZ &\$3,3F\cr
&AND &\$2,\$2,\$6\cr
&INCL&base,\#2000\cr}}
\vcenter{\halign{&\tt#\hfil\ \cr
&CMP &\$5,base,limit\cr
&SRU &\$4,\$3,10\cr
&PBZ &\$4,4F\cr
&AND &\$3,\$3,\$6\cr
&INCL&base,\#2000\cr}}
$$
Then the process cascades back through PTPs.
$$
\vcenter{\halign{&\tt#\hfil\ \cr
&CMP &\$5,base,limit\cr
&BNN &\$5,Fail\cr
&8ADDU&\$6,\$4,base\cr
&LDO  &base,\$6,0\cr
&XOR &\$6,base,\$7\cr
&AND &\$6,\$6,mask\cr
&BNZ &\$6,Fail\cr}}
\vcenter{\halign{&\tt#\hfil\ \cr
&ANDNL&base,\#1fff\cr
4H&BNN &\$5,Fail\cr
&8ADDU &\$6,\$3,base\cr
&LDO  &base,\$6,0\cr
&XOR &\$6,base,\$7\cr
&AND &\$6,\$6,mask\cr
&BNZ &\$6,Fail\cr}}
\vcenter{\halign{&\tt#\hfil\ \cr
&ANDNL&base,\#1fff\cr
3H&BNN &\$5,Fail\cr
&8ADDU &\$6,\$2,base\cr
&LDO  &base,\$6,0\cr
&XOR &\$6,base,\$7\cr
&AND &\$6,\$6,mask\cr
&BNZ &\$6,Fail\cr}}
\vcenter{\halign{&\tt#\hfil\ \cr
&ANDNL&base,\#1fff\cr
2H&BNN &\$5,Fail\cr
&8ADDU &\$6,\$1,base\cr
&LDO  &base,\$6,0\cr
&XOR &\$6,base,\$7\cr
&AND &\$6,\$6,mask\cr
&BNZ &\$6,Fail\cr}}
$$
Finally we obtain the PTE and communicate it to the machine.
If errors have been detected, we set the translation to zero; actually
any translation with permission bits zero would have the same effect.
$$\chardef\_=`\_
\vcenter{\halign{&\tt#\hfil\ \cr
&ANDNL &base,\#1fff &\% remove low 13 bits of PTP\cr
1H &BNN &\$5,Fail\cr
&8ADDU &\$6,\$0,base \cr
&LDO  &base,\$6,0  &\% base=PTE\cr
&XOR &\$6,base,\$7\cr
&ANDN&\$6,\$6,\#7\cr
&SLU &\$6,\$6,51\cr
&PBZ &\$6,Ready &\% branch if n matches\cr
Fail&SETL &base,0 &\% errors lead to PTE of zero\cr
Ready&PUT&rZZ,base\cr
&LDO&\$255,IntMask &\% load the desired setting of rK\cr
&RESUME&1 &\% now the machine will digest the translation\cr}}$$
All loads and stores in this program deal with negative virtual addresses.
This effectively shuts off memory mapping and makes the page tables
inaccessible to the user.\looseness=-1

The program assumes that the ropcode in rXX is 3 (which it is when
a forced trap is triggered by the need for virtual translation).
@^ropcodes@>
@^translation caches@>

The translation from virtual pages to physical pages need not actually
follow the rules for PTPs and PTEs; any other mapping could be
substituted by operating systems with special needs. But people usually
want compatibility between different implementations whenever
possible. The only parts of~rV that \MMIX\ really needs are the $s$~field,
which defines page sizes, and the $n$~field, which keeps TC entries
of one process from being confused with the TC entries of another.

@* The complete instruction set. We have now described all of\/ \MMIX's
special registers---except one: The special
{\it failure location register\/}~rF is set
@^rF@>
to a physical memory address when a parity error or other memory
fault occurs. (The instruction leading to this error will probably be
long gone before such a fault is detected; for example, the machine might
be trying to write old data from a cache in order to make room for
new data. Thus there is generally no connection between the current virtual
program location~rW and the physical location of a memory error. But knowledge
of the latter location can still be useful for hardware repair, or when
an operating system is booting up.)

@ One additional instruction proves to be useful.

\bull\<SWYM X,Y,Z `sympathize with your machinery'.\>
This command lubricates the disk drives, fans, magnetic tape drives,
laser printers, scanners, and any other mechanical equipment hooked
up to \MMIX, if necessary. Fields X, Y, and~Z are ignored.
@.SWYM@>

The \.{SWYM} command was originally included in \MMIX's repertoire because
machines occasionally need grease to keep in shape, just as
human beings occasionally need to swim or do some other kind of exercise
in order to maintain good muscle tone. But in fact, \.{SWYM} has turned out to
be a ``no-op,'' an instruction that does nothing at all; the
@^no-op@>
hypothetical manufacturers of our hypothetical machine have pointed out that
modern computer equipment is already well oiled and sealed for permanent use.
Even so, a no-op instruction provides a good way for software to
send signals to the hardware, for such things as scheduling the way
instructions are issued on superscalar superpipelined buzzword-compliant
machines. Software programs can also use no-ops to communicate with other
programs like symbolic debuggers.

When a forced trap computes the translation~rZZ of a virtual address~rYY,
ropcode~3 of \<RESUME 1 will put $\rm(rYY,rZZ)$ into the TC for instructions if
the opcode in~rXX is \.{SWYM}; otherwise $\rm(rYY,rZZ)$ will be put
into the TC for data.
@^ropcodes@>
@^translation caches@>
@.RESUME@>
@^virtual address emulation@>
@^emulation@>

@ The running time of\/ \MMIX\ programs depends to a great extent
on changes in technology.
\MMIX\ is a mythical machine, but its mythical hardware exists in
cheap, slow versions as well as in costly high-performance models.
Details of running time usually depend on things like the amount of main memory
available to implement virtual memory, as well as the sizes of
caches and other buffers.

For practical purposes, the running time of an \MMIX\ program can often be
estimated satisfactorily by assigning a fixed cost
to each operation, based on the approximate running time that would be obtained
on a high-performance machine with lots of main memory; so that's what
we will do. Each operation will be assumed to take an integer number
of~$\upsilon$,
where $\upsilon$ (pronounced ``oops'') is a unit that represents the clock cycle time in
@^mems@>
@^oops@>
a pipelined implementation. The value of $\upsilon$ will probably decrease
from year to year, but I'll keep calling it $\upsilon$. The running
time will also depend on the number of memory references or {\it mems\/}
that a program uses;
this is the number of load and store instructions. For example,
each \.{LDO} (load octa) instruction will be assumed to cost
$\mu+\upsilon$, where $\mu$ is the average cost of
a memory reference. The total running time of a program might be reported as,
say, $35\mu+1000\upsilon$, meaning 35 mems plus 1000~oops. The
ratio $\mu/\upsilon$ will probably increase with time, so mem-counting
is likely to become increasingly important. [See the discussion of mems in
{\sl The Stanford GraphBase\/} (New York:\ ACM Press, 1994).]
@^oops@>
@^running times, approximate@>

Integer addition, subtraction, and comparison all take just $1\upsilon$.
The same is true for \.{SET}, \.{GET}, \.{PUT}, \.{SYNC}, and \.{SWYM}
instructions,
as well as bitwise logical operations, shifts, relative jumps, comparisons,
conditional assignments,
and correctly predicted branches-not-taken or probable-branches-taken.
Mispredicted branches or probable branches cost $3\upsilon$, and
so do the \.{POP} and \.{GO} commands.
Integer multiplication takes $10\upsilon$; integer division weighs in
at~$60\upsilon$.
@.MUL@>
@.DIV@>
@.TRAP@>
@.TRIP@>
@.RESUME@>
\.{TRAP}, \.{TRIP}, and \.{RESUME} cost $5\upsilon$ each.

Most floating point operations have a nominal running time of $4\upsilon$,
although the comparison operators \.{FCMP}, \.{FEQL}, and \.{FUN}
need only $1\upsilon$.
\.{FDIV} and \.{FSQRT} cost $40\upsilon$ each.
@.FDIV@>
@.FSQRT@>
@.FREM@>
The actual running time of floating point computations
will vary depending on the operands; for example,
the machine might need one extra $\upsilon$ for each subnormal input
or output, and it might slow down greatly when trips are enabled.
The \.{FREM} instruction might typically cost
$(3+\delta)\upsilon$, where $\delta$ is the amount
by which the exponent of the first operand exceeds the exponent of the
second (or zero, if this amount is negative). A floating point
operation might take only $1\upsilon$
if at least one of its operands is zero, infinity, or~NaN\null.
However, the fixed values stated at the beginning of this paragraph
will be used for all seat-of-the-pants estimates of running time,
since we want to keep the estimates as simple as possible
without making them terribly out of line.

All load and store operations will be assumed to cost $\mu+\upsilon$,
except that \.{CSWAP} costs $2\mu+2\upsilon$.
(This applies to all OP~codes that begin with
\Hex8, \Hex9, \Hex{A}, and \Hex{B}, except \Hex{98}--\Hex{9F} and
\Hex{B8}--\Hex{BF}. It's best
to keep the rules simple, because $\mu$ is just
an approximate device for estimating average memory cost.)
\.{SAVE} and \.{UNSAVE} are charged $20\mu+\upsilon$.
@.CSWAP@>
@.SAVE@>
@.UNSAVE@>

Of course we must remember that these numbers are very rough.
We have not included the cost of fetching instructions from memory.
Furthermore, an integer multiplication or division might have an effective
cost of only $1\upsilon$, if the result is not needed while other
numbers are being calculated.
Only a detailed simulation can be expected to be truly realistic.

@ If you think that \MMIX\ has plenty of operation codes, you are right;
we have now described them all. Here is a chart that shows their
numeric values:
\def\oddline#1{\cr
  \noalign{\nointerlineskip}
  \omit&\setbox0=\hbox{\lower 2.3pt\hbox{\Hex{#1x}}}\smash{\box0}&
  \multispan{17}\hrulefill&
  \setbox0=\hbox{\lower 2.3pt\hbox{\Hex{#1x}}}\smash{\box0}\cr
  \noalign{\nointerlineskip}}
\def\evenline{\cr\noalign{\hrule}}
\def\chartstrut{\lower4.5pt\vbox to14pt{}}
\def\beginchart{$$\tt\halign to\hsize\bgroup
    \chartstrut##\tabskip0pt plus10pt&
    &\hfil##\hfil&\vrule##\cr
    \lower6.5pt\null
    &&&\Hex0&&\Hex1&&\Hex2&&\Hex3&&\Hex4&&\Hex 5&&\Hex 6&&\Hex 7&\evenline}
\def\endchart{\raise11.5pt\null&&&\Hex 8&&\Hex 9&&\Hex A&&\Hex B&
  &\Hex C&&\Hex D&&\Hex E&&\Hex F&\cr\egroup$$}
\def\\#1[#2]{\multispan3\hfil#1[#2]\hfil}
\beginchart
&&&TRAP&&FCMP&&FUN&&FEQL&&FADD&&FIX&&FSUB&&FIXU&\oddline 0
&&&\\FLOT[I]&&\\FLOTU[I]&&\\SFLOT[I]&&\\SFLOTU[I]&\evenline
&&&FMUL&&FCMPE&&FUNE&&FEQLE&&FDIV&&FSQRT&&FREM&&FINT&\oddline 1
&&&\\MUL[I]&&\\MULU[I]&&\\DIV[I]&&\\DIVU[I]&\evenline
&&&\\ADD[I]&&\\ADDU[I]&&\\SUB[I]&&\\SUBU[I]&\oddline 2
&&&\\2ADDU[I]&&\\4ADDU[I]&&\\8ADDU[I]&&\\16ADDU[I]&\evenline
&&&\\CMP[I]&&\\CMPU[I]&&\\NEG[I]&&\\NEGU[I]&\oddline 3
&&&\\SL[I]&&\\SLU[I]&&\\SR[I]&&\\SRU[I]&\evenline
&&&\\BN[B]&&\\BZ[B]&&\\BP[B]&&\\BOD[B]&\oddline 4
&&&\\BNN[B]&&\\BNZ[B]&&\\BNP[B]&&\\BEV[B]&\evenline
&&&\\PBN[B]&&\\PBZ[B]&&\\PBP[B]&&\\PBOD[B]&\oddline 5
&&&\\PBNN[B]&&\\PBNZ[B]&&\\PBNP[B]&&\\PBEV[B]&\evenline
&&&\\CSN[I]&&\\CSZ[I]&&\\CSP[I]&&\\CSOD[I]&\oddline 6
&&&\\CSNN[I]&&\\CSNZ[I]&&\\CSNP[I]&&\\CSEV[I]&\evenline
&&&\\ZSN[I]&&\\ZSZ[I]&&\\ZSP[I]&&\\ZSOD[I]&\oddline 7
&&&\\ZSNN[I]&&\\ZSNZ[I]&&\\ZSNP[I]&&\\ZSEV[I]&\evenline
&&&\\LDB[I]&&\\LDBU[I]&&\\LDW[I]&&\\LDWU[I]&\oddline 8
&&&\\LDT[I]&&\\LDTU[I]&&\\LDO[I]&&\\LDOU[I]&\evenline
&&&\\LDSF[I]&&\\LDHT[I]&&\\CSWAP[I]&&\\LDUNC[I]&\oddline 9
&&&\\LDVTS[I]&&\\PRELD[I]&&\\PREGO[I]&&\\GO[I]&\evenline
&&&\\STB[I]&&\\STBU[I]&&\\STW[I]&&\\STWU[I]&\oddline A
&&&\\STT[I]&&\\STTU[I]&&\\STO[I]&&\\STOU[I]&\evenline
&&&\\STSF[I]&&\\STHT[I]&&\\STCO[I]&&\\STUNC[I]&\oddline B
&&&\\SYNCD[I]&&\\PREST[I]&&\\SYNCID[I]&&\\PUSHGO[I]&\evenline
&&&\\OR[I]&&\\ORN[I]&&\\NOR[I]&&\\XOR[I]&\oddline C
&&&\\AND[I]&&\\ANDN[I]&&\\NAND[I]&&\\NXOR[I]&\evenline
&&&\\BDIF[I]&&\\WDIF[I]&&\\TDIF[I]&&\\ODIF[I]&\oddline D
&&&\\MUX[I]&&\\SADD[I]&&\\MOR[I]&&\\MXOR[I]&\evenline
&&&SETH&&SETMH&&SETML&&SETL&&INCH&&INCMH&&INCML&&INCL&\oddline E
&&&ORH&&ORMH&&ORML&&ORL&&ANDNH&&ANDNMH&&ANDNML&&ANDNL&\evenline
&&&\\JMP[B]&&\\PUSHJ[B]&&\\GETA[B]&&\\PUT[I]&\oddline F
&&&POP&&RESUME&&SAVE&&UNSAVE&&SYNC&&SWYM&&GET&&TRIP&\evenline
\endchart
The notation `\.{[I]}' indicates an operation with an ``immediate'' variant
in which the Z field denotes a constant instead of a register number.
Similarly, `\.{[B]}' indicates an operation with a ``backward'' variant
in which a relative address has a negative displacement. Simulators and
other programs that need to present \MMIX\ instructions in symbolic
form will say that opcode \Hex{20} is \.{ADD} while opcode \Hex{21}
is~\.{ADDI}; they will say that \Hex{F2} is \.{PUSHJ} while \Hex{F3}
is~\.{PUSHJB}. But the \MMIX\ assembler uses only the forms \.{ADD}
and \.{PUSHJ}, not \.{ADDI} or \.{PUSHJB}.

To read this chart, use the hexadecimal digits at the top, bottom,
left, and right.
For example, operation code \.{A9} in hexadecimal notation appears in
the lower part of the \Hex{Ax} row and in the \Hex1/\Hex9 column; it is
\.{STTI}, `store tetrabyte immediate'.
@^OP codes, table@>

%The blank spaces in this chart are undefined opcodes,
%reserved for future extension.
%If an instruction with such
%an opcode is encountered in a user program, it is considered to be
%an illegal instruction (like, say, \.{FIX} with the \.Y field greater than~9),
%@^illegal instructions@>
%triggering an interrupt. Such instructions might become defined in
%later versions of\/ \MMIX, at which time the operating system
%could probably emulate the new instructions for backward compatibility.
%@^version number@>

\def\\#1{\leavevmode\hbox{\it#1\/\kern.05em}} % italic type for identifiers

@*Index. (References are to section numbers, not page numbers.)
