Call versus branch-and-link

What’s the big difference?

Years ago, while pondering the “nature” of threaded code something struck me: In a threaded code system the called routine decides whether or not to push a return address onto the call stack.

This is in contrast to native code running on “traditional” architectures with “call” instructions. Call instructions always push the return address, even when it’s unnecessary. (It’s unnecessary when the code being is called is leaf routine – one that makes no further calls. But the caller doesn’t – and can’t! – know this.)

RISC architects realized that always pushing the return address was inefficient – doing so creates unnecessary memory traffic – and instead of call instructions specified “branch and link” instructions.

What’s the difference?

Operationally, a call instruction does two things:

pushes onto the stack the address of the following instruction
jumps to the destination

A branch-and-link instruction – maybe it should be called link-and-branch? – also does two things:

captures in a register (the “link” register) the address of the following instruction
jumps to the destination

If the called code is a leaf routine, it doesn’t need to push the link register onto the stack. It is careful, however, to preserve its value; then, at exit, it returns to its caller by jumping to the address in the link register.

If the called code will be making calls to other code, on entry it pushes the link register (to preserve it), and on exit it first pops and restores it, and then retuns to its caller as above, by jumping to the address in the register.

Not created equal

There are two variations of branch-and-link:

one that always uses the same link register (eg, ARM)
one that allows specification of the link register (RISC-V, and probably others – I’m not a RISC expert. ;-)

The latter is much more powerful. If all your code does is “normal” calls and returns, the difference is unimportant. But there is a curious class of uses for which the difference is important: when using a branch-and-link to capture the following address for a purpose other than to return to it later.

Say whaaaaat? Why would you ever do this?

Creative misuse

I mentioned in my discussion of threaded code that one can “misuse” a call instruction to capture the following address. This is sometimes useful when writing the “runtime” behaviors of Forth words. But using call in this way is inefficient: you capture the following address by pushing it, but then immediately pop it and do something with it.

Misusing branch-and-link instructions to do the same thing is much more efficient. The address is captured in a register, and the code moves on. No push and pop. (There is a jump involved, so perhaps a pipeline refill occurs.)

And if, for some reason, we want to do this twice, in immediate succession, we simply specify a different link register in each of the branch-and-links. (Hmm, and now we are perhaps doing two pipeline refills in quick succession...)

But again, why?

In a non-threaded implementation of Forth’s create/does words, this is precisely what happens. I’ll explain how this works by first showing how it works in a ITC (indirect-threaded code) system, and then replace some pointers with jump-and-link instructions.

We’ll look how three kinds of Forth words are represented: colon words, variables, and “incrementers” (which are going to be defined via create/does).

Our example colon word was defined in Forth like this:

  : 4*   2* 2* ;

and its compiled form looks like this:

      4*
  ~~~~~~~~~~~
  addr of NEST    ( code field)
  addr of 2*      ( parameter field)
  addr of 2*
  addr of UNNEST

Our variable looks like this:

     radix
  ~~~~~~~~~~~
  addr of VARIABLE    (code field)
  2                   (parameter field)

And our create’d word – the most complicated of our examples – was created by the following Forth code:

  : incr   create ,  does> @ + ;
  4 incr cell+

To further complicate this example – which is really the key to my argument – we will assume a tethered cross-compiled Forth, which means that the words create , does> execute on the host machine, and the target contains only the runtime pieces. cell+ and incr look like this:

     cell+
  +----------+
  |   addr   |---+   ( code field)
  +----------+   |
  4              |   ( parameter field)
                 |
     incr        |
  ~~~~~~~~~~~    |
  jal DODOES   <-+
  addr of @
  addr of +
  addr of UNNEST

  DODOES:  Push IP onto R stack
           Push pfa onto D stack
           Execute NEXT

Two notes about incr. It is not a normal Forth word with a code field and a parameter field. It is like the built-in “runtimes” for colon words or variables. In this case – unlike with VARIABLE or NEST – we want to express the runtime in Forth code rather than machine code. But runtimes have to start with machine code. And a does> runtime has to do two things: nest the execution of Forth (just as we would do if we were calling a colon word), and push the pfa of the create’d word (cell+ in our example.)

Since this behavior is common to all does> runtimes, we compile it once and jal to it.

Thus, incr begins with jal DODOES. This is the one place in an ITC system where code specific to an architecture has to be compiled into a word’s body. The Forth code in a parent (defining) word that specifies the execution behavior of child words – ie, the code following does> – has to be prefixed by machine code of some kind so that the code field in child words can point to it. Code fields always point to machine code.

Now for the point of this exercise. Let’s convert these three examples to “native” code, using jump-and-link instructions in the bodies of colon words instead of lists of pointers. Note that we use two different link registers: w in code fields, and ra in colon word bodies.

      4*
  ~~~~~~~~~~~
  jal  w NEST    (code field)
  jal ra 2*      (parameter field)
  jal ra 2*
  jal ra UNNEST

NEST in this world changes slightly from the ITC version. Notice that the jal w NEST captures the pfa – which is the address of the first call to 2* – in w. NEST now looks like this:

  NEST:  Push RA onto R stack
         jr 0(w)

That jr 0(w) means “return to the body of the colon word and start executing the code there”.

Our variable looks like this:

     radix
  ~~~~~~~~~~~
  jal w VARIABLE    (code field)
  2                 (parameter field)

Again, the jal w VARIABLE captures the pfa in w.

Since there is one more level of “nesting” in does> words, we need a third link register. Let’s use x.

     cell+
  ~~~~~~~~~~~
  jal w INCR   (code field)
  4            (parameter field)

  INCR:  jal  x DODOES
         jal ra @
         jal ra +
         jal ra UNNEST

  DODOES:  Push RA onto R stack
           Push W (pfa) onto D stack
           jr 0(x)

The jal w INCR in cell+ captures the pfa in w. The jal x DODOES in INCR captures the address of the Forth code that will be executed after W is pushed onto D stack. DODOES pushes both stacks, and then “returns” to the body of INCR.

It’s convoluted, and it’s possibly inefficient – because we are executing three jal instructions in quick succession, and probably causing a series of pipeline refills – but it’s very little code, and it’s elegant in a twisted way. ;-)