Dynamic Binary Translation

Introduction and Discussion

What is Dynamic Binary Translation, or "DBT"?

Machine code for a target machine is dynamically translated (converted) to another form, usually machine code for a host machine, or to threaded code.

Why use DBT?

Many reasons, including:

so code written for one machine can run on another;
to perform dynamic optimization so code runs faster;
to instrument the code in order to understasnd its performance or the effect of a change to hardware, software, etc.;
to isolate or "sandbox" untrusted code so it runs safely.

These uses can conver a wide range. For example, "so code written for one machine may run on another" can include:

simulating an old machine so old applications can run today;
simulating a machine which does not yet exist, to develop and test tools for the new machine;
building various hardware implementations that run the same instruction set but with varying attributes, for example to support "legacy" instructions at reduced hardware cost, or to add high-performance functions that are harder to exploit in a pure-hardware approach.

What is a simple example of DBT?

The SPARC instruction

add %r1,%r2, %r3

is translated by Shade [CK93] to

ld [%vs + R1], %l1 ld [%vs + R2], %l2 add %l1,%l2, %l3 st %l3, [%vs + R3] inc 4, %g7 br mainloop_chainable

The two ld instructions read simulated target registers %r1 and %r2 from the host "virtual state" data structure; the host add simulates the target add; the st saves the result back to the virtual state; the inc updates the target program counter, which is used frequently and is thus allocated to a global register; and the br returns to the simulator main loop to continue simulating at the instruction indcated by the new target program counter.

The branch to mainloop_chainable tells the main loop that once a successor translation is generated, this translation can have the branch target (mainloop_chainable) overwritten with the address of the successor translation, so that subsequent use of this translation can skip further calls to the main loop.

In normal use, Shade simulates several instructions together, often up to the next branch. Once a target register is cached in the host, it is reused without reloading it. For example, if %r3 is used by the next instruction and if the next instruction is translated together with the add shown above, then the value of %l3 can be used directly instead of reloading it, and one inc can update the target program counter for both instructions.

What are some common challenges to DBT?

Many, the details of which vary by application. Common issues include:

Correctness — is the DBT system correct "enough" to run all needed workloads?
Implementation effort — user-level DBT tends to be simpler and have higher performance when host and target systems are the same, but it tends to be hard to get all corner cases correct, as the DBT system is simulating both the instruction set and the OS interface. On the other hand, full-system DBT requires implementation of kernel-mode instructions, I/O devices, address translation, and system protection.
Performance, both absolute and performance variability.
Transparency — do applications need to be modified or constrained to run under DBT?

What are some important papers introducing DBT?

There are many papers on DBT, here are a few that introduce general topics in DBT:

[DS84]

Peter Deutsch and Alan M. Schiffman, ``Efficient Implementation of the Smalltalk-80 System,'' 11th Annual Symposium on Principles of Programming Languages (POPL-11), January 1984, pp.~297-302.

Describes a straightforward dynamic translator for Smalltalk-80 bytecodes and explores basic DBT issues and tradeoffs, including the use of speculation. It may be the first use of the word "translation" as used for dynamic translation.

[May87]

Cathy May, ``Mimic: A Fast S/370 Simulator,'' Proceedings of the ACM SIGPLAN 1987 Symposium on Interpreters and Interpretive Techniques; SIGPLAN Notices 22(7), June 1987, pp.~1-13.

Describes a sophisticated translator for a conventional machine instruction set. The paper discusses many of the most significant ideas and issues relating to dynamic translation.

[CK94]

Robert F. Cmelik and David Keppel, ``Shade: A Fast Instruction-Set Simulator for Execution Profiling,'' Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems May 1994, pp.~128-137.

Describes DBT used for program instrumentation, discusses implementation tradeoffs, and surveys various related systems, including those based on other techniques.

[EA97]

Kemal Ebcioglu and Erik R. Altman, "DAISY: Dynamic Compilation for 100% Architectural Compatibility." In Proceedings of the 24th International Symposium on Computer Architecture, pp.~26-37, June 1997.

Describes DBT of various target systems, run on a VLIW, and various hardware features that improve DBT efficiency.

[DGB+03]

J. Dehnert, B. Grant, J. Banning, R. Johnson, T. Kistler, A. Klaiber, J. Mattson. The Transmeta code morphing software: using speculation, recovery, and adaptive retranslation to address real-life challenges. CGO 2003, pg. 15-24.

Discusses the use of specualation to improve DBT performance.

[Kep09]

D. Keppel, "Transmeta Crusoe: Hardware, Software, and Development" AMAS-BT workshop of ISCA 2009.

Describes many issues in the life cycle of a DBT project.

How does DBT compare to simulation and emulation?

The words "simulation" and "emulation" do not have exact definitions. Broadly speaking, they mean to get one machine to pretend to be another. DBT is one of several implementation techniques used in simulators/emulators. DBT is also used as an implementation techniqe for other systems, such as debugging, dynamic optimization, instrumentation, and isolation.

What approaches are similar or related to DBT?

Other approaches include:

Static translation is similar to dynamic translation, but done before the program starts running.
Decode-and-dispatch interpretation avoids translating the code to a new form, and instead decodes and executes it immediately.

When would I prefer static translation versus DBT?

Static translation costs do not affect program runtime. Thus, a static translator can use expensive analysis to generate good code, without slowing down the runtime.

However, static translators do not have access to runtime state. Thus, they often require a "fallback" strategy in case static translation fails. For example, finding all possible execution paths in a program is generally impossible, so a program execution may use a path that was not translated statically. It is also difficult to perform static translation on code which does not exist until runtime.

Static translators may use DBT for fallback.

When I would prefer interpretation versus DBT?

Decode-and-dispatch interpreters are typically simple to construct and get correct, compared to even simpler DBT systems. Simulators also tend to have consistent performance: a DBT system can "pause" while it translates, and overall performance depends on details of how often an application reuses translated instructions. In contrast, a decode-and-dispatch interpreter always pays about the same cost to execute an instruction.

However, instruction decode and dispatch tend to take more time than the actual operation. An interpreter has overhead to decode and dispatch for every instruction, whereas DBT typically decodes an instruction only the first time it is used, and greatly reduces dispatch overhead by collecting host code in contiguous blocks, then using the host program counter to perform dispatch between simulated instuctions.

DBT systems may also use interpreters to avoid translation of low-reuse code; to collect execution profiles to improve translation; and to avoid needing to implement corner cases in translated code.

What is direct execution?

A target instruction is implemented using a host instruction, rather than a sequence of host instructions. Direct execution is used most often when the host and target are identical or nearly identical. A risk of direct execution is "leakage", where target code had unintended effects on the host; or where the target has unintended access to the host. For this discussion, direct execution can be thought of as one DBT implementation strategy.

What is partial DBT?

When some parts of a program are run directly, but others are run using DBT. As an example, when virtualizing an OS to run in user space, the OS binary might be patched wherever there is a direct access to a hardware resource, or where the host machine's cost of virtualizing is too high, such as host virtualizing done via a trap, and the resource is used too frequently to run fast with many traps.

Does DBT provide an exact implementation of the target machine?

DBT systems are often inexact in order to save effort in DBT construction or to improve performance. An inexact implementation can often run many workloads, and DBT is often used where it is not important to run absolutely all workloads.

Further, it is sometimes hard to say if something is "exact". For example, on some machines, certain operations leave certain condition code bits undefined. One machine may set the bits to 0b00, while another sets them to 0b11. If DBT sets them as 0b10 some applications may fail, since some applications depend on undefined behavior. Even if the DBT system computes the same bits, an application may fail if the speed is different. For example many early 8086 programs depend on instructions taking a specified time, and executing too slow or too fast can cause an application to fail.

When runtime performance is important, it may be impractical to provide "exact" implementation, because DBT pauses execution while translating, and pauses may be unacceptable.

What are common issues for DBT?

Common DBT issues include correctness and performance.

Correctness is an issue because:

DBT systems are often complex.
The "source" is machine code, and mistranslation errors are even more difficult to diagnose than for less-complex systems like decode-and-dispatch interpreters.
DBT decouples decoding from execution and in doing so makes debugging yet more complex.
DBT often makes extensive use of caching and similar techniques that expose errors by "circumstances" such as the size of a DBT cache rather than a feature of the workload being run.

Performance is an issue because:

DBT introduces overheads: time spent running DBT "support" such as translation and storage management is time when the translated application is making no forward progress. Further, translations have overhead, so translated execution may be slower than the original.
DBT introduces variability: translation is interleaved with execution, so real progress occurs in bursts. Variability may be visible within an application, when an interactive application goes slower than faster; may be visible as run-to-run variation of one program; and may be visible when two programs run at different speeds under DBT than they do when run without DBT.

What are some common performance considerations for DBT?

Translator overhead — how long does it take the translator to generate the translations for an application?
Translation overhead — what is the speed of translated code compared to untranslated code?
Memory effects — how does DBT change the memory "footprint"

What are common metrics for DBT performance?

Three common metrics are:

Host instructions executed per target instruction translated.
Host instructions executed per target instruction simulated, excluding translation cost.
Overall execution time on a specific host system, compared to overall execution time on a specific target.

DBT metrics are problematic because sometimes the interest is in general behavior across a variety of platforms, and sometimes in specific behavior on specific platforms. For example, it is interesting to know both the approximate efficiency static vs. dynamic translation; and also interesting to know how the speed of a specific simualtion compares to the speed on the original hardware.

What are other interesting performance metrics?

Variability, within a given run of an application; across runs of a given application; and across applications. Variability within a given run is especially interesting for interactive applications, because humans remember when things are slow, even if the application overall is fast. Variability across runs of a program is interesting because it indicates what is normal behavior. Variability across applications indicates how application performance varies on a given system.

Why is Translation Cost Important?

A common argument is that translation cost is unimportant because any time there is new code, there is also I/O, and I/O is slow.

However, translation costs can be large enough to lead to user-visible delays, even on top of I/O costs.

In addition, some code is "new" even without I/O. For example, self-modifying code, adaptive compilation, or implementations that take advantage of the ability to retranslate on demand.

Does "gradual optimization" aka "gear shifting" give higher performance?

The answer depends in part on whether the goal is faster overall execution, or fewer user-visible delays. The answer also depends a lot on the detail costs of various translation strategies, and on program exection statistics.

Gradual optimization is less often beneficial than supposed for total program performance, but is sometimes beneficial to reduce variability in performance, such as slowness during program start.

Gradual optimization is often proposed on the theory a "quick and dirty" translation of low-use code saves enough translation overhead to reduce the overall running time. In practice, gradual retranslation often does not lead to significant savings, and sometimes leads to worse overall performance: high-use code is translated several times, thus increasing the translation cost of high-use code; and all code runs slowly before any of it runs quickly.

Of course the total cost depends on translation cost, translated code speed, and the statistics of use. Thus, there are situations where gradual optimization does yield higher performance. The point here is gradual optimization is not such an "obviously" good idea as it first seems. Again, the observation is that the initial translation cost is reduced, but high-running instructions pay a higher total translation cost, and all instructions have to run slowly before any of them run fast. (This observation goes at least as far back as Hansen's 1973 dissertation.)

However, if we look at interactive programs, we observe gradual optimization tends to reduce interactive pauses. Here, the highest translation costs are paid only when code is executed many times. Thus, gradual optimization shifts some translation costs from low-execution code to high-execution code. Translation costs are thus better interleaved with execution, and the performance variability during the run is reduced and so the user's perception of performance is improved.

What are issues for DBT on multiple cooperating processors?

A multiprocessor system with coherent data caches extends translation cache coherency, since writes by any processor may invalidate a translation. In uniprocessors doing DBT, it is common to write-protect a page and trap on writes. A similar strategy in a multiprocessor requires translation on one processor performs suitable synchronization with all other processors, to ensure writes by one processor are consistent with translation by another. Fine-grained strategies for self-modifying code need to be safe against concurrent writes by a different processor.

For systems built as a set of uniprocessors, there are no new coherency issues, but consistent rates of forward progress are more important as inter-processor cooperation gets more fine-grained.