A list of tools, organized according to various interesting features.
See also a listing of tools ordered
alphabetically.
Interesting things about the tools include:
Simulation and tracing tools can perform a wide variety of tasks.
Here are some common uses:
- atr: address tracing
Classical ``address tracing'' gathers a list of instruction
and/or data memory references performed by a system.
There are many variations, such as
tracing only targets of control transfers
or tracing other resources.
- db: debugging
A simulator can help with debugging
because: it runs deterministically and repeatably;
it is possible to query system state without disturbing it;
the simulator can be backed up to an earlier checkpoint in order
to implement reverse execution
(``foo is twelve ... what was the value of bar
in the routine we just returned from?'');
and because a simulator can perform consistency checks that cannot
be done on real hardware.
- otr: other tracing and event counting
A generalization of address tracing is to
trace, count, or categorize events on any kind of processor or
system event or resource.
For example, a tool may collect
the common values of variables; register usage patterns;
interrupt or exception event counts, timing information, and so on.
- sim: (instruction set) simulation
Simulators commonly implement a processor architecture
that does not yet or no longer exists.
Simulators can also implement other devices such as
memory, bus, I/O devices, user input, and so on.
- tb: tool building
Here, ``tool building'' is meant to encompass tools
that are used to build other tools,
for example, a tool that builds various tracing tools is a
tool-building tool, whereas a configurable cache simulator
is not.
The usual distinction is that a tool-building tool can be
extended
[NG87,
NG88]
using a general-purpose programming language
(e.g. C, C++, ...), whereas a configurable tool is programmed
with a less-powerful language e.g. a list of
cache size, line size, associativity, etc.
In addition, some tools are used for
- No: Application errors
such as stores to random memory locations
may cause the simulation or tracing tool to fail
or produce spurious answers,
or may cause the application program to fail
in an unexpected (unintended) way or produce spurious answers.
- Some:
Certain kinds of errors are detected or serviced.
For example, application errors may be constrained
so that they can clobber application data in random ways
but that they cannot cause the simulation or tracing tool
to fail or produce erronious results.
- Yes:
Application errors are detected and handled in some
predictable way.
Typically, ``predictable'' means that the error
model is the same as a reference for the target architecture.
- Yes*:
Selectable; turning on checking may slow execution.
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
- No
- Y1:
multiplexes all target processors on a single host processor
- Y=:
same number of host and target processors
(to be precise, should be a ``Y-'' category
for several host processors per target processor).
- Y+:
can multiplex a large number of target processors
onto a potentially smaller number of host processors
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
- No
- S: yes, but not all kinds.
For example, a tracing tool might execute the traced program
correctly but fail to trace signal handlers.
- Yes
(Detail)
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
- d: device
- u: user
- s: system
Note: the system mode may be marked in parenthesis,
e.g. (s),
indicating that the host processor does not have a distinct
system mode in hardware,
but the tool is intended to work with
(simulate, trace, etc.) operating system code.
Processor simulators typically implement either a full procesor
or just the user-mode part of the instruction set.
A full simulation is more precise and allows analysis of
operating systems, etc.
However, it also requires implementing the processor's
protected mode architecture, simulated devices, etc.
An alternative is to implement just the user-mode portion
of the ISA and to implement system calls (transitions to
protected mode) using simulator code rather than by simulating
the operating system.
OS emulation is typically less accurate
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
- asm: assembly code
- exe: executable code, no symbol table information
- exe*: executable code, with symbol table information
- hll: high-level language
``Decompilation technology'' here refers to the process of analyzing a
(machine code)
fragment and, through analysis, creating some higher-level
information about the fragment.
For simulation and tracing tools, decompilation is typically simpler
than
static program decompilation,
in which the goal is to read a binary program and produce source code
for it in some high-level language.
Simulation and tracing ``has it easy'' in comparison because it is
possible to get by with a lower-level representation and also to punt
hard problems to the runtime, when more information is available.
Even so, executable machine code is difficult to simulate and trace
efficiently (within 2 orders of magnitude of the performance
of native execution) when using ``naive'' instruction-by-instruction
translation,
because lots of relevant information is unavailable statically.
For example, every instruction is potentially a branch target;
every word of memory is potentially used both as code and as data;
every mutable word of memory is potentially executed, modified
(at runtime), and then executed again; and so on.
Executable machine code is also inherently (target) machine-dependent
and thus lexing and parsing the machine code is a source of potential
portability problems.
(Note that
some tools use a high-level input, so that relatively little
analysis is needed to determine the original programmers intent,
at least at a level needed to simulate the program with modest efficiency.)
The following is a a list of tools and papers that show how to reduce
the overhead of analyzing each instruction;
how to reduce the number of times each instruction is analyzed;
how to perform optimistic analysis and recover when it's wrong;
and how to improve the abstraction of machine-dependent parts of the
tool.
A short list:
A slightly longer list:
The ``simulation technology'' is how the original machine instructions
(or other source representation) gets translated into an executable
representation that is suitable for simulation and/or tracing.
Choices include:
- ddi: Decode-and-dispatch
interpretation: the input representation for an operation is
fetched and decoded each time it is executed.
- pdi: Predecode
interpretation:
the input form is translated into a form that is faster to
decode; that form is then saved so that successive invocations
(e.g. subsequent iterations of a loop) need only fetch and
decode the ``fast'' form.
Note that
- The translation may happen before program invocation,
during startup, or incrementally during execution; and
that the translated form may be discarded and regenerated.
- If the original instructions change, the translated
form becomes incoherent with the original
representation; a system that fails to update
(invalidate) the translated form before it is then
reexecuted will simulate the old instructions
instead of the new ones. For some systems (e.g., those
with hardware coherent instruction caches) such
behavior is erronious.
- tci: Threaded code
interpretation:
a particularly common and efficient form of predecode
interpretation.
- scc: Static
cross-compilation:
The input form is statically (before program execution)
translated from the target instruction set to the host
instruction set.
Note that:
- All translation costs are paid statically, so runtime
efficiency may be very good.
In contrast, dynamic analysis and transformation costs
are paid during simulation, and so it may be necessary
to ``cut corners'' with dynamic translation in order to
manage the runtime cost.
Cutting corners may affect both the quality of
analysis of the original program and the quality of
code generation.
- Instructions that cannot be located statically
or which do not exist until runtime cannot be
translated statically.
- Historically, it is difficult to distinguish between
memory words that are used for instructions and those
that are used for data; translating data as
instructions may cause errors.
- Translating to machine code allows the use of the
host hardware's instruction fetch/decode/dispatch
hardware to help simulate the target's.
- Translating to machine code makes it easier to
translate clumps of host instructions;
most dispatching between target instructions is thus
eliminated.
- dcc: Dynamic Cross
Compilation:
Host machine code is generated dynamically, as the program
runs.
Note that:
- Translating ``on demand'' eases the problem of
determining what is code and what is data; a given
word may even be used as both code and data.
- Translating to machine code is often more expensive
than translating to other representations; both the
cost of generating the machine code and the cost of
executing it contribute to the overall execution time.
- Theoretical performance advantages from dynamic
cross-compilation may be overwhelmed by the host's
increased cache miss ratio due to dynamic
cross-compilation's larger code sizes
[Pittman 95].
- aug: Augmentation:
cross-compilation
where the host and target are the same machine.
Note that
- Augmentation is typically done statically.
- There is a fine line between having identical host and
target machines (augmetnation) and having
nearly-identical machines in which just a few
features (e.g. memory references) are simulated, but
in which the bulk of instruction sets and encodings are
identical.
- emu: Emulation:
Where software simulation is sped up using hardware
assistance.
``Hardware assistance'' might include special compatability
modes but might also include careful use of page mappings.
(See ``emulation''.)
Move an instruction from one place to another,
but execute with the same
host
and
target.
Compile instruction sequences from a target
machine to run on a
host
machine.
Simulation and tracing tools that perform execution
using interpretation;
the original executable code is neither preprocessed
(augmentation or static cross-compilation)
nor is it dynamically compiled to
host
code.
Statically
cross-compile instruction sequences from a
target
machine to run on some
host
machine.
Augmentation-based tracing tools run
host
instructions native,
but some instructions are simulated.
For example,
Proteus executes arithmetic and stack-relative memory reference
instructions native,
and simulates load and store instructions that may reference
shared memory.
Some tools rely on having multiple strategies
in order to achieve their desired functionality.
For the purposes here,
``untraced native execution''
counts as a translator.
- 1951: EDSAC Debug
(displaced execution, native execution)
- 1991: Dynascope
(interpretation, native execution)
- 1992: Accelerator
(static cross-compilation, interpretation)
- 1993: MINT
(dynamic cross-compilation, interpretation)
- 1993: Vest and mx
(static cross-compilation, interpretation)
- 1994: Executor
(interpretation, dynamic cross-compilation)
- 1994: SimICS
(interpretation, dynamic cross-compilation)
- 1995: FreePort Express
(static cross-compilation, interpretation;
uses Vest and mx technology)
Some tools/papers not listed under other headings.
THIS CATEGORY NOT YET ORGANIZED.
Generally, the closer the match between the
host
and the
target,
the easier it is to write a simulator,
and the better the efficiency.
Possible mismatches include:
- Byte or word size.
For example,
Kx10
simulates a machine with 36-bit words;
it runs on machines with 32-bit and 64-bit words.
- Numeric representation.
For example, whether integers are sign-magnitude,
one's complement, or two's complement.
Or, for example,
Vest,
which simulates all VAX floating-point formats
on a host machine that lacks some of the VAX formats.
- Which instruction combinations cause exceptions,
and how those exceptions are reported.
- Synchronization and atomicity.
In particular, the details may be messy
where the target machine synchronizes
implicitly and the host does so explicitly,
since all target operations that might
cause synchronization generally need to be treated as if they
do.
Note that target support for self-modifying code may be treated as a
special case of synchronization.
For example, target machines with no caches or unified instruction and
data caches will typically write instructions using ordinary store
instructions.
Therefore, all store instructions must be treated as potential
code-modifying instructions.
For timing-accurate simulation
(see Talisman
and RSIM),
some matches between the host and target can improve the efficiency,
but many do not.
THIS CATEGORY NOT YET ORGANIZED.
Some instruction-set simulators also perform timing simulation.
Timing is not strictly an element of timing simulation, but is often
useful, since one major use for instruction set simulation is to
collect information for predicting or analyzing performance.
Important features of timing simulation include both the processor
pipeline and the memory system
(see Talisman
and RSIM).
There are many ways to measure performance.
Some common metrics include:
- host instructions executed per target instruction executed;
- host cycles executed per target instruction executed;
- relative wallclock time of host and target
Metrics that are more abstract have the advantage that they are
typically
simple to reason about
and applicable across a variety of implementations.
For example, host instructions may be counted relatively easily for each
of a variety of target instructions,
and the counts are relatively isolated from the structure of the caches
and microarchitecture.
Conversly, concrete metrics tend to more accurately reflect all related
costs.
For example the effects of caches and microarchitectures are
included.l
It is worth noting that few reports give enough information about the
measurement methodology in order to make a valid comparison.
For example, if dilation is ``typically'' 20x, what is ``typical'', and
what is the performance for ``non-typical'' workloads?
THIS CATEGORY NOT YET ORGANIZED.
The status of tool
- info:
only information is available
- nonprod:
the tool is available but is not a product
- product:
the tool is a commercial product
From instruction-set simulation and tracing