• The New Mexico Statue University Parallel Trace Archive
  • The Paradyn system uses runtime (during execution) code generation (instrumentation).
  •    Method for verifying contiquity of a binary translated block of instructions
       by attaching a compare and/or branch instruction to predecessor block of
       A method for enabling a first block of instructions to verify whether the
       first block of instructions follows a second block of instructions in an
       order of execution. The method includes appending a compare instruction to
       the first block of instructions. The compare instruction compares a first
       value from the first block of instructions with a second value from the
       second block of instructions, which precedes the first block of instructions
       in the order of execution. The method further includes appending a branching
       instruction to the first block of instructions. The branching instruction is
       executed in response to the first value being unequal to the second value.
       The branching instruction, when executed, branches to an alternative look-up
       routine to obtain a block of instructions that follows the second block of
       instructions in the order of execution.
    What is claimed is: 
        1. A computer-implemented method for enabling a first block of instructions to verify whether the first block of instructions
    follows a second block of instructions in an order of execution the method comprising the steps of: 
         a) appending a compare instruction to the first block of instructions, the compare instruction when executed compares a
         first value from the first block of instructions with a second value from the second block of instructions, said second block
         of instructions preceding said first block of instructions in the order of execution; and 
         b) appending a branching instruction to the first block of instructions, said branching instruction is executed in response to
         the first value being unequal to the second value, said branching instruction, when executed, branches to an alternative
         look-up routine to obtain a block of instructions that follows the second block of instructions in the order of execution.
    U.S. REFERENCES:   (No patents reference this one) 
           De Nicolas et al.
                       11 /1992 
                             Translating a dynamic transfer control instruction address in a simulated CPU
    ABSTRACT:   The system and method of this invention simulates the flow of control of an application program targeted for a
    specific instruction set of a specific processor by utilizing a simulator running on a second processing system having a second
    processor with a different instruction set. The simulator reduces the number of translated instructions needed to simulate the
    flow of control of the first processor instructions when translating the address of the next executable instruction resulting from a
    dynamic transfer of control, i.e., resulting from a return instruction. The simulator compares the address that is loaded at run time
    by the return instruction with the return address previously executed by that instruction. If the last return address matches, the
    location of the return is the same. If the last return does not match, a translate look-aside buffer is used to determine the address.
    If the translate look-aside buffer does not find the address, then a binary tree look up mechanism is used to determine the
    address of the next instruction after a return. The performance of the simulator is enhanced by utilizing the easiest approaches
    first in the chance that a translated instruction will result most efficiently.
                       2 /1994 
                             Identifying plausible variable length machine code of selecting address in numerical
                             sequence, decoding code strings, and following execution transfer paths 
    ABSTRACT:   Information about the location of untranslated instructions in an original program is discovered during execution of a
    partial translation of the program, and that information is used later during re-translation of the original program. Preferably the
    information includes origin addresses of translated instructions and corresponding destination address of untranslated
    instructions of execution transfers that occur during the execution of the partial translation. Preferably this feedback of
    information from execution to re-translation is performed after each execution of the translated program so that virtually all of
    the instructions in the original program will eventually be located and translated. To provide an indication of the fraction of the
    code that has been translated, the program is scanned to find plausible code in the areas of memory that do not contain translated
    code. The plausible code is identified by selecting addresses according to three different scanning modes and attempting to
    decode variable-length instructions beginning at the selected addresses. The scanning modes include a first mode in which
    addresses are selected in numerical sequence by a scan pointer, a second mode in which addresses are selected in
    instruction-length sequence by an instruction decode pointer, and a third mode in which the selected addresses are destination
    addresses of previously-decoded execution transfer instructions.
    hat is claimed is: 
        26. A method of operating a digital computer having an addressable memory, said addressable memory containing a computer
    program, said computer program including instructions and data at respective address locations of said addressable memory,
    each of said instructions consisting of contents of a variable number of contiguous ones of said address locations depending upon
    an operation specified by said each of said instructions, said method identifying address locations of said addressable memory
    that appear to contain said instructions of said computer program, said method comprising the steps of: 
         a) selecting program addresses in numerical sequence, and attempting to decode an instruction in said addressable
         memory at each program address until an initial instruction is decoded; and when said initial instruction is decoded, then 
         b) attempting to decode a string of instructions immediately following said initial instruction until an execution transfer
         instruction is decoded, and when an attempt to decode an instruction fails, continuing said selecting program addresses
         and said attempting to decode an instruction at each program address as set out in said step a), and when an execution
         transfer instruction is decoded, then 
         c) attempting to decode an instruction at a destination address of the decoded execution transfer instruction, and when
         the attempt to decode an instruction at the destination address of the decoded execution transfer instruction fails,
         continuing said selecting program addresses and said attempting to decode an instruction at each program address as set
         out in step a), and when the attempt to decode an instruction at the destination address of the decoded execution transfer
         instruction succeeds, then identifying, as said address locations of said addressable memory that appear to contain said
         instructions of said computer program, the address locations including said initial instruction and said string of instructions
         including said execution transfer instruction, 
         wherein some program addresses of said computer program are known to contain instructions, and wherein said step a)
         skips over the program addresses that are known to contain instructions, 
         wherein the decoding of an instruction is not permitted when an instruction being decoded partially overlaps program
         addresses known to contain an instruction, and 
         wherein said step a) skips over a program address containing a value that is included in a predefined set of values,
         regardless of whether an attempt to decode an instruction starting at the program address would be successful, wherein
         said set of values includes values that indicate instructions having a length of one program address location, said set of
         values includes opcodes of privileged instructions, and said set of values includes the value of zero, and 
         wherein said step a) skips over a program address that is the first address of a string of at least four printable ASCII
         alphanumeric characters.
           Scalzi et al.
                       9 /1996 
                             Method of using a target processor to execute programs of a source architecture that
                             uses multiple address spaces 
    ABSTRACT:   A method of utilizing large virtual addressing in a target computer to implement an instruction set translator (1ST)
    for dynamically translating the machine language instructions of an alien source computer into a set of functionally equivalent
    target computer machine language instructions, providing in the target machine, an execution environment for source machine
    operating systems, application subsystems, and applications. The target system provides a unique pointer table in target virtual
    address space that connects each source program instruction in the multiple source virtual address spaces to a target instruction
    translation which emulates the function of that source instruction in the target system. The target system efficiently stores the
    translated executable source programs by actually storing only one copy of any source program, regardless of the number of
    source address spaces in which the source program exists. The target system efficiently manages dynamic changes in the
    source machine storage, accommodating the nature of a preemptive, multitasking source operating system. The target system
    preserves the security and data integrity for the source programs on a par with their security and data integrity obtainable when
    executing in source processors (i.e. having the source architecture as their native architecture). The target computer execution
    maintains source-architected logical separations between programs and data executing in different source address
    spaces--without a need for the target system to be aware of the source virtual address spaces.
    Having thus described our invention, what we claim as new and desire to secure by Letters patent is: 
        1. An emulation method for executing individual source instructions in a target processor to execute source programs
    requiring source processor features not built into the target processor, comprising the steps of: 
         inputting instructions of a source processor program to an emulation target processor having significant excess virtual
         addressing capacity compared to a virtual addressing capacity required for a source processor to natively execute the
         source processor program, and supporting multiple source virtual address spaces in the operation of the source
         building a virtual ITM (instruction translation map) in a target virtual address space supported by the target processor, the
         virtual ITM containing an ITM entry for each source instruction addressable unit, each source instruction addressable
         unit beginning on a source storage instruction boundary, structuring each ITM entry for containing a translation address
         to a target translation program that executes a source instruction having a source address associated with the ITM entry,
         determining a ratio R by dividing the length of each ITM entry by the length of each source instruction addressable unit, 
         accessing an ITM entry for an executing source instruction by: 
             generating a source aggregate virtual address for the source instruction by combining the source address of the
             source instruction with a source address space identifier of a source virtual address space containing the
             multiplying the source aggregate virtual address by R to obtain a target virtual address component, and
             inserting the target virtual address component into a predetermined component location in a target virtual address
             to generate an ITM entry target virtual address for locating an ITM entry associated with the source instruction in
             order to obtain a one-to-one addressing relationship between ITM entry target virtual addresses and source
             instruction addresses.
                       4 /1997 
                             Method and apparatus for the transparent emulation of an existing instruction-set
                             architecture by an arbitrary underlying instruction-set architecture 
    ABSTRACT:   The invention provides means and methods for extending an instruction-set architecture without impacting the
    software interface. This circumvents all software compatibility issues, and allows legacy software to benefit from new
    architectural extensions without recompilation and reassembly. The means employed are a translation engine for translating
    sequences of old architecture instructions into primary, new architecture instructions, and an extended instruction (EI) cache
    memory for storing the translations. A processor requesting a sequence of instructions will look first to the EI-cache for a
    translation, and if translations are unavailable, will look to a conventional cache memory for the sequence, and finally, if still
    unavailable, will look to a main memory.
    I claim: 
        1. A method for translating a series of one or more instructions of a first semantic type into one or more instructions of a
    second semantic type, comprising the steps of: 
         providing a first memory; 
         providing a second memory; 
         translating a sequence of instructions of the first semantic type stored in the first memory into one or more primary
         instructions of the second semantic type and storing the instructions of the second type in the second memory; 
         upon a request from the processor for the sequence of instructions of the first semantic type: 
             providing the corresponding instructions of the second semantic type if available in the second memory;
             providing the sequence of instructions of the first semantic type if the corresponding instructions of the second
             semantic type are not available in the second memory.
    [Others found.]
           Kareda et al.
                     8 /1982 
                            Address control system for software simulation 
    ABSTRACT:   An address control system for software simulation in a virtual machine system having a virtual storage function.
    When a simulator program is simulating an instruction of a program to be simulated, an address translation of an operand address
    in the program to be simulated is achieved using a translation lookaside buffer, thereby greatly reducing the overhead for the
    address translation during the simulator program execution. 
                     1 /1987 
                            Emulating computer 
    ABSTRACT:   An apparatus and method is disclosed for providing an emulating computer. The present invention consists of a
    computer having a storage area, processing unit, control circuits and translation circuit. The original instructions are first loaded
    into the storage area. When the processor attempts to operate an instruction the control circuit loads a section of the instructions
    into the translating circuit. These instructions are then translated and stored in a memory area of the translating circuit having the
    address of the original instruction. The processor unit then accesses the storage area and retrieves the translated instruction. 
    What is claimed is: 
        7. A method of emulating a computer comprising the steps of: 
         transmitting an instruction to a processing unit; 
         checking a cache memory for a translated instruction; 
         loading an instruction block into an instruction memory if said translated instruction is not in said cache memory; 
         translating an instruction of said instruction block providing a translated instruction; 
         storing said translated instruction in said cache memory; and 
         transmitting said translated instruction from said cache memory to said processing unit. 
  • Find/write up a 1984 bib cite on a Bell Labs project to emulate the PDP-11. They implemented it in portable FORTRAN (minus some host-specific work around to handle random access files for swapping the simulated memory). They were able to boot Unix straight from distribution tapes. The work as done 81-82, I believe. The intention was to simplify bootstrapping Unix on new hardware in environments that did not have an existing Unix machine. The objective sort of failed since by the time they got it working, Unix was so succesful, few locations were that desperate. Their slowdown was about 120, and among other ideas they said that they could re-implement the interpreter kernel using threaded code for performance.

  • TeraGen emulating microcontroller. Following from an EE Times article by David Lammers, TeraGen architecture primes single engine for multiple instruction sets (01/25/99, 02:08:32 PM EDT).

  • DATE:  June 10, Thursday, 2:30
    TITLE:  Jalapeno --- a new Java Virtual Machine for Servers
    SPEAKER: Vivek Sarkar
             IBM T. J. Watson Research Center
    In this talk, we give an overview of the Jalapeno Java Virtual Machine
    (JVM) research project at the IBM T. J. Watson Research Center.  The
    goal of Jalapeno is to expand the frontier of JVM technologies for
    server nodes --- especially in the areas of dynamic optimized
    compilation and specialization, scalable exploitation of multiple
    processors in SMPs, and the use of a JVM as a 7x24 application server.
    The Jalapeno JVM has two key distinguishing features.  First, the
    Jalapeno JVM takes a compile-only approach to program execution.
    Instead of providing both an interpreter and a JIT/dynamic compiler,
    it provides two dynamic compilers --- a quick non-optimizing
    "baseline" compiler, and a slower production-strength optimizing
    compiler.  Both compilers share the same interfaces with the rest of
    the JVM, thus making it easy to mix execution of unoptimized methods
    with optimized methods.  Second, the Jalapeno JVM is itself
    implemented in Java!  This design choice brings with it several
    advantages as well as technical challenges. The advantages include a
    uniform memory space for JVM objects and application objects, and ease
    of portability.  The key technical challenge is to overcome the large
    performance penalties of executing Java code (compared to native code)
    that has been the experience of current JVMs; if we succeed in doing
    so, we will simultaneously improve the performance of our JVM as well
    as of the applications running on our JVM.
    The Jalapeno project was initiated in January 1998 and is still work
    in progress.  This talk will highlight our design decisions and early
    experiences in working towards our goal of building a high-performance
    JVM for SMP servers.
  • Get info on EPP (Ball, Larus, et. al.).

  • Get info on xtrace.
  • Jack Veenstra (MINT) is probably at MIPS (1998).
  • Paint, a PA-RISC simulator based on Mint. See also
  • Get more info on this quote:
    "What's visible about software is the effect it has on something else. If two thoroughly different programs have the same observable effects, you cannot tell which one has executed. If a given portion of a program has no observable effects, then you have no way of knowing if it is executing, if it has finished, if it got part way through and then stopped, or if it produced 'the right answer.' Programmers nearly always must rely on highly indirect measures to determine what happens when their programs execute. This is one reason why debugging is so difficult."
    [Digital Woes, Lauren Ruth Weiner, 1993, Addison-Wesley]
  • Dixie: M. Ferna'ndez and R. Espasa. Dixie Architecture Reference Manual (version 1.0). TR UPC-DAC-1998-55, Computer Architecture Department, Universitat Politecnica de Catalunya-Barcelona, 1998.

  • SimpleScaler: D. Burger and T. M. Austin. The SimpleScalar Tool Set, Version 2.0. TR CS-TR-97-1342, computer Sciences Department, University of Wisconsin-Madison, 1997.

  • Get, read and incorporate: R. Uhlig and T. N. Mudge, ``Trace-Driven Memory Simulation: A Survey'', ACM Computing Surveys, pages 128-170, Feb. 1997.
  • Get, read, and incorporate: J. E Veenstra and R. J. Fowler. MINT: A front end for efficient simulation of shared-memory multiprocessors. In "Proceedings of the Second Interational Workshop on Modeling, Analysis and Simulation of Computer and Telecommuncation Systems (MASCOTS), pages 201-207, January 1994.
  • Get, read, and incorporate: B.-S. Yang, S.-M. Moon, S. Park, J. Lee, S. Lee, J. Lee, K. Ebcioglu, and E. Altman. LaTTe: A Java VM Just-in-Time Compiler with Fast and Efficient Register Allocation. In 1999 International Conference on Paralle Architectures and Compilatin Techniques, Ocotber 1999.
  • Get, read, and incorporate: MichaelCierniak, James. M. Stichnoth, Guie-Yuan Lueh. "Support for Garbage Collection at Every Instruction in a Java Compiler." In 1999 ACM SigPLAN Conference on Program Language Design and Implementation (POLDI) 1999.
  • Get, read, and incorporate: A. Krall, and M. Probst> "Monitors and Exceptions: How to Implement Java Efficiently." In AMC 1998 Workshop on Java for High-Performance Network Computing, 1998.
  • Get, read and incorproate: T. Wilkinson, "Kaffe: A JIT and Interpreting Virtual Machine to Run Java Code." See
  • Get, read, and incorporate: Kemal Ebcioglu and Erik R. Altman, "DAISY: Dynamic Compilation for 100% Architectural Compatibility." In Proceedings of the 24th International Symposium on Computer Architecture, pp. 26-37, June 1997.
  • MAME is a ``Multiple Arcade Machine Emulator''. See for more info. Note that MAME even runs on a Digita camera! Courtesy of James W. Surine.
  • [Transmeta 00]
    %A Alexander Klaiber
    %I Transmeta Corporation
    %T The Technology Behind Crusoe(tm) Processors
    %R From
    	as of 2002/08/19.
    %D 2000
    White paper on Crusoe, emulation.

  • [Halfhill 94b]
    \bibitem{Halfhill:94} Tom. R. Halfhill, ``Apple's 680x0 Emulation for Unix'' Byte, April 1994

  • [Scantlin 96]
    Scantlin. ``RISC architecture computer configured for emulation of the instruction set of a target computer.'' 1996/11, United States Patent #5574927.

  • [Baraz et al 98]
    Baraz et al. ``Method for verifying contiquity of a binary translated block of instructions by attaching a compare and/or branch instruction to predecessor block of instructions.'' 1998/02, United States Patent #5721927.

  • [Klein et al 98]
    Klein et al. ``Optimizing hardware and software co-simulator.'' 1998/06, United States Patent #5768567.

  • [Bunza 98]
    Bunza. ``System and method for simulation of computer systems combining hardware and software interaction.'' 1998/11, U.S. Patent #5838948

  • Find, read and incorporate:
    Kumar et al., emulation Verification of the Motorola 68060, Proceedings,
    ICCD, 1995, pp. 150-158.
    Note et al., Rapid Prototyping of DSP Systems:
    Requirements and Solutions, 6th IEEE Int'l Wkshp on
    RSP, 1995,
    pp. 40-47. 
    Tremblay et al., A Fast and Flexible Performance Simulator for
    Micro-Architecture Trade-off Analysis on
    Ultrasparc-1 '1995, pp 2. 
    Rosenberg, J.M., Dictionary of Computers, Information Processing &
    Telecommunications, John Wiley & Sons, pp 382

  • Transitive Technologies Ltd., a spinoff of Manchester University, HQ'd in San Diego, CA. ``Dynamite'' dynamic translator. Claims can do several machines; can run above or below the OS; is not quite shrink-wrap -- ``a licensing proposition''; can do instruction decoding very quickly; has issues around interrupt/exception handling; can lead to speedups during translation; less than six months to port; is ready for production. Story at EE Times, 11 June 2001.
  • ACM Transactions on Computer Systems (TOCS) Volume 15 , Issue 4 (November 1997)

    Continuous profiling: where have all the cycles gone?

    Authors Jennifer M. Anderson Digital Equipment Corp., Palo Alto, CA William E. Weihl Digital Equipment Corporation, Palo Alto, CA Lance M. Berc Digital Equipment Corp., Palo Alto, CA Jeffrey Dean Digital Equipment Corp., Palo Alto, CA Sanjay Ghemawat Digital Equipment Corp., Palo Alto, CA Monika R. Henzinger Digital Equipment Corp., Palo Alto, CA Shun-Tak A. Leung Digital Equipment Corporation, Palo Alto, CA Richard L. Sites Digital Equipment Corporation, Palo Alto, CA Mark T. Vandevoorde Digital Equipment Corporation, Palo Alto, CA Carl A. Waldspurger Digital Equipment Corporation, Palo Alto, CA

    Publisher ACM Press New York, NY, USA Pages: 357 - 390 Periodical-Issue-Article Year of Publication: 1997 ISSN:0734-2071

    ABSTRACT This article describes the Digital Continuous Profiling Infrastructure, a sampling-based profiling system designed to run continuously on production systems. The system supports multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel. Samples are collected at a high rate (over 5200 samples/sec. per 333MHz processor), yet with low overhead (1-3% slowdown for most workloads). Analysis tools supplied with the profiling system use the sample data to produce a precise and accurate accounting, down to the level of pipeline stalls incurred by individual instructions, of where time is bring spent. When instructions incur stalls, the tools identify possible reasons, such as cache misses, branch mispredictions, and functional unit contention. The fine-grained instruction-level analysis guides users and automated optimizers to the causes of performance problems and provides important insights for fixing them.

    • Dynamically Recompiling ARM Emulator, Julian Brown (May 1, 2000) WEB
    • Generator: A Sega Genesis Emulator, James Ponder (1997-1998) DOC
    • A Robust Foundation Binary Translation of x86 Code, Liang Chuan Hsu, University of Illinois, DOC WEB
    • DAISY: Dynamic Compilation for 100% Architectural Compatibility, Kemal Ebcioglu, Erik R. Altman, IBM Research Division Yorktown Center [Computer Science RC 20538 08/05/96] DOC WEB
    • DAISY Dynamic Binary Translation Software, Erik R. Altman, Kemal Ebcioglu, IBM T. J. Watson Research Center, 2000 DOC SOURCE WEB
    • DAISY: Dynamic Compilation for 100% Architectural Compatibility, Kemal Ebcioglu, Erik R. Altman, IBM T. J. Watson Research Center [Presentation] PRESENTATION WEB
    • UQBT: A Resourceable and Retargetable Binary Translator (Web Page), Cristina Cifuentes, Mike Van Emmaik (Queensland University), Norman Ramsey (Harvard University), December 1999 DOC WEB
    • UQBT: Adaptable Binary Translation at Low Cost, Cristina Cifuentes, Mike Van Emmerik, PACT99, DOC WEB
    • Machine-Adaptable Dynamic Binary Translation, David Ung, Cristina Cifuentes (University of Queensland , 1999?) DOC WEB
    • Binary Translation: Static, Dynamic, Retargetable?, Cristina Cifuentes (University of Queensland,, Vishv Malhotra (University of Tasmania, 1996? DOC WEB
    • The Design of a Resourceable and Retargetable Binary Translator, Cristina Cifuentes, Mike Van Emmerik (Queensland University), Norman Ramsey (Virgina University), 1999? DOC WEB
    • Transparent Dynamic Optimization, Vasanth Bala (, Evelyn Duesterwald, Sanjeev Banerjia, HPL Cambridge, June 1997, DOC(short), DOC(large), WEB [DYNAMO]
    • Dynamo: A Transparent Dynamic Optimization System, Vasanth Bala (, Evelyn Duesterwald (, Sanjeev Banerjia (, HPL Cambridge, 2000, DOC WEB
    • Microprocessor = Dynamic Optimization Software + CPU Hardware, HPL (DYNAMO), TALK, WEB
    • Dynamo as VM Accelerator, HPL (DYNAMO), TALK, WEB
    • An Out-of-Order Execution Technique for Runtime Binary Translators, Bich C. Le (PDL-HP, 1998) [1008 ACM 1-58113-107-0/98/0010] DOC WEB
    • Migrating a CISC Computer Family onto RISC via Object Code Translation, Kristy Andrews, Duane Sand (Tandem Computers Inc. 1992) [1992 ACM O-89791-535-6/92/0010/0213] DOC WEB
    • Optimizations and Oracle Parallelism with Dynamic Translation, Kemal Ebciuglu, Erik R. Altman, Sumedh Sathaye, Michael Gschwind (IBM Yorktown Center, { , , , 1999) [1072-4451/99 1999 IEEE] DOC PRESENTATION WEB
    • BOA: Targeting Multi-GHz with Binary Translation, Erik Altman and others, IBM Research, PRESENTATION WEB
    • Dynamic and Transparent Binary Translation, Michael Gschwind, Erik R. Altman, Sumedh Sathaye, Paul Ledak, David Appenzeller, PACT99, DOC WEB
    • Binary Translation, Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, Scott G. Robinson (Digital 1993) [ACM 1993, vol.36 no. 2] DOC WB
    • Binary Translation: A Short Tutorial, Erik R. Altman, David Kaeli, Yaron Sheffer (PACT99) DOC WEB
    • Welcome to the opportunities of Binary Translation, Erik R. Altman, David Kaeli, Yaron Seffer, PACT99 DOC WEB
    • PA-RISC to IA-64: Transparent Execution, No Recompilation, Cindy Zheng, Carol Thompson (HP March 2000) [0018-9162/00 IEEE March 2000] DOC WEB
    • Embra: Fast and Flexible Machine Simulation, Emmett Witchel, Mendel Rosenblum (MIT, Standford 1996) [Sigmetrics96] DOC1 DOC2 WEB
    • A Structuring Algorithm for Decompilation, Cristina Cifuentes, Queensland Universtiy, August 1993 DOC WEB
    • The Impact of Copyright on the Development of Cutting Edge Binary Reverse Engineering Technology, Cristina Cifuentes, U. Queensland DOC WEB
    • A Transformational Approach to Binary Translation of Delayed Branches, Norman Ramsey, Cristina Cifuentes, 1999
    • Software Profiling for Hot Path Prediction: Less is More, Evelyn Duesterwald (, Vasanth Bala (, HPL 2000 DOC WEB

    • A JAVA ILP Machine Based on Fast Dynamic Compilation, Kemal Ebcioglu (, Erik Altman (, Erdem Hokene (, IBM Yorktown Center, 1997 DOC WEB
    • Timing Insensitive Binary to Binary Translation of Real Time Systems, Bryce Cogswell (CMU), Zary Segall (U. Oregon), 1994 DOC WEB
    • DIGITAL FX!32: Combining Emulation and Binary Translation, Raymond J. Hookway and Mark A.Herdeg, (DIGITAL Technical Journal, 28 August 1997) DOC WEB
    • White Paper: How DIGITAL FX!32 works, DIGITAL DOC WEB
    • Executor Internals: How to Efficiently Run Mac Programs on PCs, Mathew J. Hostetter , Clifford T. Matthews (Ardi 1996) DOC WEB
    • Syn68k: ARDIs dynamically compiling 68LC040 emulator, Mat Hostetter (Ardi October 27 1995) DOC WEB
    • The Technology Behind Crusoe Processor, Alexander Klaiber, Transmeta Corp., January 2000 DOC WEB
    • Combining hardware and software to provide an improved microprocessor (Crusoes Transmeta Patent), Robert F. Cmelik, et al. (US Patent and Trademark Offioce, February 29, 2000) DOC WEB
    • Builiding the Virtual PC, Eric Traut, Core Technologies, Nov 1995 DOC WEB
    • The DR Emulator (Technote PT39), Eric Traut, Apple, February 1995 DOC WEB
    • HW 28 DR Emulator Caches, Apple, 8 April 1996 DOC WEB
    • Java Hot Spot Documentation DOC WEB
    • Opendoc and Java Beans, Gregg Williams DOC WEB
    • What is Binary Translation (Freeport Express), Richard Gorton, Digital 1995 DOC WEB
    • DRFAQ: Dynamic Recompilation Frequently Asked Questions, Michael König (M.I.K.e), ( 29-8-20000). DOC WEB
    • Emulators and Emulation, Bill Haygood 1999
    • Interview to Jeremey Chadwick
    • VCODE: A Retargetable, Extensible, Very Fast Dynamic Code Generation System, Dawson R. Engler (MIT 1995) [ACM PACT96] DOC WEB
    • A VCODE Tutorial, Dawson R. Engler (MIT, April 28 1996) DOC WEB
    • DCG: An Efficient, Retargetable Dynamic Code Generation System, Dawson R. Engler (MIT), Todd A. Proebsting (University of Arizona) 1993? DOC WEB
    • Dynamo: A Staged Compiler Architecture for Dynamic Program Optimization, Mark Leone, R. Kent Dybvig (Indiana University 1997) DOC WEB
    • Efficient Compilation and Profile-Driven Dynamic Recompilation in Scheme, Robert G. Burger, Indiana Universtity, March 97 DOC WEB
    • C: A Language for High-Level, Efficient and Machine-Independant Dynamic Code Generation, Dawson R. Engler, Wilson C. Hsieh, M. Franskaashoek, MIT 1995 (ACM) DOC WEB
    • tcc: A System for Fast, Flexible and High-Level Dynamic Code Generation, Massimiliano Poletto, Dawson R. Engler, M. Frans Keashoek, MIT 1996 DOC WEB
    • tcc: A Template-Based Compiler for ‘C. Massimiliano Poletto, Dawson R. Engler, M. Frans Keashoek, MIT 1996 DOC WEB
    • Fast, Effective Dynamic Compilation, Joel Auslander, Matthai Philipose, Craig Cambers, Susan J. Eggers,and Brian. Bershad, University of Washington, 1996 DOC WEB
    • Fast and Efficient Procedure Inlining, Oscar Weddell, R. Kent Dybvig, Indiana University, June 11 1997 DOC WEB [**DELETE**]
    • Some Efficient Architecture Simulation Techniques, Robert Bedichek, University of Washington, 1990 DOC WEB
    • SUPERSIM -- A New Technique for Simlation of Programmable DSP Architectures, C. Zivojnovic, S. Pees, Ch. Schlger, R. Weber, H. Meyr, Aachen University (Germany), 1995, DOC [DELETE?]
  • Kevin Scott and Jack Davidson, Strata: A software dynamic translation infrastructure. (,
  • bintrans by Mark Probst. See
  • Strata -- see
  • Dynamic Optimization Infrastructure and Algorithms for IA-64 (includes discussion of delayed compilation and dynamic translation). See
  • ``Binary Translation: Classification of emulators'' by Arjan Tijms, Leiden Institute for Advanced Computer Science, Notes some early work (e.g., IBM 1401). Read and review. See as of 2003/01/30.
  • ``Studying the Performance of the FX!32 Binary Translation System'', see as of 2003/01/30.
  • ``Three factors contribute to the success of a microprocessor: price, performance, and software availability.'' Digital Technical Journal Vol. 9, No. 1, 1997. See as of 2003/01/30.
  • `` B. Case, "Rehosting Binary Code for Software Portability," Microprocessor Report, Vol. 3, No. 1, Jan. 1989, pp. 4-9.'' See as of 2003/01/30.
  • Valgrind is a memory tracing tool by Julian Seward and Nick Nethercote Inofrmation at including ``The design and implementation of Valgrind'' from Notes: GPL'd. For finding memory-management problems. Targeted for x86 GNU/Linux executables. Dynamically link with executable using -z initfirst so it runs before any other object in the image; Valgrind thus gains control. Translates and instruments basic blocks; saved in translation cache and mapped by a translation table. Fast map typically has 98% hit rate. Translations are procedures; they return the virtual (not physical) eip. The main loop does all dispatch. Translation storage management uses an approximate LRU algorithm. Multiple main loop entry points handle special cases such as returns of system call handlers; calls to malloc(), free(), etc. User memory is tracked at the level of allocation blocks. Tracking information includes the call stack at allocation time. Calls to free() look for tracking information, if none, the block has already been freed or was never allocate and an error is signaled. If found, the block is marked inaccessible and the tracking information goes on a free list to look for invalid accesses. Various per-byte information via a sparse map of the entire 4G address space -- 64K sections mapping 64KB each, indexed by the high and low 16 bits of the address, respectively. Valgrind supports --stop-after= to return to native execution. Among other things, this helps with a binary search for Valgrind bugs. To support native execution, memory must be identical whether or not running on the simulated CPU. Signals are handled ``differently'' -- because they arrive asynchronously, --stop-after= varies. Thus, there is a mode to run all signals native. Valgrind also uses a different stack format than native, so a given signal must be handled all-native or all-Valgrind. Contains numerous assertions and internal checks, which are enabled all the time; more expensive checks are done once every N uses of the relevant coe. Some very slow checks are only enabled optionally: low-level memory management complete checks; symbol table reader; per-byte tracking; translation IR sanity checking; system call wrapper sanity checks; main loop %ebp; register allocator checks. Most symbols prefixed using a CPP macro; malloc(), free(), etc. are intercepted and have the usual names. Is largely independent of other shared objects, avoids conflicts if Valgrind and the application simultaneously execute impure code. Also, glibc defines sigset_t differently than the kernel, and Valgrind uses the kernel definition. Still imports too many ``unsafe'' .h files. Limitations include: no threads (``most-requested feature'', needs fast mutex and race safety), no MMX, support for non-POSIX signals, Linux API is not a module, so nontrivial to port to other APIs. Does not use %edi in order to avoid save/restore issues around ``special uses''. Translations are position-independent and are moved in TC as part of LRU management. %ebp always points at a fixed region except on translation return a special value indicates the return virtual pc requires special handling. Thus, virtual state accesses use N(%ebp) addressing. with the most common using 8-bit immediates. Values are re-homed at the end of every translation. Snapshotting initial state takes place before Valgrind is set up; therefore, dump state to a static area, initialize, home. System calls operate on the virtual state except the program counter is the real (Valgrind) value. Need to save Valgrind state while faking up system call state. Horror: may need to handle a second system call while blocked inside a first system call, so actually a stack of Valgrind save areas. Translation based on UCode to avoid details of the x86. UCode is approximately 2-address code, with a third field for immediates. UCode is much simpler than x86; for example only LOAD and STORE touch memory. Register allocation on UCode but some fixup done later. Valgrind does not use the FPU, so simulation is much easier. Valgrind does not cache FPU state across multiple FPU instructions. Optimisation passes can be disabled. Instrumentation can be disabled. Can single-step. UCode is annotated with memory instrumentation code. Code generation aided by GNU superopt v2.5. Memory instrumentation is 1==problem 0==valid; unintuitive, but faster to test against 0 than all 1's. Code generation is paranoid about partial-register updates. Simulated %esp is not updated lazily because the instrumenter needs the current value. Instrumentation is at the UCode level, rather than the x86 level. Valgrind attempts to track defined-ness of memory at the bit level, but ``throws in the towel'' on things that call helper functions. Optimize bit-level referneces that touch a whole object. Valgrind runs on top of a setjmp-installed exception handler so segfaults, etc., can bail out of current translation. %eip is kept within 4 bytes of real. Signals are queued and checked every 1000 blocks. Signal frames have a Valgrind return address for frame cleanup, but some signals e.g., longjmp() to exit. No systematic verification or regression. Has a cache simulator -- tracks which instructions have effects, not just the cache effects. Data structure for cache/instruction must not be discarded when the instruction is discarded. Projects include user-defined permission ranges.

    Questions: Can Valgrind run itself? The -z trick suggests no. Probably no SSE/SSE2. Further explanation of nested system calls (how they arise) would be useful.

  • KCachegrind: visualization tool for data from Cachegrind.
  • T. E. Jeremiassen. Sleipnir --- an instruction-level simulator generator. In International Conference on Computer Design, pages 23--31. IEEE, 2000.
  • @article{ chang01fast, author = "Felix Sheng-Ho Chang and Alan J. Hu", title = "Fast Specification of Cycle-Accurate Processor Models", booktitle = "Proc. International Conf. Computer Design ({ICCD})", publisher = "IEEE Computer Society Press", pages = "488--492", year = "2001", url = "" }
  • S. Onder and R. Gupta ``Automatic Generation of Microarchitecture Simulators'', In International Conference on Computer Languages, pp. 80-89, IEEE, May 1998. (Cited by [CH01], [JerXX].)
  • S. Pees, A. Hoffmann, V. Zivojnoovic, and H. Meyer. ``LISA -- Machine Description Language for Cycle-Accurate Models of Programable DSP Architectures''. In 36th Design Automation Conference, pp. 933-938, ACM/IEEE 1999. (Cited by [CH01], [JerXX].)
  • An architectural description language for simulator and other tools construction.
  • Links to Wine/Plex86/VMware/etc.
  • [RF 97]

      author="Norman Ramsey and Mary F. Fernandez",
      title="{S}pecifying {R}epresentations of {M}achine {I}nstructions",
      journal="ACM Transactions on Programming Languages and Systems",
      volume = "19",
      number = "3",
      pages = "492--524",

  • [Larsson 97]

      name = "F. Larsson",
      title="{G}enerating {E}fficient {S}imulators
    	from a {S}pecification {L}anguage",
      institution="Swedish Institute of Computer Science",

  • [PZRM 97]

      author="S. Pees, V. Zivojnovic, A. Ropers, H. Meyr",
      title="{F}ast {S}imulation of the {T}{I} {T}{M}{S}320{C}54x {D}{S}{P}",
      journal="International Conference on Signal Processing Applications and Technology},
      pages = "995-999",

    • Bergh et al., HP 3000 Emulation on HP Precision Architecture Computers, Hewlett-Packard Journal, Dec. 1987, pp. 87-89.
    • Hunter et al., DOS at RISC, Byte, Nov. 1989, pp. 361-368.
    • Saari, Michael, 68000 Binary Code Translator, FORML Conference Proceedings, 1987, pp. 48-52.
    • Banning, John, The XDOS Binary Code Conversion System, IEEE, 1989, pp. 282-287.
  • SESC simulator -- superscalar simulator for in-order and out-of-order, also several multiprocessor configurations. As of 2003/12, supports only MIPS ISA.
  • ReXSim: A Retargetable Framework for Instruction-Set Architecture Simulation, Mehrdad Reshadi, Prabhat Mishra, Nikhil Bansal, Nikil Dutt
  • SICS technical reports, ``T97-01 Generating Efficient Simulators from a Specification Language 51 pages (PostScript) Fredrik Larsson
    A simulator is a powerful tool for hardware as well as software development. However, implementing an efficient simulator by hand is a very labour intensive and error-prone task. This paper describes a tool for automatic generation of efficient instruction set architecture (ISA) simulators. A specification file describing the ISA is used as input to the tool. Besides a simulator, the tool also generates an assembler and a disassembler for the architecture. We present a method where statistics is used to identify frequently used instructions. Special versions of these instructions are then created by the tool in order to speed up the simulator. With this technique we have generated a SPARC V8 simulator which is more efficient than our hand-coded and hand-optimized one.''
  • R97-03 SimGen: Development of Efficient Instruction Set Simulators (abstract, PostScript) Fredrik Larsson, Peter Magnusson, Bengt Werner
  • T97-02 Performance Debugging and Tuning using an Instruction-Set Simulator (PostScript) Peter S. Magnusson, Johan Montelius
    Instruction-set simulators allow programmers a detailed level of insight into, and control over, the execution of a program, including parallel programs and operating systems. In principle, instruction set simulation can model any target computer and gather any statistic. Furthermore, such simulators are usually portable, independent of compiler tools, and deterministic-allowing bugs to be recreated or measurements repeated. Though often viewed as being too slow for use as a general programming tool, in the last several years their performance has improved considerably. We describe SIMICS, an instruction set simulator of SPARC-based multiprocessors developed at SICS, in its role as a general programming tool. We discuss some of the benefits of using a tool such as SIMICS to support various tasks in software engineering, including debugging, testing, analysis, and performance tuning. We present in some detail two test cases, where we've used SimICS to support analysis and performance tuning of two applications, Penny and EQNTOTT. This work resulted in improved parallelism in, and understanding of, Penny, as well as a performance improvement for EQNTOTT of over a magnitude. We also present some early work on analyzing SPARC/Linux, demonstrating the ability of tools like SimICS to analyze operating systems. (NOTE: A later version of this report was published in ILPS'97)
  • ``IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium(R)-based systems'', Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Kaletsky, Yun Wang, and Yigal Zemach. IEEE MICRO-36 2003.
  • Simulators to keep old software alive. Products include PDP-11 on ?, VAX on ?, and Alpha on Itanium. [According to one reader, they reported about 10 VAX MIPS on a 600 MHz P-III, with 4X memory expansion.].
  • Sulima SPARC V9 system simulator (as of 2004/02). Also a paper.
  • Anything you know about that I haven't included and any bugs you find that I haven't fixed.

  • From instruction-set simulation and tracing