Whos-Magnusson

SSIM

Find and incorporate

ssim: A Superscalar Simulator
Mike Johnson
AMD
M. D. Smith
Stanford Univ.

Pixie front ends in ftp://velox.stanford.edu/pub

Find and incorporate (based on work with `pixie' and `ssim', may tell about them?)

Johnson, Mike: Superscalar microprocessor design
Englewood Cliffs, NJ : Prentice Hall, 1991. - XXIV, 288 S. :
graph. Darst.
(Prentice-Hall series in innovative technology) Literaturverz. S. 273 - 278
ISBN 0-13-875634-1

Find and incorporate (mostly results of tracing, but may discuss simulation and tracing):

@Book{Huck:89,
  author =       {Jerome C. Huck and Michael J. Flynn},
  title =        {Analyzing Computer Architectures},
  publisher =    {IEEE Computer Society Press},
  year =         1989,
  address =      {Washington, DC}
}

Find out more about Robert Bedichek's T2 simulator.

Yaze Z80 and CP/M emulator. more info and source code.

WinDLX, MSWindows GUI for DLX. Also include information about DLX from [Hennessy & Patterson 93]

UAE Commodore Amiga hardware emulator (incomplete).

DEC FX!32 binary translation/emulation system for running Microsft Windows applications.

Find and incorporate

%A Max Copperman
%A Jeff Thomas
%T Poor Man's Watchpoints
%J ACM SIGPLAN NotIces
%V 30
%N 1
%D January 1995
%P 37-44

Pardo has a copy. Executive summary: debugging tool; statically patches loads and stores with code to check for data breakpoints.

Amusing story: The processor they were running on has load delay slots and does not have pipeline interlocks. Their tool replaces each load or store with several instructions; it patched a piece of user-mode code of the form

load addr -> r5
store r5 -> addr2

Before patching, the code saved the old value of r5 to addr2. After patching, it saved the new value. Technically, this code was broken already because the symptom could have also been exhibited by an interrupt or exception between the load and the store.

Find and incorporate information about Spike. Referenced in [Conte & Gimarc 95], Tom Conte conte@eos.ncsu.edu says (paraphrpased):

``Spike was built inside GNU GCC by Michael Golden and myself. It includes a lot of features that have appeared in ATOM, including the simulator with the benchnark into a single ``self-tracing'' binary. The instruction trace was based on an abstract machine model distilled from GCC's RTL; it had both a high-level and a low-level form. Spike is still in occasional use, but has never been released.''

Find and incorporate information about Reiser & Skudlarek's paper "Program Profiling Problems, and a Solution via Machine Language Rewriting", from ACM SIGPLAN Notices, V29, $1, January 1994. Pardo has a copy.

Basic summary: Wanted to profile. -p/-pg code is larger and slower by enough to make it hard to justify profiling as he default. Assumes the entire source is available. For these and other reasons, wrote jprof which operates with disassembly, analysis and rewriting. Discusses sampling errors, expected accuracy, stability, randomness, etc. Describes jprof: counters and stopwatches; subroutine call graph. Domain/OS on HP/Apollo using 68030. Discusses shared libraries. Can also use page-fault clock. 4-microsecond clocks. Some lessons/observations. Doesn't explain how program running time is affected by jprof.

Design tradeoffs between various implementations of 68k implementations (comp.arch posting).

More on decompilation of PC executables

Update the reference for Alvin R. "Alvy" Lebeck.

Review and include FX!32. March 5 1996 Microprocessor Report. Jim Turley, "Alpha Runs x86 Code with FX!32".

Summary: DEC is running Win32 application binaries on Alpha by a new combination of interpreter and static translator. The static translator runs in the background, between the first and second executions of the application. It uses info collected by the interpreter during the 1st run, to reliably distinguish active code paths from r/o data and work out the effects of indirect jumps. Static analysis can't do this automatically on its own, for typical x86 binaries.

Add info about Doug Kwan (author of YAE, an Apple ][ emulator) to "Who's who" section. Nino says: only freely available dynamic recompilation. (Dynamic recompilation for SPARC and MIPS). Information forwarded by Marinos Yannikos <nino@complang.tuwien.ac.at>.

Find, read, and incorporate decompilation info (also cites a program verification dissertation):

%A P. J. Brown
%T Re-creation of Source Code from Reverse Polish Form
%J Softwawe \- Practice & Experience
%V 2
%N 3
%P 275-278
%D 1972

Note: there's a slightly later SPE that has a follow-up article explaining how to do it faster/more efficiently.

Xref: uw-beaver comp.compilers:10907 Path: uw-beaver!uhog.mit.edu!news.mathworks.com!newsfeed.internetmci.com!in2.uu.net!ivan.iecc.com!ivan.iecc.com!not-for-mail From: faase@cs.utwente.nl (Frans F.J. Faase) Newsgroups: comp.compilers Subject: Re: Need decompiler for veryyy old code.... Date: 29 Apr 1996 23:11:51 -0400 Organization: University of Twente, Dept. of Computer Science Lines: 29 Sender: johnl@iecc.com Approved: compilers@ivan.iecc.com Message-ID: <96-04-144@comp.compilers> References: <96-04-110@comp.compilers> NNTP-Posting-Host: localhost.iecc.com Keywords: disassemble, IBM > Currently I am undertaking to modify some very old IBM code (at least > 20 years old. I believe that the code is either Assembler or Cobol. I do not know whether the following is of use for you, but I do maintain a WWW page about decompilation, which has some links to other resources as well. <a href="http://www.cs.utwente.nl/~faase/Ha/decompile.html"> http://www.cs.utwente.nl/~faase/Ha/decompile.html</a> Maybe, you should contact Martin Ward <Martin.Ward@durham.ac.uk>: <a href="http://www.dur.ac.uk/~dcs0mpw/"> http://www.dur.ac.uk/~dcs0mpw/</a> Or Tim Bull <tim.bull@durham.ac.uk>: <a href="http://www.dur.ac.uk/~dcs1tmb/home.html"> http://www.dur.ac.uk/~dcs1tmb/home.html</a> Frans (P.S. Email to <PROCUNIERA@ucfv.bc.ca> bounced with 451 error) -- Frans J. Faase Information Systems Group Tel : +31-53-4894232 Department of Computer Science secr. : +31-53-4893690 University of Twente Fax : +31-53-4892927 PO box 217, 7500 AE Enschede, The Netherlands Email : faase@cs.utwente.nl --------------- http://www.cs.utwente.nl/~faase/ --------------------- -- Send compilers articles to compilers@iecc.com, meta-mail to compilers-request@iecc.com.

A Java runtime, which generates native code at runtime: Softway's Guava. Info from Jeremy Fitzhardinge (jeremy@suede.sw.oz.au)

Find, read, and summarize the following:

%A Ariel Pashtan
%T A Prolog Implementation of an Instruction-Level Processor Simulator
%J Software \- Practice and Experience
%V 17
%N 5
%P 309-318
%D May 1987

Find, read and summarize "Augmint". According to Anthony-Trung Nguyen <anguyen@csrd.uiuc.edu>, it is based on MINT, and understands x86 instruction set and runs on Intel x86 boxes with UNIX (Linux, Unixware, etc.) or Windows NT. It is described further at http://www.csrd.uiuc.edu/iacoma/augmint.html and there was a paper in ICCD-96 paper, available from ftp://ftp.csrd.uiuc.edu/pub/Projects/iacoma/aug.ps.

Find, read and summarize "Etch". See http://memsys.cs.washington.edu/memsys/html/etch.html. Etch is an x86 Windows/NT tool for annotating x86 binaries, without source code.

Find, read and summarize "Etch".

From: bchen@eecs.harvard.edu (Brad Chen)
Newsgroups: comp.arch
Subject: Windows x86 Address Traces Available
Date: 7 Oct 1996 22:20:30 GMT
Organization: Harvard University EECS
Lines: 15
Message-ID: <53bvne$5lb@necco.harvard.edu>
NNTP-Posting-Host: steward.harvard.edu
Keywords: Windows x86 address traces

A collection of x86 memory reference traces from Win32 applications are now available from the following URL: http://etch.eecs.harvard.edu/traces/index.html. The collection includes traces from both commercial and public-domain applications. The collection currently includes:

 - Perl
 - MPeg Play
 - Borland C++
 - Microsoft Visual C
 - Microsoft Word

These traces were created using Etch, and instrumentation and optimization tool for Win32 executables. For more information on Etch see the above URL.

(etch-info@cs.washington.edu)

Add information on iprof. Here's a summary from Peter Kuhn:

Peter Kuhn                                    voice: +49-89-289-23092
Institute for Integrated Circuits (LIS)       fax1:  +49-89-289-28323
Technical University of Munich                fax2:  +49-89-289-25304
Arcisstr. 21, D-80290 Munich, Germany 
email: P_Kuhn@lis.e-technik.tu-muenchen.de
http:  //www.lis.e-technik.tu-muenchen.de/people/kp.html

portable to GNU gcc/g++ supported platforms, operating systems and processors
detailed instrumentation of instruction usage
no source code modification necessary
no restrictions for the application programmer (only "-a" switch for gcc/g++ compilers)
applicable to statically linked libraries
minimal slow down of program execution time (about 5%)
fast: no source recompilation necessary for repeated simulation runs
less amount of trace data produced
high reliability: no executable modification
covered by GNU Public License
available via anonymous ftp at: ftp://ftp.lis.e-technik.tu-muenchen.de/pub/iprof

The operation is: With gcc/g++ option -a (above version 2.6.3) you can produce a basic block statistics file (bb.out), which contains the number of times each basic block of the program is acccessed during runtime. iprof processes this basic block statistics file and accesses the program's executable to summarize the machine instructions used for each basic block. So iprof doesn't make any modifications to the gcc/g++ and is easily portable among gcc/g++ supported architectures. Currently binaries for LINUX 486, Pentium and Sparc Solaris are provided, ports to other architectures are straightforward.

There are many ways to measure slowdown. Each has certain benefits, each has shortcomings.

Time to execute target code on simulated target vs. native target running time. This is particularly interesting if you are trying to deterine relative performance for a cmmercial product such as SoftPC or if you're otherwise interested in real-time response. However, it ignores the implementation technology of the host machine. For example, a simulated Z-80 on a SPARC will be faster than a simulated SPARC on a Z-80, and performance may vary by 6X depending on which Z-80 you use.
The time or number of host instructions to execute the workload vs. executing the workload native on the host tells you the most about simulation efficiency if the host and the target are the same machine. The numbers get less useful if the host and target are different; there's also differences if the simulator executes some part of the program "native" (e.g., system calls). For example, a workload compiled for the EDSAC (17-bit words) and then run on a MIPS is unlikely to be close to the performance of the workload compiledd and run on the SPARC.
Number of host instructions per target instructions captures more of the "simulation efficiency" wihtout getting caught inthe confusion of processor implementation technologies. Howver, it potentially does the least accurate job of predicting real-time performance, as it may be unduly hurt by real-world concerns such as the number of cache misses. For example, SimICS got faster when the IR got smaller but more complicated to decode. The number of host instructions increased, but the overall running time decreased.
Multiprocessor performance is even harder to judge. For example, multiplexing target processors on a single host processor may induce TLB, cache and paging misses that lead to much worse performance. Conversely, I/O effects may be overlapped with simulation of other processors, reducing the effective overhead of simulation.
Simulating more costs more; simulators such as Shade, FX!32, etc. are as fast as they are in part because some parts of the overall workload (e.g., OS code) is executed native on the host machine, rather than simulating all host OS code.

So what we see includes:

You can't measure the running time of a workload on a target that does not yet or no longer exists.
Anything that uses elapsed running times depends strongly on the implementation technology.
The real-world performance does vary depending on the implementation technology.
The host/target ratio fails to capture some significant effects, e.g., the SimICS example.
Multiprocessor simulation may cause higher miss rates in the processor cache, TLB and paging memory. Conversely, simulation may be overlapped with compuation.
Running more of the application as host code improves the observed running time and host/target instruction ratio.

(I forget the details, but I'd definitely check out some of the early SimICS papers for a discussion of runnign times, Peter has more to say.)

Find an incorporate Harish Patil's dissertation <patil@ch.hp.com> on ``efficient program monitoring''. See the TR. Or, try here.

From: Harish Patil 
Newsgroups: comp.compilers
Subject: Thesis available: Program Monitoring
Date: 29 Jan 1997 11:21:02 -0500
Organization: Compilers Central
Lines: 59
Sender: johnl@iecc.com
Approved: compilers@ivan.iecc.com
Message-ID: <97-01-223@comp.compilers>
Reply-To: Harish Patil 
NNTP-Posting-Host: ivan.iecc.com
Keywords: report, available, performance

Hello everyone:

 I am glad to announce that my Ph.D. thesis, titled "Efficient Program
 Monitoring Techniques", is available on-line. This thesis was
 completed under the supervision of Prof. Charles Fischer at the
 department of Computer Sciences, University of Wisconsin --Madison.
 The thesis is available as technical report # 1320. Please check it
 out at the URL:
 http://www.cs.wisc.edu/Dienst/UI/2.0/Describe/ncstrl.uwmadison%2fCS-TR-96-1320
 An abstract of the thesis follows.

Regards,

-Harish
	Efficient Program Monitoring Techniques
	---------------------------------------
Programs need to be monitored for many reasons, including performance
evaluation, correctness checking, and security. However, the cost of
monitoring programs can be very high. This thesis contributes two
techniques for reducing the high execution time overhead of program
monitoring: 1) customization and 2) shadow processing. These
techniques have been tested using a memory access monitoring system
for C programs.

"Customization" reduces the cost of monitoring programs by decoupling
monitoring from original computation. A user program can be customized
for any desired monitoring activity by deleting computation not
relevant for monitoring. The customized program is smaller, easier to
analyze, and almost always faster than the original program. It can be
readily instrumented to perform the desired monitoring. We have
explored the use of program slicing technology for customizing C
programs. Customization can cut the overhead of memory access
monitoring by up to half.

"Shadow processing" hides the cost of on-line monitoring by using idle
processors in multiprocessor workstations. A user program is
partitioned into two run-time processes. One is the main process
executing as usual, without any monitoring code. The other is a shadow
process following the main process and performing the desired
monitoring. One key issue in the use of shadow process is the degree
to which the main process is burdened by the need to synchronize and
communicate with the shadow process. We believe the overhead to the
main process must be very modest to allow routine use of shadow
processing for heavily-used production programs. We therefore limit
the interaction between the two processes to communicating certain
irreproducible values. In our experimental shadow processing system
for memory access checking the overhead to the main process is very
low - almost always less than 10%.  Further, since the shadow process
avoids repeating some of the computations from the main program, it
runs much faster than a single process performing both the computation
and monitoring.

==========================================================================
Harish Patil:  Massachusetts Language Lab - Hewlett Packard
Mail Stop CHR02DC, 300 Apollo Drive, Chelmsford MA 01824
Phone: 508 436 5717  Fax: 508 436 5135  Email: patil@apollo.hp.com

From instruction-set simulation and tracing