Find and incorporate
ssim: A Superscalar Simulator
Mike Johnson
AMD
M. D. Smith
Stanford Univ.
Pixie front ends in
ftp://velox.stanford.edu/pub
Find and incorporate
(based on work with `pixie' and `ssim', may tell about them?)
Johnson, Mike: Superscalar microprocessor design
Englewood Cliffs, NJ : Prentice Hall, 1991. - XXIV, 288 S. :
graph. Darst.
(Prentice-Hall series in innovative technology) Literaturverz. S. 273 - 278
ISBN 0-13-875634-1
Find and incorporate
(mostly results of tracing, but may discuss simulation and tracing):
@Book{Huck:89,
author = {Jerome C. Huck and Michael J. Flynn},
title = {Analyzing Computer Architectures},
publisher = {IEEE Computer Society Press},
year = 1989,
address = {Washington, DC}
}
Find out more about Robert Bedichek's
T2 simulator.
Yaze Z80 and CP/M emulator.
more info
and
source code.
WinDLX,
MSWindows GUI for DLX.
Also include information about DLX
from [Hennessy & Patterson 93]
UAE
Commodore Amiga hardware emulator (incomplete).
DEC FX!32
binary translation/emulation system for running
Microsft Windows applications.
Find and incorporate
%A Max Copperman
%A Jeff Thomas
%T Poor Man's Watchpoints
%J ACM SIGPLAN NotIces
%V 30
%N 1
%D January 1995
%P 37-44
Pardo has a copy.
Executive summary: debugging tool; statically patches loads and stores
with code to check for data breakpoints.
Amusing story:
The processor they were running on
has load delay slots and does not have pipeline interlocks.
Their tool replaces each load or store with several instructions;
it patched a piece of user-mode code of the form
load addr -> r5
store r5 -> addr2
Before patching, the code saved the old value of r5
to addr2.
After patching, it saved the new value.
Technically, this code was broken already because the symptom could
have also been exhibited by an interrupt or exception between the load
and the store.
Find and incorporate information about Spike.
Referenced in
[Conte & Gimarc 95],
Tom Conte conte@eos.ncsu.edu says (paraphrpased):
``Spike was built inside GNU GCC by Michael Golden and myself.
It includes a lot of features that have appeared in ATOM,
including the simulator with the benchnark into a single
``self-tracing'' binary.
The instruction trace was based on an abstract machine model
distilled from GCC's RTL;
it had both a high-level and a low-level form.
Spike is still in occasional use,
but has never been released.''
Find and incorporate information about Reiser & Skudlarek's
paper "Program Profiling Problems, and a Solution via Machine Language
Rewriting",
from ACM SIGPLAN Notices, V29, $1, January 1994.
Pardo has a copy.
Basic summary: Wanted to profile. -p/-pg code is larger and slower
by enough to make it hard to justify profiling as he default.
Assumes the entire source is available.
For these and other reasons, wrote jprof
which operates with disassembly, analysis and rewriting.
Discusses sampling errors, expected accuracy, stability, randomness,
etc.
Describes jprof: counters and stopwatches; subroutine call graph.
Domain/OS on HP/Apollo using 68030.
Discusses shared libraries. Can also use page-fault clock.
4-microsecond clocks.
Some lessons/observations.
Doesn't explain how program running time is affected by jprof.
Design
tradeoffs
between various implementations of 68k implementations
(comp.arch posting).
More on
decompilation
of PC executables
Update the reference for
Alvin R. "Alvy" Lebeck.
Review and include FX!32.
March 5 1996 Microprocessor Report.
Jim Turley, "Alpha Runs x86 Code with FX!32".
Summary: DEC is running Win32 application binaries on Alpha by
a new combination of interpreter and static translator. The static
translator
runs in the background, between the first and second executions of the
application. It uses info collected by the interpreter during the 1st
run, to reliably distinguish active code paths from r/o data and work
out
the effects of indirect jumps. Static analysis can't do this
automatically
on its own, for typical x86 binaries.
Add info about Doug Kwan
(author of YAE,
an Apple ][ emulator)
to "Who's who" section.
Nino says: only freely available dynamic recompilation.
(Dynamic recompilation for SPARC and MIPS).
Information forwarded by
Marinos Yannikos <nino@complang.tuwien.ac.at>.
Find, read, and incorporate decompilation info
(also cites a program verification dissertation):
%A P. J. Brown
%T Re-creation of Source Code from Reverse Polish Form
%J Softwawe \- Practice & Experience
%V 2
%N 3
%P 275-278
%D 1972
Note: there's a slightly later SPE that has a follow-up article
explaining how to do it faster/more efficiently.
Xref: uw-beaver comp.compilers:10907
Path: uw-beaver!uhog.mit.edu!news.mathworks.com!newsfeed.internetmci.com!in2.uu.net!ivan.iecc.com!ivan.iecc.com!not-for-mail
From: faase@cs.utwente.nl (Frans F.J. Faase)
Newsgroups: comp.compilers
Subject: Re: Need decompiler for veryyy old code....
Date: 29 Apr 1996 23:11:51 -0400
Organization: University of Twente, Dept. of Computer Science
Lines: 29
Sender: johnl@iecc.com
Approved: compilers@ivan.iecc.com
Message-ID: <96-04-144@comp.compilers>
References: <96-04-110@comp.compilers>
NNTP-Posting-Host: localhost.iecc.com
Keywords: disassemble, IBM
> Currently I am undertaking to modify some very old IBM code (at least
> 20 years old. I believe that the code is either Assembler or Cobol.
I do not know whether the following is of use for you, but I do
maintain a WWW page about decompilation, which has some links to other
resources as well.
<a href="http://www.cs.utwente.nl/~faase/Ha/decompile.html">
http://www.cs.utwente.nl/~faase/Ha/decompile.html</a>
Maybe, you should contact Martin Ward <Martin.Ward@durham.ac.uk>:
<a href="http://www.dur.ac.uk/~dcs0mpw/">
http://www.dur.ac.uk/~dcs0mpw/</a>
Or Tim Bull <tim.bull@durham.ac.uk>:
<a href="http://www.dur.ac.uk/~dcs1tmb/home.html">
http://www.dur.ac.uk/~dcs1tmb/home.html</a>
Frans
(P.S. Email to <PROCUNIERA@ucfv.bc.ca> bounced with 451 error)
--
Frans J. Faase
Information Systems Group Tel : +31-53-4894232
Department of Computer Science secr. : +31-53-4893690
University of Twente Fax : +31-53-4892927
PO box 217, 7500 AE Enschede, The Netherlands Email : faase@cs.utwente.nl
--------------- http://www.cs.utwente.nl/~faase/ ---------------------
--
Send compilers articles to compilers@iecc.com,
meta-mail to compilers-request@iecc.com.
A Java runtime, which generates native code at runtime:
Softway's
Guava.
Info from
Jeremy Fitzhardinge (jeremy@suede.sw.oz.au)
Find, read, and summarize the following:
%A Ariel Pashtan
%T A Prolog Implementation of an Instruction-Level Processor Simulator
%J Software \- Practice and Experience
%V 17
%N 5
%P 309-318
%D May 1987
Find, read and summarize "Augmint".
According to Anthony-Trung Nguyen <anguyen@csrd.uiuc.edu>,
it is based on MINT, and understands x86
instruction set and runs on Intel x86 boxes with UNIX (Linux,
Unixware, etc.) or Windows NT.
It is described further at
http://www.csrd.uiuc.edu/iacoma/augmint.html
and there was a paper in ICCD-96 paper,
available from
ftp://ftp.csrd.uiuc.edu/pub/Projects/iacoma/aug.ps.
Find, read and summarize "Etch".
See http://memsys.cs.washington.edu/memsys/html/etch.html.
Etch is an x86 Windows/NT tool for annotating x86 binaries, without
source code.
Find, read and summarize "Etch".
From: bchen@eecs.harvard.edu (Brad Chen)
Newsgroups: comp.arch
Subject: Windows x86 Address Traces Available
Date: 7 Oct 1996 22:20:30 GMT
Organization: Harvard University EECS
Lines: 15
Message-ID: <53bvne$5lb@necco.harvard.edu>
NNTP-Posting-Host: steward.harvard.edu
Keywords: Windows x86 address traces
A collection of x86 memory reference traces from Win32
applications are now available from the following URL:
http://etch.eecs.harvard.edu/traces/index.html.
The collection includes traces from both commercial and
public-domain applications. The collection currently
includes:
- Perl
- MPeg Play
- Borland C++
- Microsoft Visual C
- Microsoft Word
These traces were created using Etch, and instrumentation
and optimization tool for Win32 executables. For more
information on Etch see the above URL.
(etch-info@cs.washington.edu)
Add information on iprof.
Here's a summary from
Peter Kuhn:
Peter Kuhn voice: +49-89-289-23092
Institute for Integrated Circuits (LIS) fax1: +49-89-289-28323
Technical University of Munich fax2: +49-89-289-25304
Arcisstr. 21, D-80290 Munich, Germany
email: P_Kuhn@lis.e-technik.tu-muenchen.de
http: //www.lis.e-technik.tu-muenchen.de/people/kp.html
- portable to GNU gcc/g++ supported platforms,
operating systems and processors
- detailed instrumentation of instruction usage
- no source code modification necessary
- no restrictions for the application programmer
(only "-a" switch for gcc/g++ compilers)
- applicable to statically linked libraries
- minimal slow down of program execution time (about 5%)
- fast: no source recompilation necessary for repeated simulation runs
- less amount of trace data produced
- high reliability: no executable modification
- covered by GNU Public License
- available via anonymous ftp at:
ftp://ftp.lis.e-technik.tu-muenchen.de/pub/iprof
The operation is:
With gcc/g++ option -a (above version 2.6.3) you can produce
a basic block statistics file (bb.out), which contains the number
of times each basic block of the program is acccessed during runtime.
iprof processes this basic block statistics file and accesses the
program's executable to summarize the machine instructions used
for each basic block.
So iprof doesn't make any modifications to
the gcc/g++ and is easily portable among gcc/g++ supported
architectures. Currently binaries for LINUX 486, Pentium and
Sparc Solaris are provided, ports to other architectures are
straightforward.
There are many ways to measure slowdown.
Each has certain benefits, each has shortcomings.
- Time to execute target code on simulated target vs. native target
running time. This is particularly interesting if you are trying
to deterine relative performance for a cmmercial product such as
SoftPC or if you're otherwise interested in real-time response.
However, it ignores the implementation technology of the host
machine. For example, a simulated Z-80 on a SPARC will be faster
than a simulated SPARC on a Z-80, and performance may vary by 6X
depending on which Z-80 you use.
- The time or number of host instructions to execute the workload
vs. executing the workload native on the host tells you the most
about simulation efficiency if the host and the target are the same
machine. The numbers get less useful if the host and target are
different; there's also differences if the simulator executes some
part of the program "native" (e.g., system calls). For example, a
workload compiled for the EDSAC (17-bit words) and then run on a
MIPS is unlikely to be close to the performance of the workload
compiledd and run on the SPARC.
- Number of host instructions per target instructions captures more
of the "simulation efficiency" wihtout getting caught inthe
confusion of processor implementation technologies. Howver, it
potentially does the least accurate job of predicting real-time
performance, as it may be unduly hurt by real-world concerns such
as the number of cache misses. For example, SimICS got faster when
the IR got smaller but more complicated to decode. The number of
host instructions increased, but the overall running time
decreased.
- Multiprocessor performance is even harder to judge. For example,
multiplexing target processors on a single host processor may
induce TLB, cache and paging misses that lead to much worse
performance. Conversely, I/O effects may be overlapped with
simulation of other processors, reducing the effective overhead of
simulation.
- Simulating more costs more; simulators such as Shade, FX!32, etc.
are as fast as they are in part because some parts of the overall
workload (e.g., OS code) is executed native on the host machine,
rather than simulating all host OS code.
So what we see includes:
- You can't measure the running time of a workload on a target
that does not yet or no longer exists.
- Anything that uses elapsed running times depends strongly
on the implementation technology.
- The real-world performance does vary depending on the
implementation technology.
- The host/target ratio fails to capture some significant effects,
e.g., the SimICS example.
- Multiprocessor simulation may cause higher miss rates in the
processor cache, TLB and paging memory. Conversely, simulation may
be overlapped with compuation.
- Running more of the application as host code improves the observed
running time and host/target instruction ratio.
(I forget the details, but I'd definitely check out some of the early
SimICS papers for a discussion of runnign times, Peter has more to
say.)
Find an incorporate
Harish Patil's dissertation <patil@ch.hp.com>
on ``efficient program monitoring''.
See the
TR.
Or, try here.
From: Harish Patil
Newsgroups: comp.compilers
Subject: Thesis available: Program Monitoring
Date: 29 Jan 1997 11:21:02 -0500
Organization: Compilers Central
Lines: 59
Sender: johnl@iecc.com
Approved: compilers@ivan.iecc.com
Message-ID: <97-01-223@comp.compilers>
Reply-To: Harish Patil
NNTP-Posting-Host: ivan.iecc.com
Keywords: report, available, performance
Hello everyone:
I am glad to announce that my Ph.D. thesis, titled "Efficient Program
Monitoring Techniques", is available on-line. This thesis was
completed under the supervision of Prof. Charles Fischer at the
department of Computer Sciences, University of Wisconsin --Madison.
The thesis is available as technical report # 1320. Please check it
out at the URL:
http://www.cs.wisc.edu/Dienst/UI/2.0/Describe/ncstrl.uwmadison%2fCS-TR-96-1320
An abstract of the thesis follows.
Regards,
-Harish
Efficient Program Monitoring Techniques
---------------------------------------
Programs need to be monitored for many reasons, including performance
evaluation, correctness checking, and security. However, the cost of
monitoring programs can be very high. This thesis contributes two
techniques for reducing the high execution time overhead of program
monitoring: 1) customization and 2) shadow processing. These
techniques have been tested using a memory access monitoring system
for C programs.
"Customization" reduces the cost of monitoring programs by decoupling
monitoring from original computation. A user program can be customized
for any desired monitoring activity by deleting computation not
relevant for monitoring. The customized program is smaller, easier to
analyze, and almost always faster than the original program. It can be
readily instrumented to perform the desired monitoring. We have
explored the use of program slicing technology for customizing C
programs. Customization can cut the overhead of memory access
monitoring by up to half.
"Shadow processing" hides the cost of on-line monitoring by using idle
processors in multiprocessor workstations. A user program is
partitioned into two run-time processes. One is the main process
executing as usual, without any monitoring code. The other is a shadow
process following the main process and performing the desired
monitoring. One key issue in the use of shadow process is the degree
to which the main process is burdened by the need to synchronize and
communicate with the shadow process. We believe the overhead to the
main process must be very modest to allow routine use of shadow
processing for heavily-used production programs. We therefore limit
the interaction between the two processes to communicating certain
irreproducible values. In our experimental shadow processing system
for memory access checking the overhead to the main process is very
low - almost always less than 10%. Further, since the shadow process
avoids repeating some of the computations from the main program, it
runs much faster than a single process performing both the computation
and monitoring.
==========================================================================
Harish Patil: Massachusetts Language Lab - Hewlett Packard
Mail Stop CHR02DC, 300 Apollo Drive, Chelmsford MA 01824
Phone: 508 436 5717 Fax: 508 436 5135 Email: patil@apollo.hp.com
From instruction-set simulation and tracing