Noor Ahamed Bauani
Learn Linux for Beginner - A BLOG type Post by Bauani
Intel x86 speed
Pentium timings
See Agner
Fog's Pentium optimization manual [text
copy of old version] and Intel's
Intel Architecture Optimization Manual 242816.
Pentium Pro timings
See the manuals listed above. Note that Pentium
Pro optimization is very different from Pentium optimization.
Pentium MMX timings
The Pentium MMX is essentially the same as the
Pentium. Big exceptions: MMX instructions; a 16K L1 data cache; the
Pentium-Pro branch-prediction mechanism; and no first-time-in-cache
pairing restrictions.
Pentium II timings
The Pentium II is essentially the same as the
Pentium Pro. Big exceptions: MMX instructions; a 16K L1 data cache.
Pentium III timings
The Pentium III is essentially the same as the
Pentium II. Big exceptions: cache prefetch instructions (welcome to the
1990s, Intel!); SSE instructions. See Intel's
Intel Architecture Optimization Reference Manual 245127 for MMX and
SSE information. Beware that SSE uses new registers that need to be
saved in context switches; SSE code will fail sporadically on older
operating systems.
Pentium 4 timings
The Pentium 4 has a similar feel to the Pentium
III, plus SSE2 instructions. However, the internal architecture is
different. Cycle counts are generally much worse than the Pentium III,
often even worse than the original Pentium.
AMD K6-2 timings
See AMD's Note 21924
(PDF).
AMD Athlon timings
See AMD's
Note 22007 (PDF).
The Athlon L1 data cache is only two-way but
is a gigantic 64K. (This is one of the reasons that the Athlon is much
faster than the Pentium III.) In one cycle it can handle two 64-bit
loads, or one 64-bit load and one 64-bit store, or two 32-bit stores. It
has a first-level TLB with 24 entries for 4K pages and 8 entries for
large pages, and a second-level four-way TLB with 256 entries for 4K
pages.
The Athlon can do an FADD and an FMUL, along
with two loads, every cycle, if the code is properly scheduled. (This is
another of the reasons that the Athlon is much faster than the Pentium
III.) Both FADD and FMUL have latency 4. For example, the code
f = x[1]; f *= y[4]; r5 += f;
f = x[1]; f *= y[5]; r6 += f;
f = x[1]; f *= y[6]; r7 += f;
f = x[1]; f *= y[7]; r8 += f;
f = x[2]; f *= y[3]; r5 += f;
f = x[2]; f *= y[4]; r6 += f;
...
takes 1 cycle per line if the 8
instruction bytes in each line (3 for FLD with 8-bit displacement, 3 for
FMUL with 8-bit displacement, 2 for FADDP) are aligned to an 8-byte
boundary. The same code takes 1.5 cycles per line if the instructions
are not aligned. Julian Ruhe suggests padding floating-point
instructions with REP to hit 8-byte boundaries; an Athlon assembler
could easily take care of this.
The Athlon does an excellent job of reordering
operations. (This is another of the reasons that the Athlon is much
faster than the Pentium III.)
Cycle counters
The Pentium line and the Athlon have built-in
64-bit cycle counters, measuring time since boot. To read the cycle
counter, use machine-language bytes 15 and 49; the result is put into
EAX/EDX.
Code measurement tools
Intel's Vtune Analayzer includes a Pentium
simulator and a Pentium II simulator, but it isn't free.
A usable simulator is a tremendous asset for
programmers trying to identify bottlenecks in speed-critical code. Every
CPU company has simulators for its chips; it amazes me that these
simulators aren't released for free.
Other sources of information
The Pentium
Compiler Group has a Pentium-optimized version of gcc; their documentation page has
some links to x86 chip information. For more links try Paul
Hsieh's page. For an introduction to programming using the x86 see Randall Hyde's
Art of Assembly Language Programming.
* This page contains some technical information which may be heard to
understand most of computer users.
**This Information is taken from Mr. D. J. Bernstein's Homepage, Who
wrote qmail, the most stable MTA in Internet
Back to Home Page