Zajimave...The actual number of micro-ops that are dispatched may be lower, depending on a number of
factors, such as whether the processor is executing in fast or slow mode and...
DB ma svou odpoved:
Jinak tady je tez maly vytah:If there is a lot of stuff happening or lasting a multiple of 2 cycles, then we might actually look at two different clocks. A slow clock and a fast clock. So the small L1D$ has a latency of 4 fast clocks, but also 2 slow clocks (which even doesn't sound like nothing special for a 16k cache). The L2 latency could be seen as either 18-20 or 9-10 cycles.
1.6.4 Instruction Fetching Improvements
While previous AMD64 processors had a single 32-byte fetch window, AMD Family 15h processors
have two 32-byte fetch windows, from which four µops can be selected. These fetch windows, when
combined with the 128-bit floating-point execution unit, allow the processor to sustain a
fetch/dispatch/retire sequence of four instructions per cycle
1.6.6 Notable Performance Improvements
Several enhancements to the AMD64 architecture have resulted in significant performance
improvements in AMD Family 15h processors, including:
• Improved performance of shuffle instructions
• Improved data transfer between floating-point registers and general purpose registers
• Improved floating-point register to floating-point register moves
• Optimization of repeated move instructions
• More efficient PUSH/POP stack operations
• 1-Gbyte paging
2.1 Key Microarchitecture Features
AMD Family 15h processors include many features designed to improve software performance. The
internal design, or microarchitecture, of these processors provides the following key features:
• Integrated DDR3 memory controller with memory prefetcher
• 64-Kbyte L1 instruction cache and 16-Kbyte L1 data cache
• Shared L2 cache between cores of compute unit
• Shared L3 cache compute units on chip (for supported platforms)
• 32-byte instruction fetch
• Instruction predecode and branch prediction during cache-line fills
• Decoupled prediction and instruction fetch pipelines
• Four-wayAMD64 instruction decoding (This is a theoretical limit. See section 2.3 on page 31.)
• Dynamic scheduling and speculative execution
• Two-way integer execution
• Two-way address generation
• Two-way 128-bit wide floating-point execution
• Legacy single-instruction multiple-data (SIMD) instruction extensions, as well as support for
XOP, FMA4, VPERMILx, and Advanced Vector Extensions (AVX).
• Superforwarding
• Prefetch into L2 or L1 data cache
• Deep out-of-order integer and floating-point execution
• HyperTransport™ technology


