Inside Intelþ CoreĆ¢ĀĀ¢ Microarchitecture: Setting New Standards For ...
White Paper
Inside IntelĀ® Coreā¢
Microarchitecture
Setting New Standards for
Energy-Efficient Performance
Ofri Wechsler
Intel Fellow, Mobility Group Director,
Mobility Microprocessor Architecture
Intel Corporation
White Paper Inside Intel® Core⢠Microarchitecture
Introduction
Introduction
2
The Intel® Core⢠microarchitecture is a new foundation for
Intel® Core⢠Microarchitecture Design Goals
3
IntelĀ® architecture-based desktop, mobile, and mainstream server
multi-core processors. This state-of-the-art multi-core optimized
Delivering Energy-Efficient Performance
4
and power-efficient microarchitecture is designed to deliver
Intel® Core⢠Microarchitecture Innovations
5
increased performance and performance-per-wattāthus increasing
IntelĀ® Wide Dynamic Execution
6
overall energy efficiency. This new microarchitecture extends
the energy efficient philosophy first delivered in Intel's mobile
IntelĀ® Intelligent Power Capability
8
microarchitecture found in the IntelĀ® PentiumĀ® M processor, and
IntelĀ® Advanced Smart Cache
8
greatly enhances it with many new and leading edge microar-
IntelĀ® Smart Memory Access
9
chitectural innovations as well as existing Intel NetBurstĀ®
microarchitecture features. Whatās more, it incorporates many
IntelĀ® Advanced Digital Media Boost
10
new and significant innovations designed to optimize the
Intel® Core⢠Microarchitecture and Software 11
power, performance, and scalability of multi-core processors.
Summary
12
The Intel Core microarchitecture shows Intelās continued
Learn More
12
innovation by delivering both greater energy efficiency
Author Biographies
12
and compute capability required for the new workloads
and usage models now making their way across computing.
With its higher performance and low power, the new Intel Core
microarchitecture will be the basis for many new solutions
and form factors. In the home, these include higher performing,
ultra-quiet, sleek and low-power computer designs, and new
advances in more sophisticated, user-friendly entertainment
systems. For IT, it will reduce space and electricity burdens in
server data centers, as well as increase responsiveness, produc-
tivity and energy efficiency across client and server platforms.
For mobile users, the Intel Core microarchitecture means
greater computer performance combined with leading battery
life to enable a variety of small form factors that enable world-
class computing āon the go.ā Overall, its higher performance,
greater energy efficiency, and more responsive multitasking
will enhance user experiences in all environmentsāin homes,
businesses, and on the go.
2
Inside Intel® Core⢠Microarchitecture White Paper
Intel® Core⢠Microarchitecture Design Goals
Intel continues to drive platform enhancements that increase
other micro-architectural innovations to continue to even
the overall user experience. Some of these enhancements
further improve performance. Intel Core microarchitecture
include areas such as connectivity, manageability, security,
is one such state-of-the-art microarchitectural update that
and reliability, as well as compute capability. One of the means
was designed to deliver increased performance combined
of significantly increasing compute capability is with IntelĀ®
with superior power efficiency. As such, Intel Core micro-
multi-core processors delivering greater levels of perform-
architecture is focused on enhancing existing and emerging
ance and performance-per-watt capabilities. The move to
application and usage models across each platform segment,
multi-core processing has also opened the door to many
including desktop, server, and mobile.
Figure 1. This diagram shows the difference between processor architecture and microarchitecture. Processor Architecture refers to the
instruction set, registers, and memory data-resident data structures that are public to a programmer. Processor architecture maintains instruction
set compatibility so processors will run code written for processor generations, past, present, and future. Microarchitecture refers to the
implementation of processor architecture in silicon. Within a family of processors, the microarchitecture is often enhanced over time to deliver
improvements in performance and capability, while maintaining compatibility to the architecture.
3
White Paper Inside Intel® Core⢠Microarchitecture
Delivering Energy-Efficient Performance
In the microprocessor world, performance usually refers to the amount of time it takes to execute
a given application or task, or the ability to run multiple applications or tasks within a given period
of time. Contrary to a popular misconception, it is not clock frequency (GHz) alone or the number
of instructions executed per clock cycle (IPC) alone that equates to performance. True performance
is a combination of both clock frequency (GHz) and IPC.1 As such, performance can be computed
as a product of frequency and instructions per clock cycle:
Performance = Frequency x Instructions per Clock Cycle
This shows that the performance can be improved by increasing frequency, IPC, or possibly both.
It turns out that frequency is a function of both the manufacturing process and the micro-
architecture. At a given clock frequency, the IPC is a function of processor microarchitecture
and the specific application being executed. Although it is not always feasible to improve both
the frequency and the IPC, increasing one and holding the other close to constant with the prior
generation can still achieve a significantly higher level of performance.
In addition to the two methods of increasing performance described above, it is also possible to
increase performance by reducing the number of instructions that it takes to execute the specific
task being measured. Single Instruction Multiple Data (SIMD) is a technique used to accomplish
this. Intel first implemented 64-bit integer SIMD instructions in 1996 on the IntelĀ® PentiumĀ®
processor with MMX⢠technology and subsequently introduced 128-bit SIMD single precision
floating point, or Streaming SIMD Extensions (SSE), on the Pentium III processor and SSE2 and
SSE3 extensions in subsequent generations. Another innovative technique that Intel introduced
in its mobile microarchitecture is called microfusion. Intelās microfusion combines many common
micro-operations or micro-ops (instructions internal to the processor) into a single micro-op,
such that the total number of micro-ops that need to be executed for a given task is reduced.
As Intel has continued to focus on delivering capabilities that best meet customer needs,
it has also become important to look at delivering optimal performance combined with energy
efficiencyāto take into account the amount of power the processor will consume to generate
the performance needed for a specific task. Here power consumption is related to the dynamic
1. Performance also can be achieved through multiple cores, multiple threads, and using
special purpose hardware. Those discussions are beyond the scope of this paper.
Please refer to Intelās white paper: Platform 2015: IntelĀ® Processor and Platform
4
Evolution for the Next Decade for further details.
Inside Intel® Core⢠Microarchitecture White Paper
capacitance (the ratio of the electrostatic charge on a conductor to the potential difference
between the conductors required to maintain that charge) required to maintain IPC efficiency
times the square of the voltage that the transistors and I/O buffers are supplied with times
the frequency that the transistors and signals are switching at. This can be expressed as:
Power = Dynamic Capacitance x Voltage x Voltage x Frequency
Taking into account this power equation along with the previous performance equation, designers can
carefully balance IPC efficiency and dynamic capacitance with the required voltage and frequency to
optimize for performance and power efficiency. The balance of this paper will explain how Intelās new
microarchitecture delivers leadership performance and performance-per-watt using this foundation.
Intel® Core⢠Microarchitecture Innovations
Intel has long been the leader in driving down power consumption in laptops. The mobile microarchitecture
found in the Intel Pentium M processor and IntelĀ® CentrinoĀ® mobile technology has consistently delivered
an industry-leading combination of laptop performance, performance-per-watt, and battery life. Intel
NetBurst microarchitecture has also delivered a number of innovations enabling great performance
in the desktop and server segments.
Now, Intelās new microarchitecture will combine
The balance of this paper will discuss these
key industry-leading elements of each of these
key Intel Core microarchitecture innovations:
existing microarchitectures, along with a num-
⢠Intel® Wide Dynamic Execution
ber of new and significant performance and
⢠Intel® Intelligent Power Capability
power innovations designed to optimize the
⢠Intel® Advanced Smart Cache
performance, energy efficiency, and scalability
⢠Intel® Smart Memory Access
of multi-core processors.
⢠Intel® Advanced Digital Media Boost
5
White Paper Inside Intel® Core⢠Microarchitecture
IntelĀ® Wide Dynamic Execution
Dynamic execution is a combination of techniques
during decoding. Two program instructions can
(data flow analysis, speculative execution, out
then be executed as one micro-op, reducing the
of order execution, and super scalar) that Intel
overall amount of work the processor has to do.
first implemented in the P6 microarchitecture
This increases the overall number of instructions
used in the Pentium Pro processor, Pentium II
that can be run within any given period of time
processor, and Pentium III processors. For Intel
or reduces the amount of time to run a set
NetBurst microarchitecture, Intel introduced its
number of instructions. By doing more in less
Advanced Dynamic Execution engine, a very
time, macrofusion improves overall performance
deep, out-of-order speculative execution engine
and energy efficiency
designed to keep the processorās execution
.The Intel Core microarchitecture also includes
units executing instructions. It also featured an
an enhanced Arithmetic Logic Unit (ALU)
enhanced branch-prediction algorithm to reduce
to further facilitate macrofusion. Its single cycle
the number of branch mispredictions.
execution of combined instruction pairs results
Now with the Intel Core microarchitecture, Intel
in increased performance for less power.
significantly enhances this capability with Intel
The Intel Core microarchitecture also enhances
Wide Dynamic Execution. It enables delivery of
micro-op fusionāan energy-saving technique
more instructions per clock cycle to improve exe-
Intel first used in the Pentium M processor. In
cution time and energy efficiency. Every execution
modern mainstream processors, x86 program
core is wider, allowing each core to fetch, dispatch,
instructions (macro-ops) are broken down into
execute, and return up to four full instructions
small pieces, called micro-ops, before being sent
simultaneously. (Intelās Mobile and Intel NetBurst
down the processor pipeline to be processed.
microarchitectures could handle three instructions
Micro-op fusion āfusesā micro-ops derived from
at a time.) Further efficiencies include more
the same macro-op to reduce the number of
accurate branch prediction, deeper instruction
micro-ops that need to be executed. Reduction
buffers for greater execution flexibility, and
in the number of micro-ops results in more
additional features to reduce execution time.
efficient scheduling and better performance
One such feature for reducing execution time is
at lower power. Studies have shown that micro-
macrofusion. In previous generation processors,
op fusion can reduce the number of micro-ops
each incoming instruction was individually
handled by the out-of-order logic by more than
decoded and executed. Macrofusion enables
ten percent. With the Intel Core microarchitecture,
common instruction pairs (such as a compare
the number of micro-ops that can be fused
followed by a conditional jump) to be combined
internally within the processor is extended.
into a single internal instruction (micro-op)
6
Inside Intel® Core⢠Microarchitecture White Paper
Figure 2. With the Intel Wide Dynamic
Execution of the Intel Core microarchitecture,
every execution core in a multi-core processor
is wider. This allows each core to fetch, dispatch,
execute, and return up to four full instructions
simultaneously. A single multi-core processor
with four cores could fetch, dispatch, execute,
and return up to 16 instructions simultaneously.
7
White Paper Inside Intel® Core⢠Microarchitecture
IntelĀ® Intelligent Power Capability
Intel Intelligent Power Capability is a set of capabilities designed
In the past, implementing power gating has been challenging
to reduce power consumption and design requirements. This fea-
because of the power consumed in the powering down and ramping
ture manages the runtime power consumption of all the processorās
back up, as well as the need to maintain system responsiveness
execution cores. It includes an advanced power gating capability
when returning to full power. Through Intel Intelligent Power
that allows for an ultra fine-grained logic control that turns on
Capability, weāve been able to satisfy these concerns, ensuring
individual processor logic subsystems only if and when they are
both significant power savings without sacrificing responsiveness.
needed. Additionally, many buses and arrays are split so that data
The result is excellent energy optimization enabling the Intel Core
required in some modes of operation can be put in a low power
microarchitecture to deliver more energy-efficient performance
state when not needed.
for desktop PCs, mobile PCs, and servers.
IntelĀ® Advanced
Smart Cache
The Intel Advanced Smart Cache is a multi-core optimized cache
that improves performance and efficiency by increasing the
probability that each execution core of a dual-core processor can
access data from a higher-performance, more-efficient cache sub-
system. To accomplish this, Intel shares L2 cache between cores.
To understand the advantage of this design, consider that
most current multi-core implementations donāt share L2 cache
among execution cores. This means when two execution cores
need the same data, they each have to store it in their own
L2 cache. With Intelās shared L2 cache, the data only has to
be stored in one place that each core can access. This better
optimizes cache resources.
By sharing L2 caches among each core, the Intel Advanced
Smart Cache also allows each core to dynamically utilize up to
100 percent of available L2 cache. When one core has minimal
cache requirements, other cores can increase their percentage
of L2 cache, reducing cache misses and increasing performance.
Multi-Core Optimized Cache also enables obtaining data from
cache at higher throughput rates.
Figure 3. In a multi-core processor where two cores donāt share L2 cache,
an idle core also means idle L2 cache space. This is a critical waste of resources,
especially when another core may be suffering a performance hit because its
L2 cache is too full. Intelās shared L2 cache design enables the working core
to dynamically take over the entire L2 cache and maximize performance.
8
Inside Intel® Core⢠Microarchitecture White Paper
IntelĀ® Smart Memory Access
Intel Smart Memory Access improves system
highest possible instruction-level parallelism.
performance by optimizing the use of the
If the speculative load ends up being valid, the
available data bandwidth from the memory
processor spends less time waiting and more
subsystem and hiding the latency of memory
time processing, resulting in faster execution
accesses. The goal is to ensure that data can
and more efficient use of processor resources.
be used as quickly as possible and that this
In the rare event that the load is invalid, Intelās
data is located as close as possible to where
memory disambiguation has built-in intelligence
itās needed to minimize latency and thus
to detect the conflict, reload the correct data
improve efficiency and speed.
and re-execute the instruction.
Intel Smart Memory Access includes an
In addition to memory disambiguation, Intel
important new capability called memory
Smart Memory Access includes advanced
disambiguation, which increases the efficiency
prefetchers. Prefetchers do just thatāāprefetchā
of out-of-order processing by providing the
memory contents before they are requested
execution cores with the built-in intelligence
so they can be placed in cache and then readily
to speculatively load data for instructions
accessed when needed. Increasing the number
that are about to execute BEFORE all previous
of loads that occur from cache versus main
store instructions are executed. To understand
memory reduces memory latency and
how this works, we have to look at what
improves performance.
happens in most out-of-order microprocessors.
To ensure data is where each execution core
Normally when an out-of-order microprocessor
needs it, the Intel Core microarchitecture
reorders instructions, it canāt reschedule loads
uses two prefetchers per L1 cache and two
ahead of stores because it doesnāt know if there
prefetchers per L2 cache. These prefetchers
are any data location dependencies it might be
detect multiple streaming and strided access
violating. Yet in many cases, loads donāt depend
patterns simultaneously. This enables them
on a previous store and really could be loaded
to ready data in the L1 cache for ājust-in-timeā
before, thus improving efficiency. The problem
execution. The prefetchers for the L2 cache
is identifying which loads are okay to load and
analyze accesses from cores to ensure that
which arenāt.
the L2 cache holds the data the cores may
need in the future.
Intel's memory disambiguation uses special
intelligent algorithms to evaluate whether or
Combined, the advanced prefetchers and the
not a load can be executed ahead of a preceding
memory disambiguation result in improved
store. If it intelligently speculates that it can,
execution throughput by maximizing the
then the load instructions can be scheduled
available system-bus bandwidth and hiding
before the store instructions to enable the
latency to the memory subsystem.
9
White Paper Inside Intel® Core⢠Microarchitecture
IntelĀ® Advanced Digital Media Boost
The Intel Advanced Digital Media Boost is a
On many previous generation processors, 128-bit
feature that significantly improves performance
SSE, SSE2 and SSE3 instructions were executed
when executing Streaming SIMD Extension (SSE)
at a sustained rate of one complete instruction every
instructions. 128-bit SIMD integer arithmetic
two clock cyclesāfor example, the lower 64 bits
and 128-bit SIMD double-precision floating-point
in one cycle and the upper 64 bits in the next.
operations reduce the overall number of instruc-
The Intel Advanced Digital Media Boost feature
tions required to execute a particular program
enables these 128-bit instructions to be completely
task, and as a result can contribute to an overall
executed at a throughput rate of one per clock
performance increase. They accelerate a broad
cycle, effectively doubling the speed of execution
range of applications, including video, speech
for these instructions. This further adds to the
and image, photo processing, encryption, finan-
overall efficiency of Intel Core microarchitecture
cial, engineering, and scientific applications.
by increasing the number of instructions handled
SSE instructions enhance the Intel architecture
per cycle. Intel Advanced Digital Media Boost is
by enabling programmers to develop algorithms
particularly useful when running many important
that can mix packed, single-precision, floating-
multimedia operations involving graphics, video
point, and integers, using both SSE and MMX
and audio, and processing other rich data sets
instructions respectively.
that use SSE, SSE2 and SSE3 instructions.
Figure 4. With Intel Single
Cycle SSE, 128-bit instructions
can be completely executed
at a throughput rate of one
per clock cycle, effectively
doubling the speed of execution
for these instructions.
10
Inside Intel® Core⢠Microarchitecture White Paper
Intel® Core⢠Microarchitecture
and Software
Intel expects that the majority of existing applications will see immediate benefits
when running on processors that are based upon the Intel Core microarchitecture.
For more information on software and the Intel Core microarchitecture, please visit
the IntelĀ® Software Network on the Intel Web site at www.intel.com/software.
11
www.intel.com
Summary
Author Biography
The Intel Core microarchitecture is a new, state-of-the-art, multi-core optimized
Ofri Wechsler is an Intel Fellow in the
microarchitecture that delivers a number of new and innovative features that will
Mobility Group and director of Mobility
set new standards for energy-efficient performance. This energy-efficient, low
Microprocessor Architecture at Intel Corp-
power, high-performing, and scaleable blueprint will be the foundation for future
oration. In this role, Wechsler is responsible
Intel-based server, desktop, and mobile multi-core processors.
for the architecture of the new IntelĀ® Coreā¢
Duo processor, the upcoming processor code
This new microarchitecture extends the energy efficient philosophy first delivered
named "Merom," and the architecture devel-
in Intel's mobile microarchitecture found in the IntelĀ® PentiumĀ® M processor, and
opment of other next-generation CPUs.
greatly enhances it with many new and leading edge microarchitectural innovations
Previously, Wechsler served as manager for
as well as existing Intel NetBurstĀ® microarchitecture features. Products based on
the IDC AV, responsible for the validation of
Intel Core microarchitecture will enter the market in the second half of 2006, and
the P55C. Wechsler joined Intel in 1989 as
will enable a wave of innovation across desktop, server, and mobile platforms. Desktops
a design engineer for i860. He received his
can deliver greater compute performance as well as ultra-quiet, sleek and low-power
bachelorās degree in electrical engineering
designs. Servers can deliver greater compute density, and laptops can take the
from Ben Gurion University, Beer Sheva,
increasing compute capability of multi-core to new mobile form factors.
Israel, in 1998. He has four U.S. patents.
Learn More
You can discover much more by visiting these Intel Web sites:
Intel® Core⢠Duo processors
www.intel.com/products/processor/coreduo
IntelĀ® Platforms
www.intel.com/platforms
Intel Multi-Core
www.intel.com/multi-core
Intel Architectural Innovation
www.intel.com/technology/architecture
Energy-Efficient Performance
www.intel.com/technology/eep
Intel, Intel logo, Intel. Leap ahead., Intel. Leap ahead. logo, Centrino, Pentium,
and Xeon are trademarks or registered trademarks of Intel Corporation
or its subsidiaries in the United States and other countries.
Copyright Ā© 2006 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Printed in the United States.
0306/RMR/HBD/2K
311830-001US