Text preview for : P6 Family .pdf part of Intel P6 Hardware Developer's Manual (sept.98)



Back to : P6 Family .pdf | Home

P6 Family of Processors
Hardware Developer's Manual
September 1998

Order No: 244001-001

Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The Pentium® II processor may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an ordering number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800548-4725 or by visiting Intel's website at http://www.intel.com Copyright © Intel Corporation 1998. * Third-party brands and names are the property of their respective owners.

ii

P6 Family of Processors Hardware Developer's Manual

Contents
1 Introduction ..................................................................................................................... 1-1
1.1 1.2 1.3 P6 FAMILY OF PROCESSORS OVERVIEW ...............................................1-1 TERMINOLOGY ...........................................................................................1-2 SPECIFIC PRODUCT REFERENCES .........................................................1-2

2

Micro-Architecture Overview ................................................................................ 2-1
2.1 2.2 FULL CORE UTILIZATION...........................................................................2-1 THE P6 FAMILY PROCESSOR PIPELINE ..................................................2-2 2.2.1 The Fetch/Decode Unit ....................................................................2-3 2.2.2 The Dispatch/Execute Unit ..............................................................2-4 2.2.3 The Retire Unit .................................................................................2-6 2.2.4 The Bus Interface Unit .....................................................................2-6 ARCHITECTURE SUMMARY ......................................................................2-7

2.3

3

System Bus Overview ............................................................................................... 3-1
3.1 3.2 SIGNALING ON P6 FAMILY SYSTEM BUS ................................................3-1 SIGNAL OVERVIEW ....................................................................................3-2 3.2.1 Execution Control Signals................................................................3-2 3.2.2 Arbitration Signals............................................................................3-3 3.2.3 Request Signals ...............................................................................3-4 3.2.4 Snoop Signals..................................................................................3-4 3.2.5 Response Signals ............................................................................3-5 3.2.6 Data Response Signals ...................................................................3-6 3.2.6.1 LINE TRANSFERS .............................................................3-6 3.2.6.2 PART LINE ALIGNED TRANSFERS ..................................3-7 3.2.6.3 PARTIAL TRANSFERS ......................................................3-7 3.2.7 Error Signals ....................................................................................3-7 3.2.8 Compatibility Signals........................................................................3-9 3.2.9 Diagnostic Signals ...........................................................................3-9

4

Data Integrity ................................................................................................................... 4-1
4.1 4.2 ERROR CLASSIFICATION ..........................................................................4-1 P6 FAMILY PROCESSOR SYSTEM BUS DATA INTEGRITY ARCHITECTURE..........................................................................................4-2 4.2.1 Bus Signals Protected Directly.........................................................4-2 4.2.2 Bus Signals Protected Indirectly ......................................................4-3 4.2.3 Unprotected Bus Signals .................................................................4-3 4.2.4 Hard-Error Response .......................................................................4-3 4.2.5 P6 Family Processor System Bus Error Code Algorithms ...............4-3 4.2.5.1 PARITY ALGORITHM.........................................................4-3 4.2.5.2 P6 FAMILY SYSTEM BUS ECC ALGORITHM ..................4-4

5

Configuration .................................................................................................................. 5-1
5.1 DESCRIPTION .............................................................................................5-1 5.1.1 Output Tristate .................................................................................5-2 5.1.2 Built-in Self-test................................................................................5-2 5.1.3 Data Bus Error Checking Policy.......................................................5-2

P6 Family of Processors Hardware Developer's Manual

iii

5.2 5.3 5.4

5.1.4 Response Signal Parity Error Checking Policy ................................5-2 5.1.5 AERR# Driving Policy ......................................................................5-3 5.1.6 AERR# Observation Policy ..............................................................5-3 5.1.7 BERR# Driving Policy for Initiator Bus Errors ..................................5-3 5.1.8 BERR# Driving Policy for Target Bus Errors....................................5-3 5.1.9 BERR# Driving Policy for Initiator Internal Errors ............................5-3 5.1.10 BINIT# Driving Policy .......................................................................5-3 5.1.11 BINIT# Observation Policy...............................................................5-3 5.1.12 In-Order Queue Pipelining ...............................................................5-4 5.1.13 Power-On Reset Vector ...................................................................5-4 5.1.14 FFRC Mode Enable .........................................................................5-4 5.1.15 APIC Mode ......................................................................................5-4 5.1.16 APIC Cluster ID ...............................................................................5-4 5.1.17 Symmetric Agent Arbitration ID........................................................5-4 5.1.18 Low Power Standby Enable.............................................................5-7 CLOCK FREQUENCIES AND RATIOS .......................................................5-8 POWER-ON CONFIGURATION REGISTER ...............................................5-8 INITIALIZATION PROCESS .......................................................................5-10

6

Test Access Port (TAP)............................................................................................. 6-1
6.1 6.2 INTERFACE .................................................................................................6-1 ACCESSING THE TAP LOGIC ....................................................................6-2 6.2.1 Accessing the Instruction Register...................................................6-4 6.2.2 Accessing the Data Registers..........................................................6-5 INSTRUCTION SET .....................................................................................6-6 DATA REGISTER SUMMARY .....................................................................6-7 6.4.1 Bypass Register...............................................................................6-7 6.4.2 Device ID Register ...........................................................................6-7 6.4.3 BIST Result Boundary Scan Register..............................................6-8 6.4.4 Boundary Scan Register ..................................................................6-8 RESET BEHAVIOR ......................................................................................6-8

6.3 6.4

6.5

7

Integration Tools ........................................................................................................... 7-1
7.1 IN-TARGET PROBE (ITP) FOR P6 FAMILY PROCESSORS .....................7-1 7.1.1 Primary Function..............................................................................7-1 7.1.2 Debug Port Connector Description ..................................................7-2 7.1.3 Debug Port Signal Descriptions .......................................................7-2 7.1.4 Debug Port Signal Notes .................................................................7-2 7.1.4.1 SIGNAL NOTE 1: DBRESET#............................................7-3 7.1.4.2 SIGNAL NOTE 5: TDO AND TDI........................................7-3 7.1.4.3 SIGNAL NOTE 7: TCK........................................................7-6 7.1.5 Debug Port Layout ...........................................................................7-6 7.1.5.1 SIGNAL QUALITY NOTES .................................................7-8 7.1.5.2 DEBUG PORT CONNECTOR ............................................7-8 7.1.6 Using Boundary Scan to Communicate to the Processor................7-8

A

Signals Reference ........................................................................................................ A-1
A.1 ALPHABETICAL SIGNALS LISTING .......................................................... A-1 A.1.1 A[35:3]# (I/O) .................................................................................. A-1 A.1.2 A20M# (I) ........................................................................................ A-1 A.1.3 ADS# (I/O) ...................................................................................... A-1

iv

P6 Family of Processors Hardware Developer's Manual

A.1.4 A.1.5 A.1.6 A.1.7 A.1.8 A.1.9 A.1.10 A.1.11 A.1.12 A.1.13 A.1.14 A.1.15 A.1.16 A.1.17 A.1.18 A.1.19 A.1.20 A.1.21 A.1.22 A.1.23 A.1.24 A.1.25 A.1.26 A.1.27 A.1.28 A.1.29 A.1.30 A.1.31 A.1.32 A.1.33 A.1.34 A.1.35 A.1.36 A.1.37 A.1.38 A.1.39 A.1.40 A.1.41 A.1.42 A.1.43 A.1.44 A.1.45 A.1.46 A.1.47 A.1.48 A.1.49 A.1.50 A.1.51 A.1.52 A.1.53 A.1.54



P6 Family of Processors Hardware Developer's Manual

v

A.2

A.1.55 TDO (O) ........................................................................................ A-15 A.1.56 THERMTRIP# (O)......................................................................... A-15 A.1.57 TMS (I) .......................................................................................... A-15 A.1.58 TRDY# (I)...................................................................................... A-15 A.1.59 TRST# (I) ...................................................................................... A-15 SIGNAL SUMMARIES ............................................................................... A-16

Index ............................................................................................................................ INDEX-1

Figures
2-1 2-2 2-3 2-4 2-5 2-6 2-7 3-1 5-1 5-2 5-3 6-1 6-2 6-3 6-4 6-5 7-1 7-2 7-3 7-4 7-5 7-6 Three Engines Communicating Using an Instruction Pool ...........................2-1 A Typical Pseudo Code Fragment ................................................................2-1 The Three Core Engines Interface with Memory via Unified Caches ...........2-3 Inside the Fetch/Decode Unit .......................................................................2-4 Inside the Dispatch/Execute Unit..................................................................2-5 Inside the Retire Unit ....................................................................................2-6 Inside the Bus Interface Unit.........................................................................2-7 Latched Bus Protocol....................................................................................3-1 Hardware Configuration Signal Sampling .....................................................5-1 BR[1:0]# Physical Interconnection with Two Symmetric Agents ..................5-5 BR[1:0]# Physical Interconnection with Four Symmetric Agents ..................5-6 Simplified Block Diagram of Processor TAP Logic .......................................6-2 TAP Controller Finite State Machine ............................................................6-3 Processor TAP Instruction Register..............................................................6-4 Operation of the Processor TAP Instruction Register ...................................6-5 TAP Instruction Register Access ..................................................................6-5 Hardware Components of the ITP ................................................................7-2 GTL+ Signal Termination ..............................................................................7-3 Generic DP System Layout for Debug Port Connection ...............................7-7 Debug Port Connector on Thermal Plate Site of Circuit Board.....................7-8 Hole Positioning for Connector on Thermal Plate Side of Circuit Board.......7-8 Processor System Where Boundary Scan Is Not Used................................7-9

Tables
3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9 3-10 4-1 5-1 5-2 Execution Control Signals.............................................................................3-2 Arbitration Signals.........................................................................................3-3 Request Signals............................................................................................3-4 Snoop Signals...............................................................................................3-5 Response Signals .........................................................................................3-5 Data Phase Signals ......................................................................................3-6 Burst Order Used for P6 Family Processor Bus Line Transfers ...................3-6 Error Signals .................................................................................................3-7 PC Compatibility Signals ..............................................................................3-9 Diagnostic Support Signals.........................................................................3-10 Direct Bus Signal Protection .........................................................................4-2 APIC Cluster ID Configuration ......................................................................5-4 P6 Family Processor Bus BREQ[1:0]# Interconnect (2-Way MP Processors)................................................................................5-5

vi

P6 Family of Processors Hardware Developer's Manual

5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 6-1 6-2 6-3 6-4 7-1 A-6

P6 Family Processor Bus BREQ[3:0]# Interconnect (4-Way MP Processors)................................................................................5-6 Arbitration ID Configuration (Two Agents) ....................................................5-7 Arbitration ID Configuration (Four Agents)....................................................5-7 System Bus To Core Frequency Multiplier Configuration .............................5-8 Processor Power-On Configuration Register ................................................5-9 Power-On Configuration Register APIC Cluster ID Bit Field.........................5-9 Power-On Configuration Register Arbitration ID Configuration...................5-10 Power-On Configuration Register Bus Frequency to Core Frequency Ratio Bit Field..............................................................................................5-10 1149.1 Instructions in the Processor TAP.....................................................6-6 TAP Data Registers ......................................................................................6-7 Device ID Register ........................................................................................6-7 TAP Reset Actions........................................................................................6-8 Debug Port Pinout Description and Requirements .......................................7-4 LEN[1:0]# Signals Data Transfer Lengths ................................................. A-10

P6 Family of Processors Hardware Developer's Manual

vii

Introduction
1.1 P6 FAMILY OF PROCESSORS OVERVIEW

1

The P6 family of processors is the generation of processors that succeeds the Pentium® line of Intel processors. This processor family implements Intel's dynamic execution microarchitecture; which incorporates a unique combination of multiple branch prediction, data flow analysis, and speculative execution. This enables P6 family processors to deliver higher performance than the Pentium family of processors, while maintaining binary compatibility with all previous Intel Architecture processors. The first processor designed from the P6 family was the Pentium Pro processor which was followed by the Pentium II processor. As new products are designed, new technologies are utilized. For example, features were added to some P6 family processor products to aid in the design of energy efficient computer systems by offering multiple low-power states such as AutoHALT, StopGrant, Sleep and Deep Sleep, to conserve power during idle times. The targeting of specific markets is another differentiator of products belonging to the P6 family including the Server and Workstation Market, Performance PC Market, Mobile Market and the Basic PC Market. All of these market segments demand specific features and performance. While all P6 family products have the benefits of Intel's dynamic execution microarchitecture, there are also product specific differentiators. For example, the P6 family offers products with larger cache sizes and support for up to four processors to meet the higher performance demand of the server and workstation markets. Additionally, the Pentium II XeonTM processor provides manageability requirements of the server and workstation environment by incorporating a System Management Bus (SMBus) interface. This interface can be used in conjunction with system hardware and software to provide more manageability options than any previous P6 family product. Memory is cacheable for 64 GB of addressable memory. This SMBus interface and larger L2 cache sizes, enables these products to provide higher performance and manageability for the server and workstation environment. For high end desktop and business applications, the Pentium II processor can deliver the necessary computing power. Memory is cacheable for up to 4 GB of addressable memory space, allowing significant headroom for applications. It also incorporates Intel's MMXTM technology for enhanced media and communications performance. The Intel P6 family also contains processors which are specifically designed and manufactured for the mobile market. These processors can operate under much more restrictive power and size constraints than the previously mentioned products, while still maintaining a high level of performance. The Intel CeleronTM processor is designed for Basic PC desktops. It provides the same benefits of the P6 family architecture and adds the capabilities of Intel's MMX technology to bring a balanced level of performance and price to Basic PC consumers.

P6 Family of Processors Hardware Developer's Manual

1-1

Introduction

1.2

TERMINOLOGY
In this document, a `#' symbol after a signal name refers to an active low signal. This means that a signal is in the active state (based on the name of the signal) when driven to a low level. For example, when FLUSH# is low, a flush has been requested. When NMI is high, a non-maskable interrupt has occurred. In the case of signals where the name does not imply an active state but describes part of a binary sequence (such as address or data), the `#' symbol implies that the signal is inverted. For example, D[3:0] = `HLHL' refers to a hex `A', and D#[3:0] = `LHLH' also refers to a hex `A' (H= High logic level, L= Low logic level). The term "system bus" refers to the interface between the processor, system core logic (a.k.a. the core logic components) and other bus agents. The system bus is a multiprocessing interface to processors, memory and I/O. The term "cache bus" refers to the interface between the processor and the L2 cache components. The cache bus does NOT connect to the system bus, and is not visible to other agents on the system bus. When signal values are referenced in tables, a 0 indicates inactive and a 1 indicates active. 0 and 1 do not reflect voltage levels. A # after a signal name indicates active low. An entry of 1 for ADS# means that ADS# is active, with a low voltage level.

1.3

SPECIFIC PRODUCT REFERENCES
The reader of this document should also reference product datasheet specific details. Datasheets for Intel processors are located at http://developer.intel.com. The task of providing all of the technical details of each product in one comprehensive manual is not the goal of this reference manual. The goal of this manual is to provide a reference of commonality between all P6 family products. It is the role each processor's datasheet to provide the specific differentiating details of each product. In the event that the datasheet and this reference manual contradict one another, please use the datasheet as the correct reference. The P6 family of processor's may contain design defects known as errata. All characterized errata are available on-line at http://developer.intel.com.

1-2

P6 Family of Processors Hardware Developer's Manual

Micro-Architecture Overview

2

The P6 family of processors use a dynamic execution micro-architecture. This three-way superscalar, pipelined micro-architecture features a decoupled, multi-stage superpipeline, which trades less work per pipestage for more stages. A P6 family processor, for example, has twelve stages with a pipestage time 33 percent less than the Pentium processor, which helps achieve a higher clock rate on any given manufacturing process. The approach used in the P6 family micro-architecture removes the constraint of linear instruction sequencing between the traditional "fetch" and "execute" phases, and opens up a wide instruction window using an instruction pool. This approach allows the "execute" phase of the processor to have much more visibility into the program instruction stream so that better scheduling may take place. It requires the instruction "fetch/decode" phase of the processor to be much more efficient in terms of predicting program flow. Optimized scheduling requires the fundamental "execute" phase to be replaced by decoupled "dispatch/execute" and "retire" phases. This allows instructions to be started in any order but always be completed in the original program order. Processors in the P6 family may be thought of as three independent engines coupled with an instruction pool as shown in Figure 2-1. Figure 2-1. Three Engines Communicating Using an Instruction Pool
Fetch/ Decode Unit Dispatch/ Execute Unit Retire Unit

Instruction Pool

000925

2.1

FULL CORE UTILIZATION
The three independent-engine approach was taken to more fully utilize the processor core. Consider the pseudo code fragment in Figure 2-2:

Figure 2-2. A Typical Pseudo Code Fragment
r1 r2 r5 r6 <= <= <= <= mem [r0] /* Instruction 1 */ r1 + r2 /* Instruction 2 */ r5 + 1 /* Instruction 3 */ r6 - r3 /* Instruction 4 */
000922

P6 Family of Processors Hardware Developer's Manual

2-1

Micro-Architecture Overview

The first instruction in this example is a load of r1 that, at run time, causes a cache miss. A traditional processor core must wait for its bus interface unit to read this data from main memory and return it before moving on to instruction 2. This processor stalls while waiting for this data and is thus being under-utilized. To avoid this memory latency problem, a P6 family processor "looks-ahead" into the instruction pool at subsequent instructions and does useful work rather than stalling. In the example in Figure 2-2, instruction 2 is not executable since it depends upon the result of instruction 1; however both instructions 3 and 4 have no prior dependencies and are therefore executable. The processor executes instructions 3 and 4 out-of-order. The results of this out-of-order execution cannot be committed to permanent machine state (i.e., the programmer-visible registers) immediately since the original program order must be maintained. The results are instead stored back in the instruction pool awaiting in-order retirement. The core executes instructions depending upon their readiness to execute, and not on their original program order, and is therefore a true dataflow engine. This approach has the side effect that instructions are typically executed out-of-order. The cache miss on instruction 1 will take many internal clocks, so the core continues to look ahead for other instructions that could be speculatively executed, and is typically looking 20 to 30 instructions in front of the instruction pointer. Within this 20 to 30 instruction window there will be, on average, five branches that the fetch/decode unit must correctly predict if the dispatch/ execute unit is to do useful work. The sparse register set of an Intel Architecture (IA) processor will create many false dependencies on registers so the dispatch/execute unit will rename the Intel Architecture registers into a larger register set to enable additional forward progress. The Retire Unit owns the programmer's Intel Architecture register set and results are only committed to permanent machine state in these registers when it removes completed instructions from the pool in original program order. Dynamic Execution technology can be summarized as optimally adjusting instruction execution by predicting program flow, having the ability to speculatively execute instructions in any order, and then analyzing the program's dataflow graph to choose the best order to execute the instructions.

2.2

THE P6 FAMILY PROCESSOR PIPELINE
In order to get a closer look at how the P6 family micro-architecture implements Dynamic Execution, Figure 2-3 shows a block diagram of the P6 family of processor products including cache and memory interfaces. The "Units" shown in Figure 2-3 represent stages of the P6 family of processors pipeline.

2-2

P6 Family of Processors Hardware Developer's Manual

Micro-Architecture Overview

Figure 2-3. The Three Core Engines Interface with Memory via Unified Caches

L2 Cache System Bus

Bus Interface Unit

L1 ICache Fetch Load

L1 DCache

Store Retire Unit

Fetch/ Decode Unit

Dispatch/ Execute Unit

Instruction Pool

000926

· The FETCH/DECODE unit: An in-order unit that takes as input the user program instruction
stream from the instruction cache, and decodes them into a series of µoperations (µops) that represent the dataflow of that instruction stream. The pre-fetch is speculative.

· The DISPATCH/EXECUTE unit: An out-of-order unit that accepts the dataflow stream,
schedules execution of the µops subject to data dependencies and resource availability and temporarily stores the results of these speculative executions.

· The RETIRE unit: An in-order unit that knows how and when to commit ("retire") the
temporary, speculative results to permanent architectural state.

· The BUS INTERFACE unit: A partially ordered unit responsible for connecting the three
internal units to the real world. The bus interface unit communicates directly with the L2 (second level) cache supporting up to four concurrent cache accesses. The bus interface unit also controls a transaction bus, with MESI snooping protocol, to system memory.

2.2.1

The Fetch/Decode Unit
Figure 2-4 shows a more detailed view of the Fetch/Decode unit.

P6 Family of Processors Hardware Developer's Manual

2-3

Micro-Architecture Overview

Figure 2-4. Inside the Fetch/Decode Unit
From Bus Interface Unit

ICache

Next_IP

Branch Target Buffer Instruction Decoder (x3) Microcode Instruction Sequencer Register Alias Table Allocate To Instruction Pool (ReOrder Buffer)
000927

The L1 Instruction Cache is a local instruction cache. The Next_IP unit provides the L1 Instruction Cache index, based on inputs from the Branch Target Buffer (BTB), trap/interrupt status, and branch-misprediction indications from the integer execution section. The L1 Instruction Cache fetches the cache line corresponding to the index from the Next_IP, and the next line, and presents 16 aligned bytes to the decoder. The prefetched bytes are rotated so that they are justified for the instruction decoders (ID). The beginning and end of the Intel Architecture instructions are marked. Three parallel decoders accept this stream of marked bytes, and proceed to find and decode the Intel Architecture instructions contained therein. The decoder converts the Intel Architecture instructions into triadic µops (two logical sources, one logical destination per µop). Most Intel Architecture instructions are converted directly into single µops, some instructions are decoded into one-to-four µops and the complex instructions require microcode (the box labeled Microcode Instruction Sequencer in Figure 2-4). This microcode is just a set of preprogrammed sequences of normal µops. The µops are queued, and sent to the Register Alias Table (RAT) unit, where the logical Intel Architecture-based register references are converted into references to physical registers in P6 family processors physical register references, and to the Allocator stage, which adds status information to the µops and enters them into the instruction pool. The instruction pool is implemented as an array of Content Addressable Memory called the ReOrder Buffer (ROB).

2.2.2

The Dispatch/Execute Unit
The Dispatch unit selects µops from the instruction pool depending upon their status. If the status indicates that a µop has all of its operands then the dispatch unit checks to see if the execution resource needed by that µop is also available. If both are true, the Reservation Station removes that µop and sends it to the resource where it is executed. The results of the µop are later returned to the pool. There are five ports on the Reservation Station, and the multiple resources are accessed as shown in Figure 2-5.

2-4

P6 Family of Processors Hardware Developer's Manual

Micro-Architecture Overview

Figure 2-5. Inside the Dispatch/Execute Unit
MMXTM Execution Unit Floating-Point Execution Unit Port 0 Integer Execution Unit

Reservatio Station
To/From Instruction Pool (ReOrder Buffer) Port 1

MMX Execution Unit Jump Execution Unit Integer Execution Unit

Port 2

Load Unit

Loads

Port 3, 4

Store Unit

Stores

000928

The P6 family of processors can schedule at a peak rate of 5 µops per clock, one to each resource port, but a sustained rate of 3 µops per clock is more typical. The activity of this scheduling process is the out-of-order process; µops are dispatched to the execution resources strictly according to dataflow constraints and resource availability, without regard to the original ordering of the program. Note that the actual algorithm employed by this execution-scheduling process is vitally important to performance. If only one µop per resource becomes data-ready per clock cycle, then there is no choice. But if several are available, it must choose. The P6 family micro-architecture uses a pseudo FIFO scheduling algorithm favoring back-to-back µops. Note that many of the µops are branches. The Branch Target Buffer (BTB) will correctly predict most of these branches but it can't correctly predict them all. Consider a BTB that is correctly predicting the backward branch at the bottom of a loop; eventually that loop is going to terminate, and when it does, that branch will be mispredicted. Branch µops are tagged (in the in-order pipeline) with their fall-through address and the destination that was predicted for them. When the branch executes, what the branch actually did is compared against what the prediction hardware said it would do. If those coincide, then the branch eventually retires and the speculatively executed work between it and the next branch instruction in the instruction pool is good. But if they do not coincide, then the Jump Execution Unit (JEU) changes the status of all of the µops behind the branch to remove them from the instruction pool. In that case the proper branch destination is provided to the BTB which restarts the whole pipeline from the new target address.

P6 Family of Processors Hardware Developer's Manual

2-5

Micro-Architecture Overview

2.2.3

The Retire Unit
Figure 2-6 shows a more detailed view of the Retire Unit.

Figure 2-6. Inside the Retire Unit
To/From DCache

Reservation Station

Memory Interf ace Unit

Retirement Register File From To
000929

Instruction Pool

The Retire Unit is also checking the status of µops in the instruction pool. It is looking for µops that have executed and can be removed from the pool. Once removed, the original architectural target of the µops is written as per the original Intel Architecture instruction. The Retire Unit must not only notice which µops are complete, it must also re-impose the original program order on them. It must also do this in the face of interrupts, traps, faults, breakpoints and mispredictions. The Retire Unit must first read the instruction pool to find the potential candidates for retirement and determine which of these candidates are next in the original program order. Then it writes the results of this cycle's retirements to the Retirement Register File (RRF). The Retire Unit is capable of retiring 3 µops per clock.

2.2.4

The Bus Interface Unit
Figure 2-7 shows a more detailed view of the Bus Interface Unit.

2-6

P6 Family of Processors Hardware Developer's Manual

Micro-Architecture Overview

Figure 2-7. Inside the Bus Interface Unit
System Memory Memory Order Buffer

Memory I/F
L2 Cache DCache

From Address Generation Unit

To/From Instruction Pool (ReOrder Buffer)
000930

There are two types of memory access: loads and stores. Loads only need to specify the memory address to be accessed, the width of the data being retrieved, and the destination register. Loads are encoded into a single µop. Stores need to provide a memory address, a data width, and the data to be written. Stores therefore require two µops, one to generate the address and one to generate the data. These µops must later re-combine for the store to complete. Stores are never performed speculatively since there is no transparent way to undo them. Stores are also never re-ordered among themselves. A store is dispatched only when both the address and the data are available and there are no older stores awaiting dispatch. A study of the importance of memory access reordering concluded:

· Stores must be constrained from passing other stores, for only a small impact on performance. · Stores can be constrained from passing loads, for an inconsequential performance loss. · Constraining loads from passing other loads or stores has a significant impact on performance.
The Memory Order Buffer (MOB) allows loads to pass other loads and stores by acting like a reservation station and re-order buffer. It holds suspended loads and stores and re-dispatches them when a blocking condition (dependency or resource) disappears.

2.3

ARCHITECTURE SUMMARY
All products within the P6 family of processors are based upon this architectural summary. Dynamic Execution is the combination of improved branch prediction, speculative execution and data flow analysis that enable P6 family processors to deliver superior performance.

P6 Family of Processors Hardware Developer's Manual

2-7

System Bus Overview

3

This chapter provides an overview of the P6 family system bus, bus transactions, and bus signals. The P6 family of processor's also support two other synchronous busses (the APIC and the TAP bus), PC compatibility signals, and several implementation specific signals. For a complete signal listing of a specific product, please refer to the relevant datasheet.

3.1

SIGNALING ON P6 FAMILY SYSTEM BUS
The P6 family processor system bus supports a synchronous latched protocol. On the rising edge of the bus clock, all agents on the system bus are required to drive their active outputs and sample required inputs. No additional logic is located in the output and input paths between the buffer and the latch stage, thus keeping setup and hold times constant for all bus signals following the latched protocol. The System bus requires that every input be sampled during a valid sampling window on a rising clock edge and its effect be driven out no sooner than the next rising clock edge. This approach allows one full clock for inter-component communication and at least one full clock at the receiver to compute a response. Figure 3-1 illustrates the latched bus protocol as it appears on the bus. In subsequent descriptions, the protocol is described as "B# is asserted in the clock after A# is observed active," or "B# is asserted two clocks after A# is asserted." Note that A# is asserted in T1, but not observed active until T2. The receiving agent uses T2 to determine its response and asserts B# in T3. Other agents observe B# active in T4.

Figure 3-1. Latched Bus Protocol
Full clock allowed for signal propagation Full clock allowed for logic delays

1
BCLK

2

3

4

5

A#

B#

Assert #

Latch A#

Assert B#

Latch B#
000936

The square and circle symbols are used in the timing diagrams to indicate the clock in which particular signals of interest are driven and sampled. The square indicates that a signal is driven (asserted, initiated) in that clock. The circle indicates that a signal is sampled (observed, latched) in that clock.

P6 Family of Processors Hardware Developer's Manual

3-1

System Bus Overview

Signals that are driven in the same clock by multiple System bus agents exhibit a "wired-OR glitch" on the electrical-low-to-electrical-high transition. To account for this situation, these signal state transitions are specified to have two clocks of settling time when deasserted before they can be safely observed. The bus signals that must meet this criteria are: BINIT#, HIT#, HITM#, BNR#, AERR#, BERR#.

3.2

SIGNAL OVERVIEW
This section describes the function of the System bus signals. In this section, the signals are grouped according to function.

3.2.1

Execution Control Signals
Table 3-1 lists the execution control signals, which control the execution and initialization of the processor.

Table 3-1. Execution Control Signals
Pin/Signal Name Bus Clock Initialization Flush Stop Clock Sleep Interprocessor Communication and Interrupts BCLK INIT#, RESET# FLUSH# STPCLK# SLP# PICCLK, PICD[1:0]#, LINT[1:0] Pin/Signal Mnemonic

The BCLK (Bus Clock) input signal is the System bus clock. All agents drive their outputs and latch their inputs on the BCLK rising edge. Each processor in the P6 family derives its internal clock from BCLK by multiplying the BCLK frequency by a multiplier determined at configuration. See Chapter 5, "Configuration," for possible clock configuration frequencies. The RESET# input signal resets all System bus agents to known states and invalidates their internal caches. Modified or dirty cache lines are NOT written back. After RESET# is deasserted, each processor begins execution at the power on reset vector defined during configuration. The INIT# input signal resets all processors without affecting their internal (L1 or L2) caches, floating-point registers, or their Machine Check Architecture registers (MCi­CTL). Each processor begins execution at the address vector as defined during power on configuration. INIT# has another meaning on RESET#'s active to inactive transition: if INIT# is sampled active on RESET#'s active to inactive transition, then the processor executes its built-in self-test (BIST). If the FLUSH# input signal is asserted, the processor writes back all internal cache lines in the Modified state (L1 and L2 caches) and invalidates all internal cache lines (L1 and L2 caches). The flush operation puts all internal cache lines in the Invalid state. All lines are written back and invalidated. The FLUSH# signal has a different meaning when it is sampled asserted on the active to inactive transition of RESET#. If FLUSH# is sampled asserted on the active to inactive transition of RESET#, then the processor tristates all of its outputs. This function is used during board testing.

3-2

P6 Family of Processors Hardware Developer's Manual

System Bus Overview

The P6 family processor supplies a STPCLK# pin to enable the processor to enter a low power state. When STPCLK# is asserted, the processor puts itself into the Stop-Grant state. The processor continues to snoop bus transactions while in Stop-Grant state. When STPCLK# is deasserted, the processor restarts its internal clock to all units and resumes execution. The assertion of STPCLK# has no effect on the bus clock. Some processors in the P6 family support the SLP# signal. The SLP# signal is the sleep signal. When asserted in Stop-Grant state, the processor enters a new low power state, the Sleep state. During Sleep state, the processor stops providing internal clock signals to all units, only leaves PLL still running. Snooping during the Sleep state is not supported. Please see specific processor datasheets for more information on this feature. The PICCLK and PICD[1:0]# signals support the Advanced Programmable Interrupt Controller (APIC) interface. The PICCLK signal is a clock input for the processor's APIC bus clock. The PICD[1:0]# signals are used for bi-directional serial message passing on the APIC bus. LINT[1:0] are local interrupt signals, also defined by the APIC interface. In APIC disabled mode, LINT0 defaults to INTR, a maskable interrupt request signal. LINT1 defaults to NMI, a nonmaskable interrupt. Both signals are asynchronous inputs. In the APIC enable mode, LINT0 and LINT1 are defined with the local vector table. LINT[1:0] are also used along with the A20M# and IGNNE# signals to determine the multiplier for the internal clock frequency of some processors as described in Chapter 5, "Configuration."

3.2.2

Arbitration Signals
The arbitration signal group (see Table 3-2) is used to arbitrate for the bus. These signals are used to gain ownership of the bus, a requirement for initiating a bus transaction. Some P6 family processors permit up to five agents to simultaneously arbitrate for the system bus with one to four symmetric agents (on BREQ[3:0]#) and one priority agent (on BPRI#). The following paragraphs describes arbitration from this perspective. P6 family processors arbitrate as symmetric agents. The priority agent normally arbitrates on behalf of the I/O subsystem (I/O agents) and memory subsystem (memory agents). Please note that in systems based on P6 family processors that support only two symmetric agents, the following descriptions are accurate with the exception of the number of symmetric agents, the number of BR# pins and the number of BREQ# signals.

Table 3-2. Arbitration Signals
Pin/Signal Name Symmetric Agent Bus Request Priority Agent Bus Request Block Next Request Lock BR[3:0]# BPRI# BNR# LOCK# Pin Mnemonic Signal Mnemonic BREQ[3:0]# BPRI# BNR# LOCK#

The symmetric agents arbitrate for the bus based on a round-robin rotating priority scheme. The arbitration is fair and symmetric. After reset, agent 0 has the highest priority followed by agent 1. All bus agents track the current bus owner. A symmetric agent requests the bus by asserting its BREQn# signal. Based on the values sampled on BREQ[3:0]#, and the last symmetric bus owner, all agents simultaneously determine the next symmetric bus owner.

P6 Family of Processors Hardware Developer's Manual

3-3

System Bus Overview

The priority agent asks for the bus by asserting BPRI#. The assertion of BPRI# temporarily overrides, but does not otherwise alter the symmetric arbitration scheme. When BPRI# is sampled active, no symmetric agent issues another unlocked bus transaction until BPRI# is sampled inactive. The priority agent is always the next bus owner. BNR# can be asserted by any bus agent to block further transactions from being issued to the bus. It is typically asserted when system resources (such as address and/or data buffers) are about to become temporarily busy or filled and cannot accommodate another transaction. After bus initialization, BNR# can be asserted to delay the first bus transaction until all bus agents are initialized. The assertion of the LOCK# signal indicates that the bus agent is executing an atomic sequence of bus transactions that must not be interrupted. A locked operation cannot be interrupted by another transaction regardless of the assertion of BREQ[3:0]# or BPRI#. LOCK# can be used to implement memory-based semaphores. LOCK# is asserted from the start of the first transaction through the end of the last transaction. The LOCK# signal is always deasserted between two sequences of locked transactions on the System bus.

3.2.3

Request Signals
The request signals (see Table 3-3) initiate a transaction.

Table 3-3. Request Signals
Pin Name Address Strobe Request Command Address Address Parity Request Parity Pin Mnemonic ADS# REQ[4:0]# A[35:3]# AP[1:0]# RP# Signal Name Address Strobe Request Address Address Parity Request Parity Signal Mnemonic ADS# REQ[4:0]# A[35:3]# AP[1:0]# RP#

The assertion of ADS# defines the beginning of the transition. The REQ[4:0]#, A[35:3]#, RP# and AP[1:0]# signals are valid in the clock that ADS# is asserted. In the clock that ADS# is asserted, the A[35:3]# signals provide a 36-bit, active-low address as part of the request. The P6 family processor maximum physical address space is 2 36 bytes or 64Gigabytes (64 GByte). Some P6 family processors may implement less address lines. Address bits 2, 1, and 0 are mapped into byte enable signals for 1 to 8 byte transfers. The address signals are protected by the AP[1:0]# pins on some processors. AP1# covers A[35:24]#, AP0# covers A[23:3]#. AP[1:0]# must be valid for two clocks beginning when ADS# is asserted. A parity signal on the system bus is correct if there are an even number of electrically low signals in the set consisting of the covered signals plus the parity signal. Parity is computed using voltage levels, regardless of whether the covered signals are active high or active low. The Request Parity pin RP# covers the request pins REQ[4:0]# and the address strobe, ADS#.

3.2.4

Snoop Signals
The snoop signal group (see Table 3-4) provides snoop result information to the System bus agents.

3-4

P6 Family of Processors Hardware Developer's Manual

System Bus Overview

Table 3-4. Snoop Signals
Type Keeping a Non-Modified Cache Line Hit to a Modified Cache Line Defer Transaction Completion HIT# HITM# DEFER# Signal Names

On observing a transaction, HIT# and HITM# are used to indicate that the line is valid or invalid in the snooping agent, whether the line is in the modified (dirty) state in the caching agent, or whether the transaction needs to be extended. The HIT# and HITM# signals are used to maintain cache coherency at the system level. If the memory agent observes HITM# active, it relinquishes responsibility for the data return and becomes a target for the implicit cache line writeback. The memory agent must merge the cache line being written back with any write data and update memory. The memory agent must also provide the implicit writeback response for the transaction. If HIT# and HITM# are sampled asserted together, it means that a caching agent is not ready to indicate snoop status, and it needs to extend the transaction. DEFER# is deasserted to indicate that the transaction can be guaranteed in-order completion. An agent asserting DEFER# ensures proper removal of the transaction from the In-order Queue by generating the appropriate response.

3.2.5

Response Signals
The response signal group (see Table 3-5) provides response information to the requesting agent.

Table 3-5. Response Signals
Type Response Status Response Parity Target Ready (for writes) RS[2:0]# RSP# TRDY# Signal Names

Requests initiated in the Request Phase enter the In-order Queue, which is maintained by every agent. The response agent is the agent responsible for completing the transaction at the top of the In-order Queue. The response agent is the agent addressed by the transaction. For write transactions, TRDY# is asserted by the response agent to indicate that it is ready to accept write or writeback data. For write transactions with an implicit writeback, TRDY# is asserted twice, first for the write data transfer and then again for the implicit writeback data transfer. The RSP# signal provides parity for RS[2:0]#. A parity signal on the System bus is correct if there are an even number of low signals in the set consisting of the covered signals plus the parity signal. Parity is computed using voltage levels, regardless of whether the covered signals are active high or active low.

P6 Family of Processors Hardware Developer's Manual

3-5

System Bus Overview

3.2.6

Data Response Signals
The data response signals (see Table 3-6) control the transfer of data on the bus and provide the data path.

Table 3-6. Data Phase Signals
Type Data Ready Data Bus Busy Data Data ECC Protection DRDY# DBSY# D[63:0]# DEP[7:0]# Signal Names

DRDY# indicates that valid data is on the bus and must be latched. The data bus owner asserts DRDY# for each clock in which valid data is to be transferred. DRDY# can be deasserted to insert wait states in the Data transfer. DBSY# is used to hold the bus before the first DRDY# and between DRDY# assertions for a multiple clock data transfer. DBSY# need not be asserted for single clock data transfers if no wait states are needed. The D[63:0]# signals provide a 64-bit data path between bus agents. Some P6 family processors support data bus error correcting code (ECC). In these processors, the DEP[7:0]# signals provide optional ECC covering D[63:0]#. As described in Chapter 5, "Configuration," the P6 family data bus can be configured with either no checking or ECC. If ECC is enabled, then DEP[7:0]# provides valid ECC for the entire data bus on each data clock, regardless of which bytes are enabled. The error correcting code can correct single bit errors and detect double bit errors. Please see specific processor datasheets for more information.

3.2.6.1

LINE TRANSFERS
A line transfer reads or writes a cache line, the unit of caching on the P6 family processor system bus. For current products, this is 32 bytes aligned on a 32-byte boundary. While a line is always aligned on a 32-byte boundary, a line transfer need not begin on that boundary. For a line transfer, A[35:3]# carry the upper 33 bits of a 36-bit physical address. Address bits A[4:3]# determine the transfer order, called burst order. A line is transferred in four eight-byte chunks, each of which can be identified by address bits 4:3. The chunk size is 64-bits. Table 3-7 specifies the transfer order used for a 32-byte line, based on address bits A[4:3]# specified in the transaction's Request Phase.

Table 3-7. Burst Order Used for P6 Family Processor Bus Line Transfers
A[4:3]# (binary) 00 01 10 11 Requested Address (hex) 0 8 10 18 0 8 10 18 1st Address Transferred (hex) 8 0 18 10 2nd Address Transferred (hex) 3rd Address Transferred (hex) 10 18 0 8 4th Address Transferred (hex) 18 10 8 0

Note that the requested read data is always transferred first. Unlike the Pentium processor, which always transfers writeback data address 0 first, the P6 family transfers writeback data requested address first.

3-6

P6 Family of Processors Hardware Developer's Manual

System Bus Overview

3.2.6.2

PART LINE ALIGNED TRANSFERS
A part-line aligned transfer moves a quantity of data smaller than a cache line but an even multiple of the chunk size between a bus agent and memory using the burst order. A part-line transfer affects no more than one line in a cache. A 16-byte transfer on a 64-bit data bus with a 32-byte cache line size is a part-line transfer, where a chunk is eight bytes aligned on an eight-byte boundary. All chunks in the span of a part-line transfer are moved across the data bus. Address bits A[4:3]# determines the transfer order for the included chunks, using the burst order specified in Table 3-7 for line transfers.

3.2.6.3

PARTIAL TRANSFERS
On a 64-bit data bus, a partial transfer moves from 0-8 bytes within an aligned 8-byte span to or from a memory or I/O address. Processors convert non-cacheable misaligned memory accesses that cross 8-byte boundaries into two partial transfers. For example, a non-cacheable, misaligned 8-byte read requires two Read Data Partial transactions. Similarly, processors convert I/O write accesses that cross 4-byte boundaries into 2 partial transfers. I/O reads are treated the same as memory reads. I/O Read and I/O Write transactions are 1 to 4 byte partial transactions.

3.2.7

Error Signals
Table 3-8 lists the error signals on the system bus.

Table 3-8. Error Signals
Type Address PArity Error Bus Initialization Bus Error Internal Error FRC Error Thermal Overrun AERR BINIT# BERR# IERR# FRCERR THERMTRIP# Signal Names

AERR#, can be enabled or disabled as part of the power on configuration (see Chapter 5, "Configuration"). If AERR# is disabled for all system bus agents, request and address parity errors are ignored and no action is taken by bus agents. If AERR# is enabled for at least one bus agent, the agents supporting address parity and observing the start of a transaction check the Address Parity signals (AP[1:0]#) and the RP# parity signal and assert AERR# appropriately if an address parity error is detected. P6 family processors support two modes of response when the AERR# signal is enabled. This may be configured at power-up with "AERR# observation" mode. AERR# observation configuration must be consistent between all bus agents. If AERR# observation is disabled, AERR# is ignored and no action is taken by the bus agents. If AERR# observation is enabled and AERR# is sampled asserted, the transaction is canceled. In addition, the requesting agent may retry the transaction at a later time up to its retry limit, after which the error becomes a hard error as determined by the initiating processor.

P6 Family of Processors Hardware Developer's Manual

3-7

System Bus Overview

If a transaction is canceled by AERR# assertion, then the transaction is aborted. Snoop results are ignored if they cannot be canceled in time. All agents reset their rotating ID for bus arbitration to the state at reset (such that bus agent 0 has highest priority). BINIT# is used to signal any bus condition that prevents reliable future operation of the bus. Like the AERR# pin, the BINIT# driver can be enabled or disabled as part of the power-on configuration (see Chapter 5, "Configuration"). If the BINIT# driver is disabled, BINIT# is never asserted and no action is taken on bus errors. Regardless of whether the BINIT# driver is enabled, the P6 family processor supports two modes of operation that may be configured at power on. These are the BINIT# observation and driving modes. If BINIT# observation is disabled, BINIT# is ignored and no action is taken by the processor even if BINIT# is sampled asserted. If BINIT# observation is enabled and BINIT# is sampled asserted, all bus state machines are reset. All agents reset their rotating ID for bus arbitration, and internal state information is lost. L1 and L2 cache contents are not affected. The BERR# pin is used to signal any error condition caused by a bus transaction that will not impact the reliable operation of the bus protocol (for example, memory data error, non-modified snoop error). A bus error that causes the assertion of BERR# can be detected by the processor, or by another bus agent. The BERR# driver can be enabled or disabled at power-on reset. If the BERR# driver is disabled, BERR# is never asserted. If the BERR# driver is enabled, the processor may assert BERR#. A machine check exception may or may not be taken for each assertion of BERR# as configured at power on. A processor will always disable the machine check exception by default. If a processor detects an internal error unrelated to bus operation, it asserts IERR#. For example, a parity error in an L1 or L2 cache causes a Pentium Pro processor to assert IERR#. A machine check exception may be taken instead of assertion of IERR# as configured with software. Two processor agents in the P6 family may be configured as an FRC (functional redundancy checking) pair. In this configuration, one processor acts as the master and the other acts as a checker, and the pair operates as a single processor. If the checker agent detects a mismatch between its internally sampled outputs and the master processor's outputs, the checker asserts FRCERR. FRCERR observation can be enabled at the master processor with software. The master enters machine check on an FRCERR provided that Machine Check Execution is enabled. The FRCERR signal is also toggled during an FRC checker agent's reset action. FRCERR is asserted one clock after RESET# transitions from its active to inactive state. If the checker processor executes its built-in self-test (BIST), then FRCERR is asserted throughout that test. After BIST completes, the checker processor desserts FRCERR only if BIST succeeded but continues to assert FRCERR if BIST failed. This feature allows the failure to be externally observed. If the checker processor does not execute its BIST, then it keeps FRCERR asserted for less than 20 clocks and then deasserts it. The processor protects itself from catastrophic overheating by use of an internal thermal sensor. This sensor is set well above the normal operating temperature to ensure that there are no false trips. The processor will stop all execution when the junction temperature exceeds the sensor setting. This is signaled to the system by the THERMTRIP# (Thermal Trip) pin. Once activated, the signal remains latched, and the processor stopped, until RESET# goes active. There is no hysteresis built into the thermal sensor itself; as long as the die temperature drops below the trip level, a RESET# pulse will reset the processor and execution will continue. If the temperature has not dropped below the trip level, the processor will continue to drive THERMTRIP# and remain stopped.

3-8

P6 Family of Processors Hardware Developer's Manual

System Bus Overview

3.2.8

Compatibility Signals
The compatibility signals group (see Table 3-9) contains signals defined for compatibility within the Intel Architecture processor family.

Table 3-9. PC Compatibility Signals
Type Floating-Point Error Ignore Numeric Error Address 20 Mask System Management Interrupt FERR# IGNNE# A20M# SMI# Signal Names

A P