## Dynamic Cache Partitioning and Adaptive Cache Replacement Schemes for Chip Multiprocessors

## Norfadila Mahrom

B.Eng.(Computer Engineering) M.Sc.(Intelligent System)

# A THESIS SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
THE UNIVERSITY OF ADELAIDE
AUSTRALIA



Copyright © 2014 Norfadila Mahrom All rights reserved

## **CONTENTS**

| Li | st of ] | Figures                                                 | ix    |
|----|---------|---------------------------------------------------------|-------|
| Li | st of   | Tables                                                  | xiii  |
| Li | st of A | Algorithms                                              | XV    |
| Al | bstrac  | ct                                                      | xvii  |
| De | eclara  | ation of Originality                                    | xix   |
| A  | cknov   | wledgments                                              | xxi   |
| Li | st of A | Abbreviations                                           | xxiii |
| 1  | Intr    | roduction                                               | 1     |
|    | 1.1     | Shared Cache of CMP Architectures and Design Challenges | 2     |
|    | 1.2     | Research Goal                                           | 6     |
|    | 1.3     | Contributions of this Thesis                            | 7     |
|    | 1.4     | Thesis Overview.                                        | 9     |
| 2  | Bac     | ekground                                                | 13    |
|    | 2.1     | Multiprocessor System Design Space Exploration          | 13    |
|    |         | 2.1.1 Multiprocessor architectures                      | 16    |
|    | 2.2     | Cache Management Schemes                                | 21    |
|    | 2.3     | Cache Replacement Policy                                | 28    |
|    | 2 4     | Summary                                                 | 35    |

| 3 | Sim | ulation | Framework                                                 | 37  |
|---|-----|---------|-----------------------------------------------------------|-----|
|   | 3.1 | The X   | Itensa LX4.0 Processor                                    | 37  |
|   | 3.2 | Simul   | ation Framework and Environment                           | 43  |
|   | 3.3 | SPEC    | 2000 Benchmarks                                           | 49  |
|   | 3.4 | Metho   | odology                                                   | 52  |
|   | 3.5 | Summ    | nary                                                      | 55  |
| 4 | Cac | he Part | titioning Schemes                                         | 57  |
|   | 4.1 | Simpl   | e Static Allocation Scheme                                | 58  |
|   |     | 4.1.1   | Introduction                                              | 58  |
|   |     | 4.1.2   | Evaluation                                                | 59  |
|   | 4.2 | Utility | y-based Cache Partitioning Scheme                         | 66  |
|   |     | 4.2.1   | Overview                                                  | 66  |
|   |     | 4.2.2   | Evaluation                                                | 67  |
|   | 4.3 | CPI-b   | ased Cache Partitioning Scheme                            | 72  |
|   |     | 4.3.1   | Overview                                                  | 72  |
|   |     | 4.3.2   | Evaluation                                                | 73  |
|   | 4.4 | Summ    | nary                                                      | 77  |
| 5 | Cac | he Rep  | lacement Policy                                           | 81  |
|   | 5.1 | Introd  | uction                                                    | 82  |
|   | 5.2 | Motiv   | ration                                                    | 83  |
|   | 5.3 | Adapt   | ive Insertion and Promotion Policy                        | 85  |
|   |     | 5.3.1   | Description                                               | 85  |
|   |     | 5.3.2   | Comparing Operation of MI2PP policy to PIPP               | 88  |
|   |     | 5.3.3   | Simulation Results                                        | 92  |
|   |     | 5.3.4   | Hardware Overhead                                         | 107 |
|   |     | 5.3.5   | Performance Analysis                                      | 110 |
|   |     |         | 5.3.5.1 Comparison with LRU, UCP, and PIPP                | 110 |
|   |     |         | 5.3.5.2 Comparison with another cache partitioning scheme | 115 |
|   |     |         | 5.3.5.3 Cache size sensitivity study                      | 117 |
|   |     |         | 5.3.5.4 Implementation in the unpartitioned shared cache  | 121 |
|   |     | 5.3.6   | Conclusion                                                | 123 |

|   | 5.4 | Partition-based Replacement Policy                           | 124 |
|---|-----|--------------------------------------------------------------|-----|
|   |     | 5.4.1 Description                                            | 124 |
|   |     | 5.4.2 Evaluation                                             | 126 |
|   |     | 5.4.3 Conclusion                                             | 129 |
|   | 5.5 | Summary                                                      | 133 |
| 6 | Sha | red Cache Partitioning Based On Performance Gain Estimations | 135 |
|   | 6.1 | Introduction                                                 | 136 |
|   | 6.2 | Motivation                                                   | 138 |
|   | 6.3 | Structure of The New Scheme                                  | 141 |
|   |     | 6.3.1 Monitoring Environment and Scheduler                   | 143 |
|   |     | 6.3.2 Identifying the Partitioning Dependence                | 145 |
|   | 6.4 | Adaptive CPI-based Cache Partitioning Scheme                 | 147 |
|   | 6.5 | Experimental Evaluation.                                     | 150 |
|   |     | 6.5.1 Simulation Methodology                                 | 151 |
|   |     | 6.5.2 Simulation Results                                     | 151 |
|   | 6.6 | Performance Analysis                                         | 155 |
|   |     | 6.6.1 Cache Scaling and Sensitivity Analysis                 | 155 |
|   |     | 6.6.2 Effect of Dynamic Set Sampling                         | 159 |
|   |     | 6.6.3 Comparison with Different Cache Replacement Policies   | 161 |
|   | 6.7 | Complexity Management and Hardware Implementation            | 164 |
|   | 6.8 | Summary                                                      | 166 |
| 7 | Con | eclusion                                                     | 169 |
|   | 7.1 | Cache Replacement Policy                                     | 170 |
|   | 7.2 | Cache Partitioning Scheme                                    | 171 |
|   | 7.3 | List of Contributions                                        | 173 |
|   | 7.4 | Limitations                                                  | 174 |
|   | 7.5 | Future Work                                                  | 175 |
|   | 7.6 | Final Remarks                                                | 177 |
|   |     |                                                              |     |

**Bibliography** 

179

## LIST OF FIGURES

| 2.1 | Multithreading in a multiprocessor system.                                                                                          | 14       |
|-----|-------------------------------------------------------------------------------------------------------------------------------------|----------|
| 2.2 | Basic structure of a CMP. Multiple processors with a private level 1 cache share the same physical level 2 cache [Pal et al. 2012]. |          |
| 2.3 | Basic structure of a SMP. Multiple processors interconnected to their private level and level 2 caches [Pal et al. 2012].           |          |
| 2.4 | The basic structure of CLMP consists of multiple processors grouped into clusters (cluster size = 2) [Pal et al. 2012].             |          |
| 3.1 | Xtensa LX4.0 processor architectural block diagram [Tensilica Inc. 2011]                                                            |          |
| 3.2 | Xtensa processor interfaces.  The baseline architecture.                                                                            | 41<br>44 |
| 3.4 | The interconnection of the cores and cache hierarchy using the PIF in a dual-core system.                                           | 46       |
|     |                                                                                                                                     |          |
| 4.1 | Total L2 cache misses improvement over standard LRU                                                                                 | 61       |
| 4.2 | Distribution of total misses caused by each individual core in the quad-core                                                        | 63       |

| 4.3 | standard LRU replacement policy                                                                                                                                                                                               |
|-----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 4.4 | Comparison of UCP, LRU, and EQ cache misses in a quad-core system                                                                                                                                                             |
| 4.5 | Total L2 misses incurred by each core when the system using UCP, LRU and EQ                                                                                                                                                   |
| 4.6 | Performance improvement of UCP over LRU and EQ for a quad-core system 71                                                                                                                                                      |
| 4.7 | Reduction of total cache misses of CPI-based partitioning scheme over LRU and EQ on in a quad-core system.                                                                                                                    |
| 4.8 | Distribution of total L2 misses incurred by each core in the system                                                                                                                                                           |
| 4.9 | Performance of CPI-based partitioning scheme over LRU and EQ for a quad-core system                                                                                                                                           |
| 5.1 | Example of insertion positions in PIPP. Consider Core 1 is allocated with 5 ways and Core 2 receives 3 ways from UCP scheme.                                                                                                  |
| 5.2 | Example operation of (a) PIPP from Xie and Loh [2009], and (b) MI2PP policy for a variety of insertions and promotions activities. Assume evictions for both policies always choose the lowest-priority cache line in the set |
| 5.3 | Performance result for IPC throughput of UCP, PIPP, and MI2PP policy normalised to an LRU-managed cache. 94                                                                                                                   |
| 5.4 | Performance as measured by the weighted speedups of IPC for UCP, PIPP and MI2PP policy compared to baseline LRU                                                                                                               |
| 5.5 | Comparison of UCP, PIPP and MI2PP policy over traditional LRU for the harmonic mean fairness metric.                                                                                                                          |
| 5.6 | L2 miss rate in MPKI of UCP, PIPP and MI2PP compared to baseline LRU policy                                                                                                                                                   |
| 5.7 | (a) IPC throughput of the 8-core system, and (b) L2 miss rate of the 4-core and 8-core systems, using PIPP and MI2PP, relative to the baseline LRU policy 100                                                                 |

| 5.8  | (a) IPC throughput, and (b) L2 miss rate of the 16-core system, using PIPP and                                                                                            |
|------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|      | MI2PP, relative to the baseline LRU policy                                                                                                                                |
| 5.9  | Distribution of L2 miss rate of each core in the 16-core system                                                                                                           |
| 5.10 | Distribution of L2 miss rate between MI2PP and LRU                                                                                                                        |
| 5.11 | Number of cores using lower priority positions as the insertion location                                                                                                  |
| 5.12 | Distribution of L2 miss rate among UCP, PIPP and MI2PP                                                                                                                    |
| 5.13 | Performance gain of MI2PP over a system using CPI-based cache partitioning scheme                                                                                         |
| 5.14 | Total miss rate reduction normalised to the original LRU replacement policy 117                                                                                           |
| 5.15 | Miss rate relative to the baseline LRU policy of a system with shared cache sizes of  (a) 512KB, and (b) 1MB using different cache partitioning schemes and MI2PP  policy |
| 5.16 | Miss rate, relative to UCP, of MI2PP/UCP policy implemented in different cache sizes                                                                                      |
| 5.17 | Miss rate reduction of MI2PP policy implemented in different sizes of unmanaged shared cache                                                                              |
| 5.18 | Performance improvement of PRP normalised to baseline LRU policy and UCP. 126                                                                                             |
| 5.19 | Performance results for the weighted speedup and fair speedup metrics 127                                                                                                 |
| 5.20 | L2 miss rate of PRP normalised to LRU and UCP                                                                                                                             |
| 5.21 | Performance improvement of UCP, PRP and EPRP in a 512KB cache, normalized to LRU policy                                                                                   |
| 5.22 | L2 miss rate of PRP and EPRP, relative to LRU policy                                                                                                                      |
|      |                                                                                                                                                                           |
| 6.1  | Each cache line is extended to include core identification                                                                                                                |
| 6.2  | The shadow tag array and four miss counters associated with each recency position in the monitoring hardware structure of each core                                       |

| 6.3  | Miss rate improvements relative to unmanaged traditional LRU cache 152                                                                                             |
|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 6.4  | Performance comparison of UCP, CPI and ACCP over unmanaged LRU L2 shared cache                                                                                     |
| 6.5  | Performance comparison using the (a) weighted speedup, and (b) harmonic mean of weighted IPC metrics                                                               |
| 6.6  | Performance comparison of ACCP, UCP and CPI-based schemes on different sizes of L2 shared cache, relative to baseline LRU                                          |
| 6.7  | Miss rate reduction of ACCP over UCP and CPI-based schemes                                                                                                         |
| 6.8  | Distribution of L2 miss rate among UCP, CPI and ACCP                                                                                                               |
| 6.9  | IPC performance improvement of ACCP using (1) DSS (labelled as ACCP_S), and (2) UMON-global (labelled as ACCP) for different cache sizes, relative to baseline LRU |
| 6.10 | Performance comparison of MI2PP replacement policy employed on ACCP, UCP, and CPI-based schemes, relative to baseline LRU                                          |
| 6.11 | Miss rate of MI2PP replacement policy employed on ACCP, UCP and CPI-based schemes, relative to baseline LRU.                                                       |

## LIST OF TABLES

| 3.1 | Xtensa LX4.0 processor configuration                                                |
|-----|-------------------------------------------------------------------------------------|
| 3.2 | Baseline system's configuration                                                     |
| 3.3 | Classification of SPEC CPU2000 applications                                         |
| 3.4 | Workload summary                                                                    |
| 3.5 | List of stoppers/first applications to finish its execution in each workload mix 54 |
|     |                                                                                     |
|     |                                                                                     |
|     |                                                                                     |
| 5.1 | Workload mixes for the 8-core system. 102                                           |
| 5.2 | A hardware cost of an UMON circuit associated to a single processor 108             |
| 5.3 | Replacement logic complexity of the LRU, PIPP and MI2PP policies 109                |
|     |                                                                                     |
|     |                                                                                     |
|     |                                                                                     |
| 6.1 | A snapshot of the information recorded during one of the execution intervals 146    |
| 6.2 | Hardware overhead per processor                                                     |

## LIST OF ALGORITHMS

| 6.1 | Pseudo code for finding a line for eviction. The function returns the position of the | line |
|-----|---------------------------------------------------------------------------------------|------|
|     | to be replaced                                                                        | 142  |
| 6.2 | Adaptive CPI-based Cache Partitioning scheme.                                         | 149  |

#### **ABSTRACT**

One of the dominant approaches towards implementing fast and high performance computer architectures is the Chip Multi Processor (CMP), in which the design of the memory hierarchy has a critical effect on performance. Performance can be improved by the use of a shared cache on the chip, but it is a matter of ongoing research as to how each processor can gain the greatest advantage from the cache without affecting the performance of other processors. Moreover, power is a critical issue in CMP design.

Cache replacement policies and cache partitioning schemes have been investigated and proven able to enhance shared cache management. However, it is still desirable to have an optimal replacement policy that can retain useful data as long as possible to minimise miss rate and not degrade performance in a partitioned shared cache. Many of the metrics that have led to innovations in various partitioning schemes have increased the complexity of the partitioning strategies and the hardware overhead. There is scope for more work in achieving the right balance between power consumption and performance improvement in the CMP.

This thesis investigates the effects of the cache replacement policy in a partitioned shared cache. The goal is to quantify whether a better power/performance trade-off can be achieved by using less complex replacement strategies. A Middle Insertion 2 Positions Promotion (MI2PP) policy is proposed to eliminate cache misses that could adversely affect

the access patterns and the throughput of the processors in the system. The insertion, promotion and eviction strategies of the replacement policy are investigated and modified to improve shared cache utilisation by the competing processors. The MI2PP policy employs a static predefined insertion point, near distance promotion and the concept of ownership in the eviction policy to avoid resource stealing among the processors. With these strategies, the performance of the shared cache and the overall system were enhanced and the miss rate showed significant improvement over the Least Recently Used (LRU) policy.

While existing cache partitioning schemes use a variety of performance metrics to allocate the cache for each competing processor, most of the schemes focus only on one metric in their partitioning algorithm. An Adaptive Cycles per Instruction (CPI)-based Cache Partitioning (ACCP) scheme is introduced to investigate the efficiency of using two metrics to optimise partitioning decisions and to study the trade-offs between the complexity of using more performance metrics in partition decision-making and additional hardware cost. The analysis performed on ACCP showed that the performance of the processors was improved compared to the existing CPI-based partitioning scheme introduced by Muralidhara et al. [2010], which uses only one of the performance metrics employed in ACCP. Evaluation on a more complex scheme, namely the Utility Cache Partitioning (UCP) scheme demonstrated that the ACCP on average achieved similar performance although the ACCP is simpler to implement. The low hardware overhead incurred by ACCP showed that it is superior to UCP. ACCP demonstrates that the complexity of the partitioning mechanism and hardware cost could be reduced without degrading the overall system performance.

#### DECLARATION OF ORIGINALITY

This work contains no material that has been accepted for the award of any other degree or diploma in any university or other tertiary institution to Norfadila Mahrom and, to the best of my knowledge and belief, contains no material previously published written by another person, except where due reference has been made in the text.

I give consent to this copy of the thesis, when deposited in the University Library, being available for loan, photocopying, and dissemination through the library digital thesis collection, subject to the provisions of the Copyright Act 1968.

I also give permission for the digital version of my thesis to be made available on the web, via the University's digital research repository, the Library catalogue, the Australasian Digital Thesis Program (ADTP) and also through web search engines, unless permission has been granted by the University to restrict access for a period of time.

| Signed | Date |  |
|--------|------|--|

#### ACKNOWLEDGEMENTS

In the Name of Allah, the Most Gracious, the Most Merciful.

All praise and thanks to Allah.

My very great appreciation and deep gratitude to my principal supervisor, Associate Professor Michael J. Liebelt for his patient guidance, assistance, encouragement and useful criticism of my research work. His willingness to allocate his time so generously to support me in many aspects and his trust in me has been very much appreciated. I also would like to express my sincere thanks to Dr. Braden J. Phillips for his valuable and constructive suggestions during the planning and development of this research work.

The generosity of the School of Electrical and Electronic Engineering, The University of Adelaide in funding extensive access to the Tensilica tools under Tensilica Inc. University Program is highly appreciated. This work would not have been possible without the support of David Bowler in all issues related to technical resources. My research candidature was sponsored by the Ministry of Higher Education, Malaysia and Universiti Malaysia Perlis, and supported by Prof. Dr. R. Badlishah Ahmad, Prof. Dr. Ali Yeon Md Shakaff and Prof. Dato' Dr. Zul Azhar Zahid Jamal.

My grateful thanks are extended for the members of CHiPTec, particularly for everyone in the Journal Club for the valuable meetings that had introduced me to the best platform in expending my knowledge and research work. Not to forget, million thanks to Noorfazila

Kamal, Dr. Puteh Saad, Deborah Coleman-George, Ivana Rebellato and Chris Andrew Haerberli for their motivation and continuous moral support.

Finally, to my family. Thank you all for being you. You know how much you have done to make me come this far.

Norfadila Mahrom

#### LIST OF ABBREVIATIONS

ACCP Adaptive CPI-based Cache Partitioning

ACCP\_S ACCP with Sampled sets

API Application Programming Interface

ART 2 Adaptive Resonance Theory 2

BIP Bimodal Insertion Policy

CLMP Clustered Multi Processor

CMP Chip Multi Processor

CPI Cycles per Instruction

CPU Central Processing Unit

DIP Dynamic Insertion Policy

DPP Dynamic Promotion Policy

DSS Dynamic Set Sampling

EQ Equally Partitioned

FIFO First-In First-Out

FPGA Field-Programmable Gate Array

HU High Utility

I/O Input/Output

ID Identification

IPC Instructions per Cycle

ISA Instruction Set Architecture

ISS Instruction Set Simulator

ISSCC IEEE international Solid-State Circuits Conference

L1 Level one

L2 Level two

L3 Level three

LFU Least Frequently Used

LIP LRU Insertion Policy

LLC Last Level Cache

LRU Least Recently Used

LU Low Utility

MI2PP Middle Insertion 2 Positions Promotion

MMU Memory Management Unit

MPKI Misses per 1000 Instructions

MPP MRU Promotion Policy

MRU Most Recently Used

MU Marginal Utility

PIF Processor Interface

PIPP Promotion/Insertion Pseudo-Partitioning

PRP Partition-based Replacement Policy

PSEL Policy Selector

RAM Random Access Memory

RISC Reduced Instruction Set Computer

ROM Read Only Memory

SIPP Single-step Incremental Promotion Policy

SMP Symmetric Multi Processor

SPEC Standard Performance Evaluation Corporation

SU Saturating Utility

TADIP Thread-Aware Dynamic Insertion Policy

TADPP Thread-Aware Dynamic Promotion Policy

TIE Tensilica Instruction Extension

UCP Utility-Cache Partitioning

UMON Utility Monitor

UMON\_DSS Utility Monitor with Dynamic Set Sampling

VPR Versatile Place and Route

XCC Xtensa C/C++ Compiler

XPG Xtensa Processor Generator

XTMP Xtensa Modeling Protocol