# COMP201 Computer Systems

KOÇ

UNIVERSITY

## & Programming

Lab #08 - Memory Organization

Fall 2024

#### Recall: Memory Hierarchy





### Why do we need Memory Hierarchies?

#### Some fundamental properties of computer systems

- Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).
- The gap between CPU and main memory speed is widening.
- Locality comes to the rescue!

These fundamental properties of hardware and software suggest an approach for organizing memory and storage systems known as a memory hierarchy.

#### Fundamental idea of a memory hierarchy

- For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.
- Because of locality, programs tend to access the data at level k more often than they access the data at level k+1.

*(Ideal):* The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.



### Caching in Memory Hierarchy

| Cache Type           | What is Cached?      | Where is it Cached? | Latency (cycles) | Managed By      |
|----------------------|----------------------|---------------------|------------------|-----------------|
| Registers            | 4-8 bytes words      | CPU core            | 0                | Compiler        |
| TLB                  | Address translations | On-Chip TLB         | 0                | Hardware MMU    |
| L1 cache             | 64-byte blocks       | On-Chip L1          | 4                | Hardware        |
| L2 cache             | 64-byte blocks       | On-Chip L2          | 10               | Hardware        |
| Virtual Memory       | 4-KB pages           | Main memory         | 100              | Hardware + OS   |
| Buffer cache         | Parts of files       | Main memory         | 100              | OS              |
| Disk cache           | Disk sectors         | Disk controller     | 100,000          | Disk firmware   |
| Network buffer cache | Parts of files       | Local disk          | 10,000,000       | NFS client      |
| Browser cache        | Web pages            | Local disk          | 10,000,000       | Web browser     |
| Web cache            | Web pages            | Remote server disks | 1,000,000,000    | Web proxy serve |



#### Cache Example #1: TIO Breakdown

- Cache Size: 1 MB
- Block Size: 64 Bytes
- 4-way Set-Associative
- 36-bit byte-addressable address space.

#### Complete the TIO address

breakdown:





#### Cache Example #2: TIO Breakdown

Assume a system with the following properties:

- Cache Size: 16 KB
- Line Size: 32 Bytes
- Direct Mapping

### What would be the values of each of the three fields for the following addresses?

| Address    | Тад | Index | Offset |
|------------|-----|-------|--------|
| 0x00B248AC |     |       |        |
| 0x5002AEF3 |     |       |        |
| 0x10203000 |     |       |        |
| 0x0023AF7C |     |       |        |



#### Cache Example #2: TIO Breakdown

Assume a system with the following properties:

- Cache Size: 16 KB
- Line Size: 32 Bytes
- Direct Mapping

What would be the values of each of the three fields for the following addresses?

| Address    | Тад     | Index | Offset |
|------------|---------|-------|--------|
| 0x00B248AC | 0x2C9   | 0x45  | 0xC    |
| 0x5002AEF3 | 0x1400A | 0x177 | 0x13   |
| 0x10203000 | 0x4080  | 0x180 | 0x0    |
| 0x0023AF7C | 0x8E    | 0x17B | 0x1C   |



#### Cache Simulator

- Simulates usage of Cache
- Step-by-step explanation
- Adjustable system parameters
- Cache hits, misses, counts and history







https://courses.cs.washington.edu/courses/cse351/cachesim/







VDT Cache Data Physical Memory m = 6, C = 16104 cd 4a f6 48 0x00 20 f6 ef ea a2 5e 9f 1a K = 4, E = 2a2 d0 4f c4 a0 0c f7 27 00--------Write back Obc03 Write-allocate 0x10 b8 bd 1a ca 35 95 cb 80 00---------Bat 1 Eviction: LRU 00-------0x18 84 3f 02 4f 8e f3 f6 e5 cd 4a f6 48 1a 6f 7e 63 0x28 e9 36 ae 32 0d 37 bc c9 0x30 93 dc b8 7a 3b 1a b2 0c 0x38 d3 a6 a4 71 e2 23 9c 59



#### Cache Simulator: Writing 0x13 at 0x22

| System Paran                                              | seters:                                                                     | Manual Me   | mory Ac               | cess:                      |                 |              | History:                           |
|-----------------------------------------------------------|-----------------------------------------------------------------------------|-------------|-----------------------|----------------------------|-----------------|--------------|------------------------------------|
| Address width<br>Cache size:                              | 16 🕶 bytes                                                                  | Explain 2   | and the second second | Addr: 0x 23<br>Addr: 0x 22 | , Byte: 0x      | (13          | R(0x23) = H<br>> W(0x22, 0x13) = ? |
| Block size<br>Associativity:<br>Write Hit:<br>Write Miss: | ○ 2 ● 4 ○ 8 bytes<br>○ 1 ● 2 ○ 4 way(s)<br>Write-back ▼<br>Write-allocate ▼ | Tag<br>100  | Flinth<br>Index<br>0  | Offset<br>11               | Cache Hits<br>0 | Cache Misses |                                    |
| Replacement:                                              | Least Recently Used 🗸                                                       | Simulation  | Message               | s:                         |                 |              |                                    |
| - Append                                                  | leset System ]<br>□Explain                                                  | Write: Øx13 | at addr               | ess 0×22                   |                 |              | Load    ( )                        |

m = 6, C = 16 K = 4, E = 2 Write back Write-allocate Eviction: LRU

|       | VDT Cache Data  |   |
|-------|-----------------|---|
|       | 104 cd 4a f6 48 | 3 |
| 305 O | 00              | 2 |
|       | 00              | - |
| Set 1 | 00              | 1 |

|                | Phy  | si | ca] | L M | emo | ry |    |
|----------------|------|----|-----|-----|-----|----|----|
| 0x00 20        | f6 0 | ef | ea  | a2  | 5e  | 9f | 1a |
| 0x08 a2        | d0 4 | f  | c4  | a0  | 0c  | £7 | 27 |
| 0x10 <b>b8</b> | bd   | la | ca  | 35  | 95  | cb | 80 |
| 0x18 84        | 3f ( | 02 | 4f  | 8e  | f3  | f6 | e5 |
| 0x20 cd        | 4a 1 | E6 | 48  | 1a  | 6f  | 7e | 63 |
| 0x28 e9        | 36   | e  | 32  | 0d  | 37  | bc | c9 |
| 0x30 93        | dc   | 8  | 7a  | 3b  | 1a  | b2 | 0c |
| 0x38 d3        | a6 a | 14 | 71  | e2  | 23  | 9c | 59 |
|                |      |    |     |     |     |    |    |





| m = 6, C = 16  | VDT Cache Data             | Physical Memory              |
|----------------|----------------------------|------------------------------|
| K = 4, E = 2   | 104 cd 4a f6 48            | 0x00 20 f6 ef ea a2 5e 9f 1a |
| Write back     | 00                         | 0x06 a2 d0 4f c4 a0 0c f7 27 |
| Write-allocate | [0]0]-   ] =               | 0x10 b8 bd 1a ca 35 95 cb 80 |
| Eviction: LRU  | Set 1 00                   | 0x18 84 3f 02 4f 8e f3 f6 e5 |
|                | Alternative all the design | 0x20 cd 4a f6 48 1a 6f 7e 63 |
|                |                            | 0x28 e9 36 ae 32 0d 37 bc c9 |
|                |                            | 0x30 93 dc b8 7a 3b 1a b2 0c |
|                |                            | 0x38 d3 a6 a4 71 e2 23 9c 59 |



| m = 6, C = 16  | VDT Cache Data  | Physical Memory              |
|----------------|-----------------|------------------------------|
| K = 4, E = 2   | 114 cd 4a 13 48 | 0x0020f6efeaa25e9f1a         |
| Write back     | Pet 0 00        | 0x01 a2 d0 4f c4 a0 0c f7 27 |
| Write-allocate | [0]0]-]]]]]     | 0x10 b8 bd 1a ca 35 95 cb 80 |
| Eviction: LRU  | 00              | 0x18 84 3f 02 4f 8e f3 f6 e5 |
|                |                 | 0x20 cd 4a f6 48 1a 6f 7e 63 |
|                | U               | 0x28 e9 36 ae 32 0d 37 bc c9 |
|                |                 | 0x30 93 dc b8 7a 3b 1a b2 0c |
|                |                 | 0x38 d3 a6 a4 71 e2 23 9c 59 |

#### Recall: General Caching Concepts: 3 Types of Cache Misses

- Cold (compulsory) miss
  - Cold misses occur because the cache starts empty and this is the first reference to the block.
- Capacity miss
  - Occurs when the set of active cache blocks (working set) is larger than the cache.
- Conflict miss
  - Most catches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k.
    - E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
  - Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.
    - E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.



#### Cache Example #3: Effective Access Time

Find the EAT for a system with the following properties:

- Cache access time: 10 ns
- Cache miss rate: 1%
- Main Memory access time: 200 ns

EAT = 
$$T_{cache}$$
 + (1-Hit Rate) \*  $T_{Memory}$   
= 10 + 0.01 \* 200  
= 10 + 2  
= 12 ns



### Locality in Programs

Principle of Locality:

 Programs tend to use data and instructions with addresses near or equal to those they have used recently.

#### • Temporal locality:

 Recently referenced items are likely be referenced in the near future.

#### • Spatial locality:

 Items with nearby addresses tend to be referenced close together in time.



| 400512: | 55    |       |       |    | push | %rbp                           |
|---------|-------|-------|-------|----|------|--------------------------------|
| 400513: | 48 89 | e5    |       |    | NOV  | %rsp,%rbp                      |
| 400516: | c7 45 | fc 00 | 88 88 | 88 | novl | Sexe, -8x4(%rbp)               |
| 40051d: | c7.45 | f8 00 | 80.00 | 60 | novl | \$0x0,-0x8(%rbp)               |
| 488524: | c7 45 | fc 88 | 00 00 | 66 | novl | \$8x8,-8x4(%rbp)               |
| 40052b: | eb 14 |       |       |    | jnp  | 400541 <nain+0x2f></nain+0x2f> |
| 40052d: | 8b 45 | fc    |       |    | NOV  | -0x4(%rbp),%eax                |
| 400530: | 0f af | 45 fc |       |    | inul | -0x4(%rbp),%eax                |
| 480534: | 89 45 | f4    |       |    | MOV  | %eax,-0xc(%rbp)                |
| 466537: | 86 45 | £4    |       |    | nov  | -0xc(%rbp),%eax                |
| 40053a: | 01 45 | f8    |       |    | add  | %eax,-8x8(%rbp)                |
| 40053d: | 83 45 | fc 01 |       |    | addl | \$8x1,-8x4(%rbp)               |
| 400541: | 83 7d | fc 09 |       |    | cmpl | \$8x9,-8x4(%rbp)               |
| 400545: | 7e e6 |       |       |    | jle  | 40052d <main+0x1b></main+0x1b> |
| 400547: | 68 66 | 88 88 | 00    |    | mov  | S0x0, Meax                     |
| 40054c: | 5d    |       |       |    | pop  | Srbp                           |
| 48854d: | c3    |       |       |    | retq |                                |
| 40054e: | 66 90 |       |       |    | xchg | Nax, Nax                       |

#### **Temporal or Spatial Locality?**



### Locality in Programs

Principle of Locality:

• Programs tend to use data and instructions with addresses near or equal to those they have used recently.

#### • Temporal locality:

 Recently referenced items are likely be referenced in the near future.

#### • Spatial locality:

 Items with nearby addresses tend to be referenced close together in time.



| 400512: | 55        |         |         | push | Srbp                           |
|---------|-----------|---------|---------|------|--------------------------------|
| 400513: | 48 89     | e5      |         | NOV  | %rsp,%rbp                      |
| 488516: | c7 45     | fc 00 0 | 0 00 00 | novl | \$8x8,-8x4(%rbp)               |
| 40051d: | c7.45     | f8 00 0 | 0 00 00 | novl | \$0x0,-0x8(%rbp)               |
| 488524: | c7 45     | fc 68 8 | 0 00 00 | novl | \$8x8,-8x4(%rbp)               |
| 40052b: | eb 14     |         |         | jnp. | 400541 <nain+0x2f></nain+0x2f> |
| 40052d: | 8b 45     | fc      |         | NON  | -0x4(%rbp),%eax                |
| 400530: | Of af     | 45 fc   |         | inul | -0x4(%rbp),%eax                |
| 488534: | 89 45     | f4      |         | MOV  | Neax, -0xc(%rbp)               |
| 466537: | 86 45     | f4      |         | nov  | -0xc(%rbp),%eax                |
| 40053a: | 01 45     | f8      |         | add  | %eax,-8x8(%rbp)                |
| 40053d: | 83 45     | fc 01   |         | addl | \$0x1,-0x4(%rbp)               |
| 400541: | 83 7d     | fc 09   |         | cmpl | \$8x9,-8x4(%rbp)               |
| 400545: | 7e e6     |         |         | jle  | 40052d <main+0x1b></main+0x1b> |
| 488547: | 68 66     | 00 00 0 | 0       | MOV  | \$0x0, Meax                    |
| 40054c: | 5d        |         |         | pop  | Srbp                           |
| 48854d: | <b>c3</b> |         |         | retq |                                |
| 40054e: | 66 90     |         |         | xchq | Sax, Sax                       |

#### **Temporal or Spatial Locality?**



Both!

#### Recall: Spatial Locality in Arrays

```
int sumarraycols(int a[M][N])
{
    int i, j, sum = 0;
    for (j = 0; j < N; j++)
    for (i = 0; i < M; i++)
        sum += a[i][j];
    return sum;
    }
}
(a)</pre>
```

| Address      | 0   | 4   | 8        | 12       | 16  | 20  |
|--------------|-----|-----|----------|----------|-----|-----|
| Contents     | a00 | a01 | $a_{02}$ | $a_{10}$ | a11 | a12 |
| Access order | 1   | 3   | 5        | 2        | 4   | 6   |

order

#### **Good Locality?**

No! (Stride-N pattern)



#### Recall: Spatial Locality in Arrays



| Address      | 0   | 4                      | 8   | 12                     | 16       | 20       |
|--------------|-----|------------------------|-----|------------------------|----------|----------|
| Contents     | a00 | <i>a</i> <sub>01</sub> | a02 | <i>a</i> <sub>10</sub> | $a_{11}$ | $a_{12}$ |
| Access order | 1   | 2                      | 3   | 4                      | 5        | 6        |

order

#### **Good Locality?**



#### Recall: Spatial Locality in Arrays



**Good Locality?** 

No!



#### Locality in Data



**Good Locality?** 



#### Locality in Data



How about this one?



#### **Concluding Observations**

#### Programmer can optimize for cache performance

- How data structures are organized
- How data are accessed
  - Nested loop structure
  - Blocking is a general technique

#### All systems favor "cache friendly code"

- Getting absolute optimum performance is very platform specific
  - Cache sizes, line sizes, associatives, etc.
- Can get most of the advantage with generic code
  - Keep working set reasonably small (temporal locality)
  - Use small strides (spatial locality)





https://valgrind.org/

### Code Profiling

- A **code profiler** is a tool to analyze a program and report on its resource usage
  - "resource" could be memory, CPU cycles, network bandwidth, and so on
- The program is run under control of a profiling tool
- During application development, a common step is to improve runtime performance

using profiling tools.

- To not waste time on optimizing functions which are rarely used, one needs to know in which parts of the program most of the time is spent.
- Some example:
  - Callgrind, GProf, JConsol, CLR



### Valgrind

the Valgrind framework supports a variety of runtime analysis tools

- memcheck
  - detects memory errors/leaks
- massif
  - reports on heap usage
- helgrind
  - detects multithreaded race conditions
- callgrind/cachegrind
  - profiles CPU/cache performance



#### Callgrind/cachegrind

- The Valgrind profiling tools are cachegrind and callgrind
- The cachegrind tool simulates the L1/L2 caches and counts cache misses/hits.
- The callgrind tool counts function calls and the CPU instructions executed within each call

and builds a function callgraph

• The callgrind tool includes a cache simulation feature adopted from cachegrind, so you can actually use **callgrind for both CPU and cache profiling**.



#### Basic Usage of Callgrind

- First, we need to compile our program with debugging enabled
  - gcc -g -ggdb name.c -o name.out
- You first need to run your program under Valgrind and explicitly request the callgrind tool (if unspecified, the tool defaults to memcheck)

valgrind --tool=callgrind [possible options] name.out
program-arguments

• The result will be stored on the files callgrind.out.PID, where PID will be the process identifier.

| ==22417== Events : Ir         |                                 |
|-------------------------------|---------------------------------|
| ==22417== Collected : 7247606 | Number of Instruction read (Ir) |
| ==22417== I refs: 7,247,606   |                                 |
|                               | ==22417==                       |

### Basic Usage of Callgrind

Counting instructions with callgrind

- The callgrind output file is a text file, but its contents are not intended for you to read yourself.
  - You can properly read the output using
    - callgrind\_annotate
      - callgrind\_annotate --auto=yes
         callgrind.out.PID
- The --auto=yes option report counts for each C statement
- Do not forget to replace PID by the actual number.

KOC

UNIVERSITY

#### Sorts a 1000-member array using selection sort

```
. void swap(int *a, int *b)
    3,000 {
    3,000
               int tmp = *a;
    4,000
               *a = *b;
    3,000
               *b = tmp;
   2,000
        . int find_min(int arr[], int start, int stop)
    3,000 {
   2,000
               int min = start;
               for(int i = start+1; i <= stop; i++)</pre>
2,005,000
                   if (arr[i] < arr[min])
4,995,000
    6,178
                       min = i:
   1,000
               return min;
   2,000
        . void selection_sort(int arr[], int n)
        E
               for (int i = 0; i < n; i++) {
    4,005
   9,000
                   int min = find_min(arr, i, n-1);
7,014,178
           => sorts.c:find_min (1000x)
   10,000
                   swap(&arr[i], &arr[min]);
  15,000
           => sorts.c:swap (1000x)
```

#### Interpreting the results

- The Ir counts are basically the count of assembly instructions executed.
- By default, the counts are exclusive
- The counts for a function include only the time spent in that function and not in the functions that it calls.
- By using exclusive counts you can detect the bottlenecks.
- Here, the work is concentrated in the loop to find the min value





### Basic Usage of Callgrind

Adding in cache simulation

Invoke valgrind --simulate-cache=yes
 by

valgrind --tool=callgrind --simulate-cache=yes name.out args

- The cache simulator models a machine with a split L1 cache (separate instruction I1 and data D1), backed by a unified second-level cache (L2).
- Similar to the previous example, callgrind\_annotate should be used to interpret the output.



#### Callgrind Example

==16409== Events : Ir Dr Dw Iimr Dimr Dimw I2mr D2mr D2mw ==16409== Collected : 7163066 4062243 537262 591 610 182 16 103 94 ==16489== ==16409== I refs: 7,163,866 ==16409== I1 misses: 591 ==16409== L21 misses: 16 ==16409== I1 miss rate: 0.0% ==16409== L2i miss rate: 0.8% ==16409=== 4,599,505 (4,062,243 rd + 537,262 wr) ==16409== D refs: ==16409== D1 misses: 792 610 rd + 182 wr) ==16409== L2d misses: 197 ( 103 rd + 94 wr) ==16409== D1 miss rate: 0.0% ( 0.0% + 8.8% ==16409== L2d miss rate: 0.0% ( 0.0% + 0.0% ) ==16489== ==16409== L2 refs: 1,201 rd + 1,383 ( 182 wr) 94 wr) ==16409== L2 misses: 213 ( 119 rd + ==16409== L2 miss rate: 0.0% ( 0.0% + 0.0% )

It sounds like we have a cache friendly code.



Ir: I cache reads (instructions executed)

11mr: 11 cache read misses (instruction wasn't in 11 cache but was in L2)

I2mr: L2 cache instruction read misses (instruction wasn't in I1 or L2 cache, had to be fetched

Dr: D cache reads (memory reads)

D1mr: D1 cache read misses (data location not in D1 cache, but in L2) D2mr: L2 cache data read misses (location not in D1 or

L2) Dw: D cache writes (memory writes)

D1mw: D1 cache write misses (location not in D1 cache, but in L2)

D2mw: L2 cache data write misses (location not in D1 or L2)

#### Callgrind Example

IINIVERSITY

| Ir           | Or         | Dw       | Ilmr | Dimr D | 1mv  | 12mr  | D2mr D2  | mw       |                      |
|--------------|------------|----------|------|--------|------|-------|----------|----------|----------------------|
| 1000         |            |          |      |        | 1.1  | biov  | swap[int | *a, in   | t +b)                |
| 3,000        | 0          | 1,000    | 1    | 0      | . 0  | 1000  |          | 1        |                      |
| 3,000        | 2,000      | 1,000    | 1.4  |        | 10   | 4 3   | int      | thp = +i | 82                   |
| 4,000        | 3,000      | 1,000    | - 52 |        | 10   | ă 3   | +8.4     | *b1      |                      |
| 3,000        | 2,000      | 1,000    |      |        | 10   | a .   | *b =     | tmp;     |                      |
| 2,000        | 2,000      |          | a 1  |        | 100  | °.)   | 6        |          |                      |
| 1            |            |          |      | 4.4    | 10   |       |          |          |                      |
|              | 1.4        |          |      |        | 1.3  | int f | ind_min( | int arr  | [], int start, int : |
| (q           |            |          |      |        |      |       |          |          |                      |
| 3,000        | 0          | 1,000    | - 1  | 0      | . 0  | - 3   | S 64 6   | 1        |                      |
| 2,000        | 1,000      | 1,000    | . 0  | 0      | - 3  |       | 0        | 1        | int min = start;     |
| 2,005,00     | 0 1,002,4  | 100 500, | 500  |        | 14   |       |          | for(int  | i = start+1; i <= :  |
| op; 1++)     |            |          |      |        |      |       |          |          |                      |
| 1,995,000 2, | 997,000    | 8        | - 10 | 32     | 0    |       | 1 19     |          | if (arr[1] < arr     |
| in])         |            |          |      |        |      |       |          |          |                      |
| 6,144        | 3,072      | 3,072    | 1.0  | * *    |      | a     |          | min      | = 1;                 |
| 1,000        | 1,000      | 100      |      | · · ·  | 14   |       | return   | min;     |                      |
| 2,000        | 2,000      |          |      |        | 1.4  | + 1   | 6 Evenie |          |                      |
| 1.1          |            |          |      | + +    | 1.1  | void  | selectio | n_sort(  | int arr[], int n)    |
| 3            | 0          | 1        | 1    | 0      | 0    | 1     |          | .(       |                      |
| 4,885        | 2,002      | 1,001    | - 14 | + +    | 10   | ÷ 4   | for      | (int i ) | = 0; i < n; i++) {   |
| 9,800        | 3,000      | 5,000    |      | 1.11   | - 63 | ÷ .   |          | int min  | = find_min(arr, i,   |
| 1);          |            |          |      |        |      |       |          |          |                      |
| 7,814,14     | 14 4,006,1 | 72 505,  | 572  | 1      | 32   | 1     | 1 1      | 9 1      | H> sorts.cifind_min  |
| 1000×)       |            |          |      |        |      |       |          |          |                      |
| 10,000       | 4,000      | 3,000    |      |        | 4    |       |          | swap16a  | rr[i], Garr[min]);   |
| 15,000       | 9,000      | 4,000    | 1    | 0      | 0    |       | 1. 6. 6  | => sor   | ts.c:swap (1000x)    |
| 1.1          | 10.00      |          |      |        |      | 3     |          |          |                      |
| 2            | 2          |          |      |        |      | 100   | 1        |          |                      |

Ir: I cache reads (instructions executed)

I1mr: I1 cache read misses (instruction wasn't in I1 cache but was in L2)

I2mr: L2 cache instruction read misses (instruction wasn't in I1 or L2 cache, had to be fetched

Dr: D cache reads (memory reads)

D1mr: D1 cache read misses (data location not in D1 cache, but in L2) D2mr: L2 cache data read misses (location not in D1 or

L2) Dw: D cache writes (memory writes)

D1mw: D1 cache write misses (location not in D1 cache, but in L2)



#### **Additional Points**

- L2 misses are much more expensive than L1 misses, so pay attention to passages with high **D2mr** or **D2mw** counts.
- Even a small number of misses can be quite important, as a L1 miss will typically cost around 5-10 cycles, an L2 miss can cost as much as 100-200 cycles
- Callgrind cannot detect the bottleneck of your program if it is related to file I/O
- Try to examine different paths of your program



#### Callgrind Example

vofile data file 'caligrind.out.18974' (creator: caligrind-3.15.6) tt cache: 32768 B, 64 B, 4-way associative It cache: 32768 B, 64 B, B-way associative L cache: #388608 B, 64 B, 16-way associative Unerange: Basic block 0 - 17001881 rigger: Program termination rofiled target: ./matrix\_good.out (PID 18974, part 1) vents recorded: If Dr Dw line Dine Dine Time Dine Olaw wents shown: Ir Dr Dw line Dime ILME DLAW OLAW went sort order: Ir Dr Ow line Diar Diaw Line Diar Diaw hresholds: ------Include dirs: lser annotated: Auto-annotation: on Or Dw line Dine Dine Dine Dine Dine 05,230,703 25,007,204 13,054,426 807 63,834 63,075 758 1,065 62,937 PROGRAM TOTALS Dr Dw Linr Diny Diny Diny File:function 6.070,729 5.020,209 3.020,210 4 1 62,501 4 0 62,409 natrix good.c:nain [/Users/ncokelek21/201/Lab0/natrix good.out] 99,079,729 5,020,209 3,020,209 4 1 62,001 4 1 62,001 4 0 8,409 Aarris\_geos.c:Aata (Users/Accessessation) (Users/Accessessation) (Just (St. 2000,000 4,000,000 2 3 6 2 3 6 2 3 ) /usr/src/debug/gllbc-2.17-CTSasB60/stdlib/random.c:random.r (Just (St. 2000,000 4,000,000 2 3 6 2 5 ) 0 2 . natris\_good.c:efftctent\_tum (Jusers/Accessestati/201/Labi/Aatris\_good.c:efftctent\_tum (Jusers/Accessestation)) (Jusers/Accessestation) (Jusers/Accesse Auto-annotated source: matrix good.c Dwi IIAr Diar Diaw ILar Diar Diaw #include <stdio.hx #include «stdlib.h» . Int efficient sumfint arrf1001f1001f1001)/ ins t. J. ks int size = 100; . Lot sun = 01 . 6 for[1 = 0; 1 < size; 1++){ 485 282 101 40,500 26,260 10,100 for(j = 0; j < size; j++)( 4,050,000 2,020,000 1,010,000 for(k = 0; k < stze; k++)[ . . 8,000,000 5,008,000 1,000,000 0 62,500 sum += arr[1][3][8]; return sun:

#### Callgrind Example

Profile data file 'callprind.out.27711' (creator: callprind-3.15.0) It cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 8, 64 8, 8-way associative L cache: 26254400 B, 64 B, 25-way associative Timerange: Basic block 0 - 17081676 Trigger: Program termination rofiled target: ./matrix\_bad.out (PID 27711, part 1) Events recorded: Ir br bw finr binr binw linr binw vents shown: Ir Dr Dw linr Dine Dine Dine Dine went sort order: Ir Dr Dw linr Dine Dine Hunr Dine Thresholds: ---------include dirs: user annotated: Auto-annotation: on Dr De linr Dinr Dine linr Dine 106,258,137 20,107,202 13,054,422 812 1,001,339 1,000,585 807 1,004 62,935 PROGRAM TOTALS Or Dw Timr Dime ILmr Dime file:function 37.090,930 6.040,410 3.025,210 4 2 995,965 4 0 62,406 matrix bad.c:main [/Users/ncokelek21/201/Lab8/natrix bad.out] 23,967,742 8,000,000 4,000,000 2 3 0 2 2 . /usr/src/debug/gllbc-2.17-c738a686/stdllb/randon\_r.c:randon\_r [/) 22,090,913 7,040,465 2,026,205 2 999,986 0 2 . natrix\_bad.c:inefficient\_sum [/Users/ncokelek21/201/Lab0/natrix\_l 17,000\_000 4,000\_000 3,000 3 0 1 3 0 1 /usr/src/debug/gllbc-2.17-c7584686/stdllb/random.cirandom [/usr/ 4,000\_000 1,000\_000 1,000,000 1 0 0 1 . . /usr/src/debug/gllbc-2.17-c7584686/stdllb/rand.cirand [/usr/llb6-Auto-annotated source: matrix bad.c Dw Iser Dier Diew ILer DLev DLev #include <stdio.ha #include <stdlib.h> - **6** int inefficient sum(int arr[108][100][100])( int 1, j, kj int size - 100; tht sun + 8: for(k = 0; k = size; k++){ 485 28Z 40.500 28,299 10,100 for(1 = 0; 1 = stze; 1++){ 4,050,000 2,020,000 1,010,000 for(j = 0: ] < size: j++){ ..... 18,000,000 5,000,000 1,000,000 8 999 985 sun += arr[1][3][k]: return sung

#### References

- 1. Some of the slides are borrowed from materials in Stanford CS107, CMU15-213 and CS201, Portland State University
- 2. <u>https://stackoverflow.com/questions/16699247/what-is-a-cache-friendly-code</u>
- 3. <u>https://www.valgrind.org/docs/manual/manual.html</u>
- 4. The Cache Simulator and its demos are borrowed from materials in University of Washington, CSE 351

### Readings

1. <u>What Every Programmer Should Know About Memory</u>

