Reopening of Unexpected Performance Monitoring Unit Data Cache Miss counter discussion.
This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Reopening of Unexpected Performance Monitoring Unit Data Cache Miss counter discussion.
Hi Franck,
I did a lot of tests today (use my own code rather than memset(..)) and saw the similar result:
1. bye write, byte read: 1MBytes
2. 32-bit word write and read
//write bytes from EMIF address
void str_bytes(uint32_t addr, uint32_t numBytes) {
asm("\tadd r1, r0, r1");
asm("\tmov r2, #0");
asm("copy_loop: ");
asm("\tstrb r2, [r0], #1");
asm("\tadd r2, r2, #1");
asm("\tcmp r0, r1");
asm("\tblt copy_loop");
}
//read bytes from EMIF address
void ldr_bytes(uint32_t addr, uint32_t numBytes) {
asm("\t add r1, r0, r1");
asm("copy_loop1: ");
asm("\t ldrb r3, [r0], #1");
asm("\t cmp r0, r1");
asm("\t blt copy_loop1");
}
The value of Data cache miss for both read and write are much less then the expected value. I don't know how the PMU counts the event.
Hello Franck,
This is what we got from ARM regarding to the cache miss:
The PMU events are counting a high number of write instructions, a similar, slightly smaller, number of cache line evictions, but only a very few cache linefills.
I suspect the reason for these numbers is simply that the processor core only needs to generate this small number of linefills.
When the processor core is writing to a full cacheline then at first the processor core will trigger a single external linefill based on the first write instruction. The subsequent write instructions will fill up the store buffer and it is quite possible that the write instructions will write to the full cache line space before the linefill access returns from external memory. This means that this linefill data is no longer required, and can be discarded when it is returned from the memory system. Meanwhile, the instructions will start writing to the next cacheline location and will again trigger a linefill. If the external memory access time is sufficiently long, then eventually the Cortex-R5 core will have the maximum number of outstanding linefills possible and it will not be able to issue anymore until one of the outstanding linefills completes. When this happens the write instructions can continue to fill up the store buffer before issuing a new linefill. If the write instructions can fill up a full cacheline they can be added into the cacheline location without ever triggering an external linefill, so the cache location is updated without the need for a linefill access.
So, this is what I suspect is happening in this test, the Cortex-R5 does not need to issue a significant number of linefill accesses because the store instructions are filling up the cacheline locations without needing to issue a linefill.
Hi QJ,
Thanks for the clarification, which makes completely sense as my test is filling the cache lines in a completely linear way.
From the performance point of view indeed such an optimization behaves as a cache hit, as the line is marked dirty without the need to pay the penalty of the external memory access.
Thanks again for your support.
Best Regards,
Franck.