AM5726: memset to ddr monopolizes whole emif bandwidth

Rasty Slutsker

Expert 1830 points

Part Number: AM5726

Tool/software:

Hi,

Is it possible to set different quality of service (for EMIF/DDR) to per A15 core?

Or limit bursts per core?

Is the some example?

Thanks Rasty

8 months ago

0 Gokul Praveen 8 months ago

TI__Genius 13080 points

Hi Rasty,

I have assigned the question to the corresponding expert.He will get back to you.

Regards

Gokul

0 Josue Zamitiz-Ayala 8 months ago in reply to Gokul Praveen

TI__Mastermind 33435 points

Hello Rasty,

I think there is a mechanism in place to do this with our SoC but I do not believe we have any examples of execution.

See the following chapter in TRM:

15.3.4.15 Class of Service

The MFLAG feature which allows for MPU accesses to DDR to 1 for 1 with L3 Interconnect would likely give the A15's as much DDR B/W as required. MFLAG overrides COS.

Best,

Josue

0 Rasty Slutsker 8 months ago in reply to Josue Zamitiz-Ayala

Expert 1830 points

Hi Josue,

I found some document that explains some performance "knobs". I understood only first part that talks about bandwidth control, it did not do what I expected. Seconds part is about COS, but it is just "copy paste" from sitara manual, without examples, no added value in this part of doc.

It seems that COS is what I need, but it talks about some "master ID", I did not find anything in Sitara manual that sounds like "master id" in context of COS .

My target is giving A15 core 1 priority over core 0 in access to DDR.

I'd like to stress that we talk about priority at A15 core level. One core is more important than the other.

Can someone give me an example?

Thanks

Rasty

0 Josue Zamitiz-Ayala 8 months ago in reply to Rasty Slutsker

TI__Mastermind 33435 points

Hello Rasty,

What document are you referring to?

Rasty Slutsker said:
I found some document that explains some performance "knobs"

I think I found the document.. is it this one? https://www.ti.com/lit/an/sprabx1a/sprabx1a.pdf

Anyhow, I doubt you will find any examples that separate the A15 cores into individual cores from us due to the fact that the A core was almost always used in SMP mode with a HLOS for the use cases that were deployed by TI.

Furthermore, folowing the Master ID thread, there is no differentiation in the A cores, A15 is merely referred to as MPU.

-Josue

0 Rasty Slutsker 8 months ago in reply to Josue Zamitiz-Ayala

Expert 1830 points

I understand.

We use hypervisor, so SMP is not need.

Can you elaborate about "SMP mode with a HLOS for the use cases that were deployed by TI". How we ca put A15 into non-SMP mode? We do not need cache coherency in A15. Each core is independent - AMP.

Thanks

Rasty

0 Josue Zamitiz-Ayala 8 months ago in reply to Rasty Slutsker

TI__Mastermind 33435 points

Rasty,

This use-case is no longer supported but you can look into the reference design that was published here:

https://www.ti.com/tool/TIDEP-009 - In particular the following document: https://www.ti.com/lit/ug/tidudf8a/tidudf8a.pdf

Best,

Josue

0 Rasty Slutsker 8 months ago in reply to Josue Zamitiz-Ayala

Expert 1830 points

Hi,

Is it possible to modify cache policy ?

What is default write policy ? write-through or write-back?

Is it possible to re-program cache comntroller (L2CACHE_CTRL_MPU)?

Thanks

Rasty

0 Josue Zamitiz-Ayala 8 months ago in reply to Rasty Slutsker

TI__Mastermind 33435 points

Rasty,

What SDK are you using/plan to use?
Please read the following chapter in TRM 4.3.2.1 MPU L2 Cache Memory System.

Will need to refer you to a colleague for further details.

-Josue

0 Rasty Slutsker 8 months ago in reply to Josue Zamitiz-Ayala

Expert 1830 points

SDK does not matter we can take examples and code from every SDK that you suggest.

My questions is exactly about chapter 4.3.2.1!

Where is description of MPU_L2CACHE_CTRL that is mentioned in that chapter?

I'm looking for "knobs" of cache controller.

Thanks

Rasty

0 Josue Zamitiz-Ayala 8 months ago in reply to Rasty Slutsker

TI__Mastermind 33435 points

Rasty,

SDK matters because that is what delineates our standard support. Anything outside of of those boundaries is not supported.

Second, the fact that the TRM does not list the "knobs" gives me the understanding that this is not something you can change, at least it seems like MPU_L2CACHE_CTRL was not made available in our SOC. Our ARM engineer is out of office until 2/25 , as soon as he is able, he will follow up on your query.

Best

Josue

0 Richard Woodruff 8 months ago in reply to Josue Zamitiz-Ayala

TI__Mastermind 24015 points

Hello,

The A15 cluster does not differentiate between A15-0 and A15-1 data for common transactions. The shared L2 cache bellow the cores anonymizes the cached accesses from the reset of the system. The entire cluster is assigned one master ID.

The priority for both cores as carried on transactions can be set by the MA_MPU register. On the path to DDR there are a few mechanisms to help shape traffic. There are a couple of application notes which give more focused information:

- https://www.ti.com/lit/an/sprabx1a/sprabx1a.pd

- https://www.ti.com/lit/an/sprac21a/sprac21a.pdf

The A15 does allow for some cache policy settings via the MMU tables for each core. What is practically achievable is somewhat constrained depending on the operating system and due to hardware coherency choices. The A15's cluster is SMP capable and ARM has set the implementation such that copy-back needs to be set to ensure the L1x2 and L2 caches are coherent. A person can mark a page table as write-through but the hardware will demote the access to a normal-non-cached-access.

To your subject, the A15 can post and track a lot of outstanding transactions. This can result in some bubbles at the DDR if it gets more than its share of transactions. If there are some RT peripherals like a camera or a display setting the EMIF_OCP_CONFIG register to limit the MPU_THRESH_MAX can help throttle the A15 cores at the DDR by limited credits to the MPU_PORT. If there is a conflict between at the SYS_PORT its possible to use DMM-PEG priorities and MFLAG. The application note and TRM and source code are useful references. If you are mostly concerned about A15-0 vs. A15-1 (say if your using Jailhouse-VM with HLOS+RTOS) jitter mitigations are limited as the design is for a SMP core. You could mark areas as non-cached or do some cross-cpu-coordination points to ensure no conflict (like have one CPU SEV to the other, and have it pause to stop it from interfering in a critical RT section).

Regards,

Richard W.

0 Rasty Slutsker 8 months ago in reply to Richard Woodruff

Expert 1830 points

Hi Richard,

I played with EMIF_OCP_CONFIG and MPU_THRESH_MAX. It did not do what I expected. Maybe because traffic from both cores qeued via the same bottleneck.

But you gave me interesting direction.

How can I throttle/pause one core from another?

Where can i read about "CPU SEV" ?

What if we use separate DDR (controller), per core? Will it help?

Another things that bothes me is burst size. According to information that I found burst is limited to 120 bytes, It creates a latency much below than what we measure.

What is the real wost case burst from cache to DDR can be ? maybe cache is not right direction?

Thanks

Rasty

0 Richard Woodruff 8 months ago in reply to Rasty Slutsker

TI__Mastermind 24015 points

Hello Rasty,

The background leading questions will be what is the range of acceptable latency and how is the per-SW/HW component budgeting look like? If the goals are like a few uS, then DDR itself is likely not a good choice as its refresh and periodic training may be more that that. Using OCMC for critical RT might be more reliable. If the needs are in the low mS range that is likely easier to achieve. What SW is in use is very important.

An EMIF per core probably would help reduce one source of jitter. The L2 cache is still a shared resource so contention there might dominate. If you can mark critical regions as normal-non-cached maybe that will be less jittery, however, those HW paths are lower performance and you don't get cache locality boosts.

I suggested WFE-SEV as a low overhead cross-core synchronization method. Its documented in ARM TRMs. Often spin locks are built atop these. Other methods like cross CPU PPI interrupts also would be lower overhead async messaging. Your SW and budgets will dictate which cross core sync methods to use. Basically, I was suggesting something like a cross-core semiphore/spin-lock, before kicking off a low-latency event, take a resource lock, which causing the general purpose CPU to spin (no conflicting traffic), do your RT work, then release the lock. Stalling the other CPU is a bit heavy handed but it may be OK for your use case.

The DDR part itself has a fundamental burst size in which it pulls/pushes data. Some chopping or interruption may happen for small sizes. The A15 for cache will be talking in cache line sizes (64 bytes).

Regards,

Richard W.

0 Rasty Slutsker 8 months ago in reply to Richard Woodruff

Expert 1830 points

Cores are completele independent , not need in hardware sync.

My assumption was that worst case latency is few uSecs, maybe 10-15 uSec. Which is still good.

Interrupt latency on RTOS side is very small. Within expectations. Also task latency was good.

But when we ported complete aplication we have started seeing something weird.

Application (RTOS thread), triggered by ISR on time does not finish work on time in some conditions!

It is a sort of blackhole - second core that runs RTOS spends tens of uSecs somewhere. No calls to RTOS at this program flow. It looks like CPU stall.

At the beginning we thought that it might be DMA/EMMC, but after certain reserach we are able to reproduce this behavior with 3 lines of code on Linux side

main()

{

x = malloc(2mbyte)

memset(x,0)

free (x)

}

Two IMPORTANT notes

1. Without memset+free problem does not occur. It looks like disposal (brk/sbrk) of dirty memory from process heap to Linux free pool causes this problem.

2. If I allocate less than 2 mbytes, problem does not occure!

2 mbytes is the size of L3 cache. That's why I focus on cache.

Best regards

Rasty

0 Richard Woodruff 8 months ago in reply to Rasty Slutsker

TI__Mastermind 24015 points

Hello,

Using HW ETM trace is an effective way to understand what is happening on each core. The TI EVMs have a MIPI-60 connector which exports all the core execution information. In the past I did debug with another engineer almost the exact scenario you cite (Linux + RTOS) and a periodic missed deadline. Using ETM trace what could be seen is Linux had CPUFreq running and it sometimes changed the shared a15 core clock frequency and this had a couple impacts on RTOS. Speed was one, but RTOS was using a PMU cycle timer with a wrong speed assumption about rate so it calculated the delta time for some deadline wrongly. After pinning the frequency the latency was consistent and met that use cases deadlines.

The above easily could play into your observations. A 'large' memset also will effect timing but how much will vary. That memset could below away the complete L2 cache under some conditions forcing reloads for a lot of code. The summation of a bunch of loads will easily add many uS to something. An uncontrolled Linux UI can do lots of disruptive things... imagine a constant cache-flush call, or tlb invalidate broadcast which hits both cores MMUs. As IO mention the default machinery for A15mpcore and LInux will define things as shared between the cores at a HW signaling level. A 'DSB' to flush buffers on CPU0 will also signal CPU1 to flush its buffers and wait for a completion before proceeding. If you mark a lot of MMU items on the LInux side as 'non-shared' it can help some also for this angle. Force quieting the alternate core as I mention may be an OK option if you are trying to guarantee something in low uS range... that could even mean a 'pre-touch' of ISRs to ensure they are cached before kicking off the jitter-sensitive routine.

Regards,

Richard W.

0 Rasty Slutsker 8 months ago in reply to Richard Woodruff

Expert 1830 points

Hi,

For RTOS core we use external interrupt from FPGA via nirq pins. ISR latency is pretty fine (we measure it with FPGA timer, and measure distance between interrupts with CPU clock timestamp, both give cosistent results) all is fine with interrupt latency.

We do all useful work in threads.

System is resitant to 5-10 uSec jitter, but we miss something like 30 may be 50 uSecs. Hard to tell exactly. it just looks like irregularity in execution time of the same code. Starts on time, but finish 30-40 uSec later then previous run of exactly the same code.

CPU frequence throttling and temperature management are disabled in Linux and CPU is properly cooled.

We also made test with memset of only few bytes - the same result, linux mmap marks whole block as "dirty".

size of memset does not matter, one dirty byte of whole allocated block yeild the same result.

Do you remember how to declare whole DDR as non-shared in Linux/jailhous?

Thansk

Rasty

0 Richard Woodruff 8 months ago in reply to Rasty Slutsker

TI__Mastermind 24015 points

Hello,

I do not know in modern Linux kernels how to do this. In the far past I attempted hack-surgery to do this with experimental success but it was fragile and broke at the next update. If the system is using the short descriptor format you might be able to use the PRRR to remap shareability for regions. Using NMRR/PRRR or MAIR sometimes is a good angle as those attribute remappers didn't used to get touched much post init.

One other surgical experimental 'hack' might be to clear the ACTLR.SMP bit in both cores. The definition of this is action is to take a core out of coherency. In A15 that is how its described in the power-down process used to off-line a CPU. On the A9 this sequence was similar and it did result in the blocking of broadcast actions (removed the need to ack back to a dsb++ request). On A15 its all more complicated as under the hood HW sometimes ignores these settings and fixes up things in its own implementation defined way. This should be an ~easier quick hack.

I'd still recommend something more direct like looking at ETM trace. If you can setup a CTI halt trigger from the high latency event, the trace history likely holds what the core was exactly doing. Generally these kind of things are easy to observe and sometimes force using JTAG tools. For MMU checks and ETM trace , I find TRACE32 provides a trusted decode which otherwise is very hard to get using other methods.

Regards,

Richard W.

0 Rasty Slutsker 8 months ago in reply to Richard Woodruff

Expert 1830 points

Setting ACTLR.SMP to 0, on core 1 (RT core) did not help. According to Chatgpt, this register is shared between cores.

We have only basic JTAG connection TDI/TDO on our custom board.

Regards

Rasty

0 Richard Woodruff 8 months ago in reply to Rasty Slutsker

TI__Mastermind 24015 points

Hello,

Chatgpt is wrong, this register is per core. AI a starting point for search, but for detailed experiments in relatively less touched paths requires consulting the specification.

Any number of pins brought out can get good long run cpu trace information. However, even with no trace pins exported (simple jtag control only) the ETM trace still can be used for this case, as it streams into a small internal buffer. Using the time violation as a stop trigger, and some filtering you can see what both cores are doing. The flow is logically similar to how a scope with memory is used.

I'll attach a short video showing ACTLR checks and basic trace usage to concretely show the approach. The time to use varies depending on the level of support a tool provides.

Regards,

Richard W.

0 Rasty Slutsker 8 months ago in reply to Richard Woodruff

Expert 1830 points

I'm back from vacations.

We tried following code on 2 both cores. The same result.

Other thoughts ?

Thanks

Rasty

void disable_smp_mode(void) {

unsigned int actlr;

// Read ACTLR

asm volatile(

"MRC p15, 0, %0, c1, c0, 1\n" // Read ACTLR into actlr

: "=r" (actlr) // Output operand

);

// Clear the SMP bit (Bit 6)

actlr &= ~(1 << 6);

// Write back to ACTLR

asm volatile(

"MCR p15, 0, %0, c1, c0, 1\n" // Write actlr back to ACTLR

: "r" (actlr) // Input operand

);

// Ensure the change takes effect immediately

asm volatile("ISB\n" ::: "memory");

}

0 Rasty Slutsker 8 months ago in reply to Rasty Slutsker

Expert 1830 points

Update, this code is not correct.

From your video I understood that we can access this register also via JTAG debugger.

If we reset this bit with debugger it HELPS, even only reset just on RTOS core.

We try to fix this code and update.

0 Richard Woodruff 8 months ago in reply to Rasty Slutsker

TI__Mastermind 24015 points

Hello Rasty,

From your notes, I interpret you made an experiment to clear the bit but when correlating with the debugger you saw it was not proper. You then forced the clear of SMP using the debugger and measured an improvement. Next, you are going to try and get the same result via run time code (factoring out the debugger). It would be great if SMP clearly helps reduce the coupling to reduce interference.

Regards,

Richard W.

0 Rasty Slutsker 8 months ago in reply to Richard Woodruff

Expert 1830 points

Hi Richard,

We cleared it, let program whole run and made our original test few times. Results are promising.

The problem is that any attemp to write to this CP15 register from RTOS app, either does not make any effect or make exception.

CPSW says that we are in system mode. Is it OK or must be in SVC mode? Other protection of CP15?

0 Rasty Slutsker 8 months ago in reply to Richard Woodruff

Expert 1830 points

One more observation. If I do this via RTOS, no problem to set ACTLR to zero with assembly instruction. Value after boot is 0x40

Under linux/jaihouse value 0x41 the same instruction do not work.

0 Richard Woodruff 7 months ago in reply to Rasty Slutsker

TI__Mastermind 24015 points

Hello,

This will be a privileged instruction it will not work from all contexts. Linux will re-write some of these registers are part of errata work arounds. Linux assumes SMP operation so handling the bit for anything past experiments may need a bit deeper surgery.

Regards,

Richard W.

0 Rasty Slutsker 7 months ago in reply to Richard Woodruff

Expert 1830 points

Hi Richard,

We managed to create a Linux ko that does it.

On one machine it works as expected, one second no effect.

Can you send me a reference for errata and share your thoughts about "surgery".

We did not abandon this direction, we have have a competion on equipment, that is why it moves slowly.

Best regards

Rasty

0 Richard Woodruff 7 months ago in reply to Rasty Slutsker

TI__Mastermind 24015 points

Hello Rasty,

I am not understanding. What is different between the 2 machines? Are they different board or cpu types or ? What is different to allow it to work on one and not the other.

Some of the ARM CP15 registers have special properties. Their accesses can be made selective (some bits can be written and others not) per mode, and some are banked (different values in secure and non-secure). The hypervisor mode can also add other indirections where a write traps to some other spot. For secure vs. non-secure TI does have a 'monitor mode' call (as only in secure can the bits be written) and based on the chip version different values (for errata) might be written at boot. I can imagine if you altered the SMP bit in one spot where some other spot could change it (say it things an errata needs setting)... or after ARMsubsystem "OFF code" a context save and restore will happen (to include errata) which will change the value... so if you have 'cpu-idle' running with a c-state which has ARM off, or you go to a sleep with off, the value will be reprogrammed.

Regards,

Richard W.

0 Rasty Slutsker 7 months ago in reply to Richard Woodruff

Expert 1830 points

By 2 different machines I intend 2 custom boards that should be identical.

Do you know how to check in Linux CPU informaiton that contains revisions?

Thanks

Rasty

0 Richard Woodruff 7 months ago in reply to Rasty Slutsker

TI__Mastermind 24015 points

The run time calls for errata and the special cp15s call through the SMC call omap_smc1. At run timer there are checks for ARM core revisions and SOC revisions. Any erratas are applied at run time. A different chip version can activate different paths. The ARM specific ones can be seen here: https://source.puri.sm/Librem5/linux/-/blob/b96eb58a96d9564d1880c112af705cb463fc4910/arch/arm/mach-omap2/omap-smp.c

0 Rasty Slutsker 7 months ago in reply to Richard Woodruff

Expert 1830 points

Thank you very much for the hint.

We see different behavior with different boards. I'm not sure about date code of processors.

We will check revisions.

Best regards

Rasty

Processors

Processors forum

AM5726: memset to ddr monopolizes whole emif bandwidth