AM5726: Access same GPMC mapped memory from dsp and a15

Vadim Malinovsky

Part Number: AM5726

Hi we have two cores enabled,

1. ARM A15_0 with TI RTOS.

2. DSP1 with TI RTOS.

I want to access the same address of GPMC on both cores, currently i have access on a15_0 , and ASIC on gpmc is mapped to 0x0800_0000, Chip Select A27.

From TRM i see that GPMC is mapped to 0x0000_0000

And MPU (a15_0,a15_1) also view the L3_MAIN from 0x00_0000_0000

But DSP sees,

Does it mean that DSP sees 0x0000_0000 with base offset of 0x1400_0000? or there is no access for DSP to 0x0000_0000 at all?

When i try to acess 0x1400_0000 from DSP i see all the memory is 0...

And more confusion makes this table, from DSP Subsystem,

There is no 0x1400_0000 mention here....

0x802_0000 adress is also L3_MAIN, maybe it is the 0x0000_0000 mapped on dsp?

Can you advise thanks.

over 3 years ago

0 Brad Griffis over 3 years ago

TI__Guru*** 125430 points

Vadim,

For the DSP core, any data accesses to addresses between 0 and 0x8200_0000 will stay local and access internal DSP registers. Those accesses will never get outside of the DSP megamodule. To get an access to exit the DSP megamodule, you must access an address such as the 0x1400_0000 address range or higher.

The "trick" to accessing address 0 from the DSP is that you must use the DSP MMU. That way from the DSP perspective you can access an address such as 0x1400_0000 which will cause the access to leave the megamodule. The MMU can correspondingly map that access to address 0 of the L3 interconnect which would allow you to access the GPMC.

Best regards,
Brad

0 Vadim Malinovsky over 3 years ago in reply to Brad Griffis

Intellectual 320 points

Thanks Brad,

So the next question obviously is there any example to enable and use MMU on the DSP for AM527x on TI RTOS?

Also a quesiton, do i need this range of memory to be cacheable or non-cacheable? Thanks.

Best regards,

Vadim.

0 Vadim Malinovsky over 3 years ago in reply to Brad Griffis

Intellectual 320 points

Update:

I managed to read the desired address of our ASIC from 0x8200_0000, without enabling MMU, in TRM is specified that:

and also:

But now i have problem that when dsp1 reading the same address the data is corrupted by a15_0, i can susped the a15_0 with jtag and data reading is correct, it seems i need somehow to protect the same GPMC registers, do you have any idea how to do it?

Bottom line:

We counter on our ASIC, we reading the counter from both CPU's a15_0 and dsp1, but we see that when two processor read the same counter, one of processor get corrupted data? how can we solve it?

0 Brad Griffis over 3 years ago in reply to Vadim Malinovsky

TI__Guru*** 125430 points

Vadim Malinovsky said:

But now i have problem that when dsp1 reading the same address the data is corrupted by a15_0, i can susped the a15_0 with jtag and data reading is correct, it seems i need somehow to protect the same GPMC registers, do you have any idea how to do it?

Bottom line:

We counter on our ASIC, we reading the counter from both CPU's a15_0 and dsp1, but we see that when two processor read the same counter, one of processor get corrupted data? how can we solve it?

Can you connect a logic analyzer to the bus to understand what happens at the bus level during one of these corruptions? Also, can you give an example of how it is corrupted, e.g. what is the expected value and what is the actual value? From a processor perspective, I do not expect any such corruption, so I believe something is wrong with the way in which the processor and ASIC interact. You might also share a schematic snippet showing the connection.

0 Vadim Malinovsky over 3 years ago in reply to Brad Griffis

Intellectual 320 points

Brad Griffis said:

Also, can you give an example of how it is corrupted, e.g. what is the expected value and what is the actual value?

The expected value is beetween 1-32. We read the value with frequency of 32Khz, each 31.25uSec the GPIO fire an interrupt, on A15_0 and on DSP1, the ISR is reading the value of counter which increments each 31.25uSec. Hence "1" tick means 31.25uSec time passed. I expect the value to increment by "1" tick each interrupt, when a15_0 is stopped i read the correct values, the increments is "1" per interrupt and value is between 1-32, but when a15_0 is running i'm getting more than "1" increments. and the maximum value of counter is 285 at max.

Thanks.

P.S: Brad, I'm still eager to know how to configure MMU for DSP1, can you provide some examples, thanks.

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

Hi Brad,

We can share schematics but, since FPGA is involved it would be quite difficult to understand what is connected and how it is implemented.

Some facts:

1. We work with GPMC (access to FPGA) from single ARM15 core for quite long time, without any problem.

2. FPGA implements counter that is incremented every 31.25 uSec and runs from 1 to 1024

3. Every 31.25 uSec both cores get interrupt and access FPGA (via GPMC) independently.

4. If ARM is stopped, DSP gets correct information, otherwise we see missing value (or duplicated) values

We cannot connect logic, but we can sample some signals with FPGA tools. This is also not easy.

before we do it, please explain motivation behind this test

a) why access from single core works good, but when second core is involved we have certain problem

b) what you expect to see and how this helps to find solution

My understanding of the situation.

When asymmetric cores (DSP, ARM15, CortexM) access peripheral bus it needs certain arbitration, either transaction serialization or kind of interrupt/resume.

I guess that that some entity inside SOC is responsible for this job. We do not do any programming work to setup this "entity" and assume that it is either set up by BSP or does not need programming at all and should work out of the box.

Please suggest how to continue.

Best regards

Rasty

0 Brad Griffis over 3 years ago in reply to Rasty Slutsker

TI__Guru*** 125430 points

Rasty Slutsker said:
We can share schematics but, since FPGA is involved it would be quite difficult to understand what is connected and how it is implemented.

Let's start with some basic info:

Are you in a 8-bit or 16-bit data mode?
Are you configuring as non-muxed address data, muxed address/data, or muxed address/address/data (i.e. gpmc_adX pins only)?
Besides the address and data signals, what other pins are you connecting to the FPGA?

Rasty Slutsker said:
2. FPGA implements counter that is incremented every 31.25 uSec and runs from 1 to 1024

What is the size of this counter, e.g. 8/16/32 bit?

Rasty Slutsker said:
4. If ARM is stopped, DSP gets correct information, otherwise we see duplicated values

Can you please give a very specific example using hex numbers? I don't understand.

Rasty Slutsker said:
We cannot connect logic, but we can sample some signals with FPGA tools. This is also not easy.

If it is going to be very difficult, let's pause on this for now. The rationale was to correlate activity on the bus with code/activity inside the device. We might be able to figure this out with other methods though.

Rasty Slutsker said:

When asymmetric cores (DSP, ARM15, CortexM) access peripheral bus it needs certain arbitration, either transaction serialization or kind of interrupt/resume.

I guess that that some entity inside SOC is responsible for this job. We do not do any programming work to setup this "entity" and assume that it is either set up by BSP or does not need programming at all and should work out of the box.

This is discussed in brief detail in the Technical Reference Manual in Section 15.4.4.6 L3 Interconnect Interface. The data requests are being queued at the interface level. The way you access the GPMC makes a big difference. For example, two 16-bit reads will behave differently than a single 32-bit read. A single 32-bit read would become "atomic". The GPMC would break it down into multiple reads (e.g. 8 or 16 bit as defined by your interface width). This is a very simple thing for you to check, and quite possibly the reason for your issue. In other words, you do NOT need to match your variable size to the bus size. For example, even if you're using an 8-bit wide data bus, it is ok to perform a 64-bit access to the bus. In that scenario it would just turn into 8 consecutive 8-bit reads.

If you wish to assign "ownership" of the interface (i.e. extending for multiple consecutive reads/writes) you would need to do that in software with some sort of shared mutex that is respected by the ARM/DSP.

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

Hi,

We work 16 bit (GPMC is configured as 16-bit), one chip-select, multiplexed.

We tested it with logic analyzer at the beginning of the project and see that 16-bit access produces only one single bus transaction.

table with signals is below.

Access from CPU (both ARM and DSP) is 16-bit.

I completely understand the idea of non-atomic access to data that is longer than bus size (more than 16-bits on 16-bit bus).

I expect that access to 16-bit is not interrupted, access to data longer than 16-bit can be interrupted, data maybe inconsistent (half words are unrelated) but no garbage expected.

We refer to one single read-access to 16-bit data on 16-bit bus.

Best regards

Rasty

0 Brad Griffis over 3 years ago in reply to Rasty Slutsker

TI__Guru*** 125430 points

The table didn't come through. Also, you didn't give an example of the issue. What do you mean by "duplicated values"? Does that mean that the top half of the bus is a mirror image of the bottom half? Does it mean that you're reading a stale value, i.e. a repeat of whatever you read the previous time?

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

Hi,

FPGA increments 16-bit register and generates interrupt every 31.25 uSec.

Every interrupt we read (DSP) and store that register to memory.

We see gaps in counter. Interrupt rate is correct, tested with external equipment.

If we stop A15 with JTAG, counter is incremented and recorded without without gaps.

Files with counter value will follow.

Question:

Is there a need to setup some arbitration policy in interconnection controller/arbiter? Maybe need to set some non-pre-emptive/exclusive access or priorities?

Pin	Signal
4	GPMC_CS0
8	GPMC_ADVN_ALE
10	GPMC_OEN_REN
12	GPMC_WEN
14	GPMC_BEN0
16	GPMC_BEN1
31	AD0
32	AD8
33	AD1
34	AD9
35	AD2
36	AD10
37	AD3
38	AD11
39	GND
40	GND
42	AD12
44	AD13
46	AD14
48	AD15
51	AD4
52	GPMC_WAIT0
53	AD5
55	AD6
57	AD7
58	GPMC_CLK
59	GND
60	GND

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

I Attach a record of counter that we read from FPGA, it is recorded by both sides (columns taken from separate records, not synchronized).

Counter runs from 1 to 32 (covers 1 msec period)

As you can see, in both cases counter is disturbed, while at DSP side disturbance is more significant.

We also prepare record where we shutdown 31.25 uSec interrupt on ARM side.

hw_mts_record.xlsx

0 Brad Griffis over 3 years ago in reply to Rasty Slutsker

TI__Guru*** 125430 points

Thanks for sharing. I would describe that as "missing" data, i.e. you end up with a value of 0 instead of the expected counter value. Since this only happens when both the ARM and the DSP are accessing the bus, it is possible that you have a timing issue in the GPMC programming with respect to back-to-back accesses. This is a situation where having visibility would help, but I have ideas for other things you can try.

Since these are uncached reads from the DSP, you will have gaps between the accesses (I would venture to guess a few hundred nanoseconds). However, when ARM and DSP are both accessing the GPMC you might have back-to-back accesses without gaps. I suspect your issue relates to a mismatch between the GPMC timing and the ability of the FPGA to respond to back to back accesses. How have you programmed GPMC_CONFIG_6_i? Try setting CYCLE2CYCLESAMECSEN=1 if it's not set already in conjunction with CYCLE2CYCLEDELAY=F. That will force some clock cycles between back-to-back reads to the same chip select. If that fixes the issue you can look into optimizing CYCLE2CYCLEDELAY, but for starters I would just use 0xF.

A related thought is to slow down the transfers to see if that impacts the issue.

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

I'll share tomorrow GPMC configuration registers.

We already tuned access time and made extensive stress test to FPGA/GPMC, like reading/writing data back to back, all kinds of data 16 and 32 bits.

when we read/write 32-bits to 16-bit bus, 2 transactions are generated back-to-back, isn't it? It is also included in stress test and gave positive results.

Slow down is not feasible, we even want to increase speed, because due to relatively slow access to GPMC software spent a lot of time on bus and burn CPU time.

I'm pretty sure that GPMC configuration is OK. Problem must be somewhere in arbitration, when 2 cores independently access external bus and interfere with each other.

Would you check please what we can tune in "interconnection" module, like parameters related to arbitration/priority/round roubin/preemption/etc ?

0 Brad Griffis over 3 years ago in reply to Rasty Slutsker

TI__Guru*** 125430 points

Rasty Slutsker said:
when we read/write 32-bits to 16-bit bus, 2 transactions are generated back-to-back, isn't it? It is also included in stress test and gave positive results.

Yes, a 32-bit access will generate two 16-bit sub-accesses. They will be incrementing accesses. Not sure if the FPGA might possibly respond differently to back-to-back fixed accesses vs back-to-back incrementing accesses? You could only generate that sort of pattern using DMA. You cannot perform a 32-bit access where both halves access the same 16-bit data. Have you tried a 64-bit access? Does that also work properly?

Rasty Slutsker said:
I'm pretty sure that GPMC configuration is OK. Problem must be somewhere in arbitration, when 2 cores independently access external bus and interfere with each other.

While there are various knobs with respect to arbitration, there are none that would ever relate to data being dropped. The arbitration knobs relate to things like priority and throughput. There are not any "timings" for the interconnect where things would break and data lost or corrupted.

What operating system are you running on the ARM and DSP? If for example you have a clock out of spec or perhaps a voltage out of spec, that might also result in this sort of unexpected behavior. We can start to dig into some of those details, but FYI, it will be a lot of work. This might be a time to consider revisiting the interface itself to gain visibility into what's happening. If the issue is on the GPMC interface itself then you might waste a tremendous amount of time looking through all the configuration inside the rest of the SoC.

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

I agree that back to back access to the same 16-bit address requires DMA or multi-processor, but on the other hand access is done from a task, which is scheduled by timer interrupt on both CPU. However due to different interrupt and task latency on both cores, clash is virtually impossible. I'd expect rare corruption, instead of very systematic one.

We use TI-RTOS on both ARM and DSP, which is pretty transparent comparing to Linux.

I'm looking for priority "knob", I'd like to see what happens if I change defaults. Would you sent me some reference please?

Plans for tomorrow

We will test what happens if we

1. Stop reading counter by ARM

2. Read another address instead of counter on ARM side

3. Make a copy of counter at different address and access different locations on both sides

0 Brad Griffis over 3 years ago in reply to Rasty Slutsker

TI__Guru*** 125430 points

Rasty,

I discussed with some colleagues. Here were two key conclusions:

1. There is a similarity here to issues we've observed in the past related to the ARM making speculative accesses to the GPMC space. Do you ever see issues on the ARM? I believe that so far you have only mentioned the ARM causing issues for the DSP. Is that correct? That is similar to issues observed due to speculative accesses in the past. The way to stop the ARM from generating speculative accesses to GPMC is by mapping the GPMC space as "device" or "strongly ordered" memory.

2. If above suggestion doesn't help, do you by chance use Lauterbach JTAG? It is possible to capture traffic to the GPMC with Lauterbach Trace32 using OCP-Watchpoint capability (there's a GPMC probe group). That would be useful for gaining further insight.

Best regards,
Brad

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

Hi Brad

1. We do some experiments in order to learn more about influence of ARM to DSP. First conclusion that there is influence, second conclusion that it maybe *not only* from gpmc access.

a) I'd not expect speculative prefetch from non-cached memory (I/O) because read from peripherals can be destructive .

b) we would see prefetch earlier during debugging of GPMC with logic analyzer 2 years ago and would notice that behavior.

c) GPMC on both sides is defined as non-cached space. I hope it is enough. We did not see any problem that till now.

2. We do not have Lauterbach.

Thanks

Rasty

0 Brad Griffis over 3 years ago in reply to Rasty Slutsker

TI__Guru*** 125430 points

Rasty Slutsker said:
a) I'd not expect speculative prefetch from non-cached memory (I/O) because read from peripherals can be destructive .

From the ARM Architecture Manual (v7A):

The architecture permits speculative accesses to memory locations marked as Normal if the access permissions and domain permit an access to the locations.

Also:

The architecture does not permit speculative data accesses to memory marked as Device or Strongly-ordered. However, it does not prohibit speculative translation table walks to Device or Strongly-ordered memory.

Furthermore be sure to mark the "NX" (no execute) bit for this region.

Rasty Slutsker said:
c) GPMC on both sides is defined as non-cached space. I hope it is enough. We did not see any problem that till now.

That is fine for the DSP. It is not sufficient for the ARM.

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

Can you post example of "NX" entry?

Thanks

0 Brad Griffis over 3 years ago in reply to Rasty Slutsker

TI__Guru*** 125430 points

Oops, sorry. I mis-remembered it slightly. I was referring to the "XN" (eXecute Never) bit that is part of the MMU translation pages. Sorry for the confusion!

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

It appears that DSP runs from DDR, we move software to on-chip memory. This will allow cleaner experiment.

meanwhile,

What how do i convert TSCL to absolute time? Numbers that we see make no sense - they show big jitter and do not match expected intervals.

Thanks

Rasty

0 Brad Griffis over 3 years ago in reply to Rasty Slutsker

TI__Guru*** 125430 points

Rasty,

Please start separate threads for related topics such as if you need more help with the TI-RTOS MMU settings or have questions related to TSCL.

You can put a link to them here if you'd like to connect them, but it's easier to get the right people assigned when we decouple some of the questions.

Thanks,
Brad

0 Brad Griffis over 3 years ago in reply to Brad Griffis

TI__Guru*** 125430 points

Hello,

Do you have any updates on this thread? Were you able to test using "device" memory with XN bit set?

Best regards,
Brad

0 Rasty Slutsker over 3 years ago in reply to Brad Griffis

Expert 1520 points

Short update.

This issue is still under investigation.

Down the road we found that we cannot get interrupt from one Input digital to 2 cores.

That was the reason for interrupt jitter and skipped numbers - we simply lost interrupts.

Then we found DSP performance issue - speed of DSP is much lower than we expected and we opened separate thread.

Finally, under heavy load (CPU is loaded 100%) we see that we read garbage from GPMC (unexpected values), so we are not sure whether it is real garbage or some software glitch due to CPU overload.

Best regards

Rasty

0 Brad Griffis over 3 years ago in reply to Rasty Slutsker

TI__Guru*** 125430 points

Rasty Slutsker said:
Finally, under heavy load (CPU is loaded 100%) we see that we read garbage from GPMC (unexpected values), so we are not sure whether it is real garbage or some software glitch due to CPU overload.

I've never seen an instance where data was corrupted internally on the buses. If for example there's a bottleneck somewhere that could result in reduced throughput, but I have never seen it cause corrupted data. The FPGA tracing of this interface would be useful. You may want to wait just a little longer till we've looked through the clock settings in your other thread, but I think the FPGA tracing is quickly rising to the top of the list.

0 Brad Griffis over 3 years ago in reply to Brad Griffis

TI__Guru*** 125430 points

Your co-worker Vadim was able to run one of my scripts related to clocking in the other thread. Perhaps you could work with him to run a similar script that relates to voltages:

http://git.ti.com/sitara-dss-files/am57xx-dss-files/blobs/raw/main/am57xx-avs-abb-decode.dss

Can you please run that script and send the resulting file? On that same board please measure these 3 rails:

VDD_MPU
VDD_DSPEVE
VDD (aka "VDD_CORE")

Best regards,
Brad

Processors

Processors forum

AM5726: Access same GPMC mapped memory from dsp and a15