TMS570 Flash ROM waitstate with higher CPU HCLK operating frequency

Chuck Wong

Hi there,

I understand that I have to enable the pipeline mode and configure waitstate=3 when the operating frequency of the CPU HCLK/GCLK is > 105 MHz.

Can anyone please tell me what that means waitstate=3 when loading instructions from the Flash ROM, given that this MCU has a pipeline dept of 8 levels?

If I can move some critical code to the RAM (no waitstate required for higher CPU operating speed), will this be a very appreciate difference in performance boost? It would be greatly appreciate if someone can quantify the gain by running a section of code in RAM.

Thanks.

over 13 years ago

0 Zhaohong Zhang over 13 years ago

TI__Mastermind 22715 points

Chuck,

The read and write speed for the Flash memory is limited about 45 MHz. Read flash with 3 wait states means that a flash read takes about 4 CPU clocks. The pipeline we propose to enable has nothing to do with CPU. It is a small buffer in the Flash wrapper. CPU can read from Flash wrapper pipeline buffer in single cycle. If the CPU requested data is not in the pipeline buffer, Flash wrapper will read 128 bits from Flash in 4 cycles and CPU can access the the rest of data in the pipeline in single cycle if the access is sequential. I would not recommend you to move your critical code into RAM right away. I would suggest you first measure the timing for the code executing from Flash with pipeline enabled.

Thanks and regards,

Zhaohong

0 Chuck Wong over 13 years ago in reply to Zhaohong Zhang

Genius 3950 points

Thanks Zhaohong for your prompt reply. Your explanation is easy to understand.

I believe the pipeline enabling is a requirement for CPU frequency over 36 MHz, so it was already enabled along with the Flash waitstate=3.

Right now my setup is not fast enough to decode an 1 MHz serial data stream to the NHET unit, which is running also at 128 MHz. I have about 6700 edge-detections per 10 ms, transfered from NHET to CPU RAM via HTU (dual-buffer of 250x64-bits). The 64-bit combo of Control Field (for RISING or FALLING edge indication) + Data Field (for edge timestamp) are transfered to the CPU RAM without issues, but when it is time to perform analysis to find the serial stream data, the CPU, running at 128 MHz, cannot keep up the pace (with CPU running at 128 MHz, my calculation gives about 190 clock cycles available per edge within a 10-ms window)!

10 ms / (1/128 MHz) = 1 280 000 CPU cycles

1 280 000 cycles / 6700 edge = 190 CPU cycles/edge, which is not much.

I was thinking because of the waitstate imposed on the Flash ROM, thus the idea of moving the edge-analysis code into ROM to spare the waitstate. Don't you think that this is a good idea?

Regards.

0 Zhaohong Zhang over 13 years ago in reply to Chuck Wong

TI__Mastermind 22715 points

Chuck,

I think that you need first optimize the data processing to make the best use of CPU. I would suggest you first taking a look at how the data processing is done. Do you call processing every 10 ms or at every edge? How is the processing triggered? What is the latency requirement (maximum delay in response to the input)? Can you do some processing in NHET? NHET can easily measure period and pulse width by it self.

Thanks and regards,

Zhaohong

0 Chuck Wong over 13 years ago in reply to Zhaohong Zhang

Genius 3950 points

Zhaohong,

I believe that my code is very well structured to reduce to the max the overhead of the CPU, as shown in the following figure, and explained below:

From NHET-->HTU transfer request every 500 ns: no CPU intervention. No lost of data at all.

HTU buffer-full interrupt (every 250 edges) will trigger the CPU to make the transfer using memcpy(). I know that I can also use DMA but memcpy() is relatively fast for this task also. At the same time, if the post-processing task (software interrupt with lesser priority) has NOT been launched, it is triggered immediately after the memcpy() transfer. I expect that this post-processing task is capable to end before the next HTU buffer-full interrupt, but it didn't, so during this whole time when the system is receiving all the 6700 edges (~3.5 ms), the post-processing task is busy, and extends beyond the 10 ms time-frame. The end result is CPU 56KB buffer-overflow once the next serial data start to clock-in (after 10 ms), which indicates that the post-processing task is unable to keep pace with the data stream.

The data that I need for post-analysis to determine the each data bit are the transition direction bit (previous bit in the Control Field of WCAP), and the timestamp (Data Field of WCAP).

After reading the instruction set and registers available, I'm somehow stunned that I can't perform this data bit determination within the NHET (to act as and FPGA) that is such an sophisticated device!!!

Do you have one hint or two for me to perform the task within the NHET, so instead of providing previous-bit and timestamp, it will provide the data bit?

Thank again.

0 Zhaohong Zhang over 13 years ago in reply to Chuck Wong

TI__Mastermind 22715 points

Chuck,

(1) Since you use a double buffer scheme for HTU transfer, I think that you can process the data directly from HTU buffer? You do not need to move data to another buffer for processing. Minimize data movement is one of the keys to reduce CPU load. "In place" processing should be performed whatever possible.

(2) You can use the NHET PCNT instruction to directly capture the rise-fall, fall-rise, rise-rise, and fall-fall time. Would those values make your calculation easier? You may need to connect the input to multiple NHET pins. You can also set up WCAP to capture both edges. If you have a way to determine the first edge, you do not need check the polarity of the rest of edges. They have to follow ....rise-fall-rise-fall.... sequence.

Thanks and regards,

Zhaohong

0 Chuck Wong over 13 years ago in reply to Zhaohong Zhang

Genius 3950 points

Zhaohong,

The suggested PCNT instruction could be just fine. I can split the input signal or even share the input within NHET to have one PCNT to detect the period/timestamp of rise2fall, and a second one to detect fall2rise.

However, even if I can trust the signal will always goes from R-F-R-F ... without losing edge (128 MHz NHET, should be OK), I still have to base the data-bit detection on one of the 5 possible cases:

0.5 us long: data bit
1.0 us long: data bit
1.5 us long: start/end sync
2.0 us long: start/end sync
all the rest: Error

On top of that, given the imperfect nature of the signal, the HR resolution (or even loop resolution) will yield some numbers that will not always be an exact match for a compare instruction, such as if (timestamp is within [14, 18]) then ... My understanding of the NHET after reading the instructions set is that this could not be performed inside ... unless I missed something with the "Compare" instructions.

I would love to hear from you if "There's one more thing ...", like Steve Jobs' quote. :0)

Thanks and regards!

0 Chuck Wong over 13 years ago in reply to Zhaohong Zhang

Genius 3950 points

Zhaohong Zhang said:
(1) Since you use a double buffer scheme for HTU transfer, I think that you can process the data directly from HTU buffer? You do not need to move data to another buffer for processing. Minimize data movement is one of the keys to reduce CPU load. "In place" processing should be performed whatever possible.

For your suggestion #1:

Gaining speed is exactly what I'm trying to achieve and I can see the benefit of processing "in place". My problem is that the processing time of 250 edges data seems to exceed the time of 250 new edges receiving, so this cannot be done without substantially improving the processing algorithm. So I'm looking at all possibilities, including declaring function to be used in the RAM.

Chuck.

Arm-based microcontrollers

Arm-based microcontrollers forum

TMS570 Flash ROM waitstate with higher CPU HCLK operating frequency