This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM625: I would like to get the PRU running at 333MHZ in AM625

Part Number: AM625

Tool/software:

HI

I would like to continue where the following thread left off. I would like to get the PRU running at 333MHZ in AM625. I can provide feedback and testing.

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1430106/am625-am625-how-to-set-333mhz-clk-for-pruss

Thank you

  • I tried these settings as suggested. The PRU clock did switch to 333MHz but LBBO calls to read from DDR increased substantially. Is there a document that described the registers and values in detail? The TRM is pretty vague.

    				pruss_coreclk_mux: coreclk-mux@3c {
    					reg = <0x3c>;
    					#clock-cells = <0>;
    					clocks = <&k3_clks 81 0>,  /* pruss_core_clk */
    						 <&k3_clks 81 14>; /* pruss_iclk */
    					assigned-clocks = <&pruss_coreclk_mux>;
    					assigned-clock-parents = <&k3_clks 81 14>;
    					assigned-clock-rates = <333333333>; 
    				};

  • For future readers, this previous customer was able to set 333MHz clock using the same settings that D Anthony posted:

    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1450236/am625-help-setting-pruss-clock-to-333mhz-on-am6254/5566801#5566801

    Can we get you to post your test code, and your test results with both the default clock frequency, and the updated 333MHz clock frequency?

    Regards,

    Nick

  • The PRU instructions are being ran at 333 Mhz with the settings above. However, the following instructions with default settings (250Mhz) was taking ~600ns. With the proposed 333Mhz clock settings the instructions are taking over 3us.  Seems the proposed settings affects the DDR read speed. Hoping it does as I would like to decrease the 600ns read if possible. 

      	set R30,R30,PRU_TEST_1
      	NOP
      	clr R30,R30,PRU_TEST_1
    	LBBO &r5, ddr_mapped_addr,0,50
      	set R30,R30,PRU_TEST_1
      	NOP
      	clr R30,R30,PRU_TEST_1

    DDR 50 byte read timing with "Default 250 Mhz" settings

     

    DDR 50 byte read timing with "333Mhz proposed settings" 

  • Hello Anthony,

    We will be tag-teaming on your thread for the next week or so - I will provide some comments, and one of my team mates will be looking into actually running tests.

    General read latency thoughts

    First off, all the previous work I have done around read/write latencies can be found on this FAQ. Please take a look if you have not read through it already:
    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1096933/faq-pru-how-do-i-calculate-read-and-write-latencies

    There is one thing that I did NOT consider when writing that FAQ: the impact of asynchronous bridges when the ICLK/VCLK frequency (fixed at 250MHz) is different from the PRU core frequencies (in your case, 333MHz). At this point in time, I am not sure if anyone at TI has done a thorough analysis of this potential impact, so I cannot give you an exact equation for predicting the impact on Read commands. I would expect that write commands are unaffected, and should still behave as described in that FAQ.

    The AM64x PRU Ethernet firmware developers mentioned that they did have to weigh the trade offs of longer reads, but running the firmware at a higher frequency. At least for them, the benefit of 33% more speed (3ns/instruction instead of 4ns/instruction) spread across 6 cores per ICSSG outweighed the penalty of increased latency to access external memory.

    Other thoughts - DDR is not great for low latency access 

    DDR is the slowest memory you can access, since the signal has to physically leave the processor to get there. How much memory does your application need, and what do you want to do with it?

    If you have the space, you will get fastest performance with the PRU Subsystem's local memory. The next-fastest access would be to the on-chip SRAM (fairly limited on AM62x, only 64kB to work with).

    Regards,

    Nick

  • Hi. Thanks for the response. The read speed from DDR is more important than the increased speed of the PRU instructions for our use case. The amount of data needed to be read is 10x the size of the PRU data rams combined so we need to keep using the DDR. Maybe there is a solution to lower the read time when running at 333 MHz if not we will be fine running at 250 Mhz.

  • Hi Anthony,

    Maybe there is a solution to lower the read time when running at 333 MHz if not we will be fine running at 250 Mhz.

    I will discuss more on this with the team and get back to you.

    Regards,

    Nitika

  • Was there ever a resolution to this?   I'm seeing the same issue.    Using LBBO to access various memory locations using 32byte reads, I'm seeing a SEVERE slowdown when the PRU clock is increased to 333Mhz. 

    I was able to verify that an LBBO to the pru local/shared RAM stays at a consistent 11 clock ticks.   That with both speeds.   However, access to DDR drops from about 83 clock ticks (on average) to about 690.   Access to the on chip SRAM drops from 24 ticks to about 143.    Access to the M4F's SRAM (we aren't using the M4F so am considering using its ram) drops from 47 clock ticks to 363.    With the clock going faster, we expect slightly higher numbers of clock ticks for the reads, but not this much.   

  • Hi Daniel,

    If it is okay can I request you to do a test for me on your setup and share the results.

    Can you try the same memory reads at 200MHz clock frequency as well and note the values.

    Regards,

    Nitika

  • Running at 200Mhz also resulted in very slow access.   Thus, I decided to try running at 250Mhz, but actually configuring it to be 250Mhz instead of the defaults.   Using:

    pruss_coreclk_mux: coreclk-mux@3c {
    reg = <0x3c>;
    #clock-cells = <0>;
    clocks = <&k3_clks 81 0>, /* pruss_core_clk */
    <&k3_clks 81 14>; /* pruss_iclk */
    assigned-clocks = <&pruss_coreclk_mux>;
    assigned-clock-parents = <&k3_clks 81 14>;
    assigned-clock-rates = <250000000>;
    };

    That ALSO resulted in it having very slow access.    Thus, it looks like any attempt to configure the clock results in slow access to external ram/memory.   Is there any instructions on how to get the clock configured for 333Mhz (or 200Mhz or 250Mhz) with out affecting the speed to the external RAM?

  • Hi Daniel,

    Thank you for running the tests and sharing your observations.

    I will bring this to our Linux expert's attention for more comments.

    Regards,

    Nitika

  • Hi Daniel,

    One more thing, if your main objective is to increase the read speed from DDR, I would also suggest looking into xfr2vbus_rd widget (refer TRM section 7.4.5.3.1 PRUSS XFR2VBUS Hardware Accelerator).

    The programming guide for the same is available in the TRM and you can also refer to this thread.

    Regards,

    Nitika

  • I did try xfr2vbus as well and got a similar slow down compared to LBBO.   xfr2vbus is a bit better if I'm able to alternate between the two read units, but it's still SIGNIFICANTLY slower if the PRU is running at 333 compared to the default.

    xfr2vbus has other complications as well.  In general, I need 48 bytes of data per read so I need to read 64byte and transfer some to a scratch pad with shifts and such.    However, if the speed isn't any better at 333 (which it's not), then it's still not usable.

  • Hello Daniel & D Anthony,

    Are you on the same team?

    Your clock frequency settings do NOT look correct

    Step 1: please test with using ICLK WITHOUT setting frequency

    Taking another look at the clock settings in this thread, they do NOT look correct to me. ICLK is a fixed 250MHz clock, so you should NOT manually set the frequency. If you are using ICLK, you should do it like this:
    https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/commit/arch/arm64/boot/dts/ti/k3-am62-main.dtsi?h=ti-linux-6.6.y-cicd&id=34e6b1d215aae3e5d9c9d455edc943140ec99775 

    My best guess is that Linux is selecting a different clock source than ICLK and setting that other clock source to 250MHz. Thus, even though you have a 250MHz clock, it is a different 250MHz clock than the bus clock, and thus you still get lowered read/write performance.

    Step 2: ok, how SHOULD we be setting the clock frequency if we want something other than 250MHz? 

    This is a bit tricky, since the AM62x TRM clock diagram is wrong right now. However, the clock diagram for AM64x should be similar. Let's take a look at the clock sources:

    I have highlighted the paths we care about.

    The PRU core clocks (CORE_CLK in the image) have 2 possible sources: pruss_iclk, and pruss_core_clk.

    pruss_iclk (or ICLK) is a fixed 250MHz. This is the "bus interface clock" or VCLK which used to communicate with the rest of the processor. As you can see in the diagram, this clock source can also be routed to the PRU's internal CORE_CLK. I would NOT expect you to see extra read/write latencies with this clock.

    pruss_core_clk (or ICSSG0_CORE_CLK in the image) has 2 different clock sources that go into it: MAIN_2_HSDIVOUT0_CLK and MAIN_0_HSDIVOUT9_CLK. These clock sources can be configured to have different frequencies. You should ONLY be setting the clock frequency if you are selecting one of these clock sources that will be routed through pruss_core_clk.

    Let's check the PRU clock IDs that are defined in the TISCI documentation. Look under ICSSM0 (even though AM62x has PRUSS, NOT PRU-ICSS):
    https://software-dl.ti.com/tisci/esd/09_02_07/5_soc_doc/am62x/clocks.html#clocks-for-icssm0-device

    For pruss_core_clk / CORE_CLK, we need to use clock ID 0, and list either clock ID 1 or 2 as an input:

    For pruss_iclk / VCLK, we use only clock ID 14:

    Step 3: ok, so what SHOULD my clock settings look like? 

    We already know what it should look like if using ICLK: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/commit/arch/arm64/boot/dts/ti/k3-am62-main.dtsi?h=ti-linux-6.6.y-cicd&id=34e6b1d215aae3e5d9c9d455edc943140ec99775 

    So let's focus on using pruss_core_clk to get other frequencies.

    Please follow the concepts of the FAQ here. It is for ICSSG instead of PRUSS, but similar concepts:
    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1049800/faq-pru_icssg-how-to-check-and-set-pru-core-frequency-in-linux

    I'll ask Nitika to followup with exactly which frequencies are supported for each clock source. So this code might NOT have the correct frequencies now. But let's pretend that MAIN_0_HSDIVOUT9_CLK can output 200MHz.

    Then in order to set the PRU core clocks to 200MHz, I would expect the code to look like this:

    pruss_coreclk_mux: coreclk-mux@3c {
    	reg = <0x3c>;
        #clock-cells = <0>;
        clocks = <&k3_clks 81 0>,  /* pruss_core_clk */
                 <&k3_clks 81 14>; /* pruss_iclk */
        assigned-clocks = <&k3_clks 81 0>, <&pruss_coreclk_mux>;
        assigned-clock-parents = <&k3_clks 81 2>, <&k3_clks 81 0>; /* DEV_ICSSM0_CORE_CLK_PARENT_POSTDIV4_16FF_MAIN_0_HSDIVOUT9_CLK */
        assigned-clock-rates = <200000000>, <0>; /* 200MHz */
    };

    Please let us know if this works for you. We will use this interaction to create a new e2e FAQ and new documentation for the AM62x clock.

    Regards,

    Nick

  • Absolutely no change on the timings from reading from DDR.     It looks like it's assigning the clocks and the requency is changing, but reads from DDR are still taking almost an order of magnitude longer than compared to the default (pruss_iclk) clock.

  • Hello Daniel,

    Setting PRU clock frequency

    Even with the "updated" devicetree settings, I would expect to see reads taking longer when using something other than the interface clock (exact time TBD). This work is mostly to make sure that your code is forward portable if using something other than the default 250MHz ICLK.

    Please post the exact devicetree settings that you are testing, and the observed results with each setting.

    I assume that when you set PRU core clock = ICLK as discussed above, the DDR reads are not taking extra cycles, right?

    You can verify your changes in hardware by checking these registers with devmem2:

    within the PRU subsystem: 

    ICSSG_CORE_SYNC_REG (offset 0x26000, register 0x3C)
    Bit CORE_VBUSP_SYNC_EN: 0h = ICSSGn_CORE_CLK is the source, 1h = ICSSGn_ICLK is the source

    within the control MMR registers:

    CFG0_ICSSM0_CLKSEL (physcial address 0010 8040h)

    bit 0 ICSSM0_CLKSEL_CORE_CLKSEL (Selects the ICSSM0 functional clock source Field values (others
    are reserved): 1'b0 - MAIN_PLL2_HSDIV0_CLKOUT 1'b1 - MAIN_PLL0_HSDIV9_CLKOUT)

    Ok, but I still see much higher latencies. Is there anything else I can do? 

    Are you using the SRAM for anything? Access latencies to the SRAM are lower than DDR in general. I would also be curious to see if the difference in latency between ICLK and other clocks is different when accessing SRAM than when accessing DDR.

    Regards,

    Nick

  • For the 250Mhz, I'm using:

     pruss_coreclk_mux: coreclk-mux@3c {
        reg = <0x3c>;
        #clock-cells = <0>;
        clocks = <&k3_clks 81 0>,  /* pruss_core_clk */
                 <&k3_clks 81 14>; /* pruss_iclk */
        assigned-clocks = <&pruss_coreclk_mux>;
        assigned-clock-parents = <&k3_clks 81 14>;
    };

    and for 200Mhz and 333Mhz, I'm using:

    pruss_coreclk_mux: coreclk-mux@3c {
        reg = <0x3c>;
        #clock-cells = <0>;
        clocks = <&k3_clks 81 0>,  /* pruss_core_clk */
                <&k3_clks 81 14>; /* pruss_iclk */
        assigned-clocks = <&k3_clks 81 0>, <&pruss_coreclk_mux>;
        assigned-clock-parents = <&k3_clks 81 2>, <&k3_clks 81 0>; /* DEV_ICSSM0_CORE_CLK_PARENT_POSTDIV4_16FF_MAIN_0_HSDIVOUT9_CLK */
        assigned-clock-rates = <200000000>, <0>; /* 200MHz */
    };

    (just updating the clock-rate for 333Mhz)

    This results in timings that look like:

    And the code I'm using:

        RESET_PRU_CLOCK r8, r9
        LDI32    r8,  0x70000000   //  OCSRAM
        LDI      r9,  1024
        LOOP TEST_OCSRAM_DONE, r9
            LBBO &r10, r8, 0, 32
            ADD r8, r8, 32
    TEST_OCSRAM_DONE: 
        GET_PRU_CLOCK r8, r9, 4
        SBCO &r8, CONST_PRUDRAM, 16, 4
    
        RESET_PRU_CLOCK r8, r9
        LDI32    r8,  0x0005040000  // M4 DRAM
        LDI      r9,  1024
        LOOP TEST_M4_DONE, r9
            LBBO &r10, r8, 0, 32
            ADD r8, r8, 32
    TEST_M4_DONE: 
        GET_PRU_CLOCK r8, r9, 4
        SBCO &r8, CONST_PRUDRAM, 24, 4
    
    
        RESET_PRU_CLOCK r8, r9
        LDI32    r8,  0x8f000000  // DDR DRAM (reserved)
        LDI      r9,  1024
        LOOP TEST_DDR_DONE, r9
            LBBO &r10, r8, 0, 32
            ADD r8, r8, 32
    TEST_DDR_DONE: 
        GET_PRU_CLOCK r8, r9, 4
        SBCO &r8, CONST_PRUDRAM, 32, 4
    
    
        RESET_PRU_CLOCK r8, r9
        LDI32    r8,  0x00010000  // local shared ram
        LDI      r9,  1024
        LOOP TEST_SHARED_DONE, r9
            LBBO &r10, r8, 0, 32
            ADD r8, r8, 32
    TEST_SHARED_DONE: 
        GET_PRU_CLOCK r8, r9, 4
        SBCO &r8, CONST_PRUDRAM, 40, 4

    I can then use devmem2 to query the four values from the PRU's ram.

  • Hi Daniel,

    I'll ask Nitika to followup with exactly which frequencies are supported for each clock source. So this code might NOT have the correct frequencies now.

    As Nick mentioned above, please find the frequencies can be used with each of the PLL clock sources:

    PLL2 (Parent Clock ID 1) : 225 and 300MHz
    PLL0 (Parent Clock ID 2) : 200, 250 and 333 MHz

    Regards,

    Nitika

  • Hello Daniel,

    Have there been any updates from your side?

    Next month I am planning to spend some time benchmarking PRU latencies (turns out there is a minor bug in my FAQ linked above: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1459880/am625-i-would-like-to-get-the-pru-running-at-333mhz-in-am625/5602605#5602605 ) and updating the app note from a couple of years back. As a part of that benchmarking, we are planning to try using different clock frequencies and measuring exactly what performance hit we see.

    I do not have specific numbers, but I have been told that your latencies are much longer than the latencies other team members have observed when they used a PRU clock other than the interface bus clock. So there might be something else going on here.

    Regards,

    Nick

  • No real "changes" from my side.  

    I couldn't get the latencies down to a reasonable number so I kind of gave up and just kept the PRU at 250Mhz.   I tried a bunch of different things for the clocks and such and I could definitely verify (based on the width/count of pulses on a PRU pin) that the PRU frequency was changing to the desired settings, but the really long latencies always remained.   

    I'd love to get it to 333Mhz, but the latencies on the DRAM reads needs to remain as low as possible.   Obviously as the clock goes up, the number of raw clock ticks per read will likely go up, but it cannot go up by almost an entire order of magnitude.

    Dan

  • Hello Daniel,

    Hypothethical expectations

    When I talk to our hardware guys, they expect a much lower latency increase than you are seeing when doing async mode with PRU cores at 333MHz as opposed to sync mode with PRU cores at 250MHz.

    They also expect the latency hit to be the same size, regardless of the size of the read (e.g., 4 byte read and 32 byte read would theoretically have the same latency increase).

    They would also expect to see the latency hit once per read (so a single 32 byte LBBO command would have 8 times less latency increase than eight 4 byte LBBO commands).

    collecting info for tests on my side next month

    For your test code, could you show me RESET_PRU_CLOCK and GET_PRU_CLOCK just so I can make sure there is nothing fishy going on there?

    Otherwise, the test code looks reasonable at first glance.

    Also, how was this run? CCS load with nothing running on any other cores? In a different setup?

    Regards,

    Nick

  • Those are fairly simple to just get the count from the control register:

    /* needs two temporary registers that can be cleared out */
    RESET_PRU_CLOCK .macro reg1, reg2
        LDI32  reg2, PRU_CONTROL_REG
        LBBO   &reg1, reg2, 0, 4
        CLR    reg1, reg1, 3
        SBBO   &reg1, reg2, 0, 4
        SET    reg1, reg1, 3
        SBBO   &reg1, reg2, 0, 4
        LDI    reg1, 0
        SBBO   &reg1, reg2, 0xC, 4
        .endm
    
    /* if size = 8, then the reg beyond the passed in will contain the stall count */
    GET_PRU_CLOCK .macro  reg, treg, size
        LDI32 treg, PRU_CONTROL_REG
        LBBO  &reg, treg, 0xC, size
        .endm

    Test code was just loaded via normal remoteproc start/stop stuff in sysfs.  At the time, almost everything was shutdown on the ARM cores.   m4f was not enabled in the device tree.   

  • Hello Daniel,

    Ok, good to know. So Linux was running and probably influencing the test results in some way, but I still would not expect latencies like you observed.

    Thanks for sharing the test code.

    At this point I really need to run benchmarking in silicon and compare notes with the hardware designers until we have a complete understanding of what is going on. I have marked your thread down to circle back to you in the test notes for next month. Feel free to ping the thread for an update.

    Regards,

    Nick