This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6748 slow GPIO read



Hi,

I try to achieve fastest possible bootup with C6748. My last post about overall plan and SYS/BIOS startup delays is here: http://e2e.ti.com/support/dsp/omap_applications_processors/f/42/p/290700/1015923.aspx#1015923

Anyway, now I have implemented 4 bit GPIO bitbanged read for Micron 25Q032 flash memory. The problem is that GPIO bitbanging is not fast enough. 25Q032 accepts clock upto 108 MHz, however I am very far from there with bitbanging.

From attached picture can be seen that toggling pin takes ca 40 ns, however reading GPIO pin value takes 260 ns. Question is how is reading so slow compared to writing? C6748 GPIO datasheet http://www.ti.com/lit/ug/sprufl8b/sprufl8b.pdf says (page 15) that reading GPIO is synchronized to GPIO clock, which is SYSCLK4. I am running DSP with 300 MHz clock, so SYSCLK4 should be 75 MHz. C6748 datasheet page 256 http://www.ti.com/lit/ds/symlink/tms320c6748.pdf shows some GPIO timing diagrams. I read out from there that minimum toggle time is 2C, where C is GPIO clock period. With 75 MHz it would be 2 + 13 = 26 ns. My logic analyzer does not sample fast enough to get such resolution. So I guess actually this 40 ns pin toggle what I measured could be less. Lets assume write also needs to synchronized with clock, then it is almost according to datasheet. I expect read to happen also within something around 30-40 ns, however it takes more like 260 ns. Which seems to be out of spec.

I wrote my own GPIO function based on Starterware GPIO API. I did not like that Starterware driver calculated pin registers and offsets runtime. Therefore I created these defines.

#define NONVOLATILEHWREG(x) (*((unsigned int *)(x)))

#define GPIOPINNUMBER(bank, pin) ((bank << 4) | (pin + 1))
#define CS0PIN GPIOPINNUMBER(1, 6)
#define CS0REGNUMBER ((CS0PIN - 1) / 32)
#define CS0PINOFFSET ((CS0PIN - 1) % 32)

//; |../bl_spi.c:156| 
inline void CLKPinWrite0() {
HWREG((SOC_GPIO_0_REGS + GPIO_CLR_DATA(CLKREGNUMBER))) = (1 << CLKPINOFFSET);
}
// ; |../bl_spi.c:160| 
inline void CLKPinWrite1() {
HWREG((SOC_GPIO_0_REGS + GPIO_SET_DATA(CLKREGNUMBER))) = (1 << CLKPINOFFSET);
}
inline unsigned int DQ0DQ3PinRead() {
unsigned int val = NONVOLATILEHWREG(SOC_GPIO_0_REGS + GPIO_IN_DATA(DQ0REGNUMBER));
return ((val & (1 << DQ3PINOFFSET)) << 1) | ((val & (1 << DQ2PINOFFSET)) >> 2) | ((val & (3 << DQ0PINOFFSET)) >> 5);
}
void pseudotestread(unsigned char* restrict destAddr) {
volatile int a;
a = 0xBEEF; // |../bl_spi.c:325| 

CLKPinWrite0();
CLKPinWrite1();
rx_data = (DQ0DQ3PinRead() << 4); //|../bl_spi.c:329| 
CLKPinWrite0();
CLKPinWrite1();
rx_data |= (DQ0DQ3PinRead()); // |../bl_spi.c:333|
CLKPinWrite0();
CLKPinWrite1();
destAddr[0] = rx_data;
a = 0xBEEF; // |../bl_spi.c:341| 
while(1) {
}

}

Assembly:

; EXCLUSIVE CPU CYCLES: 24
MVKL .S2 0xbeef,B4
|| MVKL .S1 0x1e2601c,A4
MVKH .S2 0xbeef,B4
|| MVKH .S1 0x1e2601c,A4
|| ZERO .L1 A6
ADD .L1 -4,A4,A3
|| STW .D2T2 B4,*+SP(4) ; |../bl_spi.c:325|
|| SET .S1 A6,0x18,0x18,A6
STW .D1T1 A6,*A4 ; |../bl_spi.c:156| 
|| MVK .S2 168,B6
ADD .L2X A3,B6,B6
|| STW .D1T1 A6,*A3 ; |../bl_spi.c:160|
LDW .D2T2 *B6,B5 ; |../bl_spi.c:329| 
ZERO .L2 B7
SET .S2 B7,0x2,0x1e,B7
MV .L2X A4,B9 ; |../bl_spi.c:329|
STW .D2T1 A6,*B9 ; |../bl_spi.c:156|
SHRU .S2 B5,2,B8 ; |../bl_spi.c:329| 
|| AND .L2 B7,B5,B7 ; |../bl_spi.c:329|
AND .L2 4,B8,B8 ; |../bl_spi.c:329| 
|| ADD .S2 B7,B7,B7 ; |../bl_spi.c:329|
OR .L2 B8,B7,B7 ; |../bl_spi.c:329| 
|| EXTU .S2 B5,25,30,B5 ; |../bl_spi.c:329|
|| STW .D1T1 A6,*A3 ; |../bl_spi.c:160|
LDW .D2T2 *B6,B5 ; |../bl_spi.c:333| 
|| OR .L2 B5,B7,B6 ; |../bl_spi.c:329|
EXTU .S2 B6,28,24,B6 ; |../bl_spi.c:329| 
STW .D1T1 A6,*A4 ; |../bl_spi.c:156|
STW .D1T1 A6,*A3 ; |../bl_spi.c:160|
STW .D2T2 B4,*+SP(4) ; |../bl_spi.c:341|
AND .L1X 4,B5,A5 ; |../bl_spi.c:333| 
|| SHRU .S2 B5,2,B7 ; |../bl_spi.c:333|
ADD .L1 A5,A5,A5 ; |../bl_spi.c:333| 
|| AND .L2 4,B7,B7 ; |../bl_spi.c:333|
OR .L1X A5,B6,A5 ; |../bl_spi.c:333| 
|| EXTU .S2 B5,25,30,B5 ; |../bl_spi.c:333|
OR .L1X B7,A5,A4 ; |../bl_spi.c:333| 
OR .L1X B5,A4,A3 ; |../bl_spi.c:333|
STB .D1T1 A3,*A10 ; |../bl_spi.c:337|
;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;* Disqualified loop: Bad loop structure
;*----------------------------------------------------------------------------*
$C$L11:
; EXCLUSIVE CPU CYCLES: 6
BNOP .S1 $C$L11,5 ; |../bl_spi.c:342|
; BRANCH OCCURS {$C$L11} ; |../bl_spi.c:342|
.sect ".text"
.clink

Build options

"C:/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6740 --abi=eabi -O2 --symdebug:none --optimize_with_debug=off --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.2/include" --include_path="C:/Program Files/Texas Instruments/pdk_C6748_2_0_0_0/C6748_StarterWare_1_20_03_03/include" --include_path="C:/Program Files/Texas Instruments/pdk_C6748_2_0_0_0/C6748_StarterWare_1_20_03_03/include/hw" --include_path="C:/Program Files/Texas Instruments/pdk_C6748_2_0_0_0/C6748_StarterWare_1_20_03_03/include/c674x/c6748" --program_level_compile --define=c6748 --display_error_number --diag_warning=225 --no_bad_aliases --debug_software_pipeline --opt_for_speed=5 --call_assumptions=3 -k  "../bl_copy_rprc.c" "../bl_platform.c" "../bl_spi.c" "../main.c" "../uartConsole.c" 

My question is: is it optimal? So is it rather hardware delay not software, although my interpretation from datasheet says that it should be faster from hardware point of view.

Andres

EDIT: I did not notice this before:

GPIx duration must be extended to allow the device enough time to access the GPIO register through the internal bus.

So actual read is 2C + bus time. I guess it is so slow because of bus time and there is nothing I can do about it.

  • Hi Andres,

    Thanks for your post.

    May be, you are true that, GPIO read duration is not optimal according to spec. but in general, toggling of the GPIO from write to read seems slower because the read portion is wait-stated and since the peripheral frame has "write-followed-by-read pipeline protection" the next read won't occur until the previous write has completed.

    Also, you are right, it would have consumed more bus time of GPIx duration and it would have made GPIO read very slow than the actual. I guess, there could also be delay because of software GPIO programming sequence using starter ware API's too and please ensure that it is configured appropriately. Please check the programming sequence for the starterware GPIO in the below wiki:

    http://processors.wiki.ti.com/index.php/StarterWare_GPIO

    Thanks & regards,

    Sivaraj K

    -------------------------------------------------------------------------------------------------------
    Please click the Verify Answer button on this post if it answers your question.
    --------------------------------------------------------------------------------------------------------
  • Thank you for the reply.

    Currently I am thinking if any other module can be used to do fast (> 50 MHz) 4 bit parallel read. C6748 has module called UPP. I have not used UPP yet, but it might do the trick.

    I am also considering PRU. If GPIO access is very fast with PRU I could make interface logic using that module. It would be very good if I get some numbers whether PRU GPIO access is much faster or not.

    So currently I have UPP and PRU on deck. If someone has a better idea I would be very happy to hear about it.

    Andres

  • Today I was digging PRUSS documentation. I would say that it is not very clear.

    I found nice overview about AM335x next generation PRUSS: http://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/115/t/291100.aspx from that I read out that there are special pins for GPIO output and other pins for GPIO input. So I cannot use the same pin as IO?

    I have not yet found documentation where PRUSS IO pins are located on C6748. Could someone point me out where I could find such thing.

    Andres

    EDIT: I did not notice it before but in C6748 datasheet http://www.ti.com/lit/ds/symlink/tms320c6748.pdf page 37 is very nice overview about PRU input/output pins.

  • Andres,

    What solution have you come up with for your interface?

    There is information on the Wiki about the PRU. You can search there for PRU or PRUSS, I think.

    I have not tried it, but I have been told that the PRU has faster access to the GPIO pins. This may be an enhancement in the AM335x or it may be true for all the PRU module versions. Sorry that I do not know that.

    The uPP would appear to be the best fit for your operations. Did you look at it closer?

    Regards,
    RandyP

  • I have not taken in depth look at uPP, however when I last time checked it I got impression that I cannot do 4 bit parallel efficiently. I have to take two 8-bit readings where upper 4 bits are do not care and then combine those 4 bit slices manually into single byte. I assume that starting-stoping-combining sequence after every two 4-bit slices would be quite slow.

    I got stuck with PRU because with my custom board I do not have all suitable PRU signals currently available. PRU will be fast only if I could load 4 bits with a single call. It can be done if PRU input/output signals are connected sequently. I come back to this and try it out later.

    Andres