This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C66x core cryptographic algorithm optimization

TI experts-

Is there an app note, benchmark doc, or wiki that talks about optimizing C66x core arithmetic operations (add, xor, shift), for example using 8 byte data operands and performing multiple operations in a each cycle in order to fully utilize the pipeline?

I found the SLAA547 doc (C Implementation of Cryptographic Algorithms) but this is chip agnostic.

Thanks.

-Jeff
Signalogic

  • Hi Jeff,

    Have you checked -sprabf2.pdf - Introduction to TMS320C6000 DSP Optimization and spru187u.pdf -TMS320C6000 Optimizing Compiler v 7.3 User's Guide

    Thanks,

    HR

  • HR (and TI guys)-

    sprabf2 doesn't appear to have non-multiply specific examples.

    One way to narrow my question would be ask:

      -using 16-bit operands (i.e. following the dictum
       of using smallest applicable data type in order
       to enable SIMD instructions)

      -assuming no loops

    what is the maximum number of sets of:

      add, shift, and xor

    that can be performed in one clock cycle?  As one example, if we are using C code:

    uint32 x[16];

      x[i1] ^= (x[i2] + x[i3]) << s1;
      x[i4] ^= (x[i5] + x[i6]) << s2;
      :
      :

    where iN are pre-determined indexes, sN are shift constants, and the left-shift is a rotating shift (no bits lost), would the optimizing compiler still be able to use SIMD instructions?

    If there are any app notes or benchmarks for such crypto type code, please let us know.  Thanks.

    -Jeff

  • Hi Jeff,

    OK, as you know there are  2 sets of 4 functional units (.D, .L, .M, .S) so you can have 8 commands executed in parallel (1 cycle), you can see the commands and what functional unit in sprugh7.pdf -TMS320C66x DSP CPU and Instruction Set in Appendix B table B-1, now there are commands which uses SIMD like the ADD2 (2 additions of 16bit) or ADD4 (4 additions of 8bit) now to use the asm commands use Intrinsics,

    Thanks,

    HR

  • HR-

    Thanks again.

    TI guys-

    Are there any benchmarks for stream ciphers, rather than block ciphers handled by the C66x SA?  The source code I listed above (excerpt here again):

    uint32 x[16];

      x[i1] ^= (x[i2] + x[i3]) << s1;
      x[i4] ^= (x[i5] + x[i6]) << s2;

    is typical for this category, for example the Salsa20 algorithm does this.   Thanks.

    -Jeff

  • We have SA benchmarking in progress but I am not sure if it covers your scenario. I will check.

    Regards, Eric

  • Jeff,

    This is not really a fair question. I mean it sounds like there is something common about stream ciphers, and this something is for some reason resembles Salsa. But the thing is that you can't generally judge performance of any particular algorithm based on performance of some other, even if they fall into category of stream ciphers. It's more about critical path lengths and possibility to parallelise operations, but there are no universal commons. Even boundary between block vs. stream is blurred. There are ways to build stream cipher from block function. In fact, even Salsa does it, it's just that its specification discusses counter mode as the only way to use its block function...

    As for Salsa itself. To tell you the truth I haven't looked at Salsa, only at its close relative Chacha, though not in context of C6x. Yet, based on dependencies observed in Chacha and experience with C6x, I'd say that you should be able to produce pair of values in referred snippet every cycle. Though in order to amortize for critical path of 4 cycles (1 for add, 1 for xor and 2 for rotate), you'd need to accommodate in register bank two blocks for parallel processing. Whether or not compiler will be able to achieve that, I don't know.

    You mention something about 16-bit and then refer to 32-bit code. If you mean that there is 16-bit cipher similar in structure to Salsa, then I don't know. But one has to remember that there is no "SIMD" rotate, meaning that you'll have to do it hard way, in which case you are likely to run out of execution ports, which would lead to failure to emit expected amount of results every cycle.

  • Andy-

    Thanks very much for your insightful reply.  I agree with your analysis.   We're currently working on some hand-optimized C66x asm code for Salsa, I will report back here when we have some benchmarks.

    Yes, rotate is an issue.  A suggestion for TI in their next generation devices.

    -Jeff