This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How efficient is cryptographic hardware accelerator in am335x?

Other Parts Discussed in Thread: AES-128

I am working in a project wherein the data is backed up to the device and the files are to be encrypted. Since encryption adds overhead, I tried to use the cryptographic hardware accelerator of am335x thinking that it would reduce the overhead of encryption. But while using it with my application program, the performance (i.e in terms of the total time for backup) was lesser than that of the software implementation of encryption (more time was taken for backup in hardware based encryption).

For further information I am encrypting the files in the chunk of 512 bytes each because of the application design. Does that affect the performance because of the increased access of hardware?

In one of the previous posts a guy mention that, using openssl and cryptodev kernel module may even make the performance worse. If that is so, then in which cases the accelerator gives better performance?

Thank you.

  • Hi,

    I will forward this to the Crypto experts.

  • The accelerator is only useful for big chunks of data. Try to encrypt 1 MByte at once.
  • I tried. For larger chunk of data, performance increases gradually. But to beat the software implementation performance i may have to use very large chunk of data like 1Mb as you told.
  • The problem lies not in the hardware but the overhead of the kernel driver. Big chunks of data help to avoid it, and it probably affects in-kernel users (e.g. IPSec) less than userspace users.

    I've done some preliminary tests with directly accessing the accelerator from userspace, and the hardware is fast. The limiting factor in most cases turned out to be the rate at which the Cortex-A8 could fetch data from the accelerator. The limitation is that the AES accelerator has only two contexts available, so unless some mechanism is used to arbitrate access only two userspace applications can use the accelerator this way (or one userspace application and the kernel). It would also be tough to combine this with DMA.

  • Matthijs van Duin said:
    The limiting factor in most cases turned out to be the rate at which the Cortex-A8 could fetch data from the accelerator.

    I should note that this is because the data was probably fetched in an inefficient way, even though I used 16-byte Neon loads and stores to maximize performance. The problem is that peripherals are normally mapped as "Device"-type memory region, and while Neon stores to such a region are fast (16-byte posted write), loads are split into separate 4-byte reads. I fear each read actually needs to complete before the next one is issued, which means the total time ends up 4 times the "ping time" to the peripheral.

    Having EDMA read the accelerator's output and place it in some normal uncacheable memory region ("coherent dma memory" in linux kernel terminology) from which the CPU could fetch it would bypass this problem but would be rather non-trivial to setup, and increase overhead for small transactions. Another alternative would be mapping the peripheral itself as normal uncacheable memory, but it's not clear to me how one would do this and it would heavily rely on the exact behaviour of the Cortex-A8 (specifically its lack of speculative data access): any code doing this would break spectacularly if executed e.g. on a Cortex-A15.

    Still, if I remember correctly my simple test came within a reasonable margin of the accelerator's max speed, so even without the benefit of DMA, direct userspace access should easily outperform both the kernel driver and software implementations, especially for small chunks of data.

  • What do you mean by two context in AES? Encryption and decryption contexts?
  • No, the AES accelerator has two independent contexts, each with its own register set, located at separate addresses: 0x53400000 for context 0 (aka the "secure context") and 0x53500000 for context 1 (aka the "public context"). Despite of the terminology, both contexts are freely usable on GP devices. Irqs are 102 and 103 respectively.

    The two contexts are mostly independent, with some minor caveats:

    • There is only one actual AES computation engine, hence its use will be somehow arbitrated between the two contexts. A plausible guess would be that context 0 takes precedence, but I haven't tested this.
    • The DMA events from context 0 seem to be not-connected as far as I can tell, so DMA is only possible using context 1.
    • A few things affect the whole module (both contexts), such as softreset and idlemode. These are only accessible via context 0.
    • Context 0 has two extra registers: one indicates when a context has been accessed/modified (write 1 to clear), the other allows locking some or all of the configuration of context 1.

    Other than this, the two contexts are identical in operation.

    The hash accelerator similarly has two contexts (0x53000000 and 0x53100000, irqs 108 and 109). I haven't studied it in any detail but I'd expect a similar situation.

  • Thank you so much for your information.
  • After all testing with my application code I came to know that the hardware accelerator is not useful for my application. It adds further overhead. So I am continuing with software implementation. Here, can I make use of neon co processor with openssl library compiled for it? If the cryptographic functions use the optimized code, then I think I could get better performance (lesser time for encryption/decryption).
  • So I am continuing with software implementation. Here, can I make use of neon co processor with openssl library compiled for it? If the cryptographic functions use the optimized code, then I think I could get better performance (lesser time for encryption/decryption).

    OpenSSL 1.0.2 added the fast NEON bitsliced implementation of AES-128 (ECB, CBC, CTR, XTS), which claims

    • 19.5 cycles/byte for encrypt
    • 22.1 cycles/byte for decrypt
    • 440 cycles for key setup

    on a Cortex-A8 for the raw primitive (ECB). The useful modes of operation (CTR and XTS, and CBC for backwards compatibility) will add some (probably small) overhead obviously. XTS will add at least the time for one 16-byte encryption as part of its key setup.

    The numbers I get from "openssl speed aes-128-cbc" suggest about 22 cycles/byte for encryption. The difference may be due to overhead of the CBC mode or the library calls, or perhaps because you're supposed to run benchmarks on an idle system and I was installing system updates at the same time ;-)

    After all testing with my application code I came to know that the hardware accelerator is not useful for my application. It adds further overhead.

    Well, to be more precise the overhead is due to using the kernel driver. The hardware accelerator itself should still be about twice as fast as the bitsliced NEON implementation (assuming the Cortex-A8 runs at 1 GHz and the L3F clock is 200 MHz).

    Earlier I mentioned I did a test with interfacing directly with the AES accelerator from userspace, and that bottleneck seemed to be data transfer between the HW accelerator and the CPU. I've recently discovered that, unlike normal device drivers, both /dev/mem and UIO map device memory as "strongly ordered". This is disasterous for the performance of any device I/O done from userspace, so I'm going to try to patch the kernel to fix that and see if I can make a working demo for HW-accelerated AES from userspace.

  • So, are you telling that the architecture of HW engine is itself not efficient?
    Even I have observed one thing. Initially I thought encrypting a larger chunk of data would give a better performance in HW engine. When I checked, the maximum limit for the data chunk is 64kB. I even tested by increasing that limit to 900kB and 1MB. For 900kB the encryption went fine. For 1MB, it ended up in crash (page allocation failure). For 900kB the performance was not something great. Just a few seconds improvement than that of the software approach.

  • I'm not sure what you mean by "HW engine" here.  The AES accelerator hardware itself is fast. However, currently when using it from userspace you go through many layers, each adding overhead and/or restrictions (there's for example also no good reason why the data size would be limited at all). The abstraction layer(s) in OpenSSL, the userspace<->kernel interface, the kernel's crypto framework, the kernel driver for the AES accelerator.

    Software has fewer layers of overhead to cut through, although I noticed "openssl speed aes-128-cbc" is slightly misleading since it uses the old openssl/aes.h API which has a limited set of modes (and seems to be deprecated?). To use the new API one needs to do "openssl speed -evp aes-128-cbc" and this gives me noticably slower timings, especially for small block size: the result of adding a "nice" abstraction layer.

  • I meant the whole set of openssl-kernel modules-driver-accelerator. The accelerator would be useful if I find/implement some other way to deal with it (I am not sure how to do this). If the current method of accessing the accelerator is not efficient, then why can't the TI wiki pages suggest the efficient one? Do they expect us to discover our own way?