How is the structure of the AES

How can AES be accelerated?

AES (Advanced Encryption Standard) is a symmetrical block code that can be implemented with different key lengths of 128, 192 and 256 bits. The number of processing steps for the encryption and decryption of the relevant data results from the key length. Block encryption algorithms, as the name suggests, work with blocks of data. The AES algorithm has a fixed block size of 16 bits each. If the encryption is less than 16 bytes, the unused bytes must be filled in accordingly. Since AES is a symmetrical code, the same actions and the same key are used to encrypt and decrypt information. In contrast to AES, asymmetrical encryption methods such as RSA use different keys for both operations.

Each of the four steps of the AES algorithm applies to a state. The combination of the four AES steps is called a round. How many rounds are required depends on the length of the key. The AES state begins with the 16 bytes that are to be encrypted. Every further step updates the status. Before processing the state, the input byte string must be correctly formatted in the initial state to form a 4 × 4 matrix (Fig. 1). After the initial 16 bytes have been converted into the initial state as a 4 × 4 array, it is possible to examine how each step affects its input state.

Figure 1. Initial state conversion of the 16 bytes into a 4 × 4 array. Xilinx

The add-round key is the only step that uses the key. As already mentioned, the number of rounds in the encryption algorithm depends on the key length (128, 192 or 256 bits). The coding key undergoes an expansion. This ensures that its bytes are not used multiple times in each round before they are used. Thus, the expanded key size is different every time. The extended bowl length is

Expanded Key Size (Byte) = 16 * (Rounds + 1)

During the operation of this step, the bytes of the input state are linked via Exclusive OR with the 16 bytes of the extended key. Each round uses a different section of the extended key. Round 0 uses bytes 0 to 15, round 1 uses bytes 16 to 31, and so on. In each round, byte 1 of the state is linked by exclusive OR with the lowest byte of the expanded key.

The Sub-Bytes step uses a byte substitution to swap the state values ​​with another value. The values ​​in the substitution box are predefined and designed so that they have a low correlation between the input and output bits. The substitution box (S-Box) is a 16 × 16 matrix. The upper and lower nibbles of the substituted byte appear as an index in the substitution table. So if in the S-Box encryption according to Figure 2 the first initial state byte is 0 × 69, it is replaced by the substitution value 0 × F9. The upper nibble of the state byte determines the row of the substitution box, the lower nibble the column. In Figure 2 it should be noted that there are separate substitution boxes for encryption and decryption, and that their contents are different.

Figure 2. Contents of the AES substitution box. Xilinx

The Shift-Rows step is used to rearrange the input state matrix via a circular byte shift of each row. Each line is rotated to the right by a different factor (Fig. 3). Line 1 remains unchanged. Line 2 is shifted by one byte, line 3 by two bytes and line 4 by three bytes. The same operations are performed when decrypting, but rotating to the left instead of to the right.

Fig. 3. Line shift operation. Xilinx

Mix-Columns is the most complicated step of a round. It requires 16 multiplications and twelve exclusive OR operations. These operations are carried out column by column on the input state matrix, which is multiplied by a fixed matrix to generate a new state column (Fig. 4). Each entry in the column is multiplied by a row in the matrix. The results of all multiplications are XORed to form the new state value. In Figure 4, the first column and row to be multiplied are highlighted in yellow.

Figure 4. Mix columns function for encryption and decryption. Xilinx

The Mix Columns equations in the first column are as follows:

B1 '= (B1 * 2) XOR (B2 * 3) XOR (B3 * 1) XOR (B4 * 1)

B2 '= (B1 * 1) XOR (B2 * 2) XOR (B3 * 3) XOR (B4 * 1)

B3 '= (B1 * 1) XOR (B2 * 1) XOR (B3 * 2) XOR (B4 * 3)

B4 '= (B1 * 3) XOR (B2 * 1) XOR (B3 * 1) XOR (B4 * 2)

This process is repeated with the same multiplication matrix for the next column of the input state until all columns of the input state have been addressed.

Each AES encryption round includes all four steps in this order: Sub-Bytes, Shift-Rows, Mix-Columns (for rounds 1 to N - 1) and Add-Round-Key (with an expanded key). It is imperative to ensure that this process is reversible and that the illegible block text can be converted back into plain text so that the encrypted information can be used. The following steps are used for this: Invert Shift-Rows, Invert Sub-Bytes, Add-Round-Key (with expanded key) and Invert Mix-Columns (for rounds 1 to N - 1). Before the first round of encryption, an initial add round key operation must be carried out for encryption and decryption.

A look at the algorithm that is to be used to expand the key so that there are enough bits available in the key for the number of add round key steps to be carried out (Fig. 5). Key lengths of 16, 24 or 32 bytes require 44, 52 or 60 rounds for expansion. The first bytes of the expanded key are the same as the initial key. This means that in the AES256 example explained here, the first 32 bytes of the expanded key are the key itself. The key expansion generates the 32 additional bits for the expanded key in each iteration.

Figure 5. Key expansion algorithm. Xilinx

The main steps in key expansion are: Rotate-Word - This step is similar to Shift-Rows. It rearranges a 32-bit word so that the most significant byte (MSB) becomes the least significant byte (LSB). The Subword step uses the same substitution box as for the byte substitutions in encryption, while the rcon step forms the power of two to a value defined by the user. As with mix columns, rcon is executed in the Galois body. That is why a precalculated lookup table is usually used here. EK delivers four bytes from the expanded key, K as in EK, four bytes.

How can the correct implementation of the encryption and key extension algorithms be verified? The NIST specification (National Institute of Standards and Technology) for AES offers a number of tried and tested examples that can be used to check your own implementation.

Code creation

In order to ensure that the encryption part of the AES code can be accelerated in the programmable logic (PL) of the Zynq SoC, the code must be developed with this aim in mind from the start. The first consideration is the architecture of the algorithm and its appropriate segmentation. AES is well suited for this approach because you can write functions for each step and call them as needed. In addition, a separate file must be provided for the function to be accelerated.

The corresponding software architecture therefore contains the following components: main.c, aes_enc.c, aes_enc.h and sbox.h. The main.c file contains the algorithm for key expansion, the key itself and the plain text input, as well as the call for the AES encryption function. The file aes_enc.c performs the encryption. Each level is coded as a separate function so that it can be called up as required for the AES round. So that this design corresponds to the usual implementations on the processors, the lookup tables of the mixed-step multiplications are used. The file aes_enc.h contains the definition of the aes_function and the parameters for determining its size (for example mk, nb and nr). The sbox.h file contains the substitution box for the substitution bytes, the lookup table for the rcon function for key expansion and the lookup tables for the mix columns multiplications. With this structure, the AES encryption function to be accelerated can be selected by right-clicking on the function and using the HW / SW switch (Figure 6).

Fig. 6. The function to be accelerated. Xilinx

In order to ensure the realization of the minimum performance and the savings through the acceleration of the function, it must be possible to schedule the execution of the function. The sds_clock_counter in sds_lib.h is used for this.

After writing the source code, the AES algorithm could be executed in 36,662 processor cycles in software on a single ARM Cortex-A9 processor core in the Zynq SoC.

Optimize the acceleration

The acceleration of the AES algorithm is somewhat more complicated than the algorithm for matrix multiplication because the main loop of the AES algorithm consists of interdependent stages.

In practice, the AES algorithm can be speeded up by examining the loops to see where to unroll them to optimize memory bandwidth. In addition, the correct clock frequency for the data flow and the frequency of the hardware functions are selected at this point.

The main loop of the AES encryption function includes the functions for each AES step. Every function of the AES algorithm must be completely executed and the result calculated before the next function can run. This dependency requires full concentration on the AES steps created as separate functions. There is ample potential for optimization within these steps.

Key data

The AES (Advanced Encryption Standard) encryption standard has firmly established itself in processor, microcontroller, FPGA and SoC applications for securing data input and storage. Because of the demanding operations of the AES algorithm, developers usually use FPGA solutions to implement this encryption standard. After the algorithm has been described in the high-level language C, it is accelerated in the hardware implementation. AES is a good example to illustrate the advantages of the SDSoC development environment from Xilinx.

The steps Add-Round-Key, Sub-Bytes and Mix-Columns can be arranged as a pipeline in order to increase the performance. These functions are used to execute the HLS pipeline instruction by inserting pragmas in the first loop. You should also unroll the inner loop. Some of these functions come from lookup tables that are normally implemented as block RAM. In order to increase the memory bandwidth, this example uses "complete" as the pragma parameter. This implements the memory contents as discrete registers.

The ability to transfer data between the PS and PL sections of the Zynq SoC is also important for increasing performance. In the first step, the clock in the network for the data flow was set to the highest possible value: 200 MHz. The next step was to ensure direct memory access during data transfer between PS and PL. This required a rewrite of the interface and the use of the sds_alloc function so that the data are coherent in the memory, as required by the DMA transfer (Figure 7).

Figure 7. Network for the data flow between processor system (PS) and programmable logic (PL). Xilinx

As a third and final optimization step, the clock rate for the hardware function was set to the highest frequency supported by this application: 166.67 MHz.

Results with the supported operating systems

After setting up the example circuit, the PL-accelerated AES code ran under Linux in 16,544 processor clock cycles. This corresponds to 45% (16,544 / 36,662) of the number of cycles when operating the AES code only in software, which means a massive 55% reduction in the execution time for a very complex and interdependent algorithm.

The Baremetal or Free-RTOS operating systems can also be used for the SDSoC development environment. The creation of baremetal and free RTOS projects with code reuse allows a good comparison of the performance of all three supported operating systems. The choice of operating system for a specific project depends on the type of application, the required performance and the time budget for the development.

Figure 8 illustrates the performance of the three operating systems mentioned with the Zynq SoC. It is not surprising that there are similar reductions with Free-RTOS and Baremetal, because both enable much simpler implementations than Linux.

Figure 8. Performance of Free-RTOS and Baremetal for the component from the Zynq family (PS and PL) with similar reduction results. Xilinx

As the results show, the use of the SDSoC development environment to accelerate AES encryption results in a real improvement in performance with simple implementation - even without special prior knowledge of FPGA design.