How to reverse a byte

Multi tool use
Multi tool use


How to reverse a byte



I am currectly working on a project and it happens that I have to reverse the order of a byte. I am currently using AVR Studio Mega32 Microcontroller.



For example:


0000 0001 becomes 1000 0000
0001 0110 becomes 0110 1000
1101 1001 becomes 1001 1011



To start I have this:


ldi r20,0b00010110



What is the easiest way to reverse the byte so that r20 becomes 01101000?





have a look at stackoverflow.com/questions/4924060/… there are a few suggestions how to solve it. You may also have a look at graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious for more algorithms how to reverse a byte (the code is in C but should be ported to asm easily)
– dwalter
Mar 15 '13 at 10:48





Easy for whom? Personally I'd use (table16[orig & 15]<<4) | table16[(orig>>4)];` One can easily convert that to assembler and the tables can be generated with little bit of mental effort. Also the memory requirements and speed are in somewhat good balance.
– Aki Suihkonen
Mar 15 '13 at 11:33






@AkiSuihkonen: table indexing is not cheap on AVR, and 4-bit shifts take a SWAP + mask. So it might be barely faster than a pure ALU version. I put this in an answer.
– Peter Cordes
Jun 30 at 10:09




5 Answers
5



Here's a snippet - it's written for the GNU toolchain (avr-gcc, binutils, avr-libc, etc) - but it should be easy to adapt:


static inline __attribute__ ((always_inline))
uint8_t avr_reverse_byte (uint8_t x)
{
x = ((x & 0x55) << 1) | ((x & 0xaa) >> 1);
x = ((x & 0x33) << 2) | ((x & 0xcc) >> 2);

/* x = ((x & 0x0f) << 4) | ((x & 0xf0) >> 4); */

__asm__ ("swap %0" : "=r" (x) : "0" (x)); /* swap nibbles. */

return x;
}



So, not much of an improvement over the 'C' code, except for the final hi-lo nibble swap implemented with the swap instruction.


swap





AVR gcc recognizes the x = ((x << 4) | (x >> 4)) rotate idiom with or without &0x0f / &0xf0, and compiles it to a swap instruction. You don't need inline asm. Ironically, it's the rest of the function that compiles horribly. For some reason, if you don't explicitly cast the operands of | in the first 2 expressions, gcc fails to optimize the int (from integer promotion) down to an actual 1-register operation and uses 2 regs godbolt.org/g/sYx1uP. (I haven't checked current gcc because Godbolt only has AVR gcc4.6.)
– Peter Cordes
Jun 30 at 9:08



x = ((x << 4) | (x >> 4))


&0x0f


&0xf0


swap


|


int



I can't provide AVR code just now. But the general bit reversing technique is following:


abcd efgh p
badc fehg p = ((p and 0AAh) shr 1) or ((p shl 1) and 0AAh)
dcba hgfe p = ((p and 033h) shr 2) or ((p shl 2) and 033h)
hgfe dcba p = ((p and 00Fh) shr 4) or ((p shl 4) and 0F0h)



Another easy way is to use the carry flag:



Repeat 8x:


lsl r20 ; shift one bit into the carry flag
ror r0 ; rotate carry flag into result



(Input in r20, output in r0, content of r20 destroyed; registers may be changed freely.)


r20


r0


r20



This uses 16 instructions @ 2 bytes, 1 cycle each = 32 bytes of program memory and 16 cycles to reverse one byte when completely 'unrolled'. Wrapped in a loop, the code size can be reduced but execution time will increase.



A 4-bit (16 entry) lookup table for the two halves of the byte looks like a good tradeoff (as @Aki points out in a comment).



AVR instructions are 2 bytes each, so a 16-byte table costs the same space as 8 instructions. (It turns out not to be worth it for speed or size, except maybe if you can align your array by 256 bytes to make indexing much cheaper than what gcc does.)



It would be possible to pack the LUT, using the high and low half of each byte. But that would cost more in indexing (using a branch on bit 4 of index to conditionally SWAP before masking) than it saves in table size (8 bytes = 4 instructions).



Let's compare what AVR GCC does. gcc4.6 has surprisingly terrible missed-optimizations when compiling Brett's code (actually promoting to int and not taking full advantage of truncating the result to uint8_t). Ironically it does optimize x<<4 | x>>4 into a rotate using the SWAP instruction. (AVR doesn't have a multiple-count rotate instruction, and normal rotates are rotate-through-carry. Shifts are only available with one count per instruction.)


int


x<<4 | x>>4


#include <stdint.h>

uint8_t reverse_byte_alu (uint8_t x)
{
uint8_t xeven = x & 0x55, xodd = x & 0xaa;
x = (xeven << 1) | (xodd >> 1); // swap adjacent bit pairs

xeven = x & 0x33, xodd = x & 0xcc;
x = (xeven << 2) | (xodd >> 2); // swap adjacent 2-bit chunks

x = ((x << 4) | (x >> 4)); // 4-bit rotate is recognized as SWAP
return x;
}



compiles into this asm with gcc4.6 -O3 (Godbolt compiler explorer). I don't see any missed-optimizations.


gcc4.6 -O3


reverse_byte_alu:
mov r25,r24
andi r25,lo8(85)
lsl r25
andi r24,lo8(-86)
lsr r24
or r25,r24 # swap of adjacent bits done

mov r24,r25
andi r24,lo8(51)
lsl r24
lsl r24 # << 2
andi r25,lo8(-52)
lsr r25
lsr r25 # >> 2
or r24,r25 # swap pairs of bits done

swap r24 # swap nibbles
ret



16 instructions, 2 bytes each, all of them 1-cycle. (except ret) https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_instruction_list.html


ret



So this is slightly better than @JimmyB's answer, which needs 16 single-cycle instructions not including a ret. (But it can be rolled up into a small loop).


ret



Array indexing is not cheap in AVR. The only choice of addressing mode is post-increment or pre-decrement, no immediate displacements. So the 16-bit array address has to be added to the 4-bit nibble values. If your array is in the low 256 bytes of address space, this could be cheaper. Or if your array is 256-byte aligned, you could just set the upper byte of a pointer register and put your lookup value in the low byte. (gcc misses this optimization).



Telling gcc to align the array on a 16-byte boundary make the address calculation cheaper, but the total instruction count is 18, and some of them are multi-cycle instructions.


__attribute__((aligned(16))) // makes indexing cheaper
static const uint8_t reverse_nibble[16] = {
0, 0b1000, 0b0100, 0b1100,
0b0010, 0b1010, 0b0110, 0b1110,
0b0001, 0b1001, 0b0101, 0b1101,
0b0011, 0b1011, 0b0111, 0b1111
};

uint8_t reverse_byte_nibble_LUT (uint8_t x) {
uint8_t hi = reverse_nibble[x>>4];
hi = ((hi << 4) | (hi >> 4)); // SWAP instead of SWAP+AND for just a shift
uint8_t lo = reverse_nibble[x & 0x0f];
return hi | lo;
}



compiles to 17 instructions, and LD is a 2-cycle instruction when accessing FLASH with a non-increment/decrement addressing mode. (It's 1 cycle on some CPUs when not accessing flash or internal SRAM).


# gcc4.6 output, not optimal
mov r26,r24
swap r26
andi r26,lo8(15)
ldi r27,lo8(0)
subi r26,lo8(-(reverse_nibble)) # AVR doesn't have add-immediate byte, only sub
sbci r27,hi8(-(reverse_nibble)) # X register = (x>>4) - (-LUT_base)
ld r25,X
swap r25
mov r30,r24
ldi r31,lo8(0)
andi r30,lo8(15)
andi r31,hi8(15) # missed opt, this is a double-redundant 0 & 0
subi r30,lo8(-(reverse_nibble))
sbci r31,hi8(-(reverse_nibble))
ld r24,Z
or r24,r25
ret



ldi r27, 0 / sbci r27 is probably a missed optimization. With a 16-byte aligned table, we can't get carry into the high byte. I think we can do:


ldi r27, 0


sbci r27


# generate r30 = x&0x0f
subi r30,lo8(-(reverse_nibble)) # ORI would work, too. no-op with 256-byte aligned table
ldi r31,hi8(reverse_nibble) # reuse this for hi and lo



So this might come out ahead on speed with better indexing, but total size (code + table) definitely looks worse.



With a little bit of tweaking, one can get extra performance from the code snippet (of mine) in the comments.



When we ensure, that the LUT is aligned to 16-byte boundary, we can generate the address by xoring. Also we can make XOR the table by the index, allowing the argument x to be modified in-place. I've commented out the unnecessary instructions generated by GCC.


x


__attribute__((aligned(16))) // makes indexing cheaper
static const uint8_t reverse_nibble_xor[16] = {
0 ^ 0, 1 ^ 0b1000, 2 ^ 0b0100, 3 ^ 0b1100,
4 ^ 0b0010, 5 ^ 0b1010, 6 ^ 0b0110, 7 ^ 0b1110,
8 ^ 0b0001, 9 ^ 0b1001, 10 ^ 0b0101, 11 ^ 0b1101,
12 ^ 0b0011, 13 ^ 0b1011, 14 ^ 0b0111, 15 ^ 0b1111
};


uint8_t reverse_ams(uint8_t x)
{
uint8_t *p = (uint8_t *)((uint16_t)reverse_nibble_xor ^ (x & 15));
x ^= p[0];
x = ((x << 4) | (x >> 4));
uint8_t *q = (uint8_t *)((uint16_t)reverse_nibble_xor ^ (x & 15));
x ^= q[0];
return x;
}

reverse_ams:
ldi r18,lo8(reverse_nibble)
// ldi r19,hi8(reverse_nibble)
ldi r31,hi8(reverse_nibble) // use r31 directly instead of r19
mov r30,r24
// ldi r31,lo8(0)
andi r30,lo8(15)
// andi r31,hi8(15)
eor r30,r18
// eor r31,r19
ld r25,Z
eor r25,r24
swap r25
mov r30,r25
// ldi r31,lo8(0)
andi r30,lo8(15)
// andi r31,hi8(15)
eor r30,r18
// eor r31,r19
ld r24,Z
eor r24,r25
ret





"we can generate the address by xoring" - Why not just add?
– JimmyB
yesterday



add





Why not indeed. Theoretically a compiler should know that bitwise operations do not produce carry, meaning that the low byte is isolated from the top byte. But the compiler doesn't understand that neighter with xoring or adding 0..15 to 0xABC0.
– Aki Suihkonen
yesterday







By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

u9fQIEA,Z D,issRYVcZhT BV
JmD2vR4fRqJG mXQTJKcpdhlK6ydZsQq9LM5lw,M,SJL30YrJBBWIJq9z4S3IuYnGk,gO7Mcz

Popular posts from this blog

Delphi Android file open failure with API 26

.

Amasya