When a processor stores an item of data back to memory it actually goes through quite a complex set of operations. A sketch of the activities is as follows. The first thing that needs to be done is that the cache line containing the target address of the store needs to be fetched from memory. While this is happening, the data to be stored there is placed on a store queue. When the store is the oldest item in the queue, and the cache line has been successfully fetched from memory, the data can be placed into the cache line and removed from the queue.
This works very well if data is stored and either never reused, or reused after a relatively long delay. Unfortunately it is common for data to be needed almost immediately. There are plenty of reasons why this is the case. If parameters are passed through the stack, then they will be stored to the stack, and then immediately reloaded. If a register is spilled to the stack, then the data will be reloaded from the stack shortly afterwards.
It could take some considerable number of cycles if the loads had to wait for the stores to exit the queue before they could fetch the data. So many processors implement some kind of bypassing. If a load finds the data it needs in the store queue, then it can fetch it from there. There are often some caveats associated with this bypass. For example, the store and load often have to be of the same size to the same address. i.e. you cannot bypass a byte from a store of a word. If the bypass fails, then the situation is referred to as a "RAW" hazard, meaning "Read-After-Write". If the bypass fails, then the load has to wait until the store has completed before it can retrieve the new value - this can take many cycles.
As a general rule it is best to avoid potential RAWs. It is hardware, and runtime situation dependent whether there will be a RAW hazard or not, so avoiding the possibility is the best defense. Consider the following code which uses loads and stores of bytes to construct an integer.
#include <stdio.h>
#include <sys/time.h>
void tick()
{
hrtime_t now = gethrtime();
static hrtime_t then = 0;
if (then>0) printf("Elapsed = %f\n", 1.0*(now-then)/100000000.0);
then = now;
}
int func(char * value)
{
int temp;
((char*)&temp)[0] = value[3];
((char*)&temp)[1] = value[2];
((char*)&temp)[2] = value[1];
((char*)&temp)[3] = value[0];
return temp;
}
int main()
{
int value = 0x01020304;
tick();
for (int i=0; i<100000000; i++) func((char*)&value);
}
In the above code we're reversing the byte order by loading the bytes one-by-one, and storing them into an integer in the correct position, then loading the integer. Running this code on a test machine it reports 12ns per iteration.
However, it is possible to perform the same reordering using logical operations (shifts and ORs) as follows:
int func2(char* value)
{
return (value[0]<<24) | (value[1]<<16) | (value[2]<<8) | value[0];
}
This modified routine takes about 8ns per iteration. Which is significantly faster than the original code.
The actual speed up observed will depend on many factors, the most obvious being how often the code is encountered. The more observation is that the speed up depends on the platform. Some platforms will be more sensitive to the impact of RAWs than others. So the best advice is, whereever possible, to avoid passing data through the stack.