Saturday, 28 September 2013

OS portable memcpy optimized for SSE2 & SSE3

OS portable memcpy optimized for SSE2 & SSE3

If I was to write a OS portable memcpy optimized for SSE2/SSE3 how would
that look like? I want to support both the GCC and ICC compilers. The
reason I ask is that memcpy is written in assembler code in glibc and not
optimized for SSE2/SSE3, and other generic memcpy implementations may not
fully take advantage of the systems capabilities with data alignment and
size etc.
Here is my current memcpy that take data alignment into consideration and
is optimized for SSE2 (I think) but not for SSE3:
#ifndef __SSE2__
// SSE2 optimized memcpy()
void *CMemUtils::MemCpy(void *restrict b, const void *restrict a, size_t n)
{
char *s1 = b;
const char *s2 = a;
for(; 0<n; --n)*s1++ = *s2++;
return b;
}
#else
// Generic memcpy() implementation
void *CMemUtils::MemCpy(void *dest, const void *source, size_t count) const
{
#ifdef _USE_SYSTEM_MEMCPY
// Use system memcpy()
return memcpy(dest, source, count);
#else
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);
// Copy 64-bit blocks first
_UINT64 *sourcePtr8 = (_UINT64*)source;
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] =
sourcePtr8[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);
// Copy 32-bit blocks
_UINT32 *sourcePtr4 = (_UINT32*)&sourcePtr8[blockIdx];
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] =
sourcePtr4[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);
// Copy 16-bit blocks
_UINT16 *sourcePtr2 = (_UINT16*)&sourcePtr4[blockIdx];
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] =
sourcePtr2[blockIdx];
if (!bytesLeft) return dest;
// Copy byte blocks
_UINT8 *sourcePtr1 = (_UINT8*)&sourcePtr2[blockIdx];
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++)
destPtr1[blockIdx] = sourcePtr1[blockIdx];

No comments:

Post a Comment