Mike Klein 8f506d3257 SSE filter speed improvements for bpp=3.
- memcpy-free implementations of load3() / store3().
    These should have less variance compiler to compiler.

    - call load3() only when needed at the end of a scanline.
    In the middle, we can use the faster load4(), ignoring that byte.
2016-04-04 16:10:09 -04:00
..