ARM NEON optimisations for gf-complete/jerasure/ceph-erasure

Discussion:

Janne Grunau

2014-09-04 14:42:37 UTC

Hi,

I've started writing ARM/AArch64 NEON optimizations for gf-complete.
http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
AArch64 NEON optimisations for w8.

Implemented methods are so far the carry-less/polynomial multiplication
and the split table. The polynomial multiplication is reasonable fast
for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
NEON has a 8-bit to 16-bit SIMD polynomial multiplication.

The split table method is still faster though, 5700MB/s on the same CPU.
I'm actually surprised by that since it is faster (per cycle) than the
Core i7-3770 from gf-complete's manual (page 14). That suggests that
SSE3 code might not be optimal.

I'm currently working on integrating NEON into the build system and then
will extend the existing code to work on ARMv7-a too. Those two are
straight forward. There are a couple of other issues I would like to
discuss before I start to work on them.

The #if/#ifdefs in the source are starting to make the source hard to
read then more than one optimization is added. Separating arch specific
implementations from each other and from the generic implementation
works reasonable well for the multimedia related projects I have
experience with (libav/FFmpeg, x264). There would be arch specific init
functions which set the appropriate function pointers. The neon
optimisations would then live in w8_arm.c which would be only compiled
for arm. If someone has another idea how to avoid the #ifdefs I'm open
for that too.

I'm currently using the SSE/NOSSE region option which is bogus. I'm
wondering whether I should just rename that SIMD/NOSIMD (not really true
since the carry less operations for w64 and w128 only use the SIMD
instruction set but are single data). That would need to have backward
compatibility for SSE/NOSSE. The other option would be to add
NEON/NONEON flags.

I'm sure I find other issues to discuss when I start integrating the
NEON optimisations into jerasure and ceph.

thanks

Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Loic Dachary

2014-09-04 15:21:38 UTC

Permalink

Hi Janne,

This is great news :-) Added Ethan & Kevin to the discussion.

Cheers

Post by Janne Grunau
Hi,
I've started writing ARM/AArch64 NEON optimizations for gf-complete.
http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
AArch64 NEON optimisations for w8.
Implemented methods are so far the carry-less/polynomial multiplication
and the split table. The polynomial multiplication is reasonable fast
for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
The split table method is still faster though, 5700MB/s on the same CPU.
I'm actually surprised by that since it is faster (per cycle) than the
Core i7-3770 from gf-complete's manual (page 14). That suggests that
SSE3 code might not be optimal.
I'm currently working on integrating NEON into the build system and then
will extend the existing code to work on ARMv7-a too. Those two are
straight forward. There are a couple of other issues I would like to
discuss before I start to work on them.
The #if/#ifdefs in the source are starting to make the source hard to
read then more than one optimization is added. Separating arch specific
implementations from each other and from the generic implementation
works reasonable well for the multimedia related projects I have
experience with (libav/FFmpeg, x264). There would be arch specific init
functions which set the appropriate function pointers. The neon
optimisations would then live in w8_arm.c which would be only compiled
for arm. If someone has another idea how to avoid the #ifdefs I'm open
for that too.
I'm currently using the SSE/NOSSE region option which is bogus. I'm
wondering whether I should just rename that SIMD/NOSIMD (not really true
since the carry less operations for w64 and w128 only use the SIMD
instruction set but are single data). That would need to have backward
compatibility for SSE/NOSSE. The other option would be to add
NEON/NONEON flags.
I'm sure I find other issues to discuss when I start integrating the
NEON optimisations into jerasure and ceph.
thanks
Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Loïc Dachary, Artisan Logiciel Libre

Janne Grunau

2014-09-18 10:11:28 UTC

Permalink

Hi Kevin,

I feel that separating the arch-specific implementations out and have a
default 'generic' implementation would be a huge improvement. Note that
gf-complete was in active development for some time before including the
SIMD code. In hindsight, we should have done this separation back in 2012,
but had some time pressure due to a paper deadline and limited time
available to the contributors.
Also, I agree w.r.t. the preprocessor stuff. Going with SIMD/NOSIMD is
fine by me.

I'll rename than and start implementing neon optimized function in their
own files.

Also, there should be very little "SIMD" work with jerasure, as gf-complete
is the Galois field backend, so I would not worry too much about that.

I noticed, I have hooked my neon code already locally in ceph with
touching jerasure.

That covers "clean-up" work. We can discuss the best way to choose the
underlying implementation (looks like we have a bunch of options) as this
work is completed.
With this in mind, what work were you planning to do? I can try to free up
cycles to help, but that may not happen for a few weeks.

Primarily NEON optimisations for gf-complete/ceph. Shouldn't take more
than a few days though.

One last thing... If you do have code you want to push upstream, please
submit a pull request(s) to our main bitbucket repo.
Make sense?

yes, thanks.

Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Janne Grunau

2014-10-10 14:01:09 UTC

Permalink

Hi Kevin,

I created a pull request with my neon optimisations, the SSE -> SIMD
rename and some minor fixes.

The neon methods all reside in their own files, I didn't come up with
good solution for the init / scratch_size functions, so I added
arm-specific defines there.

Also, there should be very little "SIMD" work with jerasure, as gf-complete
is the Galois field backend, so I would not worry too much about that.

Yes, there was no SIMD work in jerasure.

Please have a look at
https://bitbucket.org/jimplank/gf-complete/pull-request/25/arm-neon-optimisations
I'll be available to address review comments and suggestions.

regards

Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Loic Dachary

2014-09-04 15:57:55 UTC

Permalink

Hi Janne,

Would it be possible to make use of ifunc ( https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bifunc_007d-attribute-2529 ) to chose the function depending on CPU features ?

http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options

http://www.spinics.net/lists/ceph-devel/msg18452.html

Cheers

Post by Janne Grunau
I'm currently using the SSE/NOSSE region option which is bogus. I'm
wondering whether I should just rename that SIMD/NOSIMD (not really true
since the carry less operations for w64 and w128 only use the SIMD
instruction set but are single data). That would need to have backward
compatibility for SSE/NOSSE. The other option would be to add
NEON/NONEON flags.
I'm sure I find other issues to discuss when I start integrating the
NEON optimisations into jerasure and ceph.
thanks
Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Loïc Dachary, Artisan Logiciel Libre

Ethan L. Miller

2014-09-05 00:27:16 UTC

Permalink

Yes, it's possible to use CPU flags to allow the use of advanced
instruction sets automatically. The difficulty is that, if those
instructions aren't available, it's not clear which of the "basic"
approaches to use, since performance can vary based on a lot of
factors. Even with advanced instructions, there are often multiple
reasonable approaches to take, as Janne's email makes clear, so it's
impossible to say "this algorithm is always best".

We can certainly set up a default approach if we want, though, that
can be overridden by compile-time flags.

Incidentally, I'm starting to work on coding a version of gf-complete
(and associated erasure coding functions) in C++ using templates,
which will hopefully allow us to better separate out different
implementations. We could still have run-time dispatch for the
desired routines, but templates should allow for more compact code and
better isolation of architecture-specific code. The big drawback is
that C++ code isn't typically used in the kernel....

ethan

Post by Loic Dachary
Hi Janne,

Post by Janne Grunau
Hi,
I've started writing ARM/AArch64 NEON optimizations for gf-complete.
http://git.jannau.net/gf-complete.git/log/?h=3Dneon has proof of con=

cept