Discussion:
ARM NEON optimisations for gf-complete/jerasure/ceph-erasure
Janne Grunau
2014-09-04 14:42:37 UTC
Permalink
Hi,

I've started writing ARM/AArch64 NEON optimizations for gf-complete.
http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
AArch64 NEON optimisations for w8.

Implemented methods are so far the carry-less/polynomial multiplication
and the split table. The polynomial multiplication is reasonable fast
for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
NEON has a 8-bit to 16-bit SIMD polynomial multiplication.

The split table method is still faster though, 5700MB/s on the same CPU.
I'm actually surprised by that since it is faster (per cycle) than the
Core i7-3770 from gf-complete's manual (page 14). That suggests that
SSE3 code might not be optimal.

I'm currently working on integrating NEON into the build system and then
will extend the existing code to work on ARMv7-a too. Those two are
straight forward. There are a couple of other issues I would like to
discuss before I start to work on them.

The #if/#ifdefs in the source are starting to make the source hard to
read then more than one optimization is added. Separating arch specific
implementations from each other and from the generic implementation
works reasonable well for the multimedia related projects I have
experience with (libav/FFmpeg, x264). There would be arch specific init
functions which set the appropriate function pointers. The neon
optimisations would then live in w8_arm.c which would be only compiled
for arm. If someone has another idea how to avoid the #ifdefs I'm open
for that too.

I'm currently using the SSE/NOSSE region option which is bogus. I'm
wondering whether I should just rename that SIMD/NOSIMD (not really true
since the carry less operations for w64 and w128 only use the SIMD
instruction set but are single data). That would need to have backward
compatibility for SSE/NOSSE. The other option would be to add
NEON/NONEON flags.

I'm sure I find other issues to discuss when I start integrating the
NEON optimisations into jerasure and ceph.

thanks

Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loic Dachary
2014-09-04 15:21:38 UTC
Permalink
Hi Janne,

This is great news :-) Added Ethan & Kevin to the discussion.

Cheers
Post by Janne Grunau
Hi,
I've started writing ARM/AArch64 NEON optimizations for gf-complete.
http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
AArch64 NEON optimisations for w8.
Implemented methods are so far the carry-less/polynomial multiplication
and the split table. The polynomial multiplication is reasonable fast
for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
The split table method is still faster though, 5700MB/s on the same CPU.
I'm actually surprised by that since it is faster (per cycle) than the
Core i7-3770 from gf-complete's manual (page 14). That suggests that
SSE3 code might not be optimal.
I'm currently working on integrating NEON into the build system and then
will extend the existing code to work on ARMv7-a too. Those two are
straight forward. There are a couple of other issues I would like to
discuss before I start to work on them.
The #if/#ifdefs in the source are starting to make the source hard to
read then more than one optimization is added. Separating arch specific
implementations from each other and from the generic implementation
works reasonable well for the multimedia related projects I have
experience with (libav/FFmpeg, x264). There would be arch specific init
functions which set the appropriate function pointers. The neon
optimisations would then live in w8_arm.c which would be only compiled
for arm. If someone has another idea how to avoid the #ifdefs I'm open
for that too.
I'm currently using the SSE/NOSSE region option which is bogus. I'm
wondering whether I should just rename that SIMD/NOSIMD (not really true
since the carry less operations for w64 and w128 only use the SIMD
instruction set but are single data). That would need to have backward
compatibility for SSE/NOSSE. The other option would be to add
NEON/NONEON flags.
I'm sure I find other issues to discuss when I start integrating the
NEON optimisations into jerasure and ceph.
thanks
Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Loïc Dachary, Artisan Logiciel Libre
Janne Grunau
2014-09-18 10:11:28 UTC
Permalink
Hi Kevin,
I feel that separating the arch-specific implementations out and have a
default 'generic' implementation would be a huge improvement. Note that
gf-complete was in active development for some time before including the
SIMD code. In hindsight, we should have done this separation back in 2012,
but had some time pressure due to a paper deadline and limited time
available to the contributors.
Also, I agree w.r.t. the preprocessor stuff. Going with SIMD/NOSIMD is
fine by me.
I'll rename than and start implementing neon optimized function in their
own files.
Also, there should be very little "SIMD" work with jerasure, as gf-complete
is the Galois field backend, so I would not worry too much about that.
I noticed, I have hooked my neon code already locally in ceph with
touching jerasure.
That covers "clean-up" work. We can discuss the best way to choose the
underlying implementation (looks like we have a bunch of options) as this
work is completed.
With this in mind, what work were you planning to do? I can try to free up
cycles to help, but that may not happen for a few weeks.
Primarily NEON optimisations for gf-complete/ceph. Shouldn't take more
than a few days though.
One last thing... If you do have code you want to push upstream, please
submit a pull request(s) to our main bitbucket repo.
Make sense?
yes, thanks.

Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Janne Grunau
2014-10-10 14:01:09 UTC
Permalink
Hi Kevin,
I feel that separating the arch-specific implementations out and have a
default 'generic' implementation would be a huge improvement. Note that
gf-complete was in active development for some time before including the
SIMD code. In hindsight, we should have done this separation back in 2012,
but had some time pressure due to a paper deadline and limited time
available to the contributors.
Also, I agree w.r.t. the preprocessor stuff. Going with SIMD/NOSIMD is
fine by me.
I created a pull request with my neon optimisations, the SSE -> SIMD
rename and some minor fixes.

The neon methods all reside in their own files, I didn't come up with
good solution for the init / scratch_size functions, so I added
arm-specific defines there.
Also, there should be very little "SIMD" work with jerasure, as gf-complete
is the Galois field backend, so I would not worry too much about that.
Yes, there was no SIMD work in jerasure.

Please have a look at
https://bitbucket.org/jimplank/gf-complete/pull-request/25/arm-neon-optimisations
I'll be available to address review comments and suggestions.

regards

Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loic Dachary
2014-09-04 15:57:55 UTC
Permalink
Hi Janne,
Post by Janne Grunau
Hi,
I've started writing ARM/AArch64 NEON optimizations for gf-complete.
http://git.jannau.net/gf-complete.git/log/?h=neon has proof of concept
AArch64 NEON optimisations for w8.
Implemented methods are so far the carry-less/polynomial multiplication
and the split table. The polynomial multiplication is reasonable fast
for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) since
NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
The split table method is still faster though, 5700MB/s on the same CPU.
I'm actually surprised by that since it is faster (per cycle) than the
Core i7-3770 from gf-complete's manual (page 14). That suggests that
SSE3 code might not be optimal.
I'm currently working on integrating NEON into the build system and then
will extend the existing code to work on ARMv7-a too. Those two are
straight forward. There are a couple of other issues I would like to
discuss before I start to work on them.
The #if/#ifdefs in the source are starting to make the source hard to
read then more than one optimization is added. Separating arch specific
implementations from each other and from the generic implementation
works reasonable well for the multimedia related projects I have
experience with (libav/FFmpeg, x264). There would be arch specific init
functions which set the appropriate function pointers. The neon
optimisations would then live in w8_arm.c which would be only compiled
for arm. If someone has another idea how to avoid the #ifdefs I'm open
for that too.
Would it be possible to make use of ifunc ( https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bifunc_007d-attribute-2529 ) to chose the function depending on CPU features ?

http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.html#i386-and-x86-64-Options

http://www.spinics.net/lists/ceph-devel/msg18452.html

Cheers
Post by Janne Grunau
I'm currently using the SSE/NOSSE region option which is bogus. I'm
wondering whether I should just rename that SIMD/NOSIMD (not really true
since the carry less operations for w64 and w128 only use the SIMD
instruction set but are single data). That would need to have backward
compatibility for SSE/NOSSE. The other option would be to add
NEON/NONEON flags.
I'm sure I find other issues to discuss when I start integrating the
NEON optimisations into jerasure and ceph.
thanks
Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Loïc Dachary, Artisan Logiciel Libre
Ethan L. Miller
2014-09-05 00:27:16 UTC
Permalink
Yes, it's possible to use CPU flags to allow the use of advanced
instruction sets automatically. The difficulty is that, if those
instructions aren't available, it's not clear which of the "basic"
approaches to use, since performance can vary based on a lot of
factors. Even with advanced instructions, there are often multiple
reasonable approaches to take, as Janne's email makes clear, so it's
impossible to say "this algorithm is always best".

We can certainly set up a default approach if we want, though, that
can be overridden by compile-time flags.

Incidentally, I'm starting to work on coding a version of gf-complete
(and associated erasure coding functions) in C++ using templates,
which will hopefully allow us to better separate out different
implementations. We could still have run-time dispatch for the
desired routines, but templates should allow for more compact code and
better isolation of architecture-specific code. The big drawback is
that C++ code isn't typically used in the kernel....

ethan
Post by Loic Dachary
Hi Janne,
Post by Janne Grunau
Hi,
I've started writing ARM/AArch64 NEON optimizations for gf-complete.
http://git.jannau.net/gf-complete.git/log/?h=3Dneon has proof of con=
cept
Post by Loic Dachary
Post by Janne Grunau
AArch64 NEON optimisations for w8.
Implemented methods are so far the carry-less/polynomial multiplicat=
ion
Post by Loic Dachary
Post by Janne Grunau
and the split table. The polynomial multiplication is reasonable fas=
t
Post by Loic Dachary
Post by Janne Grunau
for region multiplications (~2000MB/s on an Apple A7 at 1.3GHz) sinc=
e
Post by Loic Dachary
Post by Janne Grunau
NEON has a 8-bit to 16-bit SIMD polynomial multiplication.
The split table method is still faster though, 5700MB/s on the same =
CPU.
Post by Loic Dachary
Post by Janne Grunau
I'm actually surprised by that since it is faster (per cycle) than t=
he
Post by Loic Dachary
Post by Janne Grunau
Core i7-3770 from gf-complete's manual (page 14). That suggests that
SSE3 code might not be optimal.
I'm currently working on integrating NEON into the build system and =
then
Post by Loic Dachary
Post by Janne Grunau
will extend the existing code to work on ARMv7-a too. Those two are
straight forward. There are a couple of other issues I would like to
discuss before I start to work on them.
The #if/#ifdefs in the source are starting to make the source hard t=
o
Post by Loic Dachary
Post by Janne Grunau
read then more than one optimization is added. Separating arch speci=
fic
Post by Loic Dachary
Post by Janne Grunau
implementations from each other and from the generic implementation
works reasonable well for the multimedia related projects I have
experience with (libav/FFmpeg, x264). There would be arch specific i=
nit
Post by Loic Dachary
Post by Janne Grunau
functions which set the appropriate function pointers. The neon
optimisations would then live in w8_arm.c which would be only compil=
ed
Post by Loic Dachary
Post by Janne Grunau
for arm. If someone has another idea how to avoid the #ifdefs I'm op=
en
Post by Loic Dachary
Post by Janne Grunau
for that too.
Would it be possible to make use of ifunc ( https://gcc.gnu.org/onlin=
edocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bifu=
nc_007d-attribute-2529 ) to chose the function depending on CPU feature=
s ?
Post by Loic Dachary
http://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/i386-and-x86-64-Options.h=
tml#i386-and-x86-64-Options
Post by Loic Dachary
http://www.spinics.net/lists/ceph-devel/msg18452.html
Cheers
Post by Janne Grunau
I'm currently using the SSE/NOSSE region option which is bogus. I'm
wondering whether I should just rename that SIMD/NOSIMD (not really =
true
Post by Loic Dachary
Post by Janne Grunau
since the carry less operations for w64 and w128 only use the SIMD
instruction set but are single data). That would need to have backwa=
rd
Post by Loic Dachary
Post by Janne Grunau
compatibility for SSE/NOSSE. The other option would be to add
NEON/NONEON flags.
I'm sure I find other issues to discuss when I start integrating the
NEON optimisations into jerasure and ceph.
thanks
Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel=
" in
Post by Loic Dachary
Post by Janne Grunau
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Lo=C3=AFc Dachary, Artisan Logiciel Libre
--=20
( Ethan L. Miller Email: ***@cs.ucsc.edu )
( Professor, Computer Science Web: http://www.cs.ucsc.edu/~elm/ )
( University of California Phone: +1 831 459-1222 )
( Santa Cruz, CA 95064 USA Fax: +1 831 459-1041 )
( PGP keyprint: 76C7 D699 1FF6 A1A4 B7A1 9629 2EBF 1273 A6ED 6A09 )
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Janne Grunau
2014-09-05 07:51:30 UTC
Permalink
Hi,
Post by Ethan L. Miller
Yes, it's possible to use CPU flags to allow the use of advanced
instruction sets automatically.
runtime detection of supported instructions sets is with the current
function pointer approach possible too.
Post by Ethan L. Miller
The difficulty is that, if those
instructions aren't available, it's not clear which of the "basic"
approaches to use, since performance can vary based on a lot of
factors. Even with advanced instructions, there are often multiple
reasonable approaches to take, as Janne's email makes clear, so it's
impossible to say "this algorithm is always best".
I agree that the current approach fits the model of implementations with
different cpu/memory use better. Using ifunc would be mostly orthogonal
to the issue of badly structured code.
Post by Ethan L. Miller
We can certainly set up a default approach if we want, though, that
can be overridden by compile-time flags.
I don't think this would be an improvement.
Post by Ethan L. Miller
Incidentally, I'm starting to work on coding a version of gf-complete
(and associated erasure coding functions) in C++ using templates,
which will hopefully allow us to better separate out different
implementations. We could still have run-time dispatch for the
desired routines, but templates should allow for more compact code and
better isolation of architecture-specific code. The big drawback is
that C++ code isn't typically used in the kernel....
One possible simplification for the carry less multiplication would be
relying on inlining and optimisations of compile time constants.

Implement one function which does a variable number of polynomial
reductions. The current functions would then just be thin wrappers which
call the general function with a compile time constant for the number of
reductions. Forced inlining and dead code removal will optimize branches
away. The same method could be used to avoid the duplication of the
inner loop for the optional xor with the destination.

Janne
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...