fio read and randread cpu usage results for qemu and host machine

Discussion:

Alexandre DERUMIER

2014-10-23 12:30:40 UTC

Hi,

I have done fio tests on multiple qemu setups and host machine,
to see if librbd use more cpu than krbd
and also find the best qemu optimisations.

Fio test was done with blocksize=4K , read and randread with num_jobs=32.

1) First, I have done more tests with qemu optimisation, like iothread (dataplane) for virtio-blk disk,
and num_queue=8 for virtio-scsi disks.

for virtio iothread:
--------------------
It's working only with qemu + krbd, and for sequential reads.
And It's seem that qemu aggregate reads in bigger ceph reads. (I see 4x more iops on fio than on ceph, with same bandwith)

for virtio-scsi num_queue = 8:
-------------------------------
works with krbd and librbd

-for random read : I jump from 7000 to 12000iops
-for sequential, qemu aggreate reads in bigger ceph reads. (Same, I see 4xmore iops on fio than on ceph, with same bandwith).

So it seem to be useful for some specific workloads.

2 ) Now, about cpu usage, it's seem than librbd use really more cpu than krbd,

on host, librbd use 4x more cpu than krbd
on qemu, librbd use 2x more cpu than krbd

So, what could explain so much difference between both ?

Regards,

Alexandre

fio iops seq read summary results
----------------------------------
qemu virtio iothread krbd vs qemu virtio iothread librbd : 27000 iops vs 15000 ipos
qemu virtio krbd vs qemu virtio librbd : 19000 iops vs 15000 iops
qemu virtio-scsi krbd vs qemu virtio librbd : 50000 iops vs 48000 iops
host krbd vs host librbd : 36000 iops vs 25000 iops

fio iops randread summary results
------------------------------
qemu virtio iothread krbd vs qemu virtio iothread librbd : 15000 iops vs 14000 iops
qemu virtio krbd vs qemu virtio librbd : 14000 iops vs 15000 iops
qemu virtio-scsi krbd vs qemu virtio librbd : 7500 iops vs 12000 iops
host krbd vs host librbd : 38000 iops vs 25000 iops

cpu usage ratio summary
------------------------
qemu virtio krbd vs qemu virtio librbd : 2x more cpu usage for librbd
qemu virtio-scsi krbd vs qemu virtio-scsi librbd : 2x more cpu usage for librbd
host krbd vs host librbd : 4x more cpu usage for fio-rbd

RESULTS
-------

host + fio - krbd
-------------------
read
-----
fio iops : 142.9MB/Ss , 36000 iops
ceph : 134 MB/s rd, 34319 op/s

fio : 70,4% kworker : 93,9% cpu = 164% cpu

100%cpu : 21000iops

randread
--------
fio: 151MB/S,38000 iops
ceph :148 MB/s rd, 37932 op/s

fio : 80%cpu kwoker : 110,3%cpu = 180% cpu

100%cpu : 21000 iops

host + fio-rbdengin :
---------------------
randread (cpu bound)
--------------------
fio iops : 25000 ops
ceph iops : 99636 kB/s rd, 24909 op/s

fio : 460%cpu

100%cpu : 5415iops

read (cpu bound)
-----------------
fio iops : 25000 ops
ceph ios : 94212 kB/s rd, 23553 op/s

fio : 480%cpu

100%cpu : 5323iops

qemu + krbd + virtio + iothread
---------------------------------
read
----
fio :107MB/S : 27000 iops >>>SEEM THAT QEMU AGGREGATE READS ops
ceph : 93942 kB/s rd, 12430 op/s

kvm: 130% cpu - kworker : 41,2% = 171,2% cpu

100%cpu ratio : 7260iops

randread
--------
fio : 60MBS - 15000 iops
ceph : 54400 kB/s rd, 13600 op/s

kvm: 95,0% cpu - kworker : 42,1 % cpu = 137,1%cpu

100%cpu ratio : 9919 iops

qemu + krbd + virtio
----------------------
read
-----
fio : 70mbs/ , 19000iops
ceph:75705 kB/s rd, 18926 op/s
kvm : 164% cpu - kworker : 48,5% cpu = 212,5%cpu

100%cpu ratio : 8906 iops

randread
--------
fio : 54mbs/ , 14000iops
ceph : 54800 kB/s rd, 13700 op/s
kvm: 103% cpu - kworker 41,2% cpu = 144,2%cpu

100%cpu ratio : 9513 iops

qemu + krbd + virtio-scsi (num_queue 8)
--------------------------------------
read:
----
fio : 200MB / 50000 iops >>>SEEM THAT QEMU AGGREGATE READS ops
ceph : 205 MB/s rd, 7648 op/s

kvm: 145% kworker : 46,5% = 191,5%cpu

100%cpu ratio : 3993 iops

randread:
----------
fio : 30MB/S / 7500 iops
ceph : 29318 kB/s rd, 7329 op/s
kvm : 150% kworker : 21,4% cpu = 171,4% cpu

100%cpu ratio : 4275 iops

qemu + librbd + virtio + iothread
----------------------------------
read
----
fio : 60MBS/s , 15000iops
ceph: 56199 kB/s rd, 14052 op/s

kvm: 300% cpu

100%cpu : 4666iops

randread
---------
fio : 56MBS/s, 14000iops
ceph : 55916 kB/s rd, 13979 op/s

kvm: 300% cpu

100%cpu : 4659 iops

qemu + librbd + virtio
-------------------------
read
-----
fio : 60MBS/s, 15000iops
ceph : 63021 kB/s rd, 15755 op/s

kvm: 300% cpu

100%cpu : 5233 iops

randread
--------
fio : 60MBS/s, 15000iops
ceph : 55916 kB/s rd, 13979 op/s

kvm : 300%cpu

100%cpu : 4659 iops

qemu + librbd + virtio-scsi (num_queue 8)
----------------------------------------
read
----
fio : 256 MB/S , 48000iops >>>SEEM THAT QEMU AGGREGATE READS ops
ceph : 244 MB/s rd, 12002 op/s

kvm : 300% cpu

100%cpu : 4000 iops

randread
--------
fio: 12000iops
ceph iops : 47511 kB/s rd, 11877 op/s

kvm: 300% cpu

100%cpu: 3959 iops

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Mark Nelson

2014-10-23 13:07:29 UTC

Permalink

Post by Alexandre DERUMIER
Hi,
I have done fio tests on multiple qemu setups and host machine,
to see if librbd use more cpu than krbd
and also find the best qemu optimisations.
Fio test was done with blocksize=4K , read and randread with num_jobs=32.
1) First, I have done more tests with qemu optimisation, like iothread (dataplane) for virtio-blk disk,
and num_queue=8 for virtio-scsi disks.
--------------------
It's working only with qemu + krbd, and for sequential reads.
And It's seem that qemu aggregate reads in bigger ceph reads. (I see 4x more iops on fio than on ceph, with same bandwith)
-------------------------------
works with krbd and librbd
-for random read : I jump from 7000 to 12000iops
-for sequential, qemu aggreate reads in bigger ceph reads. (Same, I see 4xmore iops on fio than on ceph, with same bandwith).
So it seem to be useful for some specific workloads.
2 ) Now, about cpu usage, it's seem than librbd use really more cpu than krbd,
on host, librbd use 4x more cpu than krbd
on qemu, librbd use 2x more cpu than krbd
So, what could explain so much difference between both ?

Hi Alexandre, do you have access to perf? Especially on newer
kernels/distributions where you can use dwarf symbols, perf can give you
a lot of information about what is using CPU where. vtune is also a
nice option if you ave a license.

Post by Alexandre DERUMIER
Regards,
Alexandre
fio iops seq read summary results
----------------------------------
qemu virtio iothread krbd vs qemu virtio iothread librbd : 27000 iops vs 15000 ipos
qemu virtio krbd vs qemu virtio librbd : 19000 iops vs 15000 iops
qemu virtio-scsi krbd vs qemu virtio librbd : 50000 iops vs 48000 iops
host krbd vs host librbd : 36000 iops vs 25000 iops
fio iops randread summary results
------------------------------
qemu virtio iothread krbd vs qemu virtio iothread librbd : 15000 iops vs 14000 iops
qemu virtio krbd vs qemu virtio librbd : 14000 iops vs 15000 iops
qemu virtio-scsi krbd vs qemu virtio librbd : 7500 iops vs 12000 iops
host krbd vs host librbd : 38000 iops vs 25000 iops
cpu usage ratio summary
------------------------
qemu virtio krbd vs qemu virtio librbd : 2x more cpu usage for librbd
qemu virtio-scsi krbd vs qemu virtio-scsi librbd : 2x more cpu usage for librbd
host krbd vs host librbd : 4x more cpu usage for fio-rbd
RESULTS
-------
host + fio - krbd
-------------------
read
-----
fio iops : 142.9MB/Ss , 36000 iops
ceph : 134 MB/s rd, 34319 op/s
fio : 70,4% kworker : 93,9% cpu = 164% cpu
100%cpu : 21000iops
randread
--------
fio: 151MB/S,38000 iops
ceph :148 MB/s rd, 37932 op/s
fio : 80%cpu kwoker : 110,3%cpu = 180% cpu
100%cpu : 21000 iops
---------------------
randread (cpu bound)
--------------------
fio iops : 25000 ops
ceph iops : 99636 kB/s rd, 24909 op/s
fio : 460%cpu
100%cpu : 5415iops
read (cpu bound)
-----------------
fio iops : 25000 ops
ceph ios : 94212 kB/s rd, 23553 op/s
fio : 480%cpu
100%cpu : 5323iops
qemu + krbd + virtio + iothread
---------------------------------
read
----
fio :107MB/S : 27000 iops >>>SEEM THAT QEMU AGGREGATE READS ops
ceph : 93942 kB/s rd, 12430 op/s
kvm: 130% cpu - kworker : 41,2% = 171,2% cpu
100%cpu ratio : 7260iops
randread
--------
fio : 60MBS - 15000 iops
ceph : 54400 kB/s rd, 13600 op/s
kvm: 95,0% cpu - kworker : 42,1 % cpu = 137,1%cpu
100%cpu ratio : 9919 iops
qemu + krbd + virtio
----------------------
read
-----
fio : 70mbs/ , 19000iops
ceph:75705 kB/s rd, 18926 op/s
kvm : 164% cpu - kworker : 48,5% cpu = 212,5%cpu
100%cpu ratio : 8906 iops
randread
--------
fio : 54mbs/ , 14000iops
ceph : 54800 kB/s rd, 13700 op/s
kvm: 103% cpu - kworker 41,2% cpu = 144,2%cpu
100%cpu ratio : 9513 iops
qemu + krbd + virtio-scsi (num_queue 8)
--------------------------------------
----
fio : 200MB / 50000 iops >>>SEEM THAT QEMU AGGREGATE READS ops
ceph : 205 MB/s rd, 7648 op/s
kvm: 145% kworker : 46,5% = 191,5%cpu
100%cpu ratio : 3993 iops
----------
fio : 30MB/S / 7500 iops
ceph : 29318 kB/s rd, 7329 op/s
kvm : 150% kworker : 21,4% cpu = 171,4% cpu
100%cpu ratio : 4275 iops
qemu + librbd + virtio + iothread
----------------------------------
read
----
fio : 60MBS/s , 15000iops
ceph: 56199 kB/s rd, 14052 op/s
kvm: 300% cpu
100%cpu : 4666iops
randread
---------
fio : 56MBS/s, 14000iops
ceph : 55916 kB/s rd, 13979 op/s
kvm: 300% cpu
100%cpu : 4659 iops
qemu + librbd + virtio
-------------------------
read
-----
fio : 60MBS/s, 15000iops
ceph : 63021 kB/s rd, 15755 op/s
kvm: 300% cpu
100%cpu : 5233 iops
randread
--------
fio : 60MBS/s, 15000iops
ceph : 55916 kB/s rd, 13979 op/s
kvm : 300%cpu
100%cpu : 4659 iops
qemu + librbd + virtio-scsi (num_queue 8)
----------------------------------------
read
----
fio : 256 MB/S , 48000iops >>>SEEM THAT QEMU AGGREGATE READS ops
ceph : 244 MB/s rd, 12002 op/s
kvm : 300% cpu
100%cpu : 4000 iops
randread
--------
fio: 12000iops
ceph iops : 47511 kB/s rd, 11877 op/s
kvm: 300% cpu
100%cpu: 3959 iops
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Alexandre DERUMIER

2014-10-23 15:20:57 UTC

Permalink

Hi Alexandre, do you have access to perf?=20

yes. I can send perf report if you want. (I'm not sure how to analyze t=
he results)

Especially on newer kernels/distributions where you can use dwarf sym=

bols, perf can give you=20

a lot of information about what is using CPU where.=20

I'm using kernel 3.16, so I think It'll be ok

vtune is also a nice option if you ave a license.

Don't have it sorry.

BTW, about seq read results with virtio iothread and virtio-scsi multiq=
ueue
it's seem that it's enabling linux queues merging
I had try to disable it with echo 2 > /sys/block/sda/queue/nomerges,
and I have the same result than without multiqueue.
So it's more an optimization for qemu,=20
but this not improving the reals maximum ios we can reach from rbd.

----- Mail original -----=20

De: "Mark Nelson" <***@inktank.com>=20
=C3=80: "Alexandre DERUMIER" <***@odiso.com>, "Ceph Devel" <ceph-=
***@vger.kernel.org>=20
Envoy=C3=A9: Jeudi 23 Octobre 2014 15:07:29=20
Objet: Re: fio read and randread cpu usage results for qemu and host ma=
chine=20

On 10/23/2014 07:30 AM, Alexandre DERUMIER wrote:=20

Hi,=20
=20
I have done fio tests on multiple qemu setups and host machine,=20
to see if librbd use more cpu than krbd=20
and also find the best qemu optimisations.=20
=20
Fio test was done with blocksize=3D4K , read and randread with num_jo=

bs=3D32.=20

=20
=20
=20
=20
=20
1) First, I have done more tests with qemu optimisation, like iothrea=

d (dataplane) for virtio-blk disk,=20

and num_queue=3D8 for virtio-scsi disks.=20
=20
for virtio iothread:=20
--------------------=20
It's working only with qemu + krbd, and for sequential reads.=20
And It's seem that qemu aggregate reads in bigger ceph reads. (I see =

4x more iops on fio than on ceph, with same bandwith)=20

=20
=20
for virtio-scsi num_queue =3D 8:=20
-------------------------------=20
works with krbd and librbd=20
=20
-for random read : I jump from 7000 to 12000iops=20
-for sequential, qemu aggreate reads in bigger ceph reads. (Same, I s=

ee 4xmore iops on fio than on ceph, with same bandwith).=20

=20
=20
So it seem to be useful for some specific workloads.=20
=20
=20
=20
2 ) Now, about cpu usage, it's seem than librbd use really more cpu t=

han krbd,=20

=20
on host, librbd use 4x more cpu than krbd=20
on qemu, librbd use 2x more cpu than krbd=20
=20
=20
So, what could explain so much difference between both ?=20

Hi Alexandre, do you have access to perf? Especially on newer=20
kernels/distributions where you can use dwarf symbols, perf can give yo=
u=20
a lot of information about what is using CPU where. vtune is also a=20
nice option if you ave a license.=20

=20
=20
Regards,=20
=20
Alexandre=20
=20
=20
=20
fio iops seq read summary results=20
----------------------------------=20
qemu virtio iothread krbd vs qemu virtio iothread librbd : 27000 iops=

vs 15000 ipos=20

qemu virtio krbd vs qemu virtio librbd : 19000 iops vs 15000 iops=20
qemu virtio-scsi krbd vs qemu virtio librbd : 50000 iops vs 48000 iop=

s=20

host krbd vs host librbd : 36000 iops vs 25000 iops=20
=20
=20
=20
fio iops randread summary results=20
------------------------------=20
qemu virtio iothread krbd vs qemu virtio iothread librbd : 15000 iops=

vs 14000 iops=20

qemu virtio krbd vs qemu virtio librbd : 14000 iops vs 15000 iops=20
qemu virtio-scsi krbd vs qemu virtio librbd : 7500 iops vs 12000 iops=

=20

host krbd vs host librbd : 38000 iops vs 25000 iops=20
=20
=20
cpu usage ratio summary=20
------------------------=20
qemu virtio krbd vs qemu virtio librbd : 2x more cpu usage for librbd=

=20

qemu virtio-scsi krbd vs qemu virtio-scsi librbd : 2x more cpu usage =

for librbd=20

host krbd vs host librbd : 4x more cpu usage for fio-rbd=20
=20
=20
=20
=20
=20
=20
RESULTS=20
-------=20
=20
host + fio - krbd=20
-------------------=20
read=20
-----=20
fio iops : 142.9MB/Ss , 36000 iops=20
ceph : 134 MB/s rd, 34319 op/s=20
=20
fio : 70,4% kworker : 93,9% cpu =3D 164% cpu=20
=20
100%cpu : 21000iops=20
=20
randread=20
--------=20
fio: 151MB/S,38000 iops=20
ceph :148 MB/s rd, 37932 op/s=20
=20
fio : 80%cpu kwoker : 110,3%cpu =3D 180% cpu=20
=20
100%cpu : 21000 iops=20
=20
=20
=20
=20
host + fio-rbdengin :=20
---------------------=20
randread (cpu bound)=20
--------------------=20
fio iops : 25000 ops=20
ceph iops : 99636 kB/s rd, 24909 op/s=20
=20
fio : 460%cpu=20
=20
100%cpu : 5415iops=20
=20
read (cpu bound)=20
-----------------=20
fio iops : 25000 ops=20
ceph ios : 94212 kB/s rd, 23553 op/s=20
=20
fio : 480%cpu=20
=20
100%cpu : 5323iops=20
=20
=20
=20
=20
=20
qemu + krbd + virtio + iothread=20
---------------------------------=20
read=20
----=20
fio :107MB/S : 27000 iops >>>SEEM THAT QEMU AGGREGATE READS ops=20
ceph : 93942 kB/s rd, 12430 op/s=20
=20
kvm: 130% cpu - kworker : 41,2% =3D 171,2% cpu=20
=20
100%cpu ratio : 7260iops=20
=20
randread=20
--------=20
fio : 60MBS - 15000 iops=20
ceph : 54400 kB/s rd, 13600 op/s=20
=20
kvm: 95,0% cpu - kworker : 42,1 % cpu =3D 137,1%cpu=20
=20
100%cpu ratio : 9919 iops=20
=20
=20
=20
=20
=20
qemu + krbd + virtio=20
----------------------=20
read=20
-----=20
fio : 70mbs/ , 19000iops=20
ceph:75705 kB/s rd, 18926 op/s=20
kvm : 164% cpu - kworker : 48,5% cpu =3D 212,5%cpu=20
=20
100%cpu ratio : 8906 iops=20
=20
randread=20
--------=20
fio : 54mbs/ , 14000iops=20
ceph : 54800 kB/s rd, 13700 op/s=20
kvm: 103% cpu - kworker 41,2% cpu =3D 144,2%cpu=20
=20
100%cpu ratio : 9513 iops=20
=20
=20
=20
qemu + krbd + virtio-scsi (num_queue 8)=20
--------------------------------------=20
read:=20
----=20
fio : 200MB / 50000 iops >>>SEEM THAT QEMU AGGREGATE READS ops=20
ceph : 205 MB/s rd, 7648 op/s=20
=20
kvm: 145% kworker : 46,5% =3D 191,5%cpu=20
=20
100%cpu ratio : 3993 iops=20
=20
randread:=20
----------=20
fio : 30MB/S / 7500 iops=20
ceph : 29318 kB/s rd, 7329 op/s=20
kvm : 150% kworker : 21,4% cpu =3D 171,4% cpu=20
=20
100%cpu ratio : 4275 iops=20
=20
=20
=20
=20
qemu + librbd + virtio + iothread=20
----------------------------------=20
read=20
----=20
fio : 60MBS/s , 15000iops=20
ceph: 56199 kB/s rd, 14052 op/s=20
=20
kvm: 300% cpu=20
=20
100%cpu : 4666iops=20
=20
=20
randread=20
---------=20
fio : 56MBS/s, 14000iops=20
ceph : 55916 kB/s rd, 13979 op/s=20
=20
kvm: 300% cpu=20
=20
100%cpu : 4659 iops=20
=20
=20
=20
qemu + librbd + virtio=20
-------------------------=20
read=20
-----=20
fio : 60MBS/s, 15000iops=20
ceph : 63021 kB/s rd, 15755 op/s=20
=20
kvm: 300% cpu=20
=20
100%cpu : 5233 iops=20
=20
randread=20
--------=20
fio : 60MBS/s, 15000iops=20
ceph : 55916 kB/s rd, 13979 op/s=20
=20
kvm : 300%cpu=20
=20
100%cpu : 4659 iops=20
=20
=20
qemu + librbd + virtio-scsi (num_queue 8)=20
----------------------------------------=20
read=20
----=20
fio : 256 MB/S , 48000iops >>>SEEM THAT QEMU AGGREGATE READS ops=20
ceph : 244 MB/s rd, 12002 op/s=20
=20
kvm : 300% cpu=20
=20
100%cpu : 4000 iops=20
=20
randread=20
--------=20
fio: 12000iops=20
ceph iops : 47511 kB/s rd, 11877 op/s=20
=20
kvm: 300% cpu=20
=20
100%cpu: 3959 iops=20
=20
=20
=20
=20
--=20
To unsubscribe from this list: send the line "unsubscribe ceph-devel"=

in=20

More majordomo info at http://vger.kernel.org/majordomo-info.html=20
=20

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Alexandre DERUMIER

2014-10-23 21:38:45 UTC

Permalink

I have redone the perf reports with tree and missing debug symbols.

Here my analysis:

fio + aioengine + krbd : 35000iops randread 4K
-----------------------------------------------

TOP
---
87,1% idle : 104% cpu total usage

2132 root 20 0 97740 4000 3612 S 47,9 0,0 0:14.50fio
2134 root 20 0 0 0 0 S 42,6 0,0 0:12.8 kworker/0:2
48624 root 20 0 0 0 0 S 6,3 0,0 0:54.9 kworker/2:1
48396 root 20 0 0 0 0 S 5,3 0,0 0:13.14kworker/4:0
3 root 20 0 0 0 0 S 4,0 0,0 2:34.53ksoftirqd/0
48387 root 20 0 0 0 0 S 1,3 0,0 0:07.82kworker/6:1
2130 root 20 0 67788 38m 38m S 0,3 0,1 0:00.09

perf
----

+ 24,23% kworker/0:2 [kernel.kallsyms]
+ 21,47% swapper [kernel.kallsyms]
+ 21,10% fio [kernel.kallsyms]
+ 5,37% fio fio
+ 5,36% fio [libceph]
+ 4,93% kworker/2:1 [kernel.kallsyms]
+ 4,89% kworker/0:2 [libceph]
+ 2,08% fio [rbd]
+ 1,80% kworker/0:2 [rbd]
+ 1,69% swapper [tg3]
+ 0,78% kworker/0:2 [tg3]
+ 0,74% kworker/4:0 [kernel.kallsyms]
+ 0,73% ksoftirqd/0 [kernel.kallsyms]
+ 0,66% kworker/2:1 [libceph]
+ 0,60% kworker/3:1 [kernel.kallsyms]
...

I think that kworker/0:2 is main rbd kernel management : 42% cpi

FIO + RBD ENGINE : 35000iops randread 4K
------------------------------------------

TOP
---
28%idle : 576% cpu total usage

1231 root 20 0 922m 38m 35m S 576,1 0,1 1:28.96 fio
3 root 20 0 0 0 0 S 1,3 0,0 2:27.24 ksoftirqd/0

perf
----
+ 25,00% fio [kernel.kallsyms]
+ 24,84% fio libc-2.13.so -----> malloc,free,... from fio rbdengine
+ 16,92% fio librados.so.2.0.0
+ 12,68% swapper [kernel.kallsyms]
+ 9,87% fio librbd.so.1.0.0
+ 4,64% fio libpthread-2.13.so
+ 2,33% fio libstdc++.so.6.0.17
+ 1,88% fio fio

librados+ librd = 26,79% of 576% = 154% cpu.

So, seem that librbd+librados use 3x more cpu than krbd. Is it normal ????

For fio-rbd engine, seem that it's missing a lot of optimisations.
malloc,free take around 25% of 576% : 144%cpu
other seem to be related to fio code too.

Alexandre