Bug #1842020 “ceph patch as of 8/29 segfaults all bluestore osds...” : Bugs : ceph package : Ubuntu

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30:

#1

Confirmed on two different boxes of the same processor vintage. Otherwise latest eoan updates. Desktop ceph mate though it shouldn't matter.

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30:

#2

Bisected the problem starts with 14.2.2-0ubuntu1, might be the next one. It works in 14.2.1-0ubuntu3. Look for the change in the file size of ceph-bluestore-osd.

Revision history for this message

James Page (james-page) wrote on 2019-08-30:

#3

Attempting to reproduce:

$ apt-cache policy ceph-osd
ceph-osd:
  Installed: 14.2.2-0ubuntu2
  Candidate: 14.2.2-0ubuntu2
  Version table:
*** 14.2.2-0ubuntu2 500
        500 http://gb.archive.ubuntu.com/ubuntu eoan/main amd64 Packages
        100 /var/lib/dpkg/status

$ ceph-bluestore-tool
must specify an action; --help for help

$ ceph-bluestore-tool --help
All options:

Options:
  -h [ --help ] produce help message
  --path arg bluestore path
  --out-dir arg output directory
  -l [ --log-file ] arg log file
  --log-level arg log level (30=most, 20=lots, 10=some, 1=little)
  --dev arg device(s)
  --devs-source arg bluefs-dev-migrate source device(s)
  --dev-target arg target/resulting device
  --deep arg deep fsck (read all data)
  -k [ --key ] arg label metadata key name
  -v [ --value ] arg label metadata value

Positional options:
  --command arg fsck, repair, bluefs-export, bluefs-bdev-sizes,
                         bluefs-bdev-expand, bluefs-bdev-new-db,
                         bluefs-bdev-new-wal, bluefs-bdev-migrate, show-label,
                         set-label-key, rm-label-key, prime-osd-dir,
                         bluefs-log-dump

That said I'm running on a much new processor type.

Changed in ceph (Ubuntu):
assignee:	nobody → James Page (james-page)

Revision history for this message

James Page (james-page) wrote on 2019-08-30:

#4

I don't have access to the same processor class; I'd suspect this is not a ceph specific issue but might be a compiler bug in eoan.

The stacktrace for the issue might be recorded on one of your deployments - could you try to collect it using:

apport-collect 1842020

alternatively you can collect a backtrace with full debug symbols:

https://wiki.ubuntu.com/Backtrace

and attach to this bug report.

Revision history for this message

Trent Lloyd (lathiat) wrote on 2019-08-30:

#5

At a super basic level I can't reproduce this. With an eoan container on an eoan host I don't get a segfault from ceph-bluestore-tool.

I'd suggest we may need to look at getting
(1) a coredump
(2) the somewhat unlikely but not impossible chance that it's CPU-dependent for some kind of optimization reason or similar as this CPU is quite old [can you confirm the install is also 64-bit?]
(3) A bunch of information about the system configuration.. e.g. from 'sosreport' would work or similar. [I'm not sure if you can use reportbug to upload system info about an existing bug] - including at least the "dpkg -l" full package list.

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30:

#6

result of apport-bug Edit (1.1 MiB, text/plain)

Attached is the result of
apport-bug --save
it can be viewed with
apport-unpack
It has the answers to most of the above questions. Yes, it is amd64 (dual xeon...)

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30:

#7

backtrace log Edit (4.5 KiB, text/plain)

Backtrace log for you.

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30:

#8

list of packages installed on system with crashes. Edit (233.4 KiB, text/plain)

The above data is from a run in an eoan VM on the same processor that hosts the osd's in the baremetal layer (same bug in both cases, but I reverted the baremetal layer to the state prior to the 'upgrade' to restore ceph osd function).
Here's the dpkg -l you asked for.

Here's cat /proc/cpuinfo on the VM
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel Celeron_4x0 (Conroe/Merom Class Core 2)
stepping : 3
microcode : 0x1
cpu MHz : 2327.284
cache size : 16384 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl cpuid tsc_known_freq pni vmx ssse3 cx16 x2apic tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti tpr_shadow vnmi flexpriority tsc_adjust arat arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips : 4654.56
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel Celeron_4x0 (Conroe/Merom Class Core 2)
stepping : 3
microcode : 0x1
cpu MHz : 2327.284
cache size : 16384 KB
physical id : 1
siblings : 1
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl cpuid tsc_known_freq pni vmx ssse3 cx16 x2apic tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti tpr_shadow vnmi flexpriority tsc_adjust arat arch_capabilities
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips : 4654.56
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

The above data is from a run in an eoan VM on the same processor that hosts the osd's in the baremetal layer (same bug in both cases, but I reverted the baremetal layer to the state prior to the 'upgrade' to restore ceph osd function).
Here's the dpkg -l you asked for.

Here's cat /proc/cpuinfo on the VM
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel Celeron_4x0 (Conroe/Merom Class Core 2)
stepping        : 3
microcode       : 0x1
cpu MHz         : 2327.284
cache size      : 16384 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl cpuid tsc_known_freq pni vmx ssse3 cx16 x2apic tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti tpr_shadow vnmi flexpriority tsc_adjust arat arch_capabilities
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips        : 4654.56
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel Celeron_4x0 (Conroe/Merom Class Core 2)
stepping        : 3
microcode       : 0x1
cpu MHz         : 2327.284
cache size      : 16384 KB
physical id     : 1
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl cpuid tsc_known_freq pni vmx ssse3 cx16 x2apic tsc_deadline_timer hypervisor lahf_lm cpuid_fault pti tpr_shadow vnmi flexpriority tsc_adjust arat arch_capabilities
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips        : 4654.56
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30:

#9

FYI, in the eoan VM you used to duplicate the bug, try it again with the processor set to 2 'Conroe' cpus. i440FX chipset, BIOS, kvm-spice emulator.

Also, I tried compiling ceph nautilus from upstream, can't do it in eoan without installing tox from bionic, and a couple other packages not in eoan (e.g. libcui60 dependency upstream, eoan has ...63). Had to add these #pragmas which look memory related...

src/spdk/dpdk/lib/librte_eal/common/eal_common_memory.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/common/eal_common_memzone.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/common/eal_common_tailqs.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/common/malloc_heap.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/common/rte_malloc.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/linuxapp/eal/eal_memalloc.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/linuxapp/eal/eal_memory.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/linuxapp/eal/eal_vfio.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/linuxapp/eal/eal_vfio.h:#pragma message("VFIO configured but not supported by this kernel, disabling.")
src/spdk/dpdk/lib/librte_ethdev/rte_tm.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ethdev/ethdev_profile.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ethdev/rte_ethdev.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ethdev/rte_flow.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ethdev/rte_mtr.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_mempool/rte_mempool.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_net/rte_arp.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_net/rte_net.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ring/rte_ring.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"

FYI, in the eoan VM you used to duplicate the bug, try it again with the processor set to 2 'Conroe' cpus. i440FX chipset, BIOS, kvm-spice emulator.

Also, I tried compiling ceph nautilus from upstream, can't do it in eoan without installing tox from bionic, and a couple other packages not in eoan (e.g. libcui60 dependency upstream, eoan has ...63).  Had to add these #pragmas which look memory related...

src/spdk/dpdk/lib/librte_eal/common/eal_common_memory.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/common/eal_common_memzone.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/common/eal_common_tailqs.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/common/malloc_heap.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/common/rte_malloc.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/linuxapp/eal/eal_memalloc.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/linuxapp/eal/eal_memory.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/linuxapp/eal/eal_vfio.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_eal/linuxapp/eal/eal_vfio.h:#pragma message("VFIO configured but not supported by this kernel, disabling.")
src/spdk/dpdk/lib/librte_ethdev/rte_tm.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ethdev/ethdev_profile.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ethdev/rte_ethdev.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ethdev/rte_flow.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ethdev/rte_mtr.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_mempool/rte_mempool.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_net/rte_arp.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_net/rte_net.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"
src/spdk/dpdk/lib/librte_ring/rte_ring.c:#pragma GCC diagnostic warning "-Waddress-of-packed-member"

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30: Dependencies.txt

#10

Dependencies.txt Edit (5.8 KiB, text/plain)

apport information

tags:	added: apport-collected eoan
description:	updated

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30: ProcCpuinfoMinimal.txt

#11

ProcCpuinfoMinimal.txt Edit (858 bytes, text/plain)

apport information

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30: modified.conffile..etc.default.apport.txt

#12

modified.conffile..etc.default.apport.txt Edit (151 bytes, text/plain)

apport information

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30:

#13

core.gz of typical crash Edit (807.2 KiB, application/octet-stream)

here's a core.gz of a typical crash in case the previous apport thing didn't get it to you.

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-30: Re: [Bug 1842020] Re: ceph patch as of 8/29 segfaults all bluestore osds

#14

Try it setting the processors on the VM to dual conroe.

On 8/30/19 3:38 AM, Trent Lloyd wrote:
> At a super basic level I can't reproduce this. With an eoan container on
> an eoan host I don't get a segfault from ceph-bluestore-tool.
>
> I'd suggest we may need to look at getting
> (1) a coredump
> (2) the somewhat unlikely but not impossible chance that it's CPU-dependent for some kind of optimization reason or similar as this CPU is quite old [can you confirm the install is also 64-bit?]
> (3) A bunch of information about the system configuration.. e.g. from 'sosreport' would work or similar. [I'm not sure if you can use reportbug to upload system info about an existing bug] - including at least the "dpkg -l" full package list.
>

Revision history for this message

Harry Coin (hcoin) wrote on 2019-08-31:

#15

Download full text (5.0 KiB)

And with debug symbols:

(gdb) run
Starting program: /usr/bin/ceph-bluestore-tool
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGILL, Illegal instruction.
0x0000555555743984 in eth_dev_init_cb_lists ()
(gdb) backtrace full
#0 0x0000555555743984 in eth_dev_init_cb_lists ()
No symbol table info available.
#1 0x0000555555dc045d in __libc_csu_init ()
No symbol table info available.
#2 0x00007fffee524e2e in __libc_start_main (main=0x5555557346b0 <main(int, char**)>, argc=1, argv=0x7fffffffe328, init=0x555555dc0410 <__libc_csu_init>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe318)
at ../csu/libc-start.c:264
result = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737351328816, 5390835150573769728, 140737353589600, 140737353586152, 1, 140737488347944, 140737488347960, 140737354007706}, mask_was_saved = 8}}, priv = {pad = {0x1,
0x7fffffffe328, 0x7fffffffe338, 0x7ffff7ffe190}, data = {prev = 0x1, cleanup = 0x7fffffffe328, canceltype = -7368}}}
not_first_call = <optimized out>
#3 0x000055555581e47e in _start () at /usr/include/c++/9/ostream:108
No symbol table info available.
(gdb) backtrace full
#0 0x0000555555743984 in eth_dev_init_cb_lists () at /usr/include/c++/9/ostream:108
No symbol table info available.
#1 0x0000555555dc045d in __libc_csu_init ()
No symbol table info available.
#2 0x00007fffee524e2e in __libc_start_main (main=0x5555557346b0 <main(int, char**)>, argc=1, argv=0x7fffffffe328, init=0x555555dc0410 <__libc_csu_init>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe318)
at ../csu/libc-start.c:264
result = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737351328816, 5390835150573769728, 140737353589600, 140737353586152, 1, 140737488347944, 140737488347960, 140737354007706}, mask_was_saved = 8}}, priv = {pad = {0x1,
0x7fffffffe328, 0x7fffffffe338, 0x7ffff7ffe190}, data = {prev = 0x1, cleanup = 0x7fffffffe328, canceltype = -7368}}}
not_first_call = <optimized out>
#3 0x000055555581e47e in _start () at /usr/include/c++/9/ostream:108
No symbol table info available.
(gdb) info registers
rax 0x555555fe0340 93825003291456
rbx 0x36 54
rcx 0xb 11
rdx 0x5555568921a0 93825012408736
rsi 0x7fffffffe328 140737488347944
rdi 0x1 1
rbp 0xc5 0xc5
rsp 0x7fffffffe208 0x7fffffffe208
r8 0x0 0
r9 0x0 0
r10 0x642e6264626f6c62 7218815436009204834
r11 0x20 32
r12 0x555555f2b510 93825002550544
r13 0x1 1
r14 0x7fffffffe328 140737488347944
r15 0x5555568921a0 93825012408736
rip 0x555555743984 0x555555743984 <eth_dev_init_cb_lists+68>
eflags 0x10212 [ AF IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
f...

And with debug symbols:

(gdb) run
Starting program: /usr/bin/ceph-bluestore-tool 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Program received signal SIGILL, Illegal instruction.
0x0000555555743984 in eth_dev_init_cb_lists ()
(gdb) backtrace full
#0  0x0000555555743984 in eth_dev_init_cb_lists ()
No symbol table info available.
#1  0x0000555555dc045d in __libc_csu_init ()
No symbol table info available.
#2  0x00007fffee524e2e in __libc_start_main (main=0x5555557346b0 <main(int, char**)>, argc=1, argv=0x7fffffffe328, init=0x555555dc0410 <__libc_csu_init>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe318)
at ../csu/libc-start.c:264
result = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737351328816, 5390835150573769728, 140737353589600, 140737353586152, 1, 140737488347944, 140737488347960, 140737354007706}, mask_was_saved = 8}}, priv = {pad = {0x1, 
0x7fffffffe328, 0x7fffffffe338, 0x7ffff7ffe190}, data = {prev = 0x1, cleanup = 0x7fffffffe328, canceltype = -7368}}}
not_first_call = <optimized out>
#3  0x000055555581e47e in _start () at /usr/include/c++/9/ostream:108
No symbol table info available.
(gdb) backtrace full
#0  0x0000555555743984 in eth_dev_init_cb_lists () at /usr/include/c++/9/ostream:108
No symbol table info available.
#1  0x0000555555dc045d in __libc_csu_init ()
No symbol table info available.
#2  0x00007fffee524e2e in __libc_start_main (main=0x5555557346b0 <main(int, char**)>, argc=1, argv=0x7fffffffe328, init=0x555555dc0410 <__libc_csu_init>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe318)
at ../csu/libc-start.c:264
result = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737351328816, 5390835150573769728, 140737353589600, 140737353586152, 1, 140737488347944, 140737488347960, 140737354007706}, mask_was_saved = 8}}, priv = {pad = {0x1, 
0x7fffffffe328, 0x7fffffffe338, 0x7ffff7ffe190}, data = {prev = 0x1, cleanup = 0x7fffffffe328, canceltype = -7368}}}
not_first_call = <optimized out>
#3  0x000055555581e47e in _start () at /usr/include/c++/9/ostream:108
No symbol table info available.
(gdb) info registers
rax            0x555555fe0340      93825003291456
rbx            0x36                54
rcx            0xb                 11
rdx            0x5555568921a0      93825012408736
rsi            0x7fffffffe328      140737488347944
rdi            0x1                 1
rbp            0xc5                0xc5
rsp            0x7fffffffe208      0x7fffffffe208
r8             0x0                 0
r9             0x0                 0
r10            0x642e6264626f6c62  7218815436009204834
r11            0x20                32
r12            0x555555f2b510      93825002550544
r13            0x1                 1
r14            0x7fffffffe328      140737488347944
r15            0x5555568921a0      93825012408736
rip            0x555555743984      0x555555743984 <eth_dev_init_cb_lists+68>
eflags         0x10212             [ AF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
(gdb) x/16i $pc
=> 0x555555743984 <eth_dev_init_cb_lists+68>:   pextrq $0x1,%xmm2,0x40c0(%rax)
0x55555574398f <eth_dev_init_cb_lists+79>:   pextrq $0x1,%xmm1,0xc1c0(%rax)
0x55555574399a <eth_dev_init_cb_lists+90>:   movdqa 0x6c2eae(%rip),%xmm2        # 0x555555e06850
0x5555557439a2 <eth_dev_init_cb_lists+98>:   movq   %xmm1,0x8140(%rax)
0x5555557439aa <eth_dev_init_cb_lists+106>:  movdqa 0x6c2eae(%rip),%xmm1        # 0x555555e06860
0x5555557439b2 <eth_dev_init_cb_lists+114>:  movq   $0x0,0x8138(%rax)
0x5555557439bd <eth_dev_init_cb_lists+125>:  paddq  %xmm0,%xmm2
0x5555557439c1 <eth_dev_init_cb_lists+129>:  movq   %xmm2,0x10240(%rax)
0x5555557439c9 <eth_dev_init_cb_lists+137>:  paddq  %xmm0,%xmm1
0x5555557439cd <eth_dev_init_cb_lists+141>:  pextrq $0x1,%xmm2,0x142c0(%rax)
0x5555557439d8 <eth_dev_init_cb_lists+152>:  movdqa 0x6c2e90(%rip),%xmm2        # 0x555555e06870
0x5555557439e0 <eth_dev_init_cb_lists+160>:  movq   %xmm1,0x18340(%rax)
0x5555557439e8 <eth_dev_init_cb_lists+168>:  pextrq $0x1,%xmm1,0x1c3c0(%rax)
0x5555557439f3 <eth_dev_init_cb_lists+179>:  movdqa 0x6c2e85(%rip),%xmm1        # 0x555555e06880
0x5555557439fb <eth_dev_init_cb_lists+187>:  movq   $0x0,0xc1b8(%rax)
0x555555743a06 <eth_dev_init_cb_lists+198>:  paddq  %xmm0,%xmm2
(gdb) thread apply all backtrace
Thread 1 (Thread 0x7fffee0e20c0 (LWP 825)):
#0  0x0000555555743984 in eth_dev_init_cb_lists () at /usr/include/c++/9/ostream:108
#1  0x0000555555dc045d in __libc_csu_init ()
#2  0x00007fffee524e2e in __libc_start_main (main=0x5555557346b0 <main(int, char**)>, argc=1, argv=0x7fffffffe328, init=0x555555dc0410 <__libc_csu_init>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe318) at ../csu/libc-start.c:264
#3  0x000055555581e47e in _start () at /usr/include/c++/9/ostream:108
(gdb)

Revision history for this message

Harry Coin (hcoin) wrote on 2019-09-01:

#16

I think I found it, but could use some validation.
Notice at https://ceph.io/geen-categorie/sse-optimization-for-erasure-code-in-ceph/
we have the precedent for ceph's checking what level of SSE instructions are available then using the appropriate one.

However, in the ubuntu version, littered around the makefiles we see -msse4.2 in several places
oddly (there is no mssse3)
-msse -msse2 -msse3 -mssse3 -mpclmul -msse4.1 -msse4.2

in rocksdb we see often -msse4.2

Canonical should remove the -msse4.2 compiler flags as ceph doesn't advertise it is not compatible with systems with less than sse4 capabilities.

I'm looking in to this further, but it appears to fit what I know so far.

Revision history for this message

Harry Coin (hcoin) wrote on 2019-09-01:

#17

Not so great minds think alike. Here it is, from upstream:
https://tracker.ceph.com/issues/41330

Revision history for this message

Harry Coin (hcoin) wrote on 2019-09-02:

#18

Upstream has two approaches to a solution. One was to disable sdpk except for development versions because the spdk folks set their lowest usable software level to corei7. I couldn't get that patch to work in the ubuntu packaging 'apt-get source ceph'.

I was able to get the patch working that edited the two files mentioned in the above-- editing the memcpy code and commenting out the corei7.

What I would like to see, and see as a general solution that might set canonical apart from others in a good way is:

when compiling using dpkg-buildpackage ...
a canonical-wide flag that overrides whatever -msse and -march might be the defaults and replace that with -march=native.

In that way, those who want to compile a package to get best performance (or any performance) on a particular machine can make it 'just work'.

James Page (james-page) on 2019-09-03

Changed in ceph (Ubuntu):
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

James Page (james-page) wrote on 2019-09-03:

#19

Package with SPDK disable (inline with short term upstream fix) building here:

https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3535

this also includes a fix for py3 compat in ceph-crash.

James Page (james-page) on 2019-09-03

Changed in ceph (Ubuntu):
status:	Triaged → Fix Committed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2019-09-03:

#20

This bug was fixed in the package ceph - 14.2.2-0ubuntu3

---------------
ceph (14.2.2-0ubuntu3) eoan; urgency=medium

  * d/rules: Disable SPDK support as this generates a build which
    has a minimum CPU baseline of 'corei7' on x86_64 which is not
    compatible with older CPU's (LP: #1842020).
  * d/p/issue40781.patch: Cherry pick fix for py3 compatibility in ceph-
    crash.

-- James Page <email address hidden> Tue, 03 Sep 2019 14:52:38 +0100

Changed in ceph (Ubuntu):
status:	Fix Committed → Fix Released

Ubuntu
ceph package

ceph patch as of 8/29 segfaults all bluestore osds

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntuceph package

ceph patch as of 8/29 segfaults all bluestore osds

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
ceph package