Hello Jason,
I confirm this release fixes the crashes. there is no a single crash for
past 4 days
On Mon, Sep 14, 2020 at 2:55 PM Jason Dillaman <jdillama(a)redhat.com> wrote:
On Mon, Sep 14, 2020 at 5:13 AM Lomayani S. Laizer
<lomlaizer(a)gmail.com>
wrote:
Hello,
Last week i got time to try debug crashes of these vms
Below log includes rados debug which i left last time
https://storage.habari.co.tz/index.php/s/AQEJ7tQS7epC4Zn
I have observed the following with these settings in openstack and ceph
disk_cachemodes="network=writeback" is set in openstack environment
and in ceph rbd_cache_policy = writearound is set--- crashes occur
disk_cachemodes="network=writeback" is set in openstack environment
and in ceph rbd_cache_policy = writeback is set--- no crashes
disk_cachemodes="none" is set in openstack environment and in ceph
rbd_cache_policy = writearound is set---- no crashes
disk_cachemodes="none" is set in openstack environment and in ceph
rbd_cache_policy = writeback is set --- crashes occur
Is disk_cachemodes="network=writeback" is no longer recommended in
octopus because i see it is left out in new documentation for octopus?
https://ceph.readthedocs.io/en/latest/rbd/rbd-openstack/
Can you try the latest development release of Octopus [1]? A librbd
crash fix has been sitting in that branch for about a month now to be
included in the next point release.
>* > Hello,*
>* >*
>* > Below is full debug log of 2 minutes before crash of virtual
machine.*
https://storage.habari.co.tz/index.php/s/31eCwZbOoRTMpcU>*
>
>* This log has rbd debug output, but not rados :(*
>
>* I guess you'll need to try and capture a coredump if you can't get a*
>* backtrace.*
>
>* I'd also suggest opening a tracker in case one of the rbd devs has
any*
* ideas on
this, or has seen something similar. Without a backtrace or*
* core it will be impossible to definitively identify the issue though.*
+1 to needing the backtrace. I don't see any indications of a problem in
that log.
>* >*
>* >*
>* > apport.log*
>* >*
>* > Wed May 13 09:35:30 2020: host pid 4440 crashed in a separate mount*
>* namespace, ignoring*
>* >*
>* > kernel.log*
>* > May 13 09:35:30 compute5 kernel: [123071.373217]
fn-radosclient[4485]:*
>* segfault at 0 ip 00007f4c8c85d7ed sp
00007f4c66ffc470 error 4 in*
>* librbd.so.1.12.0[7f4c8c65a000+5cb000]*
>* > May 13 09:35:30 compute5 kernel: [123071.373228] Code: 8d 44 24 08
48
81*
>* c3 d8 3e 00 00 49 21 f9 48 c1 e8 30 83 c0
01 48 c1 e0 30 48 89 02 48
8b 03*
>* 48 89 04 24 48 8b 34 24 48 21 fe <48>
8b 06 48 89 44 24 08 48 8b 44
24 08*
>* 48 8b 0b 48 21 f8 48 39 0c*
>* > May 13 09:35:33 compute5 kernel: [123074.832700] brqa72d845b-e9:
port*
>* 1(tap33511c4d-2c) entered disabled state*
>* > May 13 09:35:33 compute5 kernel: [123074.838520] device
tap33511c4d-2c*
>* left promiscuous mode*
>* > May 13 09:35:33 compute5 kernel: [123074.838527] brqa72d845b-e9:
port*
>* 1(tap33511c4d-2c) entered disabled state*
>* >*
>* > syslog*
>* > compute5 kernel: [123071.373217] fn-radosclient[4485]: segfault at
0
ip*
>* 00007f4c8c85d7ed sp 00007f4c66ffc470 error
4 i*
>* > n librbd.so.1.12.0[7f4c8c65a000+5cb000]*
>* > May 13 09:35:30 compute5 kernel: [123071.373228] Code: 8d 44 24 08
48
81*
>* c3 d8 3e 00 00 49 21 f9 48 c1 e8 30 83 c0
01 48 c1 e0 30 48 8*
>* > 9 02 48 8b 03 48 89 04 24 48 8b 34 24 48 21 fe <48> 8b 06 48 89 44
24 08*
>* 48 8b 44 24 08 48 8b 0b 48 21 f8 48 39 0c*
>* > May 13 09:35:30 compute5 libvirtd[1844]: internal error: End of
file*
>* from qemu monitor*
>* > May 13 09:35:33 compute5 systemd-networkd[1326]: tap33511c4d-2c:
Link*
>* DOWN*
>* > May 13 09:35:33 compute5 systemd-networkd[1326]: tap33511c4d-2c:
Lost*
>* carrier*
>* > May 13 09:35:33 compute5 kernel: [123074.832700] brqa72d845b-e9:
port*
>* 1(tap33511c4d-2c) entered disabled state*
>* > May 13 09:35:33 compute5 kernel: [123074.838520] device
tap33511c4d-2c*
>* left promiscuous mode*
>* > May 13 09:35:33 compute5 kernel: [123074.838527] brqa72d845b-e9:
port*
>* 1(tap33511c4d-2c) entered disabled state*
>* > May 13 09:35:33 compute5 networkd-dispatcher[1614]: Failed to
request*
>* link: No such device*
>* >*
>* > On Fri, May 8, 2020 at 5:40 AM Brad Hubbard <bhubbard@xxxxxxxxxx>
wrote:*
>* >>*
>* >> On Fri, May 8, 2020 at 12:10 PM Lomayani S. Laizer
<lomlaizer@xxxxxxxxx>*
>* wrote:*
>* >> >*
>* >> > Hello,*
>* >> > On my side at point of vm crash these are logs below. At the
moment*
>* my debug is at 10 value. I will rise to 20
for full debug. these
crashes*
>* are random and so far happens on very busy
vms. Downgrading clients
in host*
>* to Nautilus these crashes disappear*
>* >>*
>* >> You could try adding debug_rados as well but you may get a very
large*
>* >> log so keep an eye on things.*
>* >>*
>* >> >*
>* >> > Qemu is not shutting down in general because other vms on the
same*
>* host continues working*
>* >>*
>* >> A process can not reliably continue after encountering a segfault
so*
>* >> the qemu-kvm process must be
ending and therefore it should be*
>* >> possible to capture a coredump with the right configuration.*
>* >>*
>* >> In the following example, if you were to search for pid 6060 you
would*
>* >> find it is no longer running.*
>* >> >> > [ 7682.233684] fn-radosclient[6060]: segfault at 2b19 ip*
>* 00007f8165cc0a50 sp 00007f81397f6490 error 4 in*
>* librbd.so.1.12.0[7f8165ab4000+537000]*
>* >>*
>* >> Without a backtrace at a minimum it may be very difficult to work
out*
>* >> what's going on with
certainty. If you open a tracker for the
issue*
>* >> though maybe one of the devs
specialising in rbd may have some*
>* >> feedback.*
>* >>*
>* >> >*
>* >> > 2020-05-07T13:02:12.121+0300 7f88d57fa700 10
librbd::io::ReadResult:*
>* 0x7f88c80bfbf0 finish: got {} for
[0,24576] bl 24576*
>* >> > 2020-05-07T13:02:12.193+0300 7f88d57fa700 10
librbd::io::ReadResult:*
>* 0x7f88c80f9330 finish: C_ObjectReadRequest:
r=0*
>* >> > 2020-05-07T13:02:12.193+0300 7f88d57fa700 10
librbd::io::ReadResult:*
>* 0x7f88c80f9330 finish: got {} for
[0,16384] bl 16384*
>* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10 librbd::ImageState:*
>* 0x5569b5da9bb0 0x5569b5da9bb0 send_close_unlock*
>* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10 librbd::ImageState:*
>* 0x5569b5da9bb0 0x5569b5da9bb0 send_close_unlock*
>* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0 send_block_image_watcher*
>* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10
librbd::ImageWatcher:*
>* 0x7f88c400dfe0 block_notifies*
>* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 5 librbd::Watcher:*
>* 0x7f88c400dfe0 block_notifies: blocked_count=1*
>* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0
handle_block_image_watcher:
r=0*
>* >> > 2020-05-07T13:02:28.694+0300
7f890ba90500 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0
send_shut_down_update_watchers*
>* >> > 2020-05-07T13:02:28.694+0300
7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0*
>* handle_shut_down_update_watchers: r=0*
>* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0 send_shut_down_io_queue*
>* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 5*
>* librbd::io::ImageRequestWQ: 0x7f88e8001570 shut_down: shut_down:
in_flight=0*
>* >> > 2020-05-07T13:02:28.694+0300
7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0
handle_shut_down_io_queue: r=0*
>* >> > 2020-05-07T13:02:28.694+0300
7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0
send_shut_down_exclusive_lock*
>* >> > 2020-05-07T13:02:28.694+0300
7f88d4ff9700 10
librbd::ExclusiveLock:*
> >* 0x7f88c4011ba0 shut_down*
>* >> > 2020-05-07T13:02:28.694+0300
7f88d4ff9700 10
librbd::ManagedLock:*
> >* 0x7f88c4011bb8 shut_down:*
>* >> > 2020-05-07T13:02:28.694+0300
7f88d4ff9700 10
librbd::ManagedLock:*
> >* 0x7f88c4011bb8 send_shutdown:*
>* >> > 2020-05-07T13:02:28.694+0300
7f88d4ff9700 10
librbd::ManagedLock:*
> >* 0x7f88c4011bb8 send_shutdown_release:*
>* >> > 2020-05-07T13:02:28.694+0300
7f88d4ff9700 10
librbd::ExclusiveLock:*
>* 0x7f88c4011ba0 pre_release_lock_handler*
>* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* send_cancel_op_requests:*
>* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* handle_cancel_op_requests: r=0*
>* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020
send_block_writes:*
>* >> > 2020-05-07T13:02:28.698+0300
7f88d4ff9700 5*
>* librbd::io::ImageRequestWQ: 0x7f88e8001570 block_writes:
0x5569b5e1ffd0,*
>* num=1*
>* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* handle_block_writes: r=0*
>* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020
send_wait_for_ops:*
>* >> > 2020-05-07T13:02:28.698+0300
7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* handle_wait_for_ops:*
>* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* send_invalidate_cache:*
>* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 5*
>* librbd::io::ObjectDispatcher: 0x5569b5dab700 invalidate_cache:*
>* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* handle_invalidate_cache: r=0*
>* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* send_flush_notifies:*
>* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* handle_flush_notifies:*
>* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* send_close_object_map:*
>* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10*
>* librbd::object_map::UnlockRequest: 0x7f88c807a450 send_unlock:*
>* oid=rbd_object_map.2f18f2a67fad72*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 10*
>* librbd::object_map::UnlockRequest: 0x7f88c807a450 handle_unlock: r=0*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020*
>* handle_close_object_map: r=0*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 10*
>* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020
send_unlock:*
>* >> > 2020-05-07T13:02:28.702+0300
7f88d4ff9700 10
librbd::ManagedLock:*
>* 0x7f88c4011bb8 handle_shutdown_pre_release:
r=0*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10*
>* librbd::managed_lock::ReleaseRequest: 0x7f88c80b68a0 send_unlock:*
>* entity=client.58292796, cookie=auto 140225447738256*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 10*
>* librbd::managed_lock::ReleaseRequest: 0x7f88c80b68a0 handle_unlock:
r=0*
>* >> > 2020-05-07T13:02:28.702+0300
7f88d4ff9700 10
librbd::ExclusiveLock:*
>* 0x7f88c4011ba0 post_release_lock_handler:
r=0 shutting_down=1*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 5*
>* librbd::io::ImageRequestWQ: 0x7f88e8001570 unblock_writes:
0x5569b5e1ffd0,*
> >* num=0*
>* >> > 2020-05-07T13:02:28.702+0300
7f88d4ff9700 10
librbd::ImageWatcher:*
> >* 0x7f88c400dfe0 notify released lock*
>* >> > 2020-05-07T13:02:28.702+0300
7f88d4ff9700 10
librbd::ImageWatcher:*
> >* 0x7f88c400dfe0 current lock owner: [0,0]*
>* >> > 2020-05-07T13:02:28.702+0300
7f88d4ff9700 10
librbd::ManagedLock:*
> >* 0x7f88c4011bb8 handle_shutdown_post_release: r=0*
>* >> > 2020-05-07T13:02:28.702+0300
7f88d4ff9700 10
librbd::ManagedLock:*
> >* 0x7f88c4011bb8 wait_for_tracked_ops: r=0*
>* >> > 2020-05-07T13:02:28.702+0300
7f88d4ff9700 10
librbd::ManagedLock:*
>* 0x7f88c4011bb8 complete_shutdown: r=0*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0*
>* handle_shut_down_exclusive_lock: r=0*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0
send_unregister_image_watcher*
>* >> > 2020-05-07T13:02:28.702+0300
7f88d4ff9700 10
librbd::ImageWatcher:*
>* 0x7f88c400dfe0 unregistering image
watcher*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::Watcher:*
>* 0x7f88c400dfe0 unregister_watch:*
>* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 5 librbd::Watcher:*
>* 0x7f88c400dfe0 notifications_blocked: blocked=1*
>* >> > 2020-05-07T13:02:28.706+0300 7f88ceffd700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0*
>* handle_unregister_image_watcher: r=0*
>* >> > 2020-05-07T13:02:28.706+0300 7f88ceffd700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0 send_flush_readahead*
>* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0 handle_flush_readahead:
r=0*
>* >> > 2020-05-07T13:02:28.706+0300
7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0
send_shut_down_object_dispatcher*
> >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 5*
> >* librbd::io::ObjectDispatcher: 0x5569b5dab700 shut_down:*
> >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 5*
> >* librbd::io::ObjectDispatch: 0x5569b5ee8360 shut_down:*
> >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 5*
> >* librbd::io::SimpleSchedulerObjectDispatch: 0x7f88c4013ce0 shut_down:*
> >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 5*
> >* librbd::cache::WriteAroundObjectDispatch: 0x7f88c8003780 shut_down:*
> >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10*
> >* librbd::image::CloseRequest: 0x7f88c8175fd0*
> >* handle_shut_down_object_dispatcher: r=0*
> >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10*
> >* librbd::image::CloseRequest: 0x7f88c8175fd0 send_flush_op_work_queue*
>* >> > 2020-05-07T13:02:28.706+0300
7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0
handle_flush_op_work_queue:
r=0*
>* >> > 2020-05-07T13:02:28.706+0300
7f88d4ff9700 10*
>* librbd::image::CloseRequest: 0x7f88c8175fd0
handle_flush_image_watcher:
r=0*
* >>
> 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10 librbd::ImageState:*
* 0x5569b5da9bb0 0x5569b5da9bb0 handle_close: r=0*
* >> >*
* >> > On Fri, May 8, 2020 at 12:40 AM Brad Hubbard <bhubbard@xxxxxxxxxx
*
* wrote:*
* >> >>*
* >> >> On Fri, May 8, 2020 at 3:42 AM Erwin Lubbers <erwin@xxxxxxxxxxx
*
>* wrote:*
>* >> >> >*
>* >> >> > Hi,*
>* >> >> >*
>* >> >> > Did anyone found a way to resolve the problem? I'm
seeing the
same*
>* on a clean Octopus Ceph installation on
Ubuntu 18 with an Octopus
compiled*
>* KVM server running on CentOS 7.8. The KVM
machine shows:*
>* >> >> >*
>* >> >> > [ 7682.233684] fn-radosclient[6060]: segfault at 2b19 ip*
>* 00007f8165cc0a50 sp 00007f81397f6490 error 4 in*
>* librbd.so.1.12.0[7f8165ab4000+537000]*
>* >> >>*
>* >> >> Are you able to either capture a backtrace from a coredump or
set up*
>* >> >> logging and hopefully
capture a backtrace that way?*
>* >> >>*
>* >> >> >*
>* >> >> > Ceph is healthy and stable for a few weeks and I did not get
these*
* messages
while running on KVM compiled with Luminous libraries.*
* >> >> >*
* >> >> > Regards,*
* >> >> > Erwin*
* >> >> > _______________________________________________*
* >> >> > ceph-users mailing list -- ceph-users@xxxxxxx*
* >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx*
* >> >> >*
* >> >>*
* >> >>*
* >> >> --*
* >> >> Cheers,*
* >> >> Brad*
* >> >> _______________________________________________*
* >> >> ceph-users mailing list -- ceph-users@xxxxxxx*
* >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx*
* >>*
* >>*
* >>*
* >> --*
* >> Cheers,*
* >> Brad*
* >>*
* --*
* Cheers,*
* Brad*
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
[1]
https://shaman.ceph.com/repos/ceph/octopus/
--
Jason