Dear Cephalopodians,
running 13.2.6 on the source cluster and 14.2.5 on the rbd mirror nodes and the target
cluster,
I observe regular failures of rbd-mirror processes.
With failures, I mean that traffic stops, but the daemons are still listed as active
rbd-mirror daemons in
"ceph -s", and the daemons are still running. This comes in sync with a hefty
load of below messages in the mirror logs.
This happens "sometimes" when some OSDs go down and up in the target cluster
(which happens each night since the disks in that cluster
shortly go offline during "online" smart self-tests - that's a problem in
itself, but it's a cluster built from hardware that would have been trashed
otherwise).
The rbd daemons keep running in any case, but synchronization stops. If not all rbd mirror
daemons have failed (we have three running, and it usually does not hit all of them),
the "surviving" seem(s) not to take care of the images the other daemons had
locked.
Right now, I am eyeing with a "quick solution" of regularly restarting the
rbd-mirror daemons, but if there are any good ideas on which debug info I could collect
to get this analyzed and fixed, that would of course be appreciated :-).
Cheers,
Oliver
-----------------------------------------------
2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00
[2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_process_entry_safe: failed to commit
journal event: (108) Cannot send after transport endpoint shutdown
2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00
[2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_replay_complete: replay encountered an
error: (108) Cannot send after transport endpoint shutdown
...
2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00
[2/23699357-a611-4557-9d73-6ff5279da991] handle_process_entry_safe: failed to commit
journal event: (125) Operation canceled
2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00
[2/23699357-a611-4557-9d73-6ff5279da991] handle_replay_complete: replay encountered an
error: (125) Operation canceled
2019-12-24 02:08:55.707 7f31ea358700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dce2e05b0 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2019-12-24 02:08:55.707 7f31ea358700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dcf47ee70 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
...
2019-12-24 02:08:55.716 7f31f5b6f700 -1 rbd::mirror::ImageReplayer: 0x559dcb997680
[2/f8218221-6608-4a2b-8831-84ca0c2cb418] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80
start_image_replayer: global_image_id=0577bd16-acc4-4e9a-81f0-c698a24f8771: blacklisted
detected during image replay
2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80
start_image_replayer: global_image_id=05bd4cca-a561-4a5c-ad83-9905ad5ce34e: blacklisted
detected during image replay
2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80
start_image_replayer: global_image_id=0e614ece-65b1-4b4a-99bd-44dd6235eb70: blacklisted
detected during image replay
-----------------------------------------------
Attachments:
- smime.p7s
(application/pkcs7-signature — 5.3 KB)