Dear Jason,
Am 12.02.20 um 19:29 schrieb Jason Dillaman:
On Wed, Feb 12, 2020 at 11:55 AM Oliver Freyermuth
<freyermuth(a)physik.uni-bonn.de> wrote:
Dear Cephalopodians,
for those on the list also fighting rbd mirror process instabilities: With 14.2.7 (but
maybe it was also present before, it does not happen often),
I very rarely encounter a case where none of the two described hacks I use is working
anymore, since "ceph daemon /var/run/ceph/ceph-client.rbd_mirror...." just hangs
forever.
So I am now using the following three cronjobs:
1) Restart RBD mirror if any images enter unknown or stopped state. This is probably not
needed since (2) should catch everything:
----------------------
rbd --id=${rbd_id} mirror pool status | egrep -q 'unknown|stopped' &&
systemctl -q is-active ceph-rbd-mirror(a)${rbd_id}.service && systemctl restart
ceph-rbd-mirror(a)${rbd_id}.service
----------------------
2) Restart RBD mirror if the RBD client is not replaying anything anymore (happens
randomly for us due to client blacklisting when many OSDs restart, and it never recovers
from that situation):
----------------------
ceph daemon /var/run/ceph/ceph-client.${rbd_id}.\$(systemctl show --property MainPID
ceph-rbd-mirror(a)${rbd_id}.service | sed 's/MainPID=//').*.asok rbd mirror status |
grep -q Replaying || (systemctl -q is-active ceph-rbd-mirror(a)${rbd_id}.service &&
systemctl restart ceph-rbd-mirror(a)${rbd_id}.service)
----------------------
If you can repeat this situation, can you provide any supporting
rbd-mirror daemon logs? The rbd-mirror daemon periodically scans (~30
seconds) to see if it has been blacklisted and then starts a recovery
process. If something is broken with this, we would like to fix it.
of course — I did not even need to repeat it manually, but one of our RBD Mirrors has
heard you ;-).
Interestingly, I did not see any OSDs going down this time, only restarted some VMs, but
this should hopefully not affect anything here...
Here are the logs from this occasion:
2020-02-12 19:57:03.178 7f39519c2040 0 ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process rbd-mirror, pid
4003991
2020-02-12 19:57:03.234 7f39519c2040 1 mgrc service_daemon_register rbd-mirror.3286479
metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable),ceph_version_short=14.2.7,cpu=Intel(R) Xeon(R) CPU E5410 @
2.33GHz,distro=centos,distro_description=CentOS Linux 7
(Core),distro_version=7,hostname=rbd-mirror002.example.com,id=rbd_mirror_backup,instance_id=3286479,kernel_description=#1
SMP Fri Dec 6 15:49:49 UTC
2019,kernel_version=3.10.0-1062.9.1.el7.x86_64,mem_swap_kb=5119996,mem_total_kb=16264948,os=Linux}
2020-02-12 19:57:06.533 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x5614036bdf80
[2/73b65031-0770-4c72-bf74-896ad960dc8f] handle_shut_down: mirror image no longer exists
2020-02-12 19:57:30.568 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403c89200
[2/61b047c8-7069-4f6d-abbd-4515c21a1c04] handle_shut_down: mirror image no longer exists
2020-02-12 19:57:30.628 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403c8a400
[2/87ba1bbb-7afc-4604-830c-5b0abdf0c0e5] handle_shut_down: mirror image no longer exists
2020-02-12 19:57:30.658 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561404c0a900
[2/f9c1e3ad-3552-4b41-b3d3-c1b0daf58122] handle_shut_down: mirror image no longer exists
2020-02-12 19:57:33.584 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403c88480
[2/61b047c8-7069-4f6d-abbd-4515c21a1c04] handle_shut_down: mirror image no longer exists
2020-02-12 20:10:04.855 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x56140564c880
[2/61b047c8-7069-4f6d-abbd-4515c21a1c04] handle_shut_down: mirror image no longer exists
2020-02-12 20:10:04.879 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403cd1f80
[2/73b65031-0770-4c72-bf74-896ad960dc8f] handle_shut_down: mirror image no longer exists
2020-02-12 20:10:04.890 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403cd1680
[2/87ba1bbb-7afc-4604-830c-5b0abdf0c0e5] handle_shut_down: mirror image no longer exists
2020-02-12 20:10:04.971 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403cd0480
[2/f9c1e3ad-3552-4b41-b3d3-c1b0daf58122] handle_shut_down: mirror image no longer exists
2020-02-12 20:11:30.786 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561404d04900
[2/81b08854-60d9-4ac7-9169-c27db7966a06] handle_shut_down: mirror image no longer exists
2020-02-12 20:11:30.793 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561404d05f80
[2/ac260aaf-b6e4-465e-b804-af612e5ad531] handle_shut_down: mirror image no longer exists
2020-02-12 20:11:30.796 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561404cae000
[2/fc0bf498-db79-4acf-b23e-1a6bd8ad90c4] handle_shut_down: mirror image no longer exists
2020-02-12 20:13:28.638 7f393f1cb700 0 rbd::mirror::LeaderWatcher: 0x561403615200
handle_get_locker: breaking leader lock after 3 failed attempts to acquire
2020-02-12 20:13:30.061 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bf600
[2/7fbf7e3e-22cc-4e78-b570-8938d6d13190] handle_process_entry_safe: failed to commit
journal event: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:30.061 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bf600
[2/7fbf7e3e-22cc-4e78-b570-8938d6d13190] handle_replay_complete: replay encountered an
error: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:30.064 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bed00
[2/7f2bbf53-566c-409f-926f-894a2648c175] handle_process_entry_safe: failed to commit
journal event: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:30.064 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bed00
[2/7f2bbf53-566c-409f-926f-894a2648c175] handle_replay_complete: replay encountered an
error: (108) Cannot send after transport endpoint shutdown
...
many more of these
...
2020-02-12 20:13:30.121 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bdb00
[2/66f808b0-2d75-4ffa-b6a8-dc2ac934a0e5] handle_replay_complete: replay encountered an
error: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:30.122 7f393f1cb700 -1 rbd::mirror::LeaderWatcher: 0x561403615200
handle_break_leader_lock: error breaking leader lock: (108) Cannot send after transport
endpoint shutdown
2020-02-12 20:13:30.320 7f393f1cb700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x5614051d1280
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:30.336 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561403c8a880
[2/ba137ddf-f520-4e2d-b56d-b5c572ca93b5] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.329 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408125570 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.329 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408125500 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.329 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408125490 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
...
many more of these
...
2020-02-12 20:13:33.333 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561405dd7880 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.334 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=ba137ddf-f520-4e2d-b56d-b5c572ca93b5: blacklisted
detected during image replay
2020-02-12 20:13:33.334 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x56140369ad90 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
...
many more of the "Cannot send after..." lines
...
2020-02-12 20:13:33.337 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561403c8ba80
[2/e41c5629-a158-4a5c-a3ae-5a971082cffa] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.337 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561405c75d50 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.337 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561405c75420 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.338 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x56140451cbd0 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.338 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408398230 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.338 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561403cd0d80
[2/e64b7f9a-eb24-4b46-a55e-f821e8f63353] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.339 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561404c0a000
[2/ed096674-37da-4988-9898-ae27fa3a82b6] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.340 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x56140362ff80
[2/f5711272-9fec-42ba-8a03-e65b8f3d55f0] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.340 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x56140362fb00
[2/fafed4a3-802f-4860-a523-0fc8d677d9cc] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=17b2a7d9-aa98-4f85-b9da-bd27afee8aff: blacklisted
detected during image replay
2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=1be8a0ce-68df-42ce-8678-eafeeb007ad3: blacklisted
detected during image replay
2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=20d42977-e22a-41f2-b316-a4620991aee8: blacklisted
detected during image replay
2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=2fd1ae38-3be2-45b7-b221-8021732a741a: blacklisted
detected during image replay
...
more and more of those
...
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=e2d9b334-f995-442a-bd52-c7acc22d4e63: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=e41c5629-a158-4a5c-a3ae-5a971082cffa: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=e64b7f9a-eb24-4b46-a55e-f821e8f63353: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=ed096674-37da-4988-9898-ae27fa3a82b6: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=f5711272-9fec-42ba-8a03-e65b8f3d55f0: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=fafed4a3-802f-4860-a523-0fc8d677d9cc: blacklisted
detected during image replay
...
Actually, right now, it still is in this state. I easily notice this because it's
doing network I/O with ~4 MB/s in and out, via the interface facing the "target"
cluster to which the images should be replayed.
Checking the mon logs, it seems that the origin of the blacklisting this time:
2020-02-12 20:13:28.641 7f5dfad0c700 0 mon.mon001@0(leader) e3 handle_command
mon_command({"prefix": "osd blacklist", "blacklistop":
"add", "addr": "10.160.19.242:0/3907577364"} v 0) v1
2020-02-12 20:13:28.641 7f5dfad0c700 0 log_channel(audit) log [INF] : from='client.?
10.160.19.241:0/993249630' entity='client.rbd_mirror_backup'
cmd=[{"prefix": "osd blacklist", "blacklistop":
"add", "addr": "10.160.19.242:0/3907577364"}]: dispatch
actually was another rbd-mirror asking to blacklist the "colleague". Checking
the logs of rbd-mirror001 (10.160.19.241), I see:
...
2020-02-12 20:11:20.182 7f16d24d2040 0 ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process rbd-mirror, pid 674
2020-02-12 20:11:20.237 7f16d24d2040 1 mgrc service_daemon_register rbd-mirror.3245041
metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
nautilus (stable),ceph_version_short=14.2.7,cpu=Intel(R) Xeon(R) CPU E5410 @
2.33GHz,distro=centos,distro_description=CentOS Linux 7
(Core),distro_version=7,hostname=rbd-mirror001.backup.example.com,id=rbd_mirror_backup,instance_id=3245041,kernel_description=#1
SMP Fri Dec 6 15:49:49 UTC
2019,kernel_version=3.10.0-1062.9.1.el7.x86_64,mem_swap_kb=5119996,mem_total_kb=16264948,os=Linux}
2020-02-12 20:11:26.588 7f16b44c5700 0 rbd::mirror::ImageReplayer: 0x563e900bcd80
[2/5569d343-98fb-49c9-88cb-19251ce0e4cb] handle_shut_down: mirror image no longer exists
2020-02-12 20:13:28.631 7f16bfcdc700 0 rbd::mirror::LeaderWatcher: 0x563e9002f200
handle_get_locker: breaking leader lock after 3 failed attempts to acquire
2020-02-12 20:13:31.353 7f16b44c5700 0 rbd::mirror::ImageReplayer: 0x563e91510000
[2/5569d343-98fb-49c9-88cb-19251ce0e4cb] handle_shut_down: mirror image no longer exists
2020-02-12 20:13:31.660 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92992d80
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.660 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e916eb680
[2/0e614ece-65b1-4b4a-99bd-44dd6235eb70] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.660 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e913af500
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.661 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e916ec880
[2/12513e0f-a87e-44ee-81aa-57de5aca739a] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.662 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92993780
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.662 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e900bd680
[2/5949f093-8aee-41b3-b50f-5902d94ae90e] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.686 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92992e80
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.686 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e916ec400
[2/11e30f19-7975-4dfe-94ae-d73b2cff7f61] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.709 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92992f00
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.710 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e90047680
[2/1816f3b2-e5ae-461d-94bb-07a376249922] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.712 7f16a44a5700 -1
rbd::mirror::image_replayer::OpenLocalImageRequest: 0x563e923ab5c0 handle_open_image :
failed to open image '3225d8dacb970d': (108) Cannot send after transport endpoint
shutdown
2020-02-12 20:13:31.712 7f16bfcdc700 0 rbd::mirror::PoolWatcher: 0x563e913cdb00
handle_rewatch_complete: detected client is blacklisted
2020-02-12 20:13:31.713 7f16a44a5700 -1 rbd::mirror::image_replayer::BootstrapRequest:
0x563e8f3ecfc0 handle_open_local_image: failed to open local image: (108) Cannot send
after transport endpoint shutdown
2020-02-12 20:13:31.713 7f16a44a5700 -1
rbd::mirror::image_replayer::OpenLocalImageRequest: 0x563e91947560 handle_open_image :
failed to open image '30272383ed48b0': (108) Cannot send after transport endpoint
shutdown
2020-02-12 20:13:31.713 7f16bfcdc700 0 rbd::mirror::PoolWatcher: 0x563e913cdb00
handle_rewatch_complete: detected client is blacklisted
2020-02-12 20:13:31.714 7f16bfcdc700 0 rbd::mirror::PoolWatcher: 0x563e913cdb00
handle_rewatch_complete: detected client is blacklisted
...
So it "feels" like our three RBD mirrors start to fight (maybe after I/O
congestion e.g. by restarting OSDs, high load,... ?) and sometimes blacklist each other in
a frenzy.
The blacklisted one becomes rather unhappy and never recovers, though.
Let me know if there is more info I can provide, I can (sometimes) enforce this by just
restarting many OSDs in unison.
Cheers,
Oliver
3) Restart RBD mirror if control socket becomes
unresponsive. This catches the cases where (2) just hangs forever, and (1) does not fire
yet because at least one functional RBD mirror is left (we have 3):
----------------------
timeout 10 ceph daemon /var/run/ceph/ceph-client.${rbd_id}.\$(systemctl show --property
MainPID ceph-rbd-mirror(a)${rbd_id}.service | sed 's/MainPID=//').*.asok help || [
$? -eq 124 ] && (systemctl -q is-active ceph-rbd-mirror(a)${rbd_id}.service
&& systemctl restart ceph-rbd-mirror(a)${rbd_id}.service)
----------------------
With this, we are running quite stable — with 3 RBD mirrors, there never is a real outage
(and in any case, all issues seem to clearly correlate to restarts or short
"hangs" of many OSD processes).
Cheers and hope this helps somebody with similar issues,
Oliver
Am 27.12.19 um 02:43 schrieb Oliver Freyermuth:
Dear Cephalopodians,
for those following along through the holiday season, here's my "quick
hack" for now, since our rbd mirrors keep going into "blacklisted" state
whenever a bunch of OSDs restart in the cluster.
For those not following along, nice holidays to you and hopefully some calm days off
:-).
To re-summarize: Once our rbd-mirrors are in that "blacklisted" state, they
don't recover by themselves, so I think what is missing would be an auto-restart /
reconnect after blacklisting
(and, of course, an idea why the daemons' clients get blacklisted when OSDs
restart?). Let me know if I should open a tracker issue on that one,
or can provide more information (it happens every few nights for us).
Since I was looking to restart them only in case of failure, I came up with some lengthy
commands.
I now have two cronjobs on the rbd-mirror daemon nodes set up, one which works
"whatever happens", restarting an rbd mirror if any image sync is broken:
rbd --id=rbd_mirror_backup mirror pool status | egrep -q 'unknown|stopped'
&& systemctl -q is-active ceph-rbd-mirror(a)rbd_mirror_backup.service &&
systemctl restart ceph-rbd-mirror(a)rbd_mirror_backup.service
I run this hourly. With multiple rbd mirrors, this does not catch everything, though. If
we enter the failure state (blacklisted rbd mirror clients), this part only ensures at
least one client recovers
and takes over the full load. To get the other clients to restart only if they are also
blacklisted, I do:
ceph daemon /var/run/ceph/ceph-client.rbd_mirror_backup.$(systemctl show --property
MainPID ceph-rbd-mirror(a)rbd_mirror_backup.service | sed 's/MainPID=//').*.asok rbd
mirror status | grep -q Replaying || (systemctl -q is-active
ceph-rbd-mirror(a)rbd_mirror_backup.service && systemctl restart
ceph-rbd-mirror(a)rbd_mirror_backup.service)
This also runs hourly and queries the daemon state itself. If there's no image in
"Replaying" state, something is wrong and the daemon is restarted.
Technically, the latter cronjob should be sufficient, the first one is only there in case
the daemons go completely awry (but I did not observe this up to now).
I made two interesting observations, though:
- It seems the log of the rbd-mirror is sometimes not filled with errors at all. The
cause seems to be that the "rbd-mirror" processes are not SIGHUPed in the
logrotate rule shipped with ceph-base.
I created a tracker issue here:
https://tracker.ceph.com/issues/43428
- The output of the "rbd mirror status" command is not valid JSON, it is
missing the trailing brace.
I created a tracker issue here:
https://tracker.ceph.com/issues/43429
Cheers,
Oliver
Am 24.12.19 um 04:39 schrieb Oliver Freyermuth:
Dear Cephalopodians,
running 13.2.6 on the source cluster and 14.2.5 on the rbd mirror nodes and the target
cluster,
I observe regular failures of rbd-mirror processes.
With failures, I mean that traffic stops, but the daemons are still listed as active
rbd-mirror daemons in
"ceph -s", and the daemons are still running. This comes in sync with a hefty
load of below messages in the mirror logs.
This happens "sometimes" when some OSDs go down and up in the target cluster
(which happens each night since the disks in that cluster
shortly go offline during "online" smart self-tests - that's a problem in
itself, but it's a cluster built from hardware that would have been trashed
otherwise).
The rbd daemons keep running in any case, but synchronization stops. If not all rbd
mirror daemons have failed (we have three running, and it usually does not hit all of
them),
the "surviving" seem(s) not to take care of the images the other daemons had
locked.
Right now, I am eyeing with a "quick solution" of regularly restarting the
rbd-mirror daemons, but if there are any good ideas on which debug info I could collect
to get this analyzed and fixed, that would of course be appreciated :-).
Cheers,
Oliver
-----------------------------------------------
2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00
[2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_process_entry_safe: failed to commit
journal event: (108) Cannot send after transport endpoint shutdown
2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00
[2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_replay_complete: replay encountered an
error: (108) Cannot send after transport endpoint shutdown
...
2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00
[2/23699357-a611-4557-9d73-6ff5279da991] handle_process_entry_safe: failed to commit
journal event: (125) Operation canceled
2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00
[2/23699357-a611-4557-9d73-6ff5279da991] handle_replay_complete: replay encountered an
error: (125) Operation canceled
2019-12-24 02:08:55.707 7f31ea358700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dce2e05b0 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2019-12-24 02:08:55.707 7f31ea358700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dcf47ee70 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
...
2019-12-24 02:08:55.716 7f31f5b6f700 -1 rbd::mirror::ImageReplayer: 0x559dcb997680
[2/f8218221-6608-4a2b-8831-84ca0c2cb418] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80
start_image_replayer: global_image_id=0577bd16-acc4-4e9a-81f0-c698a24f8771: blacklisted
detected during image replay
2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80
start_image_replayer: global_image_id=05bd4cca-a561-4a5c-ad83-9905ad5ce34e: blacklisted
detected during image replay
2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80
start_image_replayer: global_image_id=0e614ece-65b1-4b4a-99bd-44dd6235eb70: blacklisted
detected during image replay
-----------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io