On Wed, Feb 12, 2020 at 2:53 PM Oliver Freyermuth
<freyermuth(a)physik.uni-bonn.de> wrote:
Dear Jason,
Am 12.02.20 um 19:29 schrieb Jason Dillaman:
On Wed, Feb 12, 2020 at 11:55 AM Oliver
Freyermuth
<freyermuth(a)physik.uni-bonn.de> wrote:
Dear Cephalopodians,
for those on the list also fighting rbd mirror process instabilities: With 14.2.7 (but
maybe it was also present before, it does not happen often),
I very rarely encounter a case where none of the two described hacks I use is working
anymore, since "ceph daemon /var/run/ceph/ceph-client.rbd_mirror...." just hangs
forever.
So I am now using the following three cronjobs:
1) Restart RBD mirror if any images enter unknown or stopped state. This is probably not
needed since (2) should catch everything:
----------------------
rbd --id=${rbd_id} mirror pool status | egrep -q 'unknown|stopped' &&
systemctl -q is-active ceph-rbd-mirror(a)${rbd_id}.service && systemctl restart
ceph-rbd-mirror(a)${rbd_id}.service
----------------------
2) Restart RBD mirror if the RBD client is not replaying anything anymore (happens
randomly for us due to client blacklisting when many OSDs restart, and it never recovers
from that situation):
----------------------
ceph daemon /var/run/ceph/ceph-client.${rbd_id}.\$(systemctl show --property MainPID
ceph-rbd-mirror(a)${rbd_id}.service | sed 's/MainPID=//').*.asok rbd mirror status |
grep -q Replaying || (systemctl -q is-active ceph-rbd-mirror(a)${rbd_id}.service &&
systemctl restart ceph-rbd-mirror(a)${rbd_id}.service)
----------------------
If you can repeat this situation, can you provide any supporting
rbd-mirror daemon logs? The rbd-mirror daemon periodically scans (~30
seconds) to see if it has been blacklisted and then starts a recovery
process. If something is broken with this, we would like to fix it.
of course — I did not even need to repeat it manually, but one of our RBD Mirrors has
heard you ;-).
Interestingly, I did not see any OSDs going down this time, only restarted some VMs, but
this should hopefully not affect anything here...
Here are the logs from this occasion:
2020-02-12 19:57:03.178 7f39519c2040 0 ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process rbd-mirror, pid
4003991
2020-02-12 19:57:03.234 7f39519c2040 1 mgrc service_daemon_register rbd-mirror.3286479
metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable),ceph_version_short=14.2.7,cpu=Intel(R) Xeon(R) CPU E5410 @
2.33GHz,distro=centos,distro_description=CentOS Linux 7
(Core),distro_version=7,hostname=rbd-mirror002.example.com,id=rbd_mirror_backup,instance_id=3286479,kernel_description=#1
SMP Fri Dec 6 15:49:49 UTC
2019,kernel_version=3.10.0-1062.9.1.el7.x86_64,mem_swap_kb=5119996,mem_total_kb=16264948,os=Linux}
2020-02-12 19:57:06.533 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x5614036bdf80
[2/73b65031-0770-4c72-bf74-896ad960dc8f] handle_shut_down: mirror image no longer exists
2020-02-12 19:57:30.568 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403c89200
[2/61b047c8-7069-4f6d-abbd-4515c21a1c04] handle_shut_down: mirror image no longer exists
2020-02-12 19:57:30.628 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403c8a400
[2/87ba1bbb-7afc-4604-830c-5b0abdf0c0e5] handle_shut_down: mirror image no longer exists
2020-02-12 19:57:30.658 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561404c0a900
[2/f9c1e3ad-3552-4b41-b3d3-c1b0daf58122] handle_shut_down: mirror image no longer exists
2020-02-12 19:57:33.584 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403c88480
[2/61b047c8-7069-4f6d-abbd-4515c21a1c04] handle_shut_down: mirror image no longer exists
2020-02-12 20:10:04.855 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x56140564c880
[2/61b047c8-7069-4f6d-abbd-4515c21a1c04] handle_shut_down: mirror image no longer exists
2020-02-12 20:10:04.879 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403cd1f80
[2/73b65031-0770-4c72-bf74-896ad960dc8f] handle_shut_down: mirror image no longer exists
2020-02-12 20:10:04.890 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403cd1680
[2/87ba1bbb-7afc-4604-830c-5b0abdf0c0e5] handle_shut_down: mirror image no longer exists
2020-02-12 20:10:04.971 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561403cd0480
[2/f9c1e3ad-3552-4b41-b3d3-c1b0daf58122] handle_shut_down: mirror image no longer exists
2020-02-12 20:11:30.786 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561404d04900
[2/81b08854-60d9-4ac7-9169-c27db7966a06] handle_shut_down: mirror image no longer exists
2020-02-12 20:11:30.793 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561404d05f80
[2/ac260aaf-b6e4-465e-b804-af612e5ad531] handle_shut_down: mirror image no longer exists
2020-02-12 20:11:30.796 7f39339b4700 0 rbd::mirror::ImageReplayer: 0x561404cae000
[2/fc0bf498-db79-4acf-b23e-1a6bd8ad90c4] handle_shut_down: mirror image no longer exists
2020-02-12 20:13:28.638 7f393f1cb700 0 rbd::mirror::LeaderWatcher: 0x561403615200
handle_get_locker: breaking leader lock after 3 failed attempts to acquire
2020-02-12 20:13:30.061 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bf600
[2/7fbf7e3e-22cc-4e78-b570-8938d6d13190] handle_process_entry_safe: failed to commit
journal event: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:30.061 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bf600
[2/7fbf7e3e-22cc-4e78-b570-8938d6d13190] handle_replay_complete: replay encountered an
error: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:30.064 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bed00
[2/7f2bbf53-566c-409f-926f-894a2648c175] handle_process_entry_safe: failed to commit
journal event: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:30.064 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bed00
[2/7f2bbf53-566c-409f-926f-894a2648c175] handle_replay_complete: replay encountered an
error: (108) Cannot send after transport endpoint shutdown
...
many more of these
...
2020-02-12 20:13:30.121 7f392a1a1700 -1 rbd::mirror::ImageReplayer: 0x5614036bdb00
[2/66f808b0-2d75-4ffa-b6a8-dc2ac934a0e5] handle_replay_complete: replay encountered an
error: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:30.122 7f393f1cb700 -1 rbd::mirror::LeaderWatcher: 0x561403615200
handle_break_leader_lock: error breaking leader lock: (108) Cannot send after transport
endpoint shutdown
2020-02-12 20:13:30.320 7f393f1cb700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x5614051d1280
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:30.336 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561403c8a880
[2/ba137ddf-f520-4e2d-b56d-b5c572ca93b5] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.329 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408125570 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.329 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408125500 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.329 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408125490 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
...
many more of these
...
2020-02-12 20:13:33.333 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561405dd7880 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.334 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=ba137ddf-f520-4e2d-b56d-b5c572ca93b5: blacklisted
detected during image replay
2020-02-12 20:13:33.334 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x56140369ad90 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
...
many more of the "Cannot send after..." lines
...
2020-02-12 20:13:33.337 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561403c8ba80
[2/e41c5629-a158-4a5c-a3ae-5a971082cffa] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.337 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561405c75d50 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.337 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561405c75420 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.338 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x56140451cbd0 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.338 7f39339b4700 -1
rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x561408398230 handle_get_image_id:
failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2020-02-12 20:13:33.338 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561403cd0d80
[2/e64b7f9a-eb24-4b46-a55e-f821e8f63353] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.339 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x561404c0a000
[2/ed096674-37da-4988-9898-ae27fa3a82b6] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.340 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x56140362ff80
[2/f5711272-9fec-42ba-8a03-e65b8f3d55f0] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:33.340 7f393f1cb700 -1 rbd::mirror::ImageReplayer: 0x56140362fb00
[2/fafed4a3-802f-4860-a523-0fc8d677d9cc] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=17b2a7d9-aa98-4f85-b9da-bd27afee8aff: blacklisted
detected during image replay
2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=1be8a0ce-68df-42ce-8678-eafeeb007ad3: blacklisted
detected during image replay
2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=20d42977-e22a-41f2-b316-a4620991aee8: blacklisted
detected during image replay
2020-02-12 20:14:03.328 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=2fd1ae38-3be2-45b7-b221-8021732a741a: blacklisted
detected during image replay
...
more and more of those
...
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=e2d9b334-f995-442a-bd52-c7acc22d4e63: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=e41c5629-a158-4a5c-a3ae-5a971082cffa: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=e64b7f9a-eb24-4b46-a55e-f821e8f63353: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=ed096674-37da-4988-9898-ae27fa3a82b6: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=f5711272-9fec-42ba-8a03-e65b8f3d55f0: blacklisted
detected during image replay
2020-02-12 20:43:33.333 7f393f1cb700 -1 rbd::mirror::InstanceReplayer: 0x561402a1fb80
start_image_replayer: global_image_id=fafed4a3-802f-4860-a523-0fc8d677d9cc: blacklisted
detected during image replay
...
Actually, right now, it still is in this state. I easily notice this because it's
doing network I/O with ~4 MB/s in and out, via the interface facing the "target"
cluster to which the images should be replayed.
Checking the mon logs, it seems that the origin of the blacklisting this time:
2020-02-12 20:13:28.641 7f5dfad0c700 0 mon.mon001@0(leader) e3 handle_command
mon_command({"prefix": "osd blacklist", "blacklistop":
"add", "addr": "10.160.19.242:0/3907577364"} v 0) v1
2020-02-12 20:13:28.641 7f5dfad0c700 0 log_channel(audit) log [INF] : from='client.?
10.160.19.241:0/993249630' entity='client.rbd_mirror_backup'
cmd=[{"prefix": "osd blacklist", "blacklistop":
"add", "addr": "10.160.19.242:0/3907577364"}]: dispatch
actually was another rbd-mirror asking to blacklist the "colleague". Checking
the logs of rbd-mirror001 (10.160.19.241), I see:
...
2020-02-12 20:11:20.182 7f16d24d2040 0 ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus (stable), process rbd-mirror, pid 674
2020-02-12 20:11:20.237 7f16d24d2040 1 mgrc service_daemon_register rbd-mirror.3245041
metadata {arch=x86_64,ceph_release=nautilus,ceph_version=ceph version 14.2.7
(3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8)
nautilus (stable),ceph_version_short=14.2.7,cpu=Intel(R) Xeon(R) CPU E5410 @
2.33GHz,distro=centos,distro_description=CentOS Linux 7
(Core),distro_version=7,hostname=rbd-mirror001.backup.example.com,id=rbd_mirror_backup,instance_id=3245041,kernel_description=#1
SMP Fri Dec 6 15:49:49 UTC
2019,kernel_version=3.10.0-1062.9.1.el7.x86_64,mem_swap_kb=5119996,mem_total_kb=16264948,os=Linux}
2020-02-12 20:11:26.588 7f16b44c5700 0 rbd::mirror::ImageReplayer: 0x563e900bcd80
[2/5569d343-98fb-49c9-88cb-19251ce0e4cb] handle_shut_down: mirror image no longer exists
2020-02-12 20:13:28.631 7f16bfcdc700 0 rbd::mirror::LeaderWatcher: 0x563e9002f200
handle_get_locker: breaking leader lock after 3 failed attempts to acquire
2020-02-12 20:13:31.353 7f16b44c5700 0 rbd::mirror::ImageReplayer: 0x563e91510000
[2/5569d343-98fb-49c9-88cb-19251ce0e4cb] handle_shut_down: mirror image no longer exists
2020-02-12 20:13:31.660 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92992d80
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.660 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e916eb680
[2/0e614ece-65b1-4b4a-99bd-44dd6235eb70] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.660 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e913af500
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.661 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e916ec880
[2/12513e0f-a87e-44ee-81aa-57de5aca739a] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.662 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92993780
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.662 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e900bd680
[2/5949f093-8aee-41b3-b50f-5902d94ae90e] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.686 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92992e80
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.686 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e916ec400
[2/11e30f19-7975-4dfe-94ae-d73b2cff7f61] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.709 7f16bfcdc700 -1
rbd::mirror::image_replayer::PrepareLocalImageRequest: 0x563e92992f00
handle_get_tag_owner: failed to retrieve journal tag owner: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.710 7f16bfcdc700 -1 rbd::mirror::ImageReplayer: 0x563e90047680
[2/1816f3b2-e5ae-461d-94bb-07a376249922] operator(): start failed: (108) Cannot send after
transport endpoint shutdown
2020-02-12 20:13:31.712 7f16a44a5700 -1
rbd::mirror::image_replayer::OpenLocalImageRequest: 0x563e923ab5c0 handle_open_image :
failed to open image '3225d8dacb970d': (108) Cannot send after transport endpoint
shutdown
2020-02-12 20:13:31.712 7f16bfcdc700 0 rbd::mirror::PoolWatcher: 0x563e913cdb00
handle_rewatch_complete: detected client is blacklisted
2020-02-12 20:13:31.713 7f16a44a5700 -1 rbd::mirror::image_replayer::BootstrapRequest:
0x563e8f3ecfc0 handle_open_local_image: failed to open local image: (108) Cannot send
after transport endpoint shutdown
2020-02-12 20:13:31.713 7f16a44a5700 -1
rbd::mirror::image_replayer::OpenLocalImageRequest: 0x563e91947560 handle_open_image :
failed to open image '30272383ed48b0': (108) Cannot send after transport endpoint
shutdown
2020-02-12 20:13:31.713 7f16bfcdc700 0 rbd::mirror::PoolWatcher: 0x563e913cdb00
handle_rewatch_complete: detected client is blacklisted
2020-02-12 20:13:31.714 7f16bfcdc700 0 rbd::mirror::PoolWatcher: 0x563e913cdb00
handle_rewatch_complete: detected client is blacklisted
...
So it "feels" like our three RBD mirrors start to fight (maybe after I/O
congestion e.g. by restarting OSDs, high load,... ?) and sometimes blacklist each other in
a frenzy.
The blacklisted one becomes rather unhappy and never recovers, though.
Let me know if there is more info I can provide, I can (sometimes) enforce this by just
restarting many OSDs in unison.
Cheers,
Oliver
>
>> 3) Restart RBD mirror if control socket becomes unresponsive. This catches the
cases where (2) just hangs forever, and (1) does not fire yet because at least one
functional RBD mirror is left (we have 3):
>> ----------------------
>> timeout 10 ceph daemon /var/run/ceph/ceph-client.${rbd_id}.\$(systemctl show
--property MainPID ceph-rbd-mirror(a)${rbd_id}.service | sed 's/MainPID=//').*.asok
help || [ $? -eq 124 ] && (systemctl -q is-active
ceph-rbd-mirror(a)${rbd_id}.service && systemctl restart
ceph-rbd-mirror(a)${rbd_id}.service)
>> ----------------------
>>
>> With this, we are running quite stable — with 3 RBD mirrors, there never is a
real outage (and in any case, all issues seem to clearly correlate to restarts or short
"hangs" of many OSD processes).
>>
>> Cheers and hope this helps somebody with similar issues,
>> Oliver
>>
>> Am 27.12.19 um 02:43 schrieb Oliver Freyermuth:
>>> Dear Cephalopodians,
>>>
>>> for those following along through the holiday season, here's my
"quick hack" for now, since our rbd mirrors keep going into
"blacklisted" state whenever a bunch of OSDs restart in the cluster.
>>>
>>> For those not following along, nice holidays to you and hopefully some calm
days off :-).
>>>
>>> To re-summarize: Once our rbd-mirrors are in that "blacklisted"
state, they don't recover by themselves, so I think what is missing would be an
auto-restart / reconnect after blacklisting
>>> (and, of course, an idea why the daemons' clients get blacklisted when
OSDs restart?). Let me know if I should open a tracker issue on that one,
>>> or can provide more information (it happens every few nights for us).
>>>
>>> Since I was looking to restart them only in case of failure, I came up with
some lengthy commands.
>>>
>>> I now have two cronjobs on the rbd-mirror daemon nodes set up, one which
works "whatever happens", restarting an rbd mirror if any image sync is broken:
>>>
>>> rbd --id=rbd_mirror_backup mirror pool status | egrep -q
'unknown|stopped' && systemctl -q is-active
ceph-rbd-mirror(a)rbd_mirror_backup.service && systemctl restart
ceph-rbd-mirror(a)rbd_mirror_backup.service
>>>
>>> I run this hourly. With multiple rbd mirrors, this does not catch
everything, though. If we enter the failure state (blacklisted rbd mirror clients), this
part only ensures at least one client recovers
>>> and takes over the full load. To get the other clients to restart only if
they are also blacklisted, I do:
>>>
>>> ceph daemon /var/run/ceph/ceph-client.rbd_mirror_backup.$(systemctl show
--property MainPID ceph-rbd-mirror(a)rbd_mirror_backup.service | sed
's/MainPID=//').*.asok rbd mirror status | grep -q Replaying || (systemctl -q
is-active ceph-rbd-mirror(a)rbd_mirror_backup.service && systemctl restart
ceph-rbd-mirror(a)rbd_mirror_backup.service)
>>>
>>> This also runs hourly and queries the daemon state itself. If there's no
image in "Replaying" state, something is wrong and the daemon is restarted.
>>> Technically, the latter cronjob should be sufficient, the first one is only
there in case the daemons go completely awry (but I did not observe this up to now).
>>>
>>> I made two interesting observations, though:
>>> - It seems the log of the rbd-mirror is sometimes not filled with errors at
all. The cause seems to be that the "rbd-mirror" processes are not SIGHUPed in
the logrotate rule shipped with ceph-base.
>>> I created a tracker issue here:
>>>
https://tracker.ceph.com/issues/43428
>>> - The output of the "rbd mirror status" command is not valid JSON,
it is missing the trailing brace.
>>> I created a tracker issue here:
>>>
https://tracker.ceph.com/issues/43429
>>>
>>> Cheers,
>>> Oliver
>>>
>>> Am 24.12.19 um 04:39 schrieb Oliver Freyermuth:
>>>> Dear Cephalopodians,
>>>>
>>>> running 13.2.6 on the source cluster and 14.2.5 on the rbd mirror nodes
and the target cluster,
>>>> I observe regular failures of rbd-mirror processes.
>>>>
>>>> With failures, I mean that traffic stops, but the daemons are still
listed as active rbd-mirror daemons in
>>>> "ceph -s", and the daemons are still running. This comes in
sync with a hefty load of below messages in the mirror logs.
>>>>
>>>> This happens "sometimes" when some OSDs go down and up in the
target cluster (which happens each night since the disks in that cluster
>>>> shortly go offline during "online" smart self-tests -
that's a problem in itself, but it's a cluster built from hardware that would have
been trashed otherwise).
>>>>
>>>> The rbd daemons keep running in any case, but synchronization stops. If
not all rbd mirror daemons have failed (we have three running, and it usually does not hit
all of them),
>>>> the "surviving" seem(s) not to take care of the images the
other daemons had locked.
>>>>
>>>> Right now, I am eyeing with a "quick solution" of regularly
restarting the rbd-mirror daemons, but if there are any good ideas on which debug info I
could collect
>>>> to get this analyzed and fixed, that would of course be appreciated :-).
The peer rbd-mirror processes will blacklist the lock owner / leader
rbd-mirror daemon if it fails to heartbeat on time. What I don't see
that I would expect to see is a "restarting blacklisted pool replayer
for XYZ" log message about 30 seconds after after the first
blacklisting was detected. Plus, after a daemon acquires the lock, I
wouldn't expect to see it be blacklisted by another peer for at least
another 10-15 seconds.
If you wouldn't mind, could you possibly increase the debug level for
rbd-mirror (you can temporarily inject it via the admin socket) to
"debug rbd_mirror = 15" level and attach the log to a new tracker
ticket @ tracker.ceph.com?
You can also tweak the heartbeat timeouts for rbd-mirror daemon to
hopefully mitigate OSD slow-downs:
rbd_mirror_leader_heartbeat_interval = 5 (default value -- sends a
heartbeat to peers every 5 seconds)
rbd_mirror_leader_max_missed_heartbeats = 2 (default value -- increase
this to something like 20 for a 60 second grace period).
rbd_mirror_leader_max_acquire_attempts_before_break = 3 (default value
-- how many graceful attempts to transfer ownership before
blacklisting peer)