After looking through documentation soft log kills are "normal", however in
radosgw logs we found:
2023-10-06T01:31:32.920+0200 7fb6f440b700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000002 to be held by another RGW process;
skipping for now
2023-10-06T01:31:33.371+0200 7fb6f440b700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000004 to be held by another RGW process;
skipping for now
2023-10-06T01:31:33.521+0200 7fb6f440b700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000006 to be held by another RGW process;
skipping for now
2023-10-06T01:31:33.853+0200 7fb6f440b700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000008 to be held by another RGW process;
skipping for now
2023-10-06T01:31:34.598+0200 7fb6f440b700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000012 to be held by another RGW process;
skipping for now
2023-10-06T01:31:34.740+0200 7fb6f440b700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000014 to be held by another RGW process;
skipping for now
...
after this line ... it seems that rgw stopped responding.
And the next day it stopped again almost at the same time
2023-10-07T01:27:26.299+0200 7f6216651700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000005 to be held by another RGW process;
skipping for now
2023-10-07T01:37:28.077+0200 7f6216651700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000014 to be held by another RGW process;
skipping for now
2023-10-07T01:47:27.333+0200 7f6216651700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000001 to be held by another RGW process;
skipping for now
2023-10-07T02:47:29.863+0200 7f6216651700 0 INFO: RGWReshardLock::lock
found lock on reshard.0000000006 to be held by another RGW process;
skipping for now
...
after this line ... rgw stopped responding. We had to restart it.
We were just about to upgrade to ceph 17.x... but we had postpone it
because of this.
Rok
On Fri, Oct 6, 2023 at 9:30 AM Rok Jaklič <rjaklic(a)gmail.com> wrote:
Hi,
yesterday we changed RGW from civetweb to beast and at 04:02 RGW stopped
working; we had to restart it in the morning.
In one rgw log for previous day we can see:
2023-10-06T04:02:01.105+0200 7fb71d45d700 -1 received signal: Hangup from
killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw
rbd-mirror cephfs-mirror (PID: 3202663) UID: 0
and in the next day log we can see:
2023-10-06T04:02:01.133+0200 7fb71d45d700 -1 received signal: Hangup from
(PID: 3202664) UID: 0
and after that no requests came. We had to restart rgw.
In ceph.conf we have something like
[client.radosgw.ctplmon2]
host = ctplmon2
log_file = /var/log/ceph/client.radosgw.ctplmon2.log
rgw_dns_name = ctplmon2
rgw_frontends = "beast ssl_endpoint=0.0.0.0:4443 ssl_certificate=..."
rgw_max_put_param_size = 15728640
We assume it has something to do with logrotate.
/etc/logrotate.d/ceph:
/var/log/ceph/*.log {
rotate 90
daily
compress
sharedscripts
postrotate
killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse
radosgw rbd-mirror cephfs-mirror || pkill -1 -x
"ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw|rbd-mirror|cephfs-mirror"
|| true
endscript
missingok
notifempty
su root ceph
}
ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific
(stable)
And ideas why this happend?
Kind regards,
Rok