Hello dear fellow ceph users,
it seems that for some months all current ceph releases (16.x, 17.x,
18.x) are having a bug in ceph-volume that causes disk
activation to fail with the error "IndexError: list index out of range"
(details below, [0]).
It also seems there is already a fix for it available [1], but also that
it hasn't been merged into any official release [2,3,4].
This has started to affect more and more nodes in our clusters and thus
I was wondering if others are also seeing this issue and whether anyone
knows whether it is planned to create a new release based on this soon?
Best regards,
Nico
--------------------------------------------------------------------------------
[0]
kubectl -n rook-ceph logs -c activate rook-ceph-osd-30-6558b7cf69-5cbbl
+ OSD_ID=30
+ CEPH_FSID=bd3061a0-ecf3-4af6-9017-51b63c90b526
+ OSD_UUID=319e5756-318c-46a0-b7e9-429e39069302
+ OSD_STORE_FLAG=--bluestore
+ OSD_DATA_DIR=/var/lib/ceph/osd/ceph-30
+ CV_MODE=raw
+ DEVICE=/dev/sdf
+ cp --no-preserve=mode /etc/temp-ceph/ceph.conf /etc/ceph/ceph.conf
+ python3 -c '
import configparser
config = configparser.ConfigParser()
config.read('\''/etc/ceph/ceph.conf'\'')
if not config.has_section('\''global'\''):
config['\''global'\''] = {}
if not config.has_option('\''global'\'','\''fsid'\''):
config['\''global'\'']['\''fsid'\''] = '\''....\''
with open('\''/etc/ceph/ceph.conf'\'', '\''w'\'') as configfile:
config.write(configfile)
'
+ ceph -n client.admin auth get-or-create osd.30 mon 'allow profile osd' mgr 'allow profile osd' osd 'allow *' -k /etc/ceph/admin-keyring-store/keyring
[osd.30]
key = ...
+ [[ raw == \l\v\m ]]
++ mktemp
+ OSD_LIST=/tmp/tmp.OpZRJJOcrX
+ ceph-volume raw list /dev/sdf
Traceback (most recent call last):
File "/usr/sbin/ceph-volume", line 11, in <module>
load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__
self.main(self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/main.py", line 32, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
instance.main()
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 166, in main
self.list(args)
File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 122, in list
report = self.generate(args.device)
File "/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/list.py", line 91, in generate
info_device = [info for info in info_devices if info['NAME'] == dev][0]
IndexError: list index out of range
[1] https://github.com/ceph/ceph/pull/49954
[2] https://github.com/ceph/ceph/pull/54705
[3] https://github.com/ceph/ceph/pull/54706
[4] https://github.com/ceph/ceph/pull/54707
--
Sustainable and modern Infrastructures by ungleich.ch
Hi, we're currently testing a ceph (v 16.2.14) cluster, 3 mon nodes, 6 osd
nodes à 8 nvme ssd osds distributed over 3 racks. Daemons are deployed in
containers with cephadm / podman. We got 2 pools on it, one with 3x
replication and min_size=2, one with an EC (k3m3). With 1 mon node and 2
osd nodes in each rack, the crush rules are configured in a way (for the 3x
pool chooseleaf_firstn rack, for the ec pool choose_indep 3 rack /
chooseleaf_indep 2 host), so that a full rack can go down while the cluster
stays accessible for client operations. Other options we have set are
mon_osd_down_out_subtree_limit=host so that in case of a host/rack outage
the cluster does not automatically start to backfill, but will continue to
run in a degraded state until human interaction comes to fix it. Also we
set mon_osd_reporter_subtree_level=rack.
We tested - while under (synthetic test-)client load - what happens if we
take a full rack (one mon node and 2 osd nodes) out of the cluster. We did
that using iptables to block the nodes of the rack from other nodes of the
cluster (global and cluster network), as well as from the clients. As
expected, the remainder of the cluster continues to run in a degraded state
without to start any backfilling or recovery processes. All client requests
gets served while the rack is out.
But then a strange thing happens when we take the rack (1mon node, 2 osd
nodes) back into the cluster again by deleting all firewall rules with
iptables -F at once. Some osds get integrated in the cluster again
immediatelly but some others remain in state "down" for exactly 10 minutes.
These osds that stay down for the 10 minutes are in a state where they
still seem to not be able to reach other osd nodes (see heartbeat_check
logs below). After these 10 minutes have passed, these osds come up as well
but then at exactly that time, many PGs get stuck in peering state and
other osds that were all the time in the cluster get slow requests and the
cluster blocks client traffic (I think it's just the PGs stuck in peering
soaking all the client threads). Then, exactly 45 minutes after the nodes
of the rack were made reachable by iptables -F again, the situation
recovers, peering succeeds and client load gets handled again.
We have repeated this test several times and it's always exactly the same
10 min "down interval" and a 45 min affected client requests. When we
integrate the nodes into the cluster again one after another with a delay
of some minutes inbetween, this does not happen at all. I wonder what's
happening there. It must be some kind of split-brain situation having to do
with blocking the nodes using iptables but not rebooting them completelly.
The 10 min and 45 min intervals I described occure every time. For the 10
minutes, some osds stay down after the hosts got integrated again. It's not
all of the 16 osds from the 2 osd hosts that got integrated again but just
some of them. Which ones varies randomly. Sometimes it's also only just
one. We also observerd, the longer the hosts were out of the cluster, the
more osds are affected. Then even after they get up again after 10 minutes,
it takes another 45 minutes until the stuck peering situation resolves.
Also during these 45 minutes, we see slow ops on osds thet remained into
the cluster.
####################################################
See here some OSD logs that get written after the reintegration:
####################################################
2024-01-04T08:25:03.856+0000 7f369132b700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2024-01-04T07:25:03.860426+0000)
2024-01-04T08:25:06.556+0000 7f3682882700 0 log_channel(cluster) log [WRN]
: Monitor daemon marked osd.0 down, but it is still running
2024-01-04T08:25:06.556+0000 7f3682882700 0 log_channel(cluster) log [DBG]
: map e62160 wrongly marked me down at e62136
2024-01-04T08:25:06.556+0000 7f3682882700 1 osd.0 62160
start_waiting_for_healthy
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6810 osd.2 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6814 osd.3 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6822 osd.4 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6830 osd.5 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6830 osd.7 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6830 osd.9 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6802 osd.10 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6802 osd.11 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6802 osd.13 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6802 osd.15 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6806 osd.16 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6806 osd.17 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6806 osd.20 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6806 osd.21 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6810 osd.22 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6814 osd.23 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6810 osd.26 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6810 osd.27 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6814 osd.28 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6818 osd.29 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6814 osd.32 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6818 osd.33 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6818 osd.34 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6822 osd.35 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6818 osd.37 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6822 osd.39 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6822 osd.40 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6826 osd.41 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.6:6826 osd.43 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.9:6826 osd.44 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.8:6826 osd.46 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
2024-01-04T08:25:06.872+0000 7f368caa5700 -1 osd.0 62160 heartbeat_check:
no reply from 1.2.3.5:6830 osd.47 ever on either front or back, first ping
sent 2024-01-04T08:21:04.601131+0000 (oldest deadline
2024-01-04T08:21:24.601131+0000)
[The block of loglines above gets repeated for 45 minutes until everything
is fine again, the block below comes once at the beginning after the
reintegration, then starts again after the 10 min interval and the repeats
until the end of the 45 min interval until everything is fine again.]
2024-01-04T08:25:07.036+0000 7f3691b2c700 0 auth: could not find
secret_id=1363
2024-01-04T08:25:07.036+0000 7f3691b2c700 0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1363
2024-01-04T08:25:07.036+0000 7f369132b700 0 auth: could not find
secret_id=1363
2024-01-04T08:25:07.036+0000 7f369132b700 0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1363
2024-01-04T08:25:07.236+0000 7f369232d700 0 auth: could not find
secret_id=1363
2024-01-04T08:25:07.236+0000 7f369232d700 0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1363
2024-01-04T08:25:07.236+0000 7f369232d700 0 auth: could not find
secret_id=1363
2024-01-04T08:25:07.236+0000 7f369232d700 0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1363
2024-01-04T08:35:07.225+0000 7f369232d700 0 auth: could not find
secret_id=1365
2024-01-04T08:35:07.225+0000 7f369232d700 0 cephx: verify_authorizer could
not get service secret for service osd secret_id=1365
[The block below gets logged for 10 minutes until the osd is not down
anymore]
2024-01-04T08:25:08.368+0000 7f368d2a6700 1 osd.0 62162 is_healthy false
-- only 0/10 up peers (less than 33%)
2024-01-04T08:25:08.368+0000 7f368d2a6700 1 osd.0 62162 not healthy;
waiting to boot
2024-01-04T08:25:09.340+0000 7f368d2a6700 1 osd.0 62162 is_healthy false
-- only 0/10 up peers (less than 33%)
2024-01-04T08:25:09.340+0000 7f368d2a6700 1 osd.0 62162 not healthy;
waiting to boot
2024-01-04T08:25:10.316+0000 7f368d2a6700 1 osd.0 62162 is_healthy false
-- only 0/10 up peers (less than 33%)
2024-01-04T08:25:10.316+0000 7f368d2a6700 1 osd.0 62162 not healthy;
waiting to boot
After 10 minutes then, the osd seems to reboot:
2024-01-04T08:35:07.005+0000 7f368d2a6700 1 osd.0 62509 start_boot
2024-01-04T08:35:07.009+0000 7f368b2a2700 1 osd.0 62509 set_numa_affinity
storage numa node 0
2024-01-04T08:35:07.009+0000 7f368b2a2700 -1 osd.0 62509 set_numa_affinity
unable to identify public interface '' numa node: (2) No such file or
directory
2024-01-04T08:35:07.009+0000 7f368b2a2700 1 osd.0 62509 set_numa_affinity
not setting numa affinity
2024-01-04T08:35:07.197+0000 7f367ea40700 2 osd.0 62509 ms_handle_reset
con 0x561d78ec6000 session 0x561d8ae5f0e0
2024-01-04T08:35:07.213+0000 7f3682882700 1 osd.0 62521 state: booting ->
active
##############################################################
See here some logs of the active mon that get written after the
reintegration:
##############################################################
2024-01-04T08:25:06.486+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62160 send_latest to osd.0 v2:... start 62136
2024-01-04T08:25:06.486+0000 7ff5fa87b700 1 mon.ceph-mon01(a)0(leader).osd
e62160 ignoring beacon from non-active osd.0
2024-01-04T08:25:06.490+0000 7ff5f9078700 0 log_channel(cluster) log [WRN]
: osd.0 (root=default,rack=rack3,host=ceph-osd07) is down
2024-01-04T08:25:06.642+0000 7ff5f9078700 0 log_channel(cluster) log [INF]
: osd.0 marked itself dead as of e62160
2024-01-04T08:25:07.434+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62161 preprocess_failure dne(/dup?): osd.0 [v2:...,v1:...], from osd.45
2024-01-04T08:29:59.998+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: osd.0 (root=default,rack=rack3,host=ceph-osd07) is down
2024-01-04T08:29:59.998+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: Slow OSD heartbeats on back from osd.45 [rack3] to osd.0 [rack3]
(down) 221701.273 msec
2024-01-04T08:29:59.998+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: Slow OSD heartbeats on front from osd.45 [rack3] to osd.0 [rack3]
(down) 221700.153 msec
2024-01-04T08:31:43.019+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62434 send_incremental [62416..62434] to osd.0
2024-01-04T08:32:13.915+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62443 send_incremental [62435..62443] to osd.0
2024-01-04T08:32:44.891+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62461 send_incremental [62444..62461] to osd.0
2024-01-04T08:33:15.752+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62465 send_incremental [62462..62465] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 preprocess_failure from dead osd.0, ignoring
2024-01-04T08:33:39.148+0000 7ff5fa87b700 5 mon.ceph-mon01(a)0(leader).osd
e62483 send_incremental [62466..62483] to osd.0
2024-01-04T08:33:51.240+0000 7ff5f9078700 5 mon.ceph-mon01(a)0(leader).osd
e62484 send_incremental [62484..62484] to osd.0
2024-01-04T08:35:04.077+0000 7ff5fd080700 0 log_channel(cluster) log [INF]
: Marking osd.0 out (has been down for 600 seconds)
2024-01-04T08:35:04.085+0000 7ff5fd080700 2 mon.ceph-mon01(a)0(leader).osd
e62517 osd.0 OUT
2024-01-04T08:35:07.173+0000 7ff5fd080700 2 mon.ceph-mon01(a)0(leader).osd
e62520 osd.0 UP [v2:...,v1:...]
2024-01-04T08:35:07.173+0000 7ff5fd080700 2 mon.ceph-mon01(a)0(leader).osd
e62520 osd.0 IN
This is what gets logged shortly before the situation recovers:
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: Health detail: HEALTH_WARN Reduced data availability: 292 pgs inactive,
292 pgs peering; Degraded data redundancy: 14614397/4091962683 objects
degraded (0.357%), 322 pgs degraded, 360 pgs undersized; 2 pools have too
many placement groups; 9472 slow ops, oldest one blocked for 2091 sec,
daemons [osd.10,osd.11,osd.13,osd.14,osd.15,osd.16,osd.17>
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: [WRN] PG_AVAILABILITY: Reduced data availability: 292 pgs inactive, 292
pgs peering
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.212 is stuck peering for 34m, current state remapped+peering,
last acting [4,21]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.220 is stuck peering for 34m, current state remapped+peering,
last acting [3,35]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.227 is stuck peering for 34m, current state remapped+peering,
last acting [44,26]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.228 is stuck peering for 34m, current state remapped+peering,
last acting [23,5]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.22b is stuck peering for 34m, current state remapped+peering,
last acting [32,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.234 is stuck peering for 34m, current state remapped+peering,
last acting [32,44]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.24e is stuck peering for 34m, current state remapped+peering,
last acting [17,32]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.255 is stuck peering for 34m, current state remapped+peering,
last acting [4,22]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.260 is stuck peering for 34m, current state remapped+peering,
last acting [17,47]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.267 is stuck peering for 34m, current state remapped+peering,
last acting [44,23]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.27d is stuck peering for 34m, current state remapped+peering,
last acting [4,21]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.289 is stuck peering for 34m, current state remapped+peering,
last acting [13,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.292 is stuck peering for 34m, current state remapped+peering,
last acting [15,10]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.297 is stuck peering for 34m, current state remapped+peering,
last acting [2,21]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.29d is stuck peering for 34m, current state remapped+peering,
last acting [40,23]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 14.2a8 is stuck peering for 34m, current state remapped+peering,
last acting [33,4]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.20e is stuck inactive for 34m, current state remapped+peering,
last acting [33,39,2147483647,2147483647,10,43]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.212 is stuck peering for 34m, current state remapped+peering,
last acting [36,2147483647,22,40,10,43]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.219 is stuck peering for 34m, current state remapped+peering,
last acting [13,10,21,34,2147483647,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.21c is stuck peering for 34m, current state remapped+peering,
last acting [41,4,2147483647,14,44,15]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.222 is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,23,32,3,34]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.22d is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,45,41,20,17,33]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.233 is stuck peering for 34m, current state remapped+peering,
last acting [4,2,27,34,14,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.23a is stuck peering for 34m, current state remapped+peering,
last acting [41,43,19,2147483647,34,33]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.23b is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,30,7,41,34,15]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.23e is stuck peering for 34m, current state remapped+peering,
last acting [10,37,2147483647,2147483647,11,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.243 is stuck peering for 34m, current state remapped+peering,
last acting [23,13,11,15,2147483647,45]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.244 is stuck peering for 34m, current state remapped+peering,
last acting [13,35,14,2147483647,17,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.249 is stuck peering for 34m, current state remapped+peering,
last acting [32,47,2147483647,2147483647,46,17]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.24a is stuck peering for 34m, current state remapped+peering,
last acting [47,7,15,5,2147483647,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.24b is stuck peering for 34m, current state remapped+peering,
last acting [30,2147483647,46,28,4,29]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.24f is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,13,35,33,5]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.254 is stuck peering for 34m, current state remapped+peering,
last acting [15,39,2147483647,2147483647,4,16]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.257 is stuck peering for 34m, current state remapped+peering,
last acting [13,10,2147483647,2147483647,22,40]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.25d is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,20,16,34,46]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.26f is stuck peering for 34m, current state remapped+peering,
last acting [33,17,2147483647,14,26,23]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.273 is stuck peering for 34m, current state remapped+peering,
last acting [36,2147483647,29,4,17,21]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.27a is stuck peering for 34m, current state remapped+peering,
last acting [40,34,2147483647,2147483647,26,23]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.27b is stuck peering for 34m, current state remapped+peering,
last acting [41,37,2147483647,19,11,3]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.27e is stuck peering for 34m, current state remapped+peering,
last acting [44,9,4,41,45,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.281 is stuck peering for 34m, current state remapped+peering,
last acting [2,43,21,17,31,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.290 is stuck peering for 34m, current state remapped+peering,
last acting [17,21,2147483647,2147483647,23,37]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.296 is stuck peering for 34m, current state remapped+peering,
last acting [44,3,35,20,2147483647,2147483647]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.298 is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,36,43,23,44,9]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.2a5 is stuck peering for 34m, current state remapped+peering,
last acting [33,11,19,2147483647,43,10]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.2a6 is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,4,29,27,34]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.2a8 is stuck peering for 34m, current state remapped+peering,
last acting [47,4,2147483647,2147483647,15,5]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.2ae is stuck peering for 34m, current state remapped+peering,
last acting [23,32,2147483647,2147483647,3,22]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.2b4 is stuck peering for 34m, current state remapped+peering,
last acting [2147483647,2147483647,23,20,15,34]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.2b7 is stuck peering for 34m, current state remapped+peering,
last acting [17,3,2147483647,45,41,32]
2024-01-04T09:09:59.996+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: pg 15.2b8 is stuck peering for 34m, current state remapped+peering,
last acting [40,44,2147483647,2147483647,43,10]
2024-01-04T09:10:02.572+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: Health check update: Degraded data redundancy: 14296421/4091962644
objects degraded (0.349%), 322 pgs degraded, 360 pgs undersized
(PG_DEGRADED)
2024-01-04T09:10:02.572+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: Health check update: 9127 slow ops, oldest one blocked for 2096 sec,
daemons
[osd.10,osd.11,osd.13,osd.14,osd.15,osd.16,osd.17,osd.2,osd.20,osd.21]...
have slow ops. (SLOW_OPS)
2024-01-04T09:10:27.608+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: Health check update: Slow OSD heartbeats on back (longest 429402.973ms)
(OSD_SLOW_PING_TIME_BACK)
2024-01-04T09:10:27.608+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: Health check update: Slow OSD heartbeats on front (longest 429531.265ms)
(OSD_SLOW_PING_TIME_FRONT)
2024-01-04T09:10:27.608+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: Health check update: Degraded data redundancy: 96918631/4091964189
objects degraded (2.369%), 404 pgs degraded, 241 pgs undersized
(PG_DEGRADED)
2024-01-04T09:10:27.608+0000 7ff5fd080700 0 log_channel(cluster) log [WRN]
: Health check update: 706 slow ops, oldest one blocked for 2121 sec,
daemons
[osd.10,osd.11,osd.12,osd.13,osd.15,osd.16,osd.17,osd.18,osd.2,osd.20]...
have slow ops. (SLOW_OPS)
Any ideas what's going on here ?
Facing a similar situation, any support would be helpful.
-Lokendra
On Tue, Jan 9, 2024 at 10:47 PM Kushagr Gupta <kushagrguptasps.mun(a)gmail.com>
wrote:
> Hi Team,
>
> Features used: Rados gateway, ceph S3 buckets
>
> We are trying to create a data pipeline using the S3 buckets capability
> and rado gateway in ceph.
> Our endpoint is a kafka topic.
>
> Currently, as soon as an object is created, a notification is sent to
> kafka topic.
> Our goal is to create multiple object first and then send the notification.
> Also if there is a way such that we can assign objects to a group and
> collectively send notification on an event on that group.
>
> Could anyone please help me?
>
> Thanks and Regards,
> Kushagra Gupta
>
--
~ Lokendra
skype: lokendrarathour
Hello all,
Looking at grafana reports, can anyone point me to documentation that
outlines physical vs osd? https://docs.ceph.com/en/latest/monitoring/
gives some basic info, but I'm trying to get a better understanding. For
instance, physical latency is 20ms and osd is 200ms, these are just made up
for this example, why the huge difference? Same thing if for bytes or iops,
just using latency as an example.
Thanks,
Curt
Hi Jan,
indeed this looks like some memory allocation problem - may be OSD's RAM
usage threshold reached or something?
Curious if you have any custom OSD settings or may be any memory caps
for Ceph containers?
Could you please set debug_bluestore to 5/20 and debug_prioritycache to
10 and try to start OSD once again. Please monitor process RAM usage
along the process and share the resulting log.
Thanks,
Igor
On 10/01/2024 11:20, Jan Marek wrote:
> Hi Igor,
>
> I've tried to repair osd.1 with command:
>
> ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command repair
>
> and then start osd.1 ceph-osd podman service.
>
> It semms, that there is problem with memory allocation, see
> attached log...
>
> Sincerely
> Jan
>
> Dne Út, led 09, 2024 at 02:23:32 CET napsal(a) Igor Fedotov:
>> Hi Marek,
>>
>> I haven't looked through those upgrade logs yet but here are some comments
>> regarding last OSD startup attempt.
>>
>> First of answering your question
>>
>>> _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while)
>>> Is it a mandatory part of fsck?
>> This is caused by previous non-graceful OSD process shutdown. BlueStore is unable to find up-to-date allocation map and recovers it from RocksDB. And since fsck is a read-only procedure the recovered allocmap is not saved - hence all the following BlueStore startups (within fsck or OSD init) cause another rebuild attempt. To avoid that you might want to run repair instead of fsck - this will persist up-to-date allocation map and avoid its rebuilding on the next startup. This will work till the next non-graceful shutdown only - hence unsuccessful OSD attempt might break the allocmap state again.
>>
>> Secondly - looking at OSD startup log one can see that actual OSD log ends with that allocmap recovery as well:
>>
>>> 2024-01-09T11:25:30.718449+01:00 osd1 ceph-osd[1734062]: bluestore(/var/lib/ceph/osd/ceph-1) _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while) ...
>> Subsequent log line indicating OSD daemon termination is from systemd:
>>> 2024-01-09T11:25:33.516258+01:00 osd1 systemd[1]: Stopping ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a(a)osd.1.service - Ceph osd.1 for 2c565e24-7850-47dc-a751-a6357cbbaf2a...
>> And honestly these lines provide almost no clue why termination happened. No obvious OSD failures or something are shown. Perhaps containerized environment hides the details e.g. by cutting off OSD log's tail.
>> So you might want to proceed the investigation by running repair prior to starting the OSD as per above. This will result in no alloc map recovery and hopefully workaround the problem during startup - if the issue is caused by allocmap recovery.
>> Additionally you might want to increase debug_bluestore log level for osd.1 before starting it up to get more insight on what's happening.
>>
>> Alternatively you might want to play with OSD log target settings to write OSD.1 log to some file rather than using system wide logging infra - hopefully this will be more helpful.
>>
>> Thanks,
>> Igor
>>
>> On 09/01/2024 13:31, Jan Marek wrote:
>>> Hi Igor,
>>>
>>> I've sent you logs via filesender.cesnet.cz, if someone would
>>> be interested, they are here:
>>>
>>> https://filesender.cesnet.cz/?s=download&token=047b1ec4-4df0-4e8a-90fc-3170…
>>>
>>> Some points:
>>>
>>> 1) I've found, that on the osd1 server was bad time (3 minutes in
>>> future). I've corrected that. Yes, I know, that it's bad, but we
>>> moved servers to any other net segment, where they have no access
>>> to the timeservers in Internet, then I must reconfigure it to use
>>> our own NTP servers.
>>>
>>> 2) I've tried to start osd.1 service by this sequence:
>>>
>>> a)
>>>
>>> ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck
>>>
>>> (without setting log properly :-( )
>>>
>>> b)
>>>
>>> export CEPH_ARGS="--log-file osd.1.log --debug-bluestore 5/20"
>>> ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck
>>>
>>> - here I have one question: Why is it in this log stil this line:
>>>
>>> _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while)
>>>
>>> Is it a mandatory part of fsck?
>>>
>>> Log is attached.
>>>
>>> c)
>>>
>>> systemctl start ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a(a)osd.1.service
>>>
>>> still crashing, gzip-ed log attached too.
>>>
>>> Many thanks for exploring problem.
>>>
>>> Sincerely
>>> Jan Marek
>>>
>>> Dne Po, led 08, 2024 at 12:00:05 CET napsal(a) Igor Fedotov:
>>>> Hi Jan,
>>>>
>>>> indeed fsck logs for the OSDs other than osd.0 look good so it would be
>>>> interesting to see OSD startup logs for them. Preferably to have that for
>>>> multiple (e.g. 3-4) OSDs to get the pattern.
>>>>
>>>> Original upgrade log(s) would be nice to see as well.
>>>>
>>>> You might want to use Google Drive or any other publicly available file
>>>> sharing site for that.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>> On 05/01/2024 10:25, Jan Marek wrote:
>>>>> Hi Igor,
>>>>>
>>>>> I've tried to start only osd.1, which seems to be fsck'd OK, but
>>>>> it crashed :-(
>>>>>
>>>>> I search logs and I've found, that I have logs from 22.12.2023,
>>>>> when I've did a upgrade (I have set logging to journald).
>>>>>
>>>>> Would you be interested in those logs? This file have 30MB in
>>>>> bzip2 format, how I can share it with you?
>>>>>
>>>>> It contains crash log from start osd.1 too, but I can cut out
>>>>> from it and send it to list...
>>>>>
>>>>> Sincerely
>>>>> Jan Marek
>>>>>
>>>>> Dne Čt, led 04, 2024 at 02:43:48 CET napsal(a) Jan Marek:
>>>>>> Hi Igor,
>>>>>>
>>>>>> I've ran this oneliner:
>>>>>>
>>>>>> for i in {0..12}; do export CEPH_ARGS="--log-file osd."${i}".log --debug-bluestore 5/20" ; ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.${i} --command fsck ; done;
>>>>>>
>>>>>> On osd.0 it crashed very quickly, on osd.1 it is still working.
>>>>>>
>>>>>> I've send those logs in one e-mail.
>>>>>>
>>>>>> But!
>>>>>>
>>>>>> I've tried to list disk devices in monitor view, and I've got
>>>>>> very interesting screenshot - some part I've emphasized by red
>>>>>> rectangulars.
>>>>>>
>>>>>> I've got a json from syslog, which was as a part cephadm call,
>>>>>> where it seems to be correct (for my eyes).
>>>>>>
>>>>>> Can be this coincidence for this problem?
>>>>>>
>>>>>> Sincerely
>>>>>> Jan Marek
>>>>>>
>>>>>> Dne Čt, led 04, 2024 at 12:32:47 CET napsal(a) Igor Fedotov:
>>>>>>> Hi Jan,
>>>>>>>
>>>>>>> may I see the fsck logs from all the failing OSDs to see the pattern. IIUC
>>>>>>> the full node is suffering from the issue, right?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Igor
>>>>>>>
>>>>>>> On 1/2/2024 10:53 AM, Jan Marek wrote:
>>>>>>>> Hello once again,
>>>>>>>>
>>>>>>>> I've tried this:
>>>>>>>>
>>>>>>>> export CEPH_ARGS="--log-file /tmp/osd.0.log --debug-bluestore 5/20"
>>>>>>>> ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.0 --command fsck
>>>>>>>>
>>>>>>>> And I've sending /tmp/osd.0.log file attached.
>>>>>>>>
>>>>>>>> Sincerely
>>>>>>>> Jan Marek
>>>>>>>>
>>>>>>>> Dne Ne, pro 31, 2023 at 12:38:13 CET napsal(a) Igor Fedotov:
>>>>>>>>> Hi Jan,
>>>>>>>>>
>>>>>>>>> this doesn't look like RocksDB corruption but rather like some BlueStore
>>>>>>>>> metadata inconsistency. Also assertion backtrace in the new log looks
>>>>>>>>> completely different from the original one. So in an attempt to find any
>>>>>>>>> systematic pattern I'd suggest to run fsck with verbose logging for every
>>>>>>>>> failing OSD. Relevant command line:
>>>>>>>>>
>>>>>>>>> CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20"
>>>>>>>>> bin/ceph-bluestore-tool --path <path-to-osd> --command fsck
>>>>>>>>>
>>>>>>>>> Unlikely this will fix anything it's rather a way to collect logs to get
>>>>>>>>> better insight.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Additionally you might want to run similar fsck for a couple of healthy OSDs
>>>>>>>>> - curious if it succeeds as I have a feeling that the problem with crashing
>>>>>>>>> OSDs had been hidden before the upgrade and revealed rather than caused by
>>>>>>>>> it.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Igor
>>>>>>>>>
>>>>>>>>> On 12/29/2023 3:28 PM, Jan Marek wrote:
>>>>>>>>>> Hello Igor,
>>>>>>>>>>
>>>>>>>>>> I'm attaching a part of syslog creating while starting OSD.0.
>>>>>>>>>>
>>>>>>>>>> Many thanks for help.
>>>>>>>>>>
>>>>>>>>>> Sincerely
>>>>>>>>>> Jan Marek
>>>>>>>>>>
>>>>>>>>>> Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:
>>>>>>>>>>> Hi Jan,
>>>>>>>>>>>
>>>>>>>>>>> IIUC the attached log is for ceph-kvstore-tool, right?
>>>>>>>>>>>
>>>>>>>>>>> Can you please share full OSD startup log as well?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Igor
>>>>>>>>>>>
>>>>>>>>>>> On 12/27/2023 4:30 PM, Jan Marek wrote:
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
>>>>>>>>>>>> osd node have 12 rotational disk and one NVMe device for
>>>>>>>>>>>> bluestore DB). CEPH is installed by ceph orchestrator and have
>>>>>>>>>>>> bluefs storage on osd.
>>>>>>>>>>>>
>>>>>>>>>>>> I've started process upgrade from version 17.2.6 to 18.2.1 by
>>>>>>>>>>>> invocating:
>>>>>>>>>>>>
>>>>>>>>>>>> ceph orch upgrade start --ceph-version 18.2.1
>>>>>>>>>>>>
>>>>>>>>>>>> After upgrade of mon and mgr processes orchestrator tried to
>>>>>>>>>>>> upgrade the first OSD node, but they are falling down.
>>>>>>>>>>>>
>>>>>>>>>>>> I've stop the process of upgrade, but I have 1 osd node
>>>>>>>>>>>> completely down.
>>>>>>>>>>>>
>>>>>>>>>>>> After upgrade I've got some error messages and I've found
>>>>>>>>>>>> /var/lib/ceph/crashxxxx directories, I attach to this message
>>>>>>>>>>>> files, which I've found here.
>>>>>>>>>>>>
>>>>>>>>>>>> Please, can you advice, what now I can do? It seems, that rocksdb
>>>>>>>>>>>> is even non-compatible or corrupted :-(
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>>
>>>>>>>>>>>> Sincerely
>>>>>>>>>>>> Jan Marek
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>>>>>> --
>>>>>>>>>>> Igor Fedotov
>>>>>>>>>>> Ceph Lead Developer
>>>>>>>>>>>
>>>>>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>>>>>>
>>>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Igor Fedotov
>>>>>>>>> Ceph Lead Developer
>>>>>>>>>
>>>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>>>>
>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>> --
>>>>>>> Igor Fedotov
>>>>>>> Ceph Lead Developer
>>>>>>>
>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>>
>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>> --
>>>>>> Ing. Jan Marek
>>>>>> University of South Bohemia
>>>>>> Academic Computer Centre
>>>>>> Phone: +420389032080
>>>>>> http://www.gnu.org/philosophy/no-word-attachments.cs.html
>>>>>
>>>>>
>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hello everyone,
we are running a small cluster with 3 nodes and 25 osds per node. And Ceph
version 17.2.6.
Recently the active mds crashed and since then the new starting mds has
always been in the up:replay state. In the output of the command 'ceph tell
mds.cephfs:0 status' you can see that the journal is completely read in. As
soon as it's finished, the mds crashes and the next one starts reading the
journal.
At the moment I have the journal inspection running ('cephfs-journal-tool
--rank=cephfs:0 journal inspect').
Does anyone have any further suggestions on how I can get the cluster
running again as quickly as possible?
Best regards
Lars
[image: ariadne.ai Logo] Lars Köppel
Developer
Email: lars.koeppel(a)ariadne.ai
Phone: +49 6221 5993580 <+4962215993580>
ariadne.ai (Germany) GmbH
Häusserstraße 3, 69115 Heidelberg
Amtsgericht Mannheim, HRB 744040
Geschäftsführer: Dr. Fabian Svara
https://ariadne.ai
Hi folks,
I am fighting a bit with odd deep-scrub behavior on HDDs and discovered a likely cause of why the distribution of last_deep_scrub_stamps is so weird. I wrote a small script to extract a histogram of scrubs by "days not scrubbed" (more precisely, intervals not scrubbed; see code) to find out how (deep-) scrub times are distributed. Output below.
What I expected is along the lines that HDD-OSDs try to scrub every 1-3 days, while they try to deep-scrub every 7-14 days. In other words, OSDs that have been deep-scrubbed within the last 7 days would *never* be in scrubbing+deep state. However, what I see is completely different. There seems to be no distinction between scrub- and deep-scrub start times. This is really unexpected as nobody would try to deep-scrub HDDs every day. Weekly to bi-weekly is normal, specifically for large drives.
Is there a way to configure something like osd_deep_scrub_min_interval (no, I don't want to run cron jobs for scrubbing yet)? In the output below, I would like to be able to configure a minimum period of 1-2 weeks before the next deep-scrub happens. How can I do that?
The observed behavior is very unusual for RAID systems (if its not a bug in the report script). With this behavior its not surprising that people complain about "not deep-scrubbed in time" messages and too high deep-scrub IO load when such a large percentage of OSDs is needlessly deep-scrubbed after 1-6 days again already.
Sample output:
# scrub-report
dumped pgs
Scrub report:
4121 PGs not scrubbed since 1 intervals (6h)
3831 PGs not scrubbed since 2 intervals (6h)
4012 PGs not scrubbed since 3 intervals (6h)
3986 PGs not scrubbed since 4 intervals (6h)
2998 PGs not scrubbed since 5 intervals (6h)
1488 PGs not scrubbed since 6 intervals (6h)
909 PGs not scrubbed since 7 intervals (6h)
771 PGs not scrubbed since 8 intervals (6h)
582 PGs not scrubbed since 9 intervals (6h) 2 scrubbing
431 PGs not scrubbed since 10 intervals (6h)
333 PGs not scrubbed since 11 intervals (6h) 1 scrubbing
265 PGs not scrubbed since 12 intervals (6h)
195 PGs not scrubbed since 13 intervals (6h)
116 PGs not scrubbed since 14 intervals (6h)
78 PGs not scrubbed since 15 intervals (6h) 1 scrubbing
72 PGs not scrubbed since 16 intervals (6h)
37 PGs not scrubbed since 17 intervals (6h)
5 PGs not scrubbed since 18 intervals (6h) 14.237* 19.5cd* 19.12cc* 19.1233* 14.40e*
33 PGs not scrubbed since 20 intervals (6h)
23 PGs not scrubbed since 21 intervals (6h)
16 PGs not scrubbed since 22 intervals (6h)
12 PGs not scrubbed since 23 intervals (6h)
8 PGs not scrubbed since 24 intervals (6h)
2 PGs not scrubbed since 25 intervals (6h) 19.eef* 19.bb3*
4 PGs not scrubbed since 26 intervals (6h) 19.b4c* 19.10b8* 19.f13* 14.1ed*
5 PGs not scrubbed since 27 intervals (6h) 19.43f* 19.231* 19.1dbe* 19.1788* 19.16c0*
6 PGs not scrubbed since 28 intervals (6h)
2 PGs not scrubbed since 30 intervals (6h) 19.10f6* 14.9d*
3 PGs not scrubbed since 31 intervals (6h) 19.1322* 19.1318* 8.a*
1 PGs not scrubbed since 32 intervals (6h) 19.133f*
1 PGs not scrubbed since 33 intervals (6h) 19.1103*
3 PGs not scrubbed since 36 intervals (6h) 19.19cc* 19.12f4* 19.248*
1 PGs not scrubbed since 39 intervals (6h) 19.1984*
1 PGs not scrubbed since 41 intervals (6h) 14.449*
1 PGs not scrubbed since 44 intervals (6h) 19.179f*
Deep-scrub report:
3723 PGs not deep-scrubbed since 1 intervals (24h)
4621 PGs not deep-scrubbed since 2 intervals (24h) 8 scrubbing+deep
3588 PGs not deep-scrubbed since 3 intervals (24h) 8 scrubbing+deep
2929 PGs not deep-scrubbed since 4 intervals (24h) 3 scrubbing+deep
1705 PGs not deep-scrubbed since 5 intervals (24h) 4 scrubbing+deep
1904 PGs not deep-scrubbed since 6 intervals (24h) 5 scrubbing+deep
1540 PGs not deep-scrubbed since 7 intervals (24h) 7 scrubbing+deep
1304 PGs not deep-scrubbed since 8 intervals (24h) 7 scrubbing+deep
923 PGs not deep-scrubbed since 9 intervals (24h) 5 scrubbing+deep
557 PGs not deep-scrubbed since 10 intervals (24h) 7 scrubbing+deep
501 PGs not deep-scrubbed since 11 intervals (24h) 2 scrubbing+deep
363 PGs not deep-scrubbed since 12 intervals (24h) 2 scrubbing+deep
377 PGs not deep-scrubbed since 13 intervals (24h) 1 scrubbing+deep
383 PGs not deep-scrubbed since 14 intervals (24h) 2 scrubbing+deep
252 PGs not deep-scrubbed since 15 intervals (24h) 2 scrubbing+deep
116 PGs not deep-scrubbed since 16 intervals (24h) 5 scrubbing+deep
47 PGs not deep-scrubbed since 17 intervals (24h) 2 scrubbing+deep
10 PGs not deep-scrubbed since 18 intervals (24h)
2 PGs not deep-scrubbed since 19 intervals (24h) 19.1c6c* 19.a01*
1 PGs not deep-scrubbed since 20 intervals (24h) 14.1ed*
2 PGs not deep-scrubbed since 21 intervals (24h) 19.1322* 19.10f6*
1 PGs not deep-scrubbed since 23 intervals (24h) 19.19cc*
1 PGs not deep-scrubbed since 24 intervals (24h) 19.179f*
PGs marked with a * are on busy OSDs and not eligible for scrubbing.
The script (pasted here because attaching doesn't work):
# cat bin/scrub-report
#!/bin/bash
# Compute last scrub interval count. Scrub interval 6h, deep-scrub interval 24h.
# Print how many PGs have not been (deep-)scrubbed since #intervals.
ceph -f json pg dump pgs 2>&1 > /root/.cache/ceph/pgs_dump.json
echo ""
T0="$(date +%s)"
scrub_info="$(jq --arg T0 "$T0" -rc '.pg_stats[] | [
.pgid,
(.last_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*6)|ceil),
(.last_deep_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*24)|ceil),
.state,
(.acting | join(" "))
] | @tsv
' /root/.cache/ceph/pgs_dump.json)"
# less <<<"$scrub_info"
# 1 2 3 4 5..NF
# pg_id scrub-ints deep-scrub-ints status acting[]
awk <<<"$scrub_info" '{
for(i=5; i<=NF; ++i) pg_osds[$1]=pg_osds[$1] " " $i
if($4 == "active+clean") {
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
} else if($4 ~ /scrubbing\+deep/) {
deep_scrubbing[$3]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else if($4 ~ /scrubbing/) {
scrubbing[$2]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else {
unclean[$2]++
unclean_d[$3]++
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
for(i=5; i<=NF; ++i) osd[$i]="busy"
}
}
END {
print "Scrub report:"
for(si=1; si<=si_mx; ++si) {
if(pg_sn[si]==0 && scrubbing[si]==0 && unclean[si]==0) continue;
printf("%7d PGs not scrubbed since %2d intervals (6h)", pg_sn[si], si)
if(scrubbing[si]) printf(" %d scrubbing", scrubbing[si])
if(unclean[si]) printf(" %d unclean", unclean[si])
if(pg_sn[si]<=5) {
split(pg_sn_ids[si], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "Deep-scrub report:"
for(dsi=1; dsi<=dsi_mx; ++dsi) {
if(pg_dsn[dsi]==0 && deep_scrubbing[dsi]==0 && unclean_d[dsi]==0) continue;
printf("%7d PGs not deep-scrubbed since %2d intervals (24h)", pg_dsn[dsi], dsi)
if(deep_scrubbing[dsi]) printf(" %d scrubbing+deep", deep_scrubbing[dsi])
if(unclean_d[dsi]) printf(" %d unclean", unclean_d[dsi])
if(pg_dsn[dsi]<=5) {
split(pg_dsn_ids[dsi], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "PGs marked with a * are on busy OSDs and not eligible for scrubbing."
}
'
Don't forget the last "'" when copy-pasting.
Thanks for any pointers.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi Marek,
I haven't looked through those upgrade logs yet but here are some
comments regarding last OSD startup attempt.
First of answering your question
>_init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while)
>Is it a mandatory part of fsck?
This is caused by previous non-graceful OSD process shutdown. BlueStore is unable to find up-to-date allocation map and recovers it from RocksDB. And since fsck is a read-only procedure the recovered allocmap is not saved - hence all the following BlueStore startups (within fsck or OSD init) cause another rebuild attempt. To avoid that you might want to run repair instead of fsck - this will persist up-to-date allocation map and avoid its rebuilding on the next startup. This will work till the next non-graceful shutdown only - hence unsuccessful OSD attempt might break the allocmap state again.
Secondly - looking at OSD startup log one can see that actual OSD log ends with that allocmap recovery as well:
> 2024-01-09T11:25:30.718449+01:00 osd1 ceph-osd[1734062]: bluestore(/var/lib/ceph/osd/ceph-1) _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while) ...
Subsequent log line indicating OSD daemon termination is from systemd:
> 2024-01-09T11:25:33.516258+01:00 osd1 systemd[1]: Stopping ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a(a)osd.1.service - Ceph osd.1 for 2c565e24-7850-47dc-a751-a6357cbbaf2a...
And honestly these lines provide almost no clue why termination happened. No obvious OSD failures or something are shown. Perhaps containerized environment hides the details e.g. by cutting off OSD log's tail.
So you might want to proceed the investigation by running repair prior to starting the OSD as per above. This will result in no alloc map recovery and hopefully workaround the problem during startup - if the issue is caused by allocmap recovery.
Additionally you might want to increase debug_bluestore log level for osd.1 before starting it up to get more insight on what's happening.
Alternatively you might want to play with OSD log target settings to write OSD.1 log to some file rather than using system wide logging infra - hopefully this will be more helpful.
Thanks,
Igor
On 09/01/2024 13:31, Jan Marek wrote:
> Hi Igor,
>
> I've sent you logs via filesender.cesnet.cz, if someone would
> be interested, they are here:
>
> https://filesender.cesnet.cz/?s=download&token=047b1ec4-4df0-4e8a-90fc-3170…
>
> Some points:
>
> 1) I've found, that on the osd1 server was bad time (3 minutes in
> future). I've corrected that. Yes, I know, that it's bad, but we
> moved servers to any other net segment, where they have no access
> to the timeservers in Internet, then I must reconfigure it to use
> our own NTP servers.
>
> 2) I've tried to start osd.1 service by this sequence:
>
> a)
>
> ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck
>
> (without setting log properly :-( )
>
> b)
>
> export CEPH_ARGS="--log-file osd.1.log --debug-bluestore 5/20"
> ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.1 --command fsck
>
> - here I have one question: Why is it in this log stil this line:
>
> _init_alloc::NCB::restore_allocator() failed! Run Full Recovery from ONodes (might take a while)
>
> Is it a mandatory part of fsck?
>
> Log is attached.
>
> c)
>
> systemctl start ceph-2c565e24-7850-47dc-a751-a6357cbbaf2a(a)osd.1.service
>
> still crashing, gzip-ed log attached too.
>
> Many thanks for exploring problem.
>
> Sincerely
> Jan Marek
>
> Dne Po, led 08, 2024 at 12:00:05 CET napsal(a) Igor Fedotov:
>> Hi Jan,
>>
>> indeed fsck logs for the OSDs other than osd.0 look good so it would be
>> interesting to see OSD startup logs for them. Preferably to have that for
>> multiple (e.g. 3-4) OSDs to get the pattern.
>>
>> Original upgrade log(s) would be nice to see as well.
>>
>> You might want to use Google Drive or any other publicly available file
>> sharing site for that.
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 05/01/2024 10:25, Jan Marek wrote:
>>> Hi Igor,
>>>
>>> I've tried to start only osd.1, which seems to be fsck'd OK, but
>>> it crashed :-(
>>>
>>> I search logs and I've found, that I have logs from 22.12.2023,
>>> when I've did a upgrade (I have set logging to journald).
>>>
>>> Would you be interested in those logs? This file have 30MB in
>>> bzip2 format, how I can share it with you?
>>>
>>> It contains crash log from start osd.1 too, but I can cut out
>>> from it and send it to list...
>>>
>>> Sincerely
>>> Jan Marek
>>>
>>> Dne Čt, led 04, 2024 at 02:43:48 CET napsal(a) Jan Marek:
>>>> Hi Igor,
>>>>
>>>> I've ran this oneliner:
>>>>
>>>> for i in {0..12}; do export CEPH_ARGS="--log-file osd."${i}".log --debug-bluestore 5/20" ; ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.${i} --command fsck ; done;
>>>>
>>>> On osd.0 it crashed very quickly, on osd.1 it is still working.
>>>>
>>>> I've send those logs in one e-mail.
>>>>
>>>> But!
>>>>
>>>> I've tried to list disk devices in monitor view, and I've got
>>>> very interesting screenshot - some part I've emphasized by red
>>>> rectangulars.
>>>>
>>>> I've got a json from syslog, which was as a part cephadm call,
>>>> where it seems to be correct (for my eyes).
>>>>
>>>> Can be this coincidence for this problem?
>>>>
>>>> Sincerely
>>>> Jan Marek
>>>>
>>>> Dne Čt, led 04, 2024 at 12:32:47 CET napsal(a) Igor Fedotov:
>>>>> Hi Jan,
>>>>>
>>>>> may I see the fsck logs from all the failing OSDs to see the pattern. IIUC
>>>>> the full node is suffering from the issue, right?
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Igor
>>>>>
>>>>> On 1/2/2024 10:53 AM, Jan Marek wrote:
>>>>>> Hello once again,
>>>>>>
>>>>>> I've tried this:
>>>>>>
>>>>>> export CEPH_ARGS="--log-file /tmp/osd.0.log --debug-bluestore 5/20"
>>>>>> ceph-bluestore-tool --path /var/lib/ceph/2c565e24-7850-47dc-a751-a6357cbbaf2a/osd.0 --command fsck
>>>>>>
>>>>>> And I've sending /tmp/osd.0.log file attached.
>>>>>>
>>>>>> Sincerely
>>>>>> Jan Marek
>>>>>>
>>>>>> Dne Ne, pro 31, 2023 at 12:38:13 CET napsal(a) Igor Fedotov:
>>>>>>> Hi Jan,
>>>>>>>
>>>>>>> this doesn't look like RocksDB corruption but rather like some BlueStore
>>>>>>> metadata inconsistency. Also assertion backtrace in the new log looks
>>>>>>> completely different from the original one. So in an attempt to find any
>>>>>>> systematic pattern I'd suggest to run fsck with verbose logging for every
>>>>>>> failing OSD. Relevant command line:
>>>>>>>
>>>>>>> CEPH_ARGS="--log-file osd.N.log --debug-bluestore 5/20"
>>>>>>> bin/ceph-bluestore-tool --path <path-to-osd> --command fsck
>>>>>>>
>>>>>>> Unlikely this will fix anything it's rather a way to collect logs to get
>>>>>>> better insight.
>>>>>>>
>>>>>>>
>>>>>>> Additionally you might want to run similar fsck for a couple of healthy OSDs
>>>>>>> - curious if it succeeds as I have a feeling that the problem with crashing
>>>>>>> OSDs had been hidden before the upgrade and revealed rather than caused by
>>>>>>> it.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Igor
>>>>>>>
>>>>>>> On 12/29/2023 3:28 PM, Jan Marek wrote:
>>>>>>>> Hello Igor,
>>>>>>>>
>>>>>>>> I'm attaching a part of syslog creating while starting OSD.0.
>>>>>>>>
>>>>>>>> Many thanks for help.
>>>>>>>>
>>>>>>>> Sincerely
>>>>>>>> Jan Marek
>>>>>>>>
>>>>>>>> Dne St, pro 27, 2023 at 04:42:56 CET napsal(a) Igor Fedotov:
>>>>>>>>> Hi Jan,
>>>>>>>>>
>>>>>>>>> IIUC the attached log is for ceph-kvstore-tool, right?
>>>>>>>>>
>>>>>>>>> Can you please share full OSD startup log as well?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Igor
>>>>>>>>>
>>>>>>>>> On 12/27/2023 4:30 PM, Jan Marek wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I've problem: my ceph cluster (3x mon nodes, 6x osd nodes, every
>>>>>>>>>> osd node have 12 rotational disk and one NVMe device for
>>>>>>>>>> bluestore DB). CEPH is installed by ceph orchestrator and have
>>>>>>>>>> bluefs storage on osd.
>>>>>>>>>>
>>>>>>>>>> I've started process upgrade from version 17.2.6 to 18.2.1 by
>>>>>>>>>> invocating:
>>>>>>>>>>
>>>>>>>>>> ceph orch upgrade start --ceph-version 18.2.1
>>>>>>>>>>
>>>>>>>>>> After upgrade of mon and mgr processes orchestrator tried to
>>>>>>>>>> upgrade the first OSD node, but they are falling down.
>>>>>>>>>>
>>>>>>>>>> I've stop the process of upgrade, but I have 1 osd node
>>>>>>>>>> completely down.
>>>>>>>>>>
>>>>>>>>>> After upgrade I've got some error messages and I've found
>>>>>>>>>> /var/lib/ceph/crashxxxx directories, I attach to this message
>>>>>>>>>> files, which I've found here.
>>>>>>>>>>
>>>>>>>>>> Please, can you advice, what now I can do? It seems, that rocksdb
>>>>>>>>>> is even non-compatible or corrupted :-(
>>>>>>>>>>
>>>>>>>>>> Thanks in advance.
>>>>>>>>>>
>>>>>>>>>> Sincerely
>>>>>>>>>> Jan Marek
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>>>>> --
>>>>>>>>> Igor Fedotov
>>>>>>>>> Ceph Lead Developer
>>>>>>>>>
>>>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>>>>
>>>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>>>>
>>>>>>> --
>>>>>>> Igor Fedotov
>>>>>>> Ceph Lead Developer
>>>>>>>
>>>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>>>
>>>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>> --
>>>>> Igor Fedotov
>>>>> Ceph Lead Developer
>>>>>
>>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>>
>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>> --
>>>> Ing. Jan Marek
>>>> University of South Bohemia
>>>> Academic Computer Centre
>>>> Phone: +420389032080
>>>> http://www.gnu.org/philosophy/no-word-attachments.cs.html
>>>
>>>
>>>
>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi All,
A little help please.
TL/DR: Please help with error message:
~~~
REST API failure, code : 500
Unable to access the configuration object
Unable to contact the local API endpoint (https://localhost:5000/api)
~~~
The Issue
------------
I've been through the documentation and can't find what I'm looking for
- possibly because I'm not really sure what it is I *am* looking for, so
if someone can point me in the right direction I would really appreciate it.
I get the above error message when I run the `gwcli` command from inside
a cephadm shell.
What I'm trying to do is set up a set of iSCSI Gateways in our Ceph-Reef
18.2.1 Cluster (yes, I know its being depreciated as of Nov 22 - or
whatever). We recently migrated 7 upgraded from a manual install of
Quincy to a CephAdm install of Reef - everything went AOK *except* for
the iSCSI Gateways. So we tore them down and then rebuilt them as per
the latest documentation. So now we've got 3 gateways as per the Service
page of the Dashboard and I'm trying to create the targets.
I tried via the Dashboard but had errors, so instead I went in to do it
via gwcli and hit the above error (which I now bevel to be the cause of
the GUI creation I encountered.
I have absolutely no experience with podman or containers in general,
and can't work out how to fix the issue. So I'm requesting some help -
not to solve the problem for me, but to point me in the right direction
to solve it myself. :-)
So, anyone?
Cheers
Dulux-Oz