I am running Ceph 15.2.13 on CentOS 7.9.2009 and recently my MDS servers
have started failing with the error message
In function 'void Server::handle_client_open(MDRequestRef&)' thread
7f0ca9908700 time 2021-06-28T09:21:11.484768+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el7/BUILD/ceph-15.2.13/src/mds/Server.cc:
4149: FAILED ceph_assert(cur->is_auth())
Complete log is:
https://gist.github.com/pvanheus/4da555a6de6b5fa5e46cbf74f5500fbd
ceph status output is:
# ceph status
cluster:
id: ed7b2c16-b053-45e2-a1fe-bf3474f90508
health: HEALTH_WARN
30 OSD(s) experiencing BlueFS spillover
insufficient standby MDS daemons available
1 MDSs report slow requests
2 mgr modules have failed dependencies
4347046/326505282 objects misplaced (1.331%)
6 nearfull osd(s)
23 pgs not deep-scrubbed in time
23 pgs not scrubbed in time
8 pool(s) nearfull
services:
mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 22m)
mgr: ceph-mon1(active, since 11w), standbys: ceph-mon2, ceph-mon3
mds: SANBI_FS:2 {0=ceph-mon1=up:active(laggy or
crashed),1=ceph-mon2=up:stopping}
osd: 54 osds: 54 up (since 2w), 54 in (since 11w); 50 remapped pgs
data:
pools: 8 pools, 833 pgs
objects: 42.37M objects, 89 TiB
usage: 159 TiB used, 105 TiB / 264 TiB avail
pgs: 4347046/326505282 objects misplaced (1.331%)
782 active+clean
49 active+clean+remapped
1 active+clean+scrubbing+deep
1 active+clean+remapped+scrubbing
io:
client: 29 KiB/s rd, 427 KiB/s wr, 37 op/s rd, 48 op/s wr
When restarting a MDS it goes through states replace, reconnect, resolve
and finally sets itself to active before this crash happens.
Any advice on what to do?
Thanks,
Peter
P.S. apologies if you received this email more than once - I have had some
trouble figuring out the correct mailing list to use.
Hi,
I have setup a ceph cluster with cephadm with docker backend.
I want to move /var/lib/docker to a separate device to get better
performance and less load on the OS device.
I tried that by stopping docker copy the content of /var/lib/docker to
the new device and mount the new device to /var/lib/docker.
The other containers started as expected and continues to work and run
as expected.
But the ceph containers seems to be broken.
I am not able to get them back in working state.
I have tried to remove the host with `ceph orch host rm itcnchn-bb4067`
and readd it but no effect.
The strange thing is that 2 of 4 containers comes up as expected.
ceph orch ps itcnchn-bb4067
NAME HOST STATUS
REFRESHED AGE VERSION IMAGE NAME IMAGE ID
CONTAINER ID
crash.itcnchn-bb4067 itcnchn-bb4067 running (18h) 10m
ago 4w 15.2.7 docker.io/ceph/ceph:v15 2bc420ddb175
2af28c4571cf
mds.cephfs.itcnchn-bb4067.qzoshl itcnchn-bb4067 error 10m
ago 4w <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>
mon.itcnchn-bb4067 itcnchn-bb4067 error 10m
ago 18h <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>
rgw.ikea.dc9-1.itcnchn-bb4067.gtqedc itcnchn-bb4067 running (18h) 10m
ago 4w 15.2.7 docker.io/ceph/ceph:v15 2bc420ddb175
00d000aec32b
Docker logs from the active manager does not say much about what is
wrong
debug 2021-01-05T09:57:52.537+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring mds.cephfs.itcnchn-bb4067.qzoshl (unknown last
config time)...
debug 2021-01-05T09:57:52.541+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon mds.cephfs.itcnchn-bb4067.qzoshl on
itcnchn-bb4067
debug 2021-01-05T09:57:52.973+0000 7fdb64e88700 0 log_channel(cluster)
log [DBG] : pgmap v347: 241 pgs: 241 active+clean; 18 GiB data, 50 GiB
used, 52 TiB / 52 TiB avail; 18 KiB/s rd, 78 KiB/s wr, 24 op/s
debug 2021-01-05T09:57:53.085+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring mon.itcnchn-bb4067 (unknown last config
time)...
debug 2021-01-05T09:57:53.085+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon mon.itcnchn-bb4067 on itcnchn-bb4067
debug 2021-01-05T09:57:53.625+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring rgw.ikea.dc9-1.itcnchn-bb4067.gtqedc (unknown
last config time)...
debug 2021-01-05T09:57:53.629+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon rgw.ikea.dc9-1.itcnchn-bb4067.gtqedc on
itcnchn-bb4067
debug 2021-01-05T09:57:54.141+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring crash.itcnchn-bb4067 (unknown last config
time)...
debug 2021-01-05T09:57:54.141+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon crash.itcnchn-bb4067 on itcnchn-bb4067
- Karsten
Has anybody run into a 'stuck' OSD service specification? I've tried
to delete it, but it's stuck in 'deleting' state, and has been for
quite some time (even prior to upgrade, on 15.2.x). This is on 16.2.3:
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
osd.osd_spec 504/525 <deleting> 12m label:osd
root@ceph01:/# ceph orch rm osd.osd_spec
Removed service osd.osd_spec
From active monitor:
debug 2021-05-06T23:14:48.909+0000 7f17d310b700 0
log_channel(cephadm) log [INF] : Remove service osd.osd_spec
Yet in ls, it's still there, same as above. --export on it:
root@ceph01:/# ceph orch ls osd.osd_spec --export
service_type: osd
service_id: osd_spec
service_name: osd.osd_spec
placement: {}
unmanaged: true
spec:
filter_logic: AND
objectstore: bluestore
We've tried --force, as well, with no luck.
To be clear, the --export even prior to delete looks nothing like the
actual service specification we're using, even after I re-apply it, so
something seems 'bugged'. Here's the OSD specification we're applying:
service_type: osd
service_id: osd_spec
placement:
label: "osd"
data_devices:
rotational: 1
db_devices:
rotational: 0
db_slots: 12
I would appreciate any insight into how to clear this up (without
removing the actual OSDs, we're just wanting to apply the updated
service specification - we used to use host placement rules and are
switching to label-based).
Thanks,
David
Dear cephers,
I have a strange problem. An OSD went down and recovery finished. For some reason, I have a slow ops warning for the failed OSD stuck in the system:
health: HEALTH_WARN
430 slow ops, oldest one blocked for 36 sec, osd.580 has slow ops
The OSD is auto-out:
| 580 | ceph-22 | 0 | 0 | 0 | 0 | 0 | 0 | autoout,exists |
It is probably a warning dating back to just before the fail. How can I clear it?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi,
I’m continuously getting scrub errors in my index pool and log pool that I need to repair always.
HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 2 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 20.19 is active+clean+inconsistent, acting [39,41,37]
Why is this?
I have no cue at all, no log entry no anything ☹
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.