Hi all,
We have a cephfs with data_pool in erasure coding (3+2) ans 1024 pg
(nautilus 14.2.8).
One of the pgs is partially destroyed (we lost 3 osd thus 3 shards), it
have 143 objects unfound and is stuck in state
"active+recovery_unfound+undersized+degraded+remapped".
We then lost some datas (we are using cephfs-data-scan pg_files... to
identify files with data on the bad pg) .
We thus created a new filesystem (this time with data_pool in replica 3)
and we are copying all the datas from the broken FS to the new one.
But we need to remove files from the broken FS after copy to free space
(because there will not be enough space on the cluster). To avoid
problems of strays we removed the snapshots on the broken FS before
deleting files.
The point is that the mds managing the broken FS is now "Behind on
trimming (123036/128) max_segments: 128, num_segments: 123036"
and have 1 slow metadata IOs are blocked > 30 secs, oldest blocked for
83645 secs.
The slow IO correspond to osd 27 which is acting_primary for the broken
PG. and the broken pg have a long "snap_trimq":
"[1e0c~1,1e0e~1,1e12~1,1e16~1,1e18~1,1e1a~1,........" and
"snap_trimq_len": 460.
It then seems that cephfs is not able to trim ops corresponding to the
deletion of objects and snaps which have data on the broken PG, probably
because the pg is not healty.
Is there a way to tell ceph/cephfs to flush or forget about (only) lost
objects on the broken pg and get this pg healty enough to perform
trimming ?
thanks for your help
F.
hi there,
trying to get around my head rocksdb spillovers and how to deal with
them … in particular, i have one osds which does not have any pools
associated (as per ceph pg ls-by-osd $osd ), yet it does show up in ceph
health detail as:
osd.$osd spilled over 2.9 MiB metadata from 'db' device (49 MiB
used of 37 GiB) to slow device
compaction doesn't help. i am well aware of
https://tracker.ceph.com/issues/38745 , yet find it really
counter-intuitive that an empty osd with a more-or-less optimal sized db
volume can't fit its rockdb on the former.
is there any way to repair this, apart from re-creating the osd? fwiw,
dumping the database with
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd dump >
bluestore_kv.dump
yields a file of less than 100mb in size.
and, while we're at it, a few more related questions:
- am i right to assume that the leveldb and rocksdb arguments to
ceph-kvstore-tool are only relevant for osds with filestore-backend?
- does ceph-kvstore-tool bluestore-kv … also deal with rocksdb-items for
osds with bluestore-backend?
thank you very much & with kind regards,
thoralf.
I am new to Ceph so I hope this is not a question of me not reading the documentation well enough.
I have setup a small cluster to learn with three physical hosts each with two nics.
The cluster is upp and running but I have not figured out how to tie the OSD:s to my second interface for a separate cluster network, as it is now all communication goes thru the public network.
Is it possible to define the cluster network with cephadm in some way?
/Jimmy
Hi, trying to migrate a second ceph cluster to Cephadm. All the host successfully migrated from "legacy" except one of the OSD hosts (cephadm kept duplicating osd ids e.g. two "osd.5", still not sure why). To make things easier, we re-provisioned the node (reinstalled from netinstall, applied the same SaltStack traits as the other nodes, wiped the disks) and tried to use cephadm to setup the OSD's.
So, orch correctly starts the provisioning processes (a docker container running ceph-volume is created). But the provisioning never completes (docker exec):
# ps axu
root 1 0.1 0.2 99272 22488 ? Ss 15:26 0:01 /usr/libexec/platform-python -s /usr/sbin/ceph-volume lvm batch --no-auto /dev/sdb /dev/sdc --dmcrypt --yes --no-systemd
root 807 0.9 0.5 154560 44120 ? S<L 15:26 0:06 /usr/sbin/cryptsetup --key-file - --allow-discards luksOpen /dev/ceph-851cae40-3270-45ea-b788-be6e05465e92/osd-data-e3157b54-f6b9-4ec9-ab12-e289f52c00a4 Afr6Ct-ok4h-pBEy-GfFF-xxYl-EKwi-cHhjZc
# cat /var/log/ceph/ceph-volume.log
Running command: /usr/sbin/cryptsetup --batch-mode --key-file - luksFormat /dev/ceph-851cae40-3270-45ea-b788-be6e05465e92/osd-data-e3157b54-f6b9-4ec9-ab12-e289f52c00a4
Running command: /usr/sbin/cryptsetup --key-file - --allow-discards luksOpen /dev/ceph-851cae40-3270-45ea-b788-be6e05465e92/osd-data-e3157b54-f6b9-4ec9-ab12-e289f52c00a4 Afr6Ct-ok4h-pBEy-GfFF-xxYl-EKwi-cHhjZc
# docker ps
2956dec0450d ceph/ceph:v15 "/usr/sbin/ceph-volu…" 14 minutes ago Up 14 minutes condescending_nightingale
# cat osd_spec_default.yaml
service_type: osd
service_id: osd_spec_default
placement:
host_pattern: '*'
data_devices:
all: true
encrypted: true
It looks like cephadm hangs on luksOpen.
Is this expected (encryption is mentioned to be supported, outside of no documentation)?
This is again about our bad cluster, with too much objects, and the hdd
OSDs have a DB device that is (much) too small (e.g. 20 GB, i.e. 3 GB
usable). Now several OSDs do not come up any more.
Typical error message:
/build/ceph-14.2.8/src/os/bluestore/BlueFS.cc: 2261: FAILED
ceph_assert(h->file->fnode.ino != 1)
Also just tried to add a few GB to the DB device (lvextend,
ceph-bluestore-tool bluefs-bdev-expand), but this also crashes, also
with this message.
Options that helped us before (thanks Wido :-) do not help here, e.g.
CEPH_ARGS="--bluestore-rocksdb-options compaction_readahead_size=0"
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$OSD compact
Any ideas that I could try to save these OSDs?
Cheers
Harry
Helo,
I have a live ceph cluster, and I’m in the need of modifying the bucket hierarchy. I am currently using the default crush rule (ie. keep each replica on a different host). My need is to add a “chassis” level, and keep replicas on a per-chassis level.
From what I read in the documentation, I would have to edit the crush file manually, however this sounds kinda scary for a live cluster.
Are there any “best known methods” to achieve that goal without messing things up?
In my current scenario, I have one host per chassis, and planning on later adding nodes where there would be >1 hosts per chassis. It looks like “in theory” there wouldn’t be a need for any data movement after the crush map changes. Will reality match theory? Anything else I need to watch out for?
Thank you!
George