Hello List,
first of all: Yes - i made mistakes. Now i am trying to recover :-/
I had a healthy 3 node cluster which i wanted to convert to a single one.
My goal was to reinstall a fresh 3 Node cluster and start with 2 nodes.
I was able to healthy turn it from a 3 Node Cluster to a 2 Node cluster.
Then the problems began.
I started to change size=1 and min_size=1.
Health was okay until here. Then over sudden both nodes got
fenced...one node refused to boot, mons where missing, etc...to make
long story short, here is where i am right now:
root@node03:~ # ceph -s
cluster b3be313f-d0ef-42d5-80c8-6b41380a47e3
health HEALTH_WARN
53 pgs stale
53 pgs stuck stale
monmap e4: 2 mons at {0=10.15.15.3:6789/0,1=10.15.15.2:6789/0}
election epoch 298, quorum 0,1 1,0
osdmap e6097: 14 osds: 9 up, 9 in
pgmap v93644673: 512 pgs, 1 pools, 1193 GB data, 304 kobjects
1088 GB used, 32277 GB / 33366 GB avail
459 active+clean
53 stale+active+clean
root@node03:~ # ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 32.56990 root default
-2 25.35992 host node03
0 3.57999 osd.0 up 1.00000 1.00000
5 3.62999 osd.5 up 1.00000 1.00000
6 3.62999 osd.6 up 1.00000 1.00000
7 3.62999 osd.7 up 1.00000 1.00000
8 3.62999 osd.8 up 1.00000 1.00000
19 3.62999 osd.19 up 1.00000 1.00000
20 3.62999 osd.20 up 1.00000 1.00000
-3 7.20998 host node02
3 3.62999 osd.3 up 1.00000 1.00000
4 3.57999 osd.4 up 1.00000 1.00000
1 0 osd.1 down 0 1.00000
9 0 osd.9 down 0 1.00000
10 0 osd.10 down 0 1.00000
17 0 osd.17 down 0 1.00000
18 0 osd.18 down 0 1.00000
my main mistakes seemd to be:
--------------------------------
ceph osd out osd.1
ceph auth del osd.1
systemctl stop ceph-osd@1
ceph osd rm 1
umount /var/lib/ceph/osd/ceph-1
ceph osd crush remove osd.1
As far as i can tell, ceph waits and needs data from that OSD.1 (which
i removed)
root@node03:~ # ceph health detail
HEALTH_WARN 53 pgs stale; 53 pgs stuck stale
pg 0.1a6 is stuck stale for 5086.552795, current state
stale+active+clean, last acting [1]
pg 0.142 is stuck stale for 5086.552784, current state
stale+active+clean, last acting [1]
pg 0.1e is stuck stale for 5086.552820, current state
stale+active+clean, last acting [1]
pg 0.e0 is stuck stale for 5086.552855, current state
stale+active+clean, last acting [1]
pg 0.1d is stuck stale for 5086.552822, current state
stale+active+clean, last acting [1]
pg 0.13c is stuck stale for 5086.552791, current state
stale+active+clean, last acting [1]
[...] SNIP [...]
pg 0.e9 is stuck stale for 5086.552955, current state
stale+active+clean, last acting [1]
pg 0.87 is stuck stale for 5086.552939, current state
stale+active+clean, last acting [1]
When i try to start ODS.1 manually, i get:
--------------------------------------------
2020-02-10 18:48:26.107444 7f9ce31dd880 0 ceph version 0.94.10
(b1e0532418e4631af01acbc0cedd426f1905f4af), process ceph-osd, pid
10210
2020-02-10 18:48:26.134417 7f9ce31dd880 0
filestore(/var/lib/ceph/osd/ceph-1) backend xfs (magic 0x58465342)
2020-02-10 18:48:26.184202 7f9ce31dd880 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is supported and appears to work
2020-02-10 18:48:26.184209 7f9ce31dd880 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2020-02-10 18:48:26.184526 7f9ce31dd880 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2020-02-10 18:48:26.184585 7f9ce31dd880 0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_feature: extsize
is disabled by conf
2020-02-10 18:48:26.309755 7f9ce31dd880 0
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2020-02-10 18:48:26.633926 7f9ce31dd880 1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2020-02-10 18:48:26.642185 7f9ce31dd880 1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2020-02-10 18:48:26.664273 7f9ce31dd880 0 <cls>
cls/hello/cls_hello.cc:271: loading cls_hello
2020-02-10 18:48:26.732154 7f9ce31dd880 0 osd.1 6002 crush map has
features 1107558400, adjusting msgr requires for clients
2020-02-10 18:48:26.732163 7f9ce31dd880 0 osd.1 6002 crush map has
features 1107558400 was 8705, adjusting msgr requires for mons
2020-02-10 18:48:26.732167 7f9ce31dd880 0 osd.1 6002 crush map has
features 1107558400, adjusting msgr requires for osds
2020-02-10 18:48:26.732179 7f9ce31dd880 0 osd.1 6002 load_pgs
2020-02-10 18:48:31.939810 7f9ce31dd880 0 osd.1 6002 load_pgs opened 53 pgs
2020-02-10 18:48:31.940546 7f9ce31dd880 -1 osd.1 6002 log_to_monitors
{default=true}
2020-02-10 18:48:31.942471 7f9ce31dd880 1 journal close
/var/lib/ceph/osd/ceph-1/journal
2020-02-10 18:48:31.969205 7f9ce31dd880 -1 ESC[0;31m ** ERROR: osd
init failed: (1) Operation not permittedESC[0m
Its mounted:
/dev/sdg1 3.7T 127G 3.6T 4% /var/lib/ceph/osd/ceph-1
Is there any way i can get the OSD.1 back in?
Thanks a lot,
mario
Dear people on this mailing list,
I've got the "problem" that our MAX AVAIL value increases by about
5-10 TB when I reboot a whole OSD node. After the reboot the value
goes back to normal.
I would love to know WHY.
Under normal circumstances I would ignore this behavior, but because I
am very new to the whole ceph software I would like to know why stuff
like this happens.
What I read is, that this value will be calculated by the most filled OSD.
I've set noout and norebalance while the node is offline and I unset
both values after the reboot.
We are currently on nautilus.
Cheers and thanks in advance
Boris
This means it has been applied:
# ceph osd dump -f json | jq .require_osd_release
"nautilus"
-- dan
On Mon, Feb 17, 2020 at 11:10 AM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
>
> How do you check if you issued this command in the past?
>
>
> -----Original Message-----
> To: ceph-users(a)ceph.io
> Subject: [ceph-users] Re: Excessive write load on mons after upgrade
> from 12.2.13 -> 14.2.7
>
> Hi Peter,
>
> could be a totally different problem but did you run the command "ceph
> osd require-osd-release nautilus" after the upgrade?
> We had poor performance after upgrading to nautilus and running this
> command fixed it. The same was reported by others for previous updates.
> Here is my original message regarding this issue:
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/OYFRWSJXPV…
>
> We did not observe the master election problem though.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
On Wed, May 27, 2020 at 10:09 PM Dylan McCulloch <dmc(a)unimelb.edu.au> wrote:
>
> Hi all,
>
> The single active MDS on one of our Ceph clusters is close to running out of RAM.
>
> MDS total system RAM = 528GB
> MDS current free system RAM = 4GB
> mds_cache_memory_limit = 451GB
> current mds cache usage = 426GB
This mds_cache_memory_limit is way too high for the available RAM. We
normally recommend that your RAM be 150% of your cache limit but we
lack data for such large cache sizes.
> Presumably we need to reduce our mds_cache_memory_limit and/or mds_max_caps_per_client, but would like some guidance on whether it’s possible to do that safely on a live production cluster when the MDS is already pretty close to running out of RAM.
>
> Cluster is Luminous - 12.2.12
> Running single active MDS with two standby.
> 890 clients
> Mix of kernel client (4.19.86) and ceph-fuse.
> Clients are 12.2.12 (398) and 12.2.13 (3)
v12.2.12 has the changes necessary to throttle MDS cache size
reduction. You should be able to reduce mds_cache_memory_limit to any
lower value without destabilizing the cluster.
> The kernel clients have stayed under “mds_max_caps_per_client”: “1048576". But the ceph-fuse clients appear to hold very large numbers according to the ceph-fuse asok.
> e.g.
> “num_caps”: 1007144398,
> “num_caps”: 1150184586,
> “num_caps”: 1502231153,
> “num_caps”: 1714655840,
> “num_caps”: 2022826512,
This data from the ceph-fuse asok is actually the number of caps ever
received, not the current number. I've created a ticket for this:
https://tracker.ceph.com/issues/45749
Look at the data from `ceph tell mds.foo session ls` instead.
> Dropping caches on the clients appears to reduce their cap usage but does not free up RAM on the MDS.
The MDS won't free up RAM until the cache memory limit is reached.
> What is the safest method to free cache and reduce RAM usage on the MDS in this situation (without having to evict or remount clients)?
reduce mds_cache_memory_limit
> I’m concerned that reducing mds_cache_memory_limit even in very small increments may trigger a large recall of caps and overwhelm the MDS.
That used to be the case in older versions of Luminous but not any longer.
--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
hi there,
trying to get around my head rocksdb spillovers and how to deal with
them … in particular, i have one osds which does not have any pools
associated (as per ceph pg ls-by-osd $osd ), yet it does show up in ceph
health detail as:
osd.$osd spilled over 2.9 MiB metadata from 'db' device (49 MiB
used of 37 GiB) to slow device
compaction doesn't help. i am well aware of
https://tracker.ceph.com/issues/38745 , yet find it really
counter-intuitive that an empty osd with a more-or-less optimal sized db
volume can't fit its rockdb on the former.
is there any way to repair this, apart from re-creating the osd? fwiw,
dumping the database with
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd dump >
bluestore_kv.dump
yields a file of less than 100mb in size.
and, while we're at it, a few more related questions:
- am i right to assume that the leveldb and rocksdb arguments to
ceph-kvstore-tool are only relevant for osds with filestore-backend?
- does ceph-kvstore-tool bluestore-kv … also deal with rocksdb-items for
osds with bluestore-backend?
thank you very much & with kind regards,
thoralf.
Hi, trying to migrate a second ceph cluster to Cephadm. All the host successfully migrated from "legacy" except one of the OSD hosts (cephadm kept duplicating osd ids e.g. two "osd.5", still not sure why). To make things easier, we re-provisioned the node (reinstalled from netinstall, applied the same SaltStack traits as the other nodes, wiped the disks) and tried to use cephadm to setup the OSD's.
So, orch correctly starts the provisioning processes (a docker container running ceph-volume is created). But the provisioning never completes (docker exec):
# ps axu
root 1 0.1 0.2 99272 22488 ? Ss 15:26 0:01 /usr/libexec/platform-python -s /usr/sbin/ceph-volume lvm batch --no-auto /dev/sdb /dev/sdc --dmcrypt --yes --no-systemd
root 807 0.9 0.5 154560 44120 ? S<L 15:26 0:06 /usr/sbin/cryptsetup --key-file - --allow-discards luksOpen /dev/ceph-851cae40-3270-45ea-b788-be6e05465e92/osd-data-e3157b54-f6b9-4ec9-ab12-e289f52c00a4 Afr6Ct-ok4h-pBEy-GfFF-xxYl-EKwi-cHhjZc
# cat /var/log/ceph/ceph-volume.log
Running command: /usr/sbin/cryptsetup --batch-mode --key-file - luksFormat /dev/ceph-851cae40-3270-45ea-b788-be6e05465e92/osd-data-e3157b54-f6b9-4ec9-ab12-e289f52c00a4
Running command: /usr/sbin/cryptsetup --key-file - --allow-discards luksOpen /dev/ceph-851cae40-3270-45ea-b788-be6e05465e92/osd-data-e3157b54-f6b9-4ec9-ab12-e289f52c00a4 Afr6Ct-ok4h-pBEy-GfFF-xxYl-EKwi-cHhjZc
# docker ps
2956dec0450d ceph/ceph:v15 "/usr/sbin/ceph-volu…" 14 minutes ago Up 14 minutes condescending_nightingale
# cat osd_spec_default.yaml
service_type: osd
service_id: osd_spec_default
placement:
host_pattern: '*'
data_devices:
all: true
encrypted: true
It looks like cephadm hangs on luksOpen.
Is this expected (encryption is mentioned to be supported, outside of no documentation)?
This is again about our bad cluster, with too much objects, and the hdd
OSDs have a DB device that is (much) too small (e.g. 20 GB, i.e. 3 GB
usable). Now several OSDs do not come up any more.
Typical error message:
/build/ceph-14.2.8/src/os/bluestore/BlueFS.cc: 2261: FAILED
ceph_assert(h->file->fnode.ino != 1)
Also just tried to add a few GB to the DB device (lvextend,
ceph-bluestore-tool bluefs-bdev-expand), but this also crashes, also
with this message.
Options that helped us before (thanks Wido :-) do not help here, e.g.
CEPH_ARGS="--bluestore-rocksdb-options compaction_readahead_size=0"
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$OSD compact
Any ideas that I could try to save these OSDs?
Cheers
Harry
Hi,
I've 15.2.1 installed on all machines. On primary machine I executed ceph upgrade command:
$ ceph orch upgrade start --ceph-version 15.2.2
When I check ceph -s I see this:
progress:
Upgrade to docker.io/ceph/ceph:v15.2.2 (30m)
[=...........................] (remaining: 8h)
It says 8 hours. It is already ran for 3 hours. No upgrade processed. It get stuck at this point.
Is there any way to know why this has stuck?
Thanks,
Gencer.