March 2024 - ceph-users - lists.ceph.io

by Ml Ml

Hello, i wonder why my autobalancer is not working here: root@ceph01:~# ceph -s cluster: id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df health: HEALTH_ERR 1 backfillfull osd(s) 1 full osd(s) 1 nearfull osd(s) 4 pool(s) full => osd.17 was too full (92% or something like that) root@ceph01:~# ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE ... %USE ... PGS TYPE NAME -25 209.50084 - 213 TiB ... 69.56 ... - datacenter xxx-dc-root -19 84.59369 - 86 TiB ... 56.97 ... - rack RZ1.Reihe4.R10 -3 35.49313 - 37 TiB ... 57.88 ... - host ceph02 2 hdd 1.70000 1.00000 1.7 TiB ... 58.77 ... 44 osd.2 3 hdd 1.00000 1.00000 2.7 TiB ... 22.14 ... 25 osd.3 7 hdd 2.50000 1.00000 2.7 TiB ... 58.84 ... 70 osd.7 9 hdd 9.50000 1.00000 9.5 TiB ... 63.07 ... 268 osd.9 13 hdd 2.67029 1.00000 2.7 TiB ... 53.59 ... 65 osd.13 16 hdd 2.89999 1.00000 2.7 TiB ... 59.35 ... 71 osd.16 19 hdd 1.70000 1.00000 1.7 TiB ... 48.98 ... 37 osd.19 23 hdd 2.38419 1.00000 2.4 TiB ... 59.33 ... 64 osd.23 24 hdd 1.39999 1.00000 1.7 TiB ... 51.23 ... 39 osd.24 28 hdd 3.63869 1.00000 3.6 TiB ... 64.17 ... 104 osd.28 31 hdd 2.70000 1.00000 2.7 TiB ... 64.73 ... 76 osd.31 32 hdd 3.39999 1.00000 3.3 TiB ... 67.28 ... 101 osd.32 -9 22.88817 - 23 TiB ... 56.96 ... - host ceph06 35 hdd 7.15259 1.00000 7.2 TiB ... 55.71 ... 182 osd.35 36 hdd 5.24519 1.00000 5.2 TiB ... 53.75 ... 128 osd.36 45 hdd 5.24519 1.00000 5.2 TiB ... 60.91 ... 144 osd.45 48 hdd 5.24519 1.00000 5.2 TiB ... 57.94 ... 139 osd.48 -17 26.21239 - 26 TiB ... 55.67 ... - host ceph08 37 hdd 6.67569 1.00000 6.7 TiB ... 58.17 ... 174 osd.37 40 hdd 9.53670 1.00000 9.5 TiB ... 58.54 ... 250 osd.40 46 hdd 5.00000 1.00000 5.0 TiB ... 52.39 ... 116 osd.46 47 hdd 5.00000 1.00000 5.0 TiB ... 50.05 ... 112 osd.47 -20 59.11053 - 60 TiB ... 82.47 ... - rack RZ1.Reihe4.R9 -4 23.09996 - 24 TiB ... 79.92 ... - host ceph03 5 hdd 1.70000 0.75006 1.7 TiB ... 87.24 ... 66 osd.5 6 hdd 1.70000 0.44998 1.7 TiB ... 47.30 ... 36 osd.6 10 hdd 2.70000 0.85004 2.7 TiB ... 83.23 ... 100 osd.10 15 hdd 2.70000 0.75006 2.7 TiB ... 74.26 ... 88 osd.15 17 hdd 0.50000 0.85004 1.6 TiB ... 91.44 ... 67 osd.17 20 hdd 2.00000 0.85004 1.7 TiB ... 88.41 ... 68 osd.20 21 hdd 2.79999 0.75006 2.7 TiB ... 77.25 ... 91 osd.21 25 hdd 1.70000 0.90002 1.7 TiB ... 78.31 ... 60 osd.25 26 hdd 2.70000 1.00000 2.7 TiB ... 82.75 ... 99 osd.26 27 hdd 2.70000 0.90002 2.7 TiB ... 84.26 ... 101 osd.27 63 hdd 1.89999 0.90002 1.7 TiB ... 84.15 ... 65 osd.63 -13 36.01057 - 36 TiB ... 84.12 ... - host ceph05 11 hdd 7.15259 0.90002 7.2 TiB ... 85.45 ... 273 osd.11 39 hdd 7.20000 0.85004 7.2 TiB ... 80.90 ... 257 osd.39 41 hdd 7.20000 0.75006 7.2 TiB ... 74.95 ... 239 osd.41 42 hdd 9.00000 1.00000 9.5 TiB ... 92.00 ... 392 osd.42 43 hdd 5.45799 1.00000 5.5 TiB ... 84.84 ... 207 osd.43 -21 65.79662 - 66 TiB ... 74.29 ... - rack RZ3.Reihe3.R10 -2 28.49664 - 29 TiB ... 74.79 ... - host ceph01 0 hdd 2.70000 1.00000 2.7 TiB ... 73.82 ... 88 osd.0 1 hdd 3.63869 1.00000 3.6 TiB ... 73.47 ... 121 osd.1 4 hdd 2.70000 1.00000 2.7 TiB ... 74.63 ... 89 osd.4 8 hdd 2.70000 1.00000 2.7 TiB ... 77.10 ... 92 osd.8 12 hdd 2.70000 1.00000 2.7 TiB ... 78.76 ... 94 osd.12 14 hdd 5.45799 1.00000 5.5 TiB ... 78.86 ... 193 osd.14 18 hdd 1.89999 1.00000 2.7 TiB ... 63.79 ... 76 osd.18 22 hdd 1.70000 1.00000 1.7 TiB ... 74.85 ... 57 osd.22 30 hdd 1.70000 1.00000 1.7 TiB ... 76.34 ... 59 osd.30 64 hdd 3.29999 1.00000 3.3 TiB ... 73.48 ... 110 osd.64 -11 12.39999 - 12 TiB ... 73.40 ... - host ceph04 34 hdd 5.20000 1.00000 5.2 TiB ... 72.81 ... 171 osd.34 44 hdd 7.20000 1.00000 7.2 TiB ... 73.83 ... 236 osd.44 -15 24.89998 - 25 TiB ... 74.15 ... - host ceph07 66 hdd 7.20000 1.00000 7.2 TiB ... 74.07 ... 236 osd.66 67 hdd 7.20000 1.00000 7.2 TiB ... 73.74 ... 236 osd.67 68 hdd 3.29999 1.00000 3.3 TiB ... 72.99 ... 110 osd.68 69 hdd 7.20000 1.00000 7.2 TiB ... 75.18 ... 241 osd.69 -1 0 - 0 B ... 0 ... - root default TOTAL 213 TiB ... 69.56 root@ceph01:~# ceph balancer status { "active": true, "last_optimize_duration": "0:00:00.256761", "last_optimize_started": "Mon Mar 4 10:25:10 2024", "mode": "upmap", "no_optimization_needed": true, "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", "plans": [] } Where as reweight-by-utilization would change quite a bit!: root@ceph01:~# ceph osd test-reweight-by-utilization 110 .5 10 moved 260 / 6627 (3.92334%) avg 127.442 stddev 81.345 -> 78.169 (expected baseline 11.18) min osd.17 with 24 -> 16 pgs (0.188321 -> 0.125547 * mean) max osd.42 with 401 -> 320 pgs (3.14652 -> 2.51094 * mean) oload 110 max_change 0.5 max_change_osds 10 average_utilization 0.6956 overload_utilization 0.7652 osd.42 weight 1.0000 -> 0.7561 osd.6 weight 0.4500 -> 0.6616 osd.17 weight 0.8500 -> 0.6466 osd.20 weight 0.8500 -> 0.6688 osd.5 weight 0.7501 -> 0.6004 osd.11 weight 0.9000 -> 0.7326 osd.43 weight 1.0000 -> 0.8199 osd.27 weight 0.9000 -> 0.7430 osd.63 weight 0.9000 -> 0.7440 osd.10 weight 0.8500 -> 0.7104 no change root@ceph01:~# ceph versions { "mon": { "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 3 }, "mgr": { "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 2 }, "osd": { "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 52 }, "mds": { "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 4 }, "overall": { "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)": 61 } } Cheers, Michael

2 months, 2 weeks

5
4
0 0

Successfully using dm-cache

by Michael Lipp

Just in case anybody is interested: Using dm-cache works and boosts performance -- at least for my use case. The "challenge" was to get 100 (identical) Linux-VMs started on a three node hyperconverged cluster. The hardware is nothing special, each node has a Supermicro server board with a single CPU with 24 cores and 4 x 4 TB hard disks. And there's that extra 1 TB NVMe... I know that the general recommendation is to use the NVMe for WAL and metadata, but this didn't seem appropriate for my use case and I'm still not quite sure about failure scenarios with this configuration. So instead I made each drive a logical volume (managed by an OSD) and added 85 GiB NVMe to each LV as read-only cache. Each VM uses as system disk an RBD based on a snapshot from the master image. The idea was that with this configuration, all VMs should share most (actually almost all) of the data on their system disk and this data should be available from the cache. Well, it works. When booting the 100 VMs, almost all read operations are satisfied from the cache. So I get close to NVMe speed but have payed for conventional hard drives only (well, SSDs aren't that much more expensive nowadays, but the hardware is 4 years old). So, nothing sophisticated, but as I couldn't find anything about this kind of setup, it might be of interest nevertheless. - Michael

2 months, 2 weeks

13
25
0 0

v16.2.15 Pacific released

by Yuri Weinstein

We're happy to announce the 15th, and expected to be the last, backport release in the Pacific series. https://ceph.io/en/news/blog/2024/v16-2-15-pacific-released/ Notable Changes --------------- * `ceph config dump --format <json|xml>` output will display the localized option names instead of their normalized version. For example, "mgr/prometheus/x/server_port" will be displayed instead of "mgr/prometheus/server_port". This matches the output of the non pretty-print formatted version of the command. * CephFS: MDS evicts clients who are not advancing their request tids, which causes a large buildup of session metadata, resulting in the MDS going read-only due to the RADOS operation exceeding the size threshold. The `mds_session_metadata_threshold` config controls the maximum size that an (encoded) session metadata can grow. * RADOS: The `get_pool_is_selfmanaged_snaps_mode` C++ API has been deprecated due to its susceptibility to false negative results. Its safer replacement is `pool_is_in_selfmanaged_snaps_mode`. * RBD: When diffing against the beginning of time (`fromsnapname == NULL`) in fast-diff mode (`whole_object == true` with `fast-diff` image feature enabled and valid), diff-iterate is now guaranteed to execute locally if exclusive lock is available. This brings a dramatic performance improvement for QEMU live disk synchronization and backup use cases. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-16.2.15.tar.gz * Containers at https://quay.io/repository/ceph/ceph * For packages, see https://docs.ceph.com/en/latest/install/get-packages/ * Release git sha1: 618f440892089921c3e944a991122ddc44e60516

2 months, 2 weeks

2
1
0 0

[Quincy] cannot configure dashboard to listen on all ports

by wodel youchi

Hi, ceph dashboard fails to listen on all IPs. log_channel(cluster) log [ERR] : Unhandled exception from module 'dashboard' while running on mgr.controllera: OSError("No socket could be created -- (('0.0.0.0', 8443): [Errno -2] Name or service not known) -- (('::', 8443, 0, 0): ceph version 17.2.7 quincy (stable) Regards.

2 months, 2 weeks

1
0
0 0

ceph-crash NOT reporting crashes due to wrong permissions on /var/lib/ceph/crash/posted (Debian / Ubuntu packages)

by Christian Rohmann

Hey ceph-users, I just noticed issues with ceph-crash using the Debian /Ubuntu packages (package: ceph-base): While the /var/lib/ceph/crash/posted folder is created by the package install, it's not properly chowned to ceph:ceph by the postinst script. This might also affect RPM based installs somehow, but I did not look into that. I opened a bug report with all the details and two ideas to fix this: https://tracker.ceph.com/issues/64548 The wrong ownership causes ceph-crash to NOT work at all. I myself missed quite a few crash reports. All of them were just sitting around on the machines, but were reported right after I did chown ceph:ceph /var/lib/ceph/crash/posted systemctl restart ceph-crash.service You might want to check if you might be affected as well. Failing to post crashes to the local cluster results in them not being reported back via telemetry. Regards Christian

2 months, 2 weeks

3
3
1 0

Ceph orch doesn't execute commands and doesn't report correct status of daemons

by wodel youchi

Hi, I have finished the conversion from ceph-ansible to cephadm yesterday. Everything seemed to be working until this morning, I wanted to redeploy rgw service to specify the network to be used. So I deleted the rgw services with ceph orch rm, then I prepared a yml file with the new conf. I applied the file and the new rgw service was started but it was launched with an external image, so I wanted to redeploy using my local image so I did a redeploy ... and then nothing happened, I get the rescheduled message but nothing happened, then I restarted one of the controllers, the orchestrator doesn't seem to be aware that some service have restarted??? PS : I don't fully master the cephadm command line and use. Regards.

2 months, 2 weeks

2
3
0 0

[Quincy] NFS ingress mode haproxy-protocol not recognized

by wodel youchi

Hi; I tried to create an NFS cluster using this command : [root@controllera ceph]# ceph nfs cluster create mynfs "3 controllera controllerb controllerc" --ingress --virtual_ip 20.1.0.201 --ingress-mode haproxy-protocol Invalid command: haproxy-protocol not in default|keepalive-only And I got this error : Invalid command haproxy-protocol I am using Quincy : ceph version 17.2.7 (...) quincy (stable) Is it not supported yet? Regards.

2 months, 2 weeks

2
1
0 0

Question about erasure coding on cephfs

by Erich Weiler

Hi Y'all, We have a new ceph cluster online that looks like this: md-01 : monitor, manager, mds md-02 : monitor, manager, mds md-03 : monitor, manager store-01 : twenty 30TB NVMe OSDs store-02 : twenty 30TB NVMe OSDs The cephfs storage is using erasure coding at 4:2. The crush domain is set to "osd". (I know that's not optimal but let me get to that in a minute) We have a current regular single NFS server (nfs-01) with the same storage as the OSD servers above (twenty 30TB NVME disks). We want to wipe the NFS server and integrate it into the above ceph cluster as "store-03". When we do that, we would then have three OSD servers. We would then switch the crush domain to "host". My question is this: Given that we have 4:2 erasure coding, would the data rebalance evenly across the three OSD servers after we add store-03 such that if a single OSD server went down, the other two would be enough to keep the system online? Like, with 4:2 erasure coding, would 2 shards go on store-01, then 2 shards on store-02, and then 2 shards on store-03? Is that how I understand it? Thanks for any insight! -erich

2 months, 2 weeks

3
2
0 0

Re: has anyone enabled bdev_enable_discard?

by jsterr＠deckplosion.de

Is there any update on this? Did someone test the option and has performance values before and after? Is there any good documentation regarding this option?

2 months, 2 weeks

7
7
0 0

Re: Scrub stuck and 'pg has invalid (post-split) stat'

by Eugen Block

Please don't drop the list from your response. The first question coming to mind is, why do you have a cache-tier if all your pools are on nvme decices anyway? I don't see any benefit here. Did you try the suggested workaround and disable the cache-tier? Zitat von Cedric <yipikai7(a)gmail.com>: > Thanks Eugen, see attached infos. > > Some more details: > > - commands that actually hangs: ceph balancer status ; rbd -p vms ls ; > rados -p vms_cache cache-flush-evict-all > - all scrub running on vms_caches pgs are stall / start in a loop > without actually doing anything > - all io are 0 both from ceph status or iostat on nodes > > On Tue, Feb 20, 2024 at 10:00 AM Eugen Block <eblock(a)nde.ag> wrote: >> >> Hi, >> >> some more details would be helpful, for example what's the pool size >> of the cache pool? Did you issue a PG split before or during the >> upgrade? This thread [1] deals with the same problem, the described >> workaround was to set hit_set_count to 0 and disable the cache layer >> until that is resolved. Afterwards you could enable the cache layer >> again. But keep in mind that the code for cache tier is entirely >> removed in Reef (IIRC). >> >> Regards, >> Eugen >> >> [1] >> https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-addin… >> >> Zitat von Cedric <yipikai7(a)gmail.com>: >> >> > Hello, >> > >> > Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we >> > encounter an issue with a cache pool becoming completely stuck, >> > relevant messages below: >> > >> > pg xx.x has invalid (post-split) stats; must scrub before tier agent >> > can activate >> > >> > In OSD logs, scrubs are starting in a loop without succeeding for all >> > pg of this pool. >> > >> > What we already tried without luck so far: >> > >> > - shutdown / restart OSD >> > - rebalance pg between OSD >> > - raise the memory on OSD >> > - repeer PG >> > >> > Any idea what is causing this? any help will be greatly appreciated >> > >> > Thanks >> > >> > Cédric >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2 months, 2 weeks

3
5
1 0

2024

2023

2022

2021

2020

2019

ceph-users March 2024