March 2021 - ceph-users - lists.ceph.io

by Dave Hall

Hello, In some documentation I was reading last night about laying out OSDs, it was suggested that if more that one OSD uses the same NVMe drive, the failure-domain should probably be set to node. However, for a small cluster the inclination is to use EC-pools and failure-domain = OSD. I was wondering if there is a middle ground - could we define failure-domain = NVMe? I think the map would need to be defined manually in the same way that failure-domain = rack requires information about which nodes are in each rack. Example: My latest OSD nodes have 8 HDDs and 3 U.2 NVMe. I'd set up the WAL/DB for with HDDs per OSD (wasted space on the 3rd NVMe). Across all my OSD nodes I will have 8 HDDs and either 2 or 3 NVMe devices per node - 15 total NVMe devices. My preferred EC-pool profile is 8+2. It seems that this profile could be safely dispersed across 15 failure domains, resulting in protection against NVMe failure. Please let me know if this is worth pursuing. Thanks. -Dave -- Dave Hall Binghamton University kdhall(a)binghamton.edu 607-760-2328 (Cell) 607-777-4641 (Office)

3 years, 2 months

5
9
0 0

mon db growing. over 500Gb

by ricardo.re.azevedo＠gmail.com

Hi all, I have a fairly pressing issue. I had a monitor fall out of quorum because it ran out of disk space during rebalancing from switching to upmap. I noticed all my monitor store.db started taking up nearly all disk space so I set noout, nobackfill and norecover and shutdown all the monitor daemons. Each store.db was at: mon.a 89GB (the one that firt dropped out) mon.a 400GB mon.c 400GB I tried setting mon_compact_on_start. This brought mon.a down to 1GB. Cool. However, when I try it on the other monitors it increased the db size ~1Gb/10s so I shut them down again. Any idea what is going on? Or how can I shrik back down the db?

3 years, 2 months

4
6
0 0

Can FS snapshots cause factor 3 performance loss?

by Frank Schilder

Hi all, we are observing a dramatic performance drop on our ceph file system and are wondering if this could be related to ceph fs snapshots. We are taking rotating snapshots in 2 directories and have 11 snapshots in each (ls below) as of today. We observe the performance drop with an rsync process that writes to ceph fs to another folder *without* snapshots. The performance reduction is a factor of 3 or even higher. Could this possibly be caused by the snapshots being present? Has anyone else seen something like this? The reason we consider snapshots is that not much else changed on the cluster except that we started taking rolling snapshots on the 23rd of February. In addition, the kernel symbols ceph_update_snap_trace, rebuild_snap_realms and build_snap_context show up really high in a perf report. The performance reduction seems to be present since at least 3 days. The ceph version is mimic 13.2.10. The kernel version of the rsync server is 3.10.0-1127.10.1.el7.x86_64. $ ls home/.snap 2021-02-23_183554+0100_weekly 2021-03-06_000611+0100_daily 2021-03-09_000611+0100_daily 2021-03-01_000911+0100_weekly 2021-03-07_000611+0100_daily 2021-03-10_000611+0100_daily 2021-03-04_000611+0100_daily 2021-03-08_000611+0100_daily 2021-03-11_000611+0100_daily 2021-03-05_000611+0100_daily 2021-03-08_000911+0100_weekly $ ls groups/.snap 2021-02-23_183554+0100_weekly 2021-03-06_000611+0100_daily 2021-03-09_000611+0100_daily 2021-03-01_000912+0100_weekly 2021-03-07_000611+0100_daily 2021-03-10_000612+0100_daily 2021-03-04_000611+0100_daily 2021-03-08_000611+0100_daily 2021-03-11_000612+0100_daily 2021-03-05_000611+0100_daily 2021-03-08_000911+0100_weekly Many thanks for any pointers and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 2 months

1
0
0 0

Cephadm: Upgrade 15.2.5 -> 15.2.9 stops on non existing OSD

by Kai Stian Olstad

Before I started the upgrade the cluster was healthy but one OSD(osd.355) was down, can't remember if it was in or out. Upgrade was started with ceph orch upgrade start --image goharbor.example.com/library/ceph/ceph:v15.2.9 The upgrade started but when Ceph tried to upgrade osd.355 it paused with the following messages: 2021-03-11T09:15:35.638104+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: Target is goharbor.example.com/library/ceph/ceph:v15.2.9 with id dfc48307963697ff48acd9dd6fda4a7a24017b9d8124f86c2 a542b0802fe77ba 2021-03-11T09:15:35.639882+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: Checking mgr daemons... 2021-03-11T09:15:35.644170+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: All mgr daemons are up to date. 2021-03-11T09:15:35.644376+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: Checking mon daemons... 2021-03-11T09:15:35.647669+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: All mon daemons are up to date. 2021-03-11T09:15:35.647866+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: Checking crash daemons... 2021-03-11T09:15:35.652035+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: Setting container_image for all crash... 2021-03-11T09:15:35.653683+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: All crash daemons are up to date. 2021-03-11T09:15:35.653896+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: Checking osd daemons... 2021-03-11T09:15:36.273345+0000 mgr.pech-mon-2.cjeiyc [INF] It is presumed safe to stop ['osd.355'] 2021-03-11T09:15:36.273504+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: It is presumed safe to stop ['osd.355'] 2021-03-11T09:15:36.273887+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade: Redeploying osd.355 2021-03-11T09:15:36.276673+0000 mgr.pech-mon-2.cjeiyc [ERR] Upgrade: Paused due to UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.355 on host pech-hd-009 failed. One of the first ting the upgrade did was to upgrade mon, so they are restarted and now the osd.355 no longer exist $ ceph osd info osd.355 Error EINVAL: osd.355 does not exist But if I run a resume ceph orch upgrade resume it still tries to upgrade osd.355, same message as above. I tried to stop and start the upgrade again with ceph orch upgrade stop ceph orch upgrade start --image goharbor.example.com/library/ceph/ceph:v15.2.9 it still tries to upgrade osd.355, with the same message as above. Looking at the source code it looks like it get daemons to upgrade from mgr cache, so I restarted both mgr but still it tries to upgrade osd.355. Does anyone know how I can get the upgrade to continue? -- Kai Stian Olstad

3 years, 2 months

2
4
0 0

NVME pool creation time :: OSD services strange state

by Adrian Sevcenco

Hi! So, after i selected the tags to add 2 nvme ssds i declared a replicated n=2 pool .. and for the last 30 min the progress shown in notification is 0% and iotop shows around 100K/s for 2 (???) ceph-mon processes and that all ... and in my service list the osd services look somehow empty: https://prntscr.com/10iwwbh what did i miss? Thanks a lot! Adrian

3 years, 2 months

1
2
0 0

Openstack rbd image Error deleting problem

by Norman.Kern

Hi Guys, I have used Ceph rbd for Openstack for sometime, I met a problem while destroying a VM. The Openstack tried to delete rbd image but failed. I have a test deleting a image by rbd command, it costs lots of time(image size 512G or more). Anyone met the same problem with me? Thanks, Norman

3 years, 2 months

2
3
0 0

cephadm (curl master)/15.2.9:: how to add orchestration

by Adrian Sevcenco

Hi! After an initial bumpy bootstrapping (IMHO the defaults should be whatever is already defined in .ssh of the user and custom values setup with cli arguments) now i'm stuck adding any service/hosts/osds because apparently i lack orchestration .. the the documentation show a big "Page does not exist" see https://docs.ceph.com/en/latest/docs/octopus/mgr/orchestrator so, what is it and what options do i have? to set up it seems that is as easy as: ceph orch set backend I just started with ceph and i just want to start a ceph service (i cannot call it a cluster) on my desktop (with 2 dedicated osds) to get familiar also with usage. Thanks a lot! Adrian

3 years, 2 months

4
4
0 0

how to tell balancer to balance

by Boris Behrens

Hi, I know this topic seems to be handled a lot (as far as I can see), but I reached the end of my google_foo. * We have OSDs that are near full, but there are also OSDs that are only loaded with 50%. * We have 4,8,16 TB rotating disks in the cluster. * The disks that get packed are 4TB disks and very empty disks are also 4TB * The OSD nodes are all around the same total disk space (51 - 59) * The balancer tells me that it can not find further optimization, or that pg_num is decreasin. How can I debug further before the cluster goes into a bad state? [root@s3db1 ~]# ceph osd df tree | sort -nk 17 | head -n 30 ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME MIN/MAX VAR: 0.75/1.23 STDDEV: 6.96 TOTAL 673 TiB 474 TiB 452 TiB 100 GiB 1.2 TiB 199 TiB 70.39 -3 58.49872 - 58 TiB 39 TiB 36 TiB 8.2 GiB 85 GiB 19 TiB 67.40 0.96 - host s3db2 -4 58.49872 - 58 TiB 40 TiB 35 TiB 35 GiB 81 GiB 19 TiB 67.91 0.96 - host s3db3 -11 50.94173 - 51 TiB 35 TiB 35 TiB 3.5 GiB 94 GiB 16 TiB 68.00 0.97 - host s3db10 -10 51.28331 - 51 TiB 35 TiB 35 TiB 4.6 GiB 93 GiB 16 TiB 69.16 0.98 - host s3db9 -6 58.89636 - 59 TiB 41 TiB 40 TiB 2.4 GiB 102 GiB 18 TiB 69.15 0.98 - host s3db5 -12 50.99052 - 51 TiB 36 TiB 36 TiB 1.8 GiB 93 GiB 15 TiB 69.99 0.99 - host s3db11 -2 58.20561 - 58 TiB 41 TiB 37 TiB 9.6 GiB 96 GiB 17 TiB 70.00 0.99 - host s3db1 -1 673.44452 - 673 TiB 474 TiB 452 TiB 100 GiB 1.2 TiB 199 TiB 70.39 1.00 - root default -5 58.49872 - 58 TiB 42 TiB 35 TiB 7.0 GiB 94 GiB 17 TiB 71.06 1.01 - host s3db4 -8 58.89636 - 59 TiB 42 TiB 42 TiB 3.6 GiB 108 GiB 17 TiB 71.91 1.02 - host s3db7 -7 58.89636 - 59 TiB 43 TiB 42 TiB 15 GiB 120 GiB 16 TiB 72.69 1.03 - host s3db6 -37 58.55478 - 59 TiB 43 TiB 43 TiB 4.4 GiB 117 GiB 16 TiB 73.18 1.04 - host s3db12 -9 51.28331 - 51 TiB 38 TiB 38 TiB 4.9 GiB 103 GiB 13 TiB 74.18 1.05 - host s3db8 15 hdd 3.63689 1.00000 3.6 TiB 1.9 TiB 1.7 TiB 2.1 GiB 0 B 1.7 TiB 52.87 0.75 45 up osd.15 6 hdd 3.63689 1.00000 3.6 TiB 1.9 TiB 1.7 TiB 1.7 GiB 0 B 1.7 TiB 52.90 0.75 46 up osd.6 12 hdd 3.63689 1.00000 3.6 TiB 1.9 TiB 1.7 TiB 570 MiB 0 B 1.7 TiB 53.04 0.75 41 up osd.12 81 hdd 3.63689 1.00000 3.6 TiB 2.0 TiB 1.7 TiB 895 MiB 0 B 1.7 TiB 54.26 0.77 51 up osd.81 27 hdd 3.73630 1.00000 3.7 TiB 2.1 TiB 2.0 TiB 6.8 MiB 5.8 GiB 1.6 TiB 56.12 0.80 47 up osd.27 3 hdd 3.63689 1.00000 3.6 TiB 2.1 TiB 1.6 TiB 510 MiB 0 B 1.6 TiB 57.04 0.81 51 up osd.3 5 hdd 3.63689 1.00000 3.6 TiB 2.1 TiB 1.5 TiB 431 MiB 0 B 1.5 TiB 57.88 0.82 49 up osd.5 80 hdd 3.63689 1.00000 3.6 TiB 2.1 TiB 1.5 TiB 1.8 GiB 0 B 1.5 TiB 58.31 0.83 51 up osd.80 25 hdd 3.73630 1.00000 3.7 TiB 2.2 TiB 2.1 TiB 4.1 MiB 6.1 GiB 1.5 TiB 58.91 0.84 39 up osd.25 0 hdd 3.73630 1.00000 3.7 TiB 2.2 TiB 2.1 TiB 83 MiB 6.2 GiB 1.5 TiB 60.03 0.85 46 up osd.0 79 hdd 3.63689 1.00000 3.6 TiB 2.3 TiB 1.4 TiB 1.8 GiB 0 B 1.4 TiB 62.53 0.89 47 up osd.79 61 hdd 7.32619 1.00000 7.3 TiB 4.6 TiB 4.6 TiB 1.1 GiB 12 GiB 2.7 TiB 62.80 0.89 101 up osd.61 67 hdd 7.27739 1.00000 7.3 TiB 4.6 TiB 4.6 TiB 557 MiB 13 GiB 2.7 TiB 63.29 0.90 96 up osd.67 72 hdd 7.32619 1.00000 7.3 TiB 4.6 TiB 4.6 TiB 107 MiB 11 GiB 2.7 TiB 63.36 0.90 87 up osd.72 [root@s3db1 ~]# ceph osd df tree | sort -nk 17 | tail 51 hdd 7.27739 1.00000 7.3 TiB 5.6 TiB 5.5 TiB 724 MiB 14 GiB 1.7 TiB 76.34 1.08 105 up osd.51 71 hdd 3.68750 1.00000 3.7 TiB 2.8 TiB 2.8 TiB 3.7 MiB 7.8 GiB 867 GiB 77.04 1.09 47 up osd.71 82 hdd 3.63689 1.00000 3.6 TiB 2.8 TiB 839 GiB 628 MiB 0 B 839 GiB 77.48 1.10 45 up osd.82 14 hdd 3.63689 1.00000 3.6 TiB 2.9 TiB 777 GiB 18 GiB 0 B 777 GiB 79.14 1.12 59 up osd.14 4 hdd 3.63689 1.00000 3.6 TiB 2.9 TiB 752 GiB 826 MiB 0 B 752 GiB 79.80 1.13 53 up osd.4 75 hdd 3.68750 1.00000 3.7 TiB 2.9 TiB 2.9 TiB 523 MiB 8.2 GiB 757 GiB 79.95 1.14 53 up osd.75 76 hdd 3.68750 1.00000 3.7 TiB 3.0 TiB 3.0 TiB 237 MiB 9.2 GiB 668 GiB 82.30 1.17 50 up osd.76 33 hdd 3.73630 1.00000 3.7 TiB 3.1 TiB 3.0 TiB 380 MiB 8.5 GiB 671 GiB 82.46 1.17 57 up osd.33 34 hdd 3.73630 1.00000 3.7 TiB 3.1 TiB 3.0 TiB 464 MiB 8.4 GiB 605 GiB 84.18 1.20 60 up osd.34 35 hdd 3.73630 1.00000 3.7 TiB 3.2 TiB 3.1 TiB 352 MiB 8.7 GiB 515 GiB 86.55 1.23 53 up osd.35 [root@s3db1 ~]# ceph balancer status { "last_optimize_duration": "0:00:00.020142", "plans": [], "mode": "upmap", "active": true, "optimize_result": "Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect", "last_optimize_started": "Thu Mar 11 13:42:32 2021" } [root@s3db1 ~]# ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 673 TiB 199 TiB 474 TiB 474 TiB 70.41 TOTAL 673 TiB 199 TiB 474 TiB 474 TiB 70.41 POOLS: POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL rbd 0 64 0 B 0 0 B 0 19 TiB .rgw.root 1 64 98 KiB 118 98 KiB 0 19 TiB eu-central-1.rgw.control 2 64 0 B 8 0 B 0 19 TiB eu-central-1.rgw.data.root 3 64 1022 KiB 3.02k 1022 KiB 0 19 TiB eu-central-1.rgw.gc 4 64 84 MiB 32 84 MiB 0 19 TiB eu-central-1.rgw.log 5 64 220 MiB 564 220 MiB 0 19 TiB eu-central-1.rgw.users.uid 6 64 2.8 MiB 6.89k 2.8 MiB 0 19 TiB eu-central-1.rgw.users.keys 7 64 262 KiB 6.70k 262 KiB 0 19 TiB eu-central-1.rgw.meta 8 64 384 KiB 1k 384 KiB 0 19 TiB eu-central-1.rgw.users.email 9 64 40 B 1 40 B 0 19 TiB eu-central-1.rgw.buckets.index 10 64 10 GiB 67.54k 10 GiB 0.02 19 TiB eu-central-1.rgw.buckets.data 11 1024 151 TiB 106.48M 151 TiB 72.67 19 TiB eu-central-1.rgw.buckets.non-ec 12 64 268 MiB 13.03k 268 MiB 0 19 TiB eu-central-1.rgw.usage 13 64 501 MiB 32 501 MiB 0 19 TiB eu-msg-1.rgw.control 56 64 0 B 8 0 B 0 19 TiB eu-msg-1.rgw.data.root 57 64 71 KiB 221 71 KiB 0 19 TiB eu-msg-1.rgw.gc 58 64 60 KiB 32 60 KiB 0 19 TiB eu-msg-1.rgw.log 59 64 835 KiB 242 835 KiB 0 19 TiB eu-msg-1.rgw.users.uid 60 64 56 KiB 107 56 KiB 0 19 TiB eu-msg-1.rgw.usage 61 64 36 MiB 25 36 MiB 0 19 TiB eu-msg-1.rgw.users.keys 62 64 3.8 KiB 97 3.8 KiB 0 19 TiB eu-msg-1.rgw.meta 63 64 600 KiB 1.58k 600 KiB 0 19 TiB eu-msg-1.rgw.buckets.index 64 64 46 MiB 112 46 MiB 0 19 TiB eu-msg-1.rgw.users.email 65 64 0 B 0 0 B 0 19 TiB eu-msg-1.rgw.buckets.data 66 64 2.8 TiB 1.14M 2.8 TiB 4.76 19 TiB eu-msg-1.rgw.buckets.non-ec 67 64 2.2 MiB 353 2.2 MiB 0 19 TiB default.rgw.control 69 32 0 B 8 0 B 0 19 TiB default.rgw.data.root 70 32 0 B 0 0 B 0 19 TiB default.rgw.gc 71 32 0 B 0 0 B 0 19 TiB default.rgw.log 72 32 0 B 0 0 B 0 19 TiB default.rgw.users.uid 73 32 0 B 0 0 B 0 19 TiB fra-1.rgw.control 74 32 0 B 8 0 B 0 19 TiB fra-1.rgw.meta 75 32 0 B 0 0 B 0 19 TiB fra-1.rgw.log 76 32 50 B 28 50 B 0 19 TiB whitespace-again-2021-03-10 77 64 111 MiB 363.94k 111 MiB 0 19 TiB whitespace-again-2021-03-10_2 78 32 18 KiB 59 18 KiB 0 19 TiB whitespace-again-2021-03-10_3 79 32 11 KiB 36 11 KiB 0 19 TiB -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groÃƒ¼en Saal.

3 years, 2 months

1
0
0 0

Re: how smart is ceph recovery?

by Marc

> > > > 2. If a down host comes up again and it's osd are started. Is data > still being copied, or does ceph see that checksums(?) > > PG or RADOS object epoch I think. So if data hasn’t changed, the > recovery completes without having anything to do. > > > are the same and just sets a pointer(?) > back to the old location? Yes I mean the osdmap having the old osd on the node that was down. Hmmm, I currently have pg's in 'active+remapped+backfill_wait' of a pool rbd.backup. Of which I know nothing in it has changed. And those osd's listed[1] are already up. Especially when you are getting at the 'end' of recovery where osd_max_backfills=X has less effect and recovery takes longer. It would be nice to have the "no work needed pg's" be processed quickly. [1] [18,25,41]p18 [18,41,17] [5,0,25]p5 [5,0,4

3 years, 2 months

1
0
0 0

Unpurgeable rbd image from trash

by Enrico Bocchi

Hello everyone, We have an unpurgeable image living in the trash of one of our clusters: # rbd --pool volumes trash ls 5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b If we try to purge the whole trash it says the image is being restored but we have never tried to do that: # rbd --pool volumes trash purge Removing images: 0% complete...failed. 2021-03-10 13:58:42.849 7f78b3fc9c80 -1 librbd::api::Trash: remove: error: image is pending restoration. When trying to delete manually, it says there are some watchers, but this is actually not the case: # rbd --pool volumes trash remove 5afa5e5a07b8bc rbd: error: image still has watchers2021-03-10 14:00:21.262 7f93ee8f8c80 -1 librbd::api::Trash: remove: error: image is pending restoration. This means the image is still open or the client using it crashed. Try again after closing/unmapping it or waiting 30s for the crashed client to timeout. Removing image: 0% complete...failed. # rados listwatchers -p volumes rbd_header.5afa5e5a07b8bc # We have tried to stat the first 10 rbd_data objects and they were all deleted. We know we can manually delete the omapkey from rbd_trash but we though it would be better to understand how an image might get in this state. Has anyone seen this before? Many thanks! Cheers, Enrico -- Enrico Bocchi CERN European Laboratory for Particle Physics IT - Storage Group - General Storage Services Mailbox: G20500 - Office: 31-2-010 1211 Genève 23 Switzerland

3 years, 2 months

2
4
0 0

2024

2023

2022

2021

2020

2019

ceph-users March 2021