April 2023 - ceph-users - lists.ceph.io

by Johan Hattne

Dear all; Up until a few hours ago, I had a seemingly normally-behaving cluster (Quincy, 17.2.5) with 36 OSDs, evenly distributed across 3 of its 6 nodes. The cluster is only used for CephFS and the only non-standard configuration I can think of is that I had 2 active MDSs, but only 1 standby. I had also doubled mds_cache_memory limit to 8 GB (all OSD hosts have 256 G of RAM) at some point in the past. Then I rebooted one of the OSD nodes. The rebooted node held one of the active MDSs. Now the node is back up: ceph -s says the cluster is healthy, but all PGs are in a active+clean+remapped state and 166.67% of the objects are misplaced (dashboard: -66.66% healthy). The data pool is a threefold replica with 5.4M object, the number of misplaced objects is reported as 27087410/16252446. The denominator in the ratio makes sense to me (16.2M / 3 = 5.4M), but the numerator does not. I also note that the ratio is *exactly* 5 / 3. The filesystem is still mounted and appears to be usable, but df reports it as 100% full; I suspect it would say 167% but that is capped somewhere. Any ideas about what is going on? Any suggestions for recovery? // Best wishes; Johan

1 year, 1 month

3
6
0 0

CephFS thrashing through the page cache

by Ashu Pachauri

We have an internal use case where we back the storage of a proprietary database by a shared file system. We noticed something very odd when testing some workload with a local block device backed file system vs cephfs. We noticed that the amount of network IO done by cephfs is almost double compared to the IO done in case of a local file system backed by an attached block device. We also noticed that CephFS thrashes through the page cache very quickly compared to the amount of data being read and think that the two issues might be related. So, I wrote a simple test. 1. I wrote 10k files 400KB each using dd (approx 4 GB data). 2. I dropped the page cache completely. 3. I then read these files serially, again using dd. The page cache usage shot up to 39 GB for reading such a small amount of data. Following is the code used to repro this in bash: for i in $(seq 1 10000); do dd if=/dev/zero of=test_${i} bs=4k count=100 done sync; echo 1 > /proc/sys/vm/drop_caches for i in $(seq 1 10000); do dd if=test_${i} of=/dev/null bs=4k count=100 done The ceph version being used is: ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable) The ceph configs being overriden: WHO MASK LEVEL OPTION VALUE RO mon advanced auth_allow_insecure_global_id_reclaim false mgr advanced mgr/balancer/mode upmap mgr advanced mgr/dashboard/server_addr 127.0.0.1 * mgr advanced mgr/dashboard/server_port 8443 * mgr advanced mgr/dashboard/ssl false * mgr advanced mgr/prometheus/server_addr 0.0.0.0 * mgr advanced mgr/prometheus/server_port 9283 * osd advanced bluestore_compression_algorithm lz4 osd advanced bluestore_compression_mode aggressive osd advanced bluestore_throttle_bytes 536870912 osd advanced osd_max_backfills 3 osd advanced osd_op_num_threads_per_shard_ssd 8 * osd advanced osd_scrub_auto_repair true mds advanced client_oc false mds advanced client_readahead_max_bytes 4096 mds advanced client_readahead_max_periods 1 mds advanced client_readahead_min 0 mds basic mds_cache_memory_limit 21474836480 client advanced client_oc false client advanced client_readahead_max_bytes 4096 client advanced client_readahead_max_periods 1 client advanced client_readahead_min 0 client advanced fuse_disable_pagecache false The cephfs mount options (note that readahead was disabled for this test): /mnt/cephfs type ceph (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0) Any help or pointers are appreciated; this is a major performance issue for us. Thanks and Regards, Ashu Pachauri

1 year, 1 month

4
12
0 0

Upgrading to 16.2.11 timing out on ceph-volume due to raw list performance bug, downgrade isn't possible due to new OP code in bluestore

by Mikael Öhman

Trying to upgrade a containerized setup from 16.2.10 to 16.2.11 gave us two big surprises, I wanted to share in case anyone else encounters the same. I don't see any nice solution to this apart from a new release that fixes the performance regression that completely breaks the container setup in cephadm due to timeouts: After some digging, we would that the it was the "ceph-volume" command that kept timing out, and after a ton of digging, found that it does so because of https://github.com/ceph/ceph/commit/bea9f4b643ce32268ad79c0fc257b25ff2f8333… which was introduced into 16.2.11. Unfortunately, the vital fix for this https://github.com/ceph/ceph/commit/8d7423c3e75afbe111c91e699ef3cb1c0beee61b was not included in 16.2.11 So, in a setup like ours, with *many* devices, a simple "ceph-volume raw list" now takes over 10 minutes to run (instead of 5 seconds in 16.2.10). As a result, the service files that cephadm generates [Service] LimitNOFILE=1048576 LimitNPROC=1048576 EnvironmentFile=-/etc/environment ExecStart=/bin/bash /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.run ExecStop=-/bin/bash -c '/bin/podman stop ceph-5406fed0-d52b-11ec-beff-7ed30a54847b-%i ; bash /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.stop' ExecStopPost=-/bin/bash /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.poststop KillMode=none Restart=on-failure RestartSec=10s TimeoutStartSec=120 TimeoutStopSec=120 StartLimitInterval=30min StartLimitBurst=5 ExecStartPre=-/bin/rm -f %t/%n-pid %t/%n-cid ExecStopPost=-/bin/rm -f %t/%n-pid %t/%n-cid Type=forking PIDFile=%t/%n-pid Delegate=yes will repeatedly be marked as failed, as they take over 2 minutes to run now. This tells systemd to restart, and we now have an infinite loop, as the 5 restarts takes over 50 minutes, it never even triggers the StarLimitInterval, leaving this OSD in an infinite loop over listing the n^2 devices (which, as a bonus, is also filling up the root disk with an enormous amount of repeated logging in ceph-volume.log as it infinitely tries to figure out if a block device is a bluestore) And trying to just fix the service or unit files manually to at least just stop this container from being incorrectly restarted over and over, is also a dead end, since the orchestration stuff just overwrites this automatically, and restarts the services again. I found it seemed to be /var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2 on my system that generated these files, so i tried tweaking that to have the necessary 1200 second TimeoutStart and finally that managed to get the darn container to stop restarting endlessly. (I admit i'm very fuzzy on how these services and orchestration stuff is triggered as i usually don't work on our storage stuff) Still though, it takes 11 minutes to start each OSD service now, so this isn't great. We wanted to revert back to 16.2.10 but it turns out to also be a no-go, as a new operation added to bluefs https://github.com/ceph/ceph/pull/42750 in 16.2.11 (though this isn't mentioned in the changelogs, i had to compare the source code to see that it was in fact added 16.2.11). So trying to revert an OSD then fails with: debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs _replay 0x100000: stop: unrecognized op 12 debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs mount failed to replay log: (5) Input/output error debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluestore(/var/lib/ceph/osd/ceph-10) _open_bluefs failed bluefs mount: (5) Input/output error debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluestore(/var/lib/ceph/osd/ceph-10) _open_db failed to prepare db environment: debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 1 bdev(0x5590e80a0400 /var/lib/ceph/osd/ceph-10/block) close debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1 osd.10 0 OSD:init: unable to mount object store debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1 ** ERROR: osd init failed: (5) Input/output error Ouch Best regards, Mikael

1 year, 1 month

3
3
0 0

RGW can't create bucket

by Kamil Madac

Hi, One of my customers had a correctly working RGW cluster with two zones in one zonegroup and since a few days ago users are not able to create buckets and are always getting Access denied. Working with existing buckets works (like listing/putting objects into existing bucket). The only operation which is not working is bucket creation. We also tried to create a new user, but the behavior is the same, and he is not able to create the bucket. We tried s3cmd, python script with boto library and also Dashboard as admin user. We are always getting Access Denied. Zones are in-sync. Has anyone experienced such behavior? Thanks in advance, here are some outputs: $ s3cmd -c .s3cfg_python_client mb s3://test ERROR: Access to bucket 'test' was denied ERROR: S3 error: 403 (AccessDenied) Zones are in-sync: Primary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) metadata sync no sync (zone is master) data sync source: e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source Secondary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source -- Kamil Madac

1 year, 1 month

3
6
0 0

Crushmap rule for multi-datacenter erasure coding

by Michel Jouvin

Hi, We have a 3-site Ceph cluster and would like to create a 4+2 EC pool with 2 chunks per datacenter, to maximise the resilience in case of 1 datacenter being down. I have not found a way to create an EC profile with this 2-level allocation strategy. I created an EC profile with a failure domain = datacenter but it doesn't work as, I guess, it would like to ensure it has always 5 OSDs up (to ensure that the pools remains R/W) where with a failure domain = datacenter, the guarantee is only 4. My idea was to create a 2-step allocation and a failure domain=host to achieve our desired configuration, with something like the following in the crushmap rule: step choose indep 3 datacenter step chooseleaf indep x host step emit Is it the right approach? If yes, what should be 'x'? Would 0 work? From what I have seen, there is no way to create such a rule with the 'ceph osd crush' commands: I have to download the current CRUSHMAP, edit it and upload the modified version. Am I right? Thanks in advance for your help or suggestions. Best regards, Michel

1 year, 1 month

2
1
0 0

Crushmap rule for multi-datacenter erasure coding

by Michel Jouvin

Hi, We have a 3-site Ceph cluster and would like to create a 4+2 EC pool with 2 chunks per datacenter, to maximise the resilience in case of 1 datacenter being down. I have not found a way to create an EC profile with this 2-level allocation strategy. I created an EC profile with a failure domain = datacenter but it doesn't work as, I guess, it would like to ensure it has always 5 OSDs up (to ensure that the pools remains R/W) where with a failure domain = datacenter, the guarantee is only 4. My idea was to create a 2-step allocation and a failure domain=host to achieve our desired configuration, with something like the following in the crushmap rule: step choose indep 3 datacenter step chooseleaf indep x host step emit Is it the right approach? If yes, what should be 'x'? Would 0 work? From what I have seen, there is no way to create such a rule with the 'ceph osd crush' commands: I have to download the current CRUSHMAP, edit it and upload the modified version. Am I right? Thanks in advance for your help or suggestions. Best regards, Michel

1 year, 1 month

3
5
0 0

Recently deployed cluster showing 9Tb of raw usage without any load deployed

by Work Ceph

Hello guys! We noticed an unexpected situation. In a recently deployed Ceph cluster we are seeing a raw usage, that is a bit odd. We have the following setup: We have a new cluster with 5 nodes with the following setup: - 128 GB of RAM - 2 cpus Intel(R) Intel Xeon Silver 4210R - 1 NVME of 2 TB for the rocks DB caching - 5 HDDs of 14TB - 1 NIC dual port of 25GiB in BOND mode. Right after deploying the Ceph cluster, we see a raw usage of about 9TiB. However, no load has been applied onto the cluster. Have you guys seen such a situation? Or, can you guys help understand it? We are using Ceph Octopus, and we have set the following configurations: ``` ceph_conf_overrides: global: osd pool default size: 3 osd pool default min size: 1 osd pool default pg autoscale mode: "warn" perf: true rocksdb perf: true mon: mon osd down out interval: 120 osd: bluestore min alloc size hdd: 65536 ``` Any tip or help on how to explain this situation is welcome!

1 year, 1 month

3
7
0 0

Read and write performance on distributed filesystem

by David Cunningham

Hello, We are considering CephFS as an alternative to GlusterFS, and have some questions about performance. Is anyone able to advise us please? This would be for file systems between 100GB and 2TB in size, average file size around 5MB, and a mixture of reads and writes. I may not be using the correct terminology in the Ceph world, but in my parlance a node is a Linux server running the Ceph storage software. Multiple nodes make up the whole Ceph storage solution. Someone correct me if I should be using different terms! In our normal scenario the nodes in the replicated filesystem would be around 0.3ms apart, but we're also interested in geographically remote nodes which would be say 20ms away. We are using third party software which relies on a traditional Linux filesystem, so we can't use an object storage solution directly. So my specific questions are: 1. When reading a file from CephFS, does it read from just one node, or from all nodes? 2. If reads are from one node then does it choose the node with the fastest response to optimise performance, or if from all nodes then will reads be no faster than latency to the furthest node? 3. When writing to CephFS, are all nodes written to synchronously, or are writes to one node which then replicates that to other nodes asynchronously? 4. Can anyone give a recommendation on maximum latency between nodes to have decent performance? 5. How does CephFS handle a node which suddenly becomes unavailable on the network? Is the block time configurable, and how good is the healing process after the lost node rejoins the network? 6. I have read that CephFS is more complicated to administer than GlusterFS. What does everyone think? Are things like healing after a net split difficult for administrators new to Ceph to handle? Thanks very much in advance. -- David Cunningham, Voisonics Limited http://voisonics.com/ USA: +1 213 221 1092 New Zealand: +64 (0)28 2558 3782

1 year, 1 month

2
1
0 0

compiling Nautilus for el9

by Marc

Is it possible to compile Nautilus for el9? Or maybe just the osd's?

1 year, 1 month

2
4
0 0

How mClock profile calculation works, and IOPS

by Luis Domingues

Hi, I am reading reading some documentation about mClock and have two questions. First, about the IOPS. Are those IOPS disk IOPS or other kind of IOPS? And what the assumption of those? (Like block size, sequential or random reads/writes)? And the second question, How mClock calculates its profiles? I have my lab cluster running Quincy, and I have this parameters for mClock: "osd_mclock_max_capacity_iops_hdd": "450.000000", "osd_mclock_profile": "balanced", According to the documentation: https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#bala… I am expecting to have: "osd_mclock_scheduler_background_best_effort_lim": "999999", "osd_mclock_scheduler_background_best_effort_res": "90", "osd_mclock_scheduler_background_best_effort_wgt": "2", "osd_mclock_scheduler_background_recovery_lim": "675", "osd_mclock_scheduler_background_recovery_res": "180", "osd_mclock_scheduler_background_recovery_wgt": "1", "osd_mclock_scheduler_client_lim": "450", "osd_mclock_scheduler_client_res": "180", "osd_mclock_scheduler_client_wgt": "1", But what I get is: "osd_mclock_scheduler_background_best_effort_lim": "999999", "osd_mclock_scheduler_background_best_effort_res": "18", "osd_mclock_scheduler_background_best_effort_wgt": "2", "osd_mclock_scheduler_background_recovery_lim": "135", "osd_mclock_scheduler_background_recovery_res": "36", "osd_mclock_scheduler_background_recovery_wgt": "1", "osd_mclock_scheduler_client_lim": "90", "osd_mclock_scheduler_client_res": "36", "osd_mclock_scheduler_client_wgt": "1", Which seems very low according to what my disk seems to be able to handle. Is this calculation the expected one? Or did I miss something on how those profiles are populated? Luis Domingues Proton AG

1 year, 1 month

2
5
0 0

2024

2023

2022

2021

2020

2019

ceph-users April 2023