Dear all;
Up until a few hours ago, I had a seemingly normally-behaving cluster
(Quincy, 17.2.5) with 36 OSDs, evenly distributed across 3 of its 6
nodes. The cluster is only used for CephFS and the only non-standard
configuration I can think of is that I had 2 active MDSs, but only 1
standby. I had also doubled mds_cache_memory limit to 8 GB (all OSD
hosts have 256 G of RAM) at some point in the past.
Then I rebooted one of the OSD nodes. The rebooted node held one of the
active MDSs. Now the node is back up: ceph -s says the cluster is
healthy, but all PGs are in a active+clean+remapped state and 166.67% of
the objects are misplaced (dashboard: -66.66% healthy).
The data pool is a threefold replica with 5.4M object, the number of
misplaced objects is reported as 27087410/16252446. The denominator in
the ratio makes sense to me (16.2M / 3 = 5.4M), but the numerator does
not. I also note that the ratio is *exactly* 5 / 3. The filesystem is
still mounted and appears to be usable, but df reports it as 100% full;
I suspect it would say 167% but that is capped somewhere.
Any ideas about what is going on? Any suggestions for recovery?
// Best wishes; Johan
We have an internal use case where we back the storage of a proprietary
database by a shared file system. We noticed something very odd when
testing some workload with a local block device backed file system vs
cephfs. We noticed that the amount of network IO done by cephfs is almost
double compared to the IO done in case of a local file system backed by an
attached block device.
We also noticed that CephFS thrashes through the page cache very quickly
compared to the amount of data being read and think that the two issues
might be related. So, I wrote a simple test.
1. I wrote 10k files 400KB each using dd (approx 4 GB data).
2. I dropped the page cache completely.
3. I then read these files serially, again using dd. The page cache usage
shot up to 39 GB for reading such a small amount of data.
Following is the code used to repro this in bash:
for i in $(seq 1 10000); do
dd if=/dev/zero of=test_${i} bs=4k count=100
done
sync; echo 1 > /proc/sys/vm/drop_caches
for i in $(seq 1 10000); do
dd if=test_${i} of=/dev/null bs=4k count=100
done
The ceph version being used is:
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
(stable)
The ceph configs being overriden:
WHO MASK LEVEL OPTION VALUE
RO
mon advanced auth_allow_insecure_global_id_reclaim false
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/dashboard/server_addr 127.0.0.1
*
mgr advanced mgr/dashboard/server_port 8443
*
mgr advanced mgr/dashboard/ssl false
*
mgr advanced mgr/prometheus/server_addr 0.0.0.0
*
mgr advanced mgr/prometheus/server_port 9283
*
osd advanced bluestore_compression_algorithm lz4
osd advanced bluestore_compression_mode aggressive
osd advanced bluestore_throttle_bytes 536870912
osd advanced osd_max_backfills 3
osd advanced osd_op_num_threads_per_shard_ssd 8
*
osd advanced osd_scrub_auto_repair true
mds advanced client_oc false
mds advanced client_readahead_max_bytes 4096
mds advanced client_readahead_max_periods 1
mds advanced client_readahead_min 0
mds basic mds_cache_memory_limit
21474836480
client advanced client_oc false
client advanced client_readahead_max_bytes 4096
client advanced client_readahead_max_periods 1
client advanced client_readahead_min 0
client advanced fuse_disable_pagecache false
The cephfs mount options (note that readahead was disabled for this test):
/mnt/cephfs type ceph (rw,relatime,name=cephfs,secret=<hidden>,acl,rasize=0)
Any help or pointers are appreciated; this is a major performance issue for
us.
Thanks and Regards,
Ashu Pachauri
Trying to upgrade a containerized setup from 16.2.10 to 16.2.11 gave us two
big surprises, I wanted to share in case anyone else encounters the same. I
don't see any nice solution to this apart from a new release that fixes the
performance regression that completely breaks the container setup in
cephadm due to timeouts:
After some digging, we would that the it was the "ceph-volume" command that
kept timing out, and after a ton of digging, found that it does so because
of
https://github.com/ceph/ceph/commit/bea9f4b643ce32268ad79c0fc257b25ff2f8333…
which was introduced into 16.2.11.
Unfortunately, the vital fix for this
https://github.com/ceph/ceph/commit/8d7423c3e75afbe111c91e699ef3cb1c0beee61b
was not included in 16.2.11
So, in a setup like ours, with *many* devices, a simple "ceph-volume raw
list" now takes over 10 minutes to run (instead of 5 seconds in 16.2.10).
As a result, the service files that cephadm generates
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/environment
ExecStart=/bin/bash
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.run
ExecStop=-/bin/bash -c '/bin/podman stop
ceph-5406fed0-d52b-11ec-beff-7ed30a54847b-%i ; bash
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.stop'
ExecStopPost=-/bin/bash
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/%i/unit.poststop
KillMode=none
Restart=on-failure
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=120
StartLimitInterval=30min
StartLimitBurst=5
ExecStartPre=-/bin/rm -f %t/%n-pid %t/%n-cid
ExecStopPost=-/bin/rm -f %t/%n-pid %t/%n-cid
Type=forking
PIDFile=%t/%n-pid
Delegate=yes
will repeatedly be marked as failed, as they take over 2 minutes to run
now. This tells systemd to restart, and we now have an infinite loop, as
the 5 restarts takes over 50 minutes, it never even triggers the
StarLimitInterval, leaving this OSD in an infinite loop over listing the
n^2 devices (which, as a bonus, is also filling up the root disk with an
enormous amount of repeated logging in ceph-volume.log as it infinitely
tries to figure out if a block device is a bluestore)
And trying to just fix the service or unit files manually to at least just
stop this container from being incorrectly restarted over and over, is also
a dead end, since the orchestration stuff just overwrites this
automatically, and restarts the services again.
I found it seemed to be
/var/lib/ceph/5406fed0-d52b-11ec-beff-7ed30a54847b/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2
on my system that generated these files, so i tried tweaking that to have
the necessary 1200 second TimeoutStart and finally that managed to get the
darn container to stop restarting endlessly. (I admit i'm very fuzzy on how
these services and orchestration stuff is triggered as i usually don't work
on our storage stuff)
Still though, it takes 11 minutes to start each OSD service now, so this
isn't great.
We wanted to revert back to 16.2.10 but it turns out to also be a no-go, as
a new operation added to bluefs https://github.com/ceph/ceph/pull/42750 in
16.2.11 (though this isn't mentioned in the changelogs, i had to compare
the source code to see that it was in fact added 16.2.11). So trying to
revert an OSD then fails with:
debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs _replay 0x100000:
stop: unrecognized op 12
debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1 bluefs mount failed to
replay log: (5) Input/output error
debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1
bluestore(/var/lib/ceph/osd/ceph-10) _open_bluefs failed bluefs mount: (5)
Input/output error
debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 -1
bluestore(/var/lib/ceph/osd/ceph-10) _open_db failed to prepare db
environment:
debug 2023-04-04T11:42:45.927+0000 7f2c12f6a200 1 bdev(0x5590e80a0400
/var/lib/ceph/osd/ceph-10/block) close
debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1 osd.10 0 OSD:init:
unable to mount object store
debug 2023-04-04T11:42:46.153+0000 7f2c12f6a200 -1 ** ERROR: osd init
failed: (5) Input/output error
Ouch
Best regards, Mikael
Hi,
One of my customers had a correctly working RGW cluster with two zones in
one zonegroup and since a few days ago users are not able to create buckets
and are always getting Access denied. Working with existing buckets works
(like listing/putting objects into existing bucket). The only operation
which is not working is bucket creation. We also tried to create a new
user, but the behavior is the same, and he is not able to create the
bucket. We tried s3cmd, python script with boto library and also Dashboard
as admin user. We are always getting Access Denied. Zones are in-sync.
Has anyone experienced such behavior?
Thanks in advance, here are some outputs:
$ s3cmd -c .s3cfg_python_client mb s3://test
ERROR: Access to bucket 'test' was denied
ERROR: S3 error: 403 (AccessDenied)
Zones are in-sync:
Primary cluster:
# radosgw-admin sync status
realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
metadata sync no sync (zone is master)
data sync source: e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
Secondary cluster:
# radosgw-admin sync status
realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
--
Kamil Madac
Hi,
We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:
step choose indep 3 datacenter
step chooseleaf indep x host
step emit
Is it the right approach? If yes, what should be 'x'? Would 0 work?
From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?
Thanks in advance for your help or suggestions. Best regards,
Michel
Hi,
We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
with 2 chunks per datacenter, to maximise the resilience in case of 1
datacenter being down. I have not found a way to create an EC profile
with this 2-level allocation strategy. I created an EC profile with a
failure domain = datacenter but it doesn't work as, I guess, it would
like to ensure it has always 5 OSDs up (to ensure that the pools remains
R/W) where with a failure domain = datacenter, the guarantee is only 4.
My idea was to create a 2-step allocation and a failure domain=host to
achieve our desired configuration, with something like the following in
the crushmap rule:
step choose indep 3 datacenter
step chooseleaf indep x host
step emit
Is it the right approach? If yes, what should be 'x'? Would 0 work?
From what I have seen, there is no way to create such a rule with the
'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
it and upload the modified version. Am I right?
Thanks in advance for your help or suggestions. Best regards,
Michel
Hello guys!
We noticed an unexpected situation. In a recently deployed Ceph cluster we
are seeing a raw usage, that is a bit odd. We have the following setup:
We have a new cluster with 5 nodes with the following setup:
- 128 GB of RAM
- 2 cpus Intel(R) Intel Xeon Silver 4210R
- 1 NVME of 2 TB for the rocks DB caching
- 5 HDDs of 14TB
- 1 NIC dual port of 25GiB in BOND mode.
Right after deploying the Ceph cluster, we see a raw usage of about 9TiB.
However, no load has been applied onto the cluster. Have you guys seen such
a situation? Or, can you guys help understand it?
We are using Ceph Octopus, and we have set the following configurations:
```
ceph_conf_overrides:
global:
osd pool default size: 3
osd pool default min size: 1
osd pool default pg autoscale mode: "warn"
perf: true
rocksdb perf: true
mon:
mon osd down out interval: 120
osd:
bluestore min alloc size hdd: 65536
```
Any tip or help on how to explain this situation is welcome!
Hello,
We are considering CephFS as an alternative to GlusterFS, and have some
questions about performance. Is anyone able to advise us please?
This would be for file systems between 100GB and 2TB in size, average file
size around 5MB, and a mixture of reads and writes. I may not be using the
correct terminology in the Ceph world, but in my parlance a node is a Linux
server running the Ceph storage software. Multiple nodes make up the whole
Ceph storage solution. Someone correct me if I should be using different
terms!
In our normal scenario the nodes in the replicated filesystem would be
around 0.3ms apart, but we're also interested in geographically remote
nodes which would be say 20ms away. We are using third party software which
relies on a traditional Linux filesystem, so we can't use an object storage
solution directly.
So my specific questions are:
1. When reading a file from CephFS, does it read from just one node, or
from all nodes?
2. If reads are from one node then does it choose the node with the fastest
response to optimise performance, or if from all nodes then will reads be
no faster than latency to the furthest node?
3. When writing to CephFS, are all nodes written to synchronously, or are
writes to one node which then replicates that to other nodes asynchronously?
4. Can anyone give a recommendation on maximum latency between nodes to
have decent performance?
5. How does CephFS handle a node which suddenly becomes unavailable on the
network? Is the block time configurable, and how good is the healing
process after the lost node rejoins the network?
6. I have read that CephFS is more complicated to administer than
GlusterFS. What does everyone think? Are things like healing after a net
split difficult for administrators new to Ceph to handle?
Thanks very much in advance.
--
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
Hi,
I am reading reading some documentation about mClock and have two questions.
First, about the IOPS. Are those IOPS disk IOPS or other kind of IOPS? And what the assumption of those? (Like block size, sequential or random reads/writes)?
And the second question,
How mClock calculates its profiles? I have my lab cluster running Quincy, and I have this parameters for mClock:
"osd_mclock_max_capacity_iops_hdd": "450.000000",
"osd_mclock_profile": "balanced",
According to the documentation: https://docs.ceph.com/en/quincy/rados/configuration/mclock-config-ref/#bala… I am expecting to have:
"osd_mclock_scheduler_background_best_effort_lim": "999999",
"osd_mclock_scheduler_background_best_effort_res": "90",
"osd_mclock_scheduler_background_best_effort_wgt": "2",
"osd_mclock_scheduler_background_recovery_lim": "675",
"osd_mclock_scheduler_background_recovery_res": "180",
"osd_mclock_scheduler_background_recovery_wgt": "1",
"osd_mclock_scheduler_client_lim": "450",
"osd_mclock_scheduler_client_res": "180", "osd_mclock_scheduler_client_wgt": "1",
But what I get is:
"osd_mclock_scheduler_background_best_effort_lim": "999999",
"osd_mclock_scheduler_background_best_effort_res": "18",
"osd_mclock_scheduler_background_best_effort_wgt": "2",
"osd_mclock_scheduler_background_recovery_lim": "135",
"osd_mclock_scheduler_background_recovery_res": "36",
"osd_mclock_scheduler_background_recovery_wgt": "1",
"osd_mclock_scheduler_client_lim": "90",
"osd_mclock_scheduler_client_res": "36",
"osd_mclock_scheduler_client_wgt": "1",
Which seems very low according to what my disk seems to be able to handle.
Is this calculation the expected one? Or did I miss something on how those profiles are populated?
Luis Domingues
Proton AG