https://docs.ceph.com/en/reef/cephfs/createfs/ says:
> The data pool used to create the file system is the “default” data pool and the location for storing all inode backtrace information, which is used for hard link management and disaster recovery.
> For this reason, all CephFS inodes have at least one object in the default data pool. If erasure-coded pools are planned for file system data, it is best to configure the default as a replicated pool to improve small-object write and read performance when updating backtraces.
This poses the question:
Are normal replicated CephFS installations (metadata on SSDs, data on HDDs) set up with suboptimal performance because they don't do this?
If having inodes/backtraces on replicated instead of EC improves performance, shouldn't one expect that putting inodes/backtraces on SSD would improve it even more?
From the docs I also cannot really conclude when inotes/backtraces become important.
Is that all the time, or only sometimes?
Thanks!
Hi,
In the discussion after the Ceph Month talks yesterday, there was a bit
of chat about cephadm / containers / packages. IIRC, Sage observed that
a common reason in the recent user survey for not using cephadm was that
it only worked on containerised deployments. I think he then went on to
say that he hadn't heard any compelling reasons why not to use
containers, and suggested that resistance was essentially a user
education question[0].
I'd like to suggest, briefly, that:
* containerised deployments are more complex to manage, and this is not
simply a matter of familiarity
* reducing the complexity of systems makes admins' lives easier
* the trade-off of the pros and cons of containers vs packages is not
obvious, and will depend on deployment needs
* Ceph users will benefit from both approaches being supported into the
future
We make extensive use of containers at Sanger, particularly for
scientific workflows, and also for bundling some web apps (e.g.
Grafana). We've also looked at a number of container runtimes (Docker,
singularity, charliecloud). They do have advantages - it's easy to
distribute a complex userland in a way that will run on (almost) any
target distribution; rapid "cloud" deployment; some separation (via
namespaces) of network/users/processes.
For what I think of as a 'boring' Ceph deploy (i.e. install on a set of
dedicated hardware and then run for a long time), I'm not sure any of
these benefits are particularly relevant and/or compelling - Ceph
upstream produce Ubuntu .debs and Canonical (via their Ubuntu Cloud
Archive) provide .debs of a couple of different Ceph releases per Ubuntu
LTS - meaning we can easily separate out OS upgrade from Ceph upgrade.
And upgrading the Ceph packages _doesn't_ restart the daemons[1],
meaning that we maintain control over restart order during an upgrade.
And while we might briefly install packages from a PPA or similar to
test a bugfix, we roll those (test-)cluster-wide, rather than trying to
run a mixed set of versions on a single cluster - and I understand this
single-version approach is best practice.
Deployment via containers does bring complexity; some examples we've
found at Sanger (not all Ceph-related, which we run from packages):
* you now have 2 process supervision points - dockerd and systemd
* docker updates (via distribution unattended-upgrades) have an
unfortunate habit of rudely restarting everything
* docker squats on a chunk of RFC 1918 space (and telling it not to can
be a bore), which coincides with our internal network...
* there is more friction if you need to look inside containers
(particularly if you have a lot running on a host and are trying to find
out what's going on)
* you typically need to be root to build docker containers (unlike packages)
* we already have package deployment infrastructure (which we'll need
regardless of deployment choice)
We also currently use systemd overrides to tweak some of the Ceph units
(e.g. to do some network sanity checks before bringing up an OSD), and
have some tools to pair OSD / journal / LVM / disk device up; I think
these would be more fiddly in a containerised deployment. I'd accept
that fixing these might just be a SMOP[2] on our part.
Now none of this is show-stopping, and I am most definitely not saying
"don't ship containers". But I think there is added complexity to your
deployment from going the containers route, and that is not simply a
"learn how to use containers" learning curve. I do think it is
reasonable for an admin to want to reduce the complexity of what they're
dealing with - after all, much of my job is trying to automate or
simplify the management of complex systems!
I can see from a software maintainer's point of view that just building
one container and shipping it everywhere is easier than building
packages for a number of different distributions (one of my other hats
is a Debian developer, and I have a bunch of machinery for doing this
sort of thing). But it would be a bit unfortunate if the general thrust
of "let's make Ceph easier to set up and manage" was somewhat derailed
with "you must use containers, even if they make your life harder".
I'm not going to criticise anyone who decides to use a container-based
deployment (and I'm sure there are plenty of setups where it's an
obvious win), but if I were advising someone who wanted to set up and
use a 'boring' Ceph cluster for the medium term, I'd still advise on
using packages. I don't think this makes me a luddite :)
Regards, and apologies for the wall of text,
Matthew
[0] I think that's a fair summary!
[1] This hasn't always been true...
[2] Simple (sic.) Matter of Programming
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Hi again, hopefully for the last time with problems.
We had a MDS crash earlier with the MDS staying in failed state and used a command to reset the filesystem (this was wrong, I know now, thanks Patrick Donnelly for pointing this out). I did a full scrub on the filesystem and two files were damaged. One of those got repaired, but the following file keeps giving errors and can't be removed.
What can I do now? Below some information.
# ceph tell mds.atlassian-prod:0 damage ls
[
{
"damage_type": "backtrace",
"id": 2244444901,
"ino": 1099534008829,
"path": "/app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01"
}
]
Trying to repair the error (online research shows this should work for a backtrace damage type)
----------
# ceph tell mds.atlassian-prod:0 scrub start /app1/shared/data/repositories/11271 recursive,repair,force
{
"return_code": 0,
"scrub_tag": "d10ead42-5280-4224-971e-4f3022e79278",
"mode": "asynchronous"
}
Cluster logs after this
----------
1/2/24 9:37:05 AM
[INF]
scrub summary: idle
1/2/24 9:37:02 AM
[INF]
scrub summary: idle+waiting paths [/app1/shared/data/repositories/11271]
1/2/24 9:37:01 AM
[INF]
scrub summary: active paths [/app1/shared/data/repositories/11271]
1/2/24 9:37:01 AM
[INF]
scrub summary: idle+waiting paths [/app1/shared/data/repositories/11271]
1/2/24 9:37:01 AM
[INF]
scrub queued for path: /app1/shared/data/repositories/11271
But the error doesn't disappear and still can't remove the file.
On the client trying to remove the file (we got a backup)
----------
$ rm -f /mnt/shared_disk-app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01
rm: cannot remove '/mnt/shared_disk-app1/shared/data/repositories/11271/objects/41/8f82507a0737c611720ed224bcc8b7a24fda01': Input/output error
Best regards,
Sake
Hi!
As I'm reading through the documentation about subtree pinning, I was wondering if the following is possible.
We've got the following directory structure.
/
/app1
/app2
/app3
/app4
Can I pin /app1 to MDS rank 0 and 1, the directory /app2 to rank 2 and finally /app3 and /app4 to rank 3?
I would like to load balance the subfolders of /app1 to 2 (or 3) MDS servers.
Best regards,
Sake
Hi all,
I have a problem regarding upgrading Ceph cluster from Pacific to Quincy
version with cephadm. I have successfully upgraded the cluster to the
latest Pacific (16.2.11). But when I run the following command to upgrade
the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process stops
with "Unexpected error". (everything is on a private network)
ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5
I also tried the 17.2.4 version.
cephadm fails to check the hosts' status and marks them as offline:
cephadm 2023-04-06T10:19:59.998510+0000 mgr.host9.arhpnd (mgr.4516356) 5782
: cephadm [DBG] host host4 (x.x.x.x) failed check
cephadm 2023-04-06T10:19:59.998553+0000 mgr.host9.arhpnd (mgr.4516356) 5783
: cephadm [DBG] Host "host4" marked as offline. Skipping daemon refresh
cephadm 2023-04-06T10:19:59.998581+0000 mgr.host9.arhpnd (mgr.4516356) 5784
: cephadm [DBG] Host "host4" marked as offline. Skipping gather facts
refresh
cephadm 2023-04-06T10:19:59.998609+0000 mgr.host9.arhpnd (mgr.4516356) 5785
: cephadm [DBG] Host "host4" marked as offline. Skipping network refresh
cephadm 2023-04-06T10:19:59.998633+0000 mgr.host9.arhpnd (mgr.4516356) 5786
: cephadm [DBG] Host "host4" marked as offline. Skipping device refresh
cephadm 2023-04-06T10:19:59.998659+0000 mgr.host9.arhpnd (mgr.4516356) 5787
: cephadm [DBG] Host "host4" marked as offline. Skipping osdspec preview
refresh
cephadm 2023-04-06T10:19:59.998682+0000 mgr.host9.arhpnd (mgr.4516356) 5788
: cephadm [DBG] Host "host4" marked as offline. Skipping autotune
cluster 2023-04-06T10:20:00.000151+0000 mon.host8 (mon.0) 158587 : cluster
[ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade: failed
due to an unexpected exception
cluster 2023-04-06T10:20:00.000191+0000 mon.host8 (mon.0) 158588 : cluster
[ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check
cluster 2023-04-06T10:20:00.000202+0000 mon.host8 (mon.0) 158589 : cluster
[ERR] host host7 (x.x.x.x) failed check: Unable to reach remote host
host7. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000213+0000 mon.host8 (mon.0) 158590 : cluster
[ERR] host host2 (x.x.x.x) failed check: Unable to reach remote host
host2. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000220+0000 mon.host8 (mon.0) 158591 : cluster
[ERR] host host8 (x.x.x.x) failed check: Unable to reach remote host
host8. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000228+0000 mon.host8 (mon.0) 158592 : cluster
[ERR] host host4 (x.x.x.x) failed check: Unable to reach remote host
host4. Process exited with non-zero exit status 3
cluster 2023-04-06T10:20:00.000240+0000 mon.host8 (mon.0) 158593 : cluster
[ERR] host host3 (x.x.x.x) failed check: Unable to reach remote host
host3. Process exited with non-zero exit status 3
and here are some outputs of the commands:
[root@host8 ~]# ceph -s
cluster:
id: xxx
health: HEALTH_ERR
9 hosts fail cephadm check
Upgrade: failed due to an unexpected exception
services:
mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w)
mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih,
host1.warjsr, host2.qyavjj
mds: 1/1 daemons up, 3 standby
osd: 37 osds: 37 up (since 8h), 37 in (since 3w)
data:
io:
client:
progress:
Upgrade to 17.2.5 (0s)
[............................]
[root@host8 ~]# ceph orch upgrade status
{
"target_image": "my-private-repo/quay-io/ceph/ceph@sha256
:34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427",
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [],
"progress": "3/59 daemons upgraded",
"message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
unexpected exception",
"is_paused": true
}
[root@host8 ~]# ceph cephadm check-host host7
check-host failed:
Host 'host7' not found. Use 'ceph orch host ls' to see all managed hosts.
[root@host8 ~]# ceph versions
{
"mon": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 5
},
"mgr": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 1,
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
},
"osd": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 37
},
"mds": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 4
},
"overall": {
"ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894)
pacific (stable)": 47,
"ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
}
}
The strange thing is I can rollback the cluster status by failing to
not-upgraded mgr like this:
ceph mgr fail
ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v16.2.11
Would you happen to have any idea about this?
Best regards,
Reza
Summary
----------
The relationship of the values configured for bluestore_min_alloc_size and bluefs_shared_alloc_size are reported to impact space amplification, partial overwrites in erasure coded pools, and storage capacity as an osd becomes more fragmented and/or more full.
Previous discussions including this topic
----------------------------------------
comment #7 in bug 63618 in Dec 2023 - https://tracker.ceph.com/issues/63618#note-7
pad writeup related to bug 62282 likely from late 2023 - https://pad.ceph.com/p/RCA_62282
email sent 13 Sept 2023 in mail list discussion of cannot create new osd - https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/5M4QAXJDCN…
comment #9 in bug 58530 likely from early 2023 - https://tracker.ceph.com/issues/58530#note-9
email sent 30 Sept 2021 in mail list discussion of flapping osds - https://www.mail-archive.com/ceph-users@ceph.io/msg13072.html
email sent 25 Feb 2020 in mail list discussion of changing allocation size - https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/B3DGKH6THFG…
Current situation
-----------------
We have three Ceph clusters that were originally built via cephadm on octopus and later upgraded to pacific. All osds are HDD (will be moving to wal+db on SSD) and were resharded after the upgrade to enable rocksdb sharding.
The value for bluefs_shared_alloc_size has remained unchanged at 65535.
The value for bluestore_min_alloc_size_hdd was 65535 in octopus but is reported as 4096 by ceph daemon osd.<id> config show in pacific. However, the osd label after upgrading to pacific retains the value of 65535 for bfm_bytes_per_block. BitmapFreelistManager.h in Ceph source code (src/os/bluestore/BitmapFreelistManager.h) indicates that bytes_per_block is bdev_block_size. This indicates that the physical layout of the osd has not changed from 65535 despite the return of the ceph dameon command reporting it as 4096. This interpretation is supported by the Minimum Allocation Size part of the Bluestore configuration reference for quincy (https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#m…)
Questions
----------
What are the pros and cons of the following three cases with two variations per case - when using co-located wal+db on HDD and when using separate wal+db on SSD:
1) bluefs_shared_alloc_size, bluestore_min_alloc_size, and bfm_bytes_per_block all equal
2) bluefs_shared_alloc_size greater than but a multiple of bluestore_min_alloc_size with bfm_bytes_per_block equal to bluestore_min_alloc_size
3) bluefs_shared_alloc_size greater than but a multiple of bluestore_min_alloc_size with bfm_bytes_per_block equal to bluefs_shared_alloc_size
Hi,
We're a student club from Montréal where we host an Openstack cloud with
a Ceph backend for storage of virtual machines and volumes using rbd.
Two weeks ago we received an email from our ceph cluster saying that
some pages were damaged. We ran "sudo ceph pg repair <pg-id>" but then
there was an I/O error on the disk during the recovery ("An
unrecoverable disk media error occurred on Disk 4 in Backplane 1 of
Integrated RAID Controller 1." and "Bad block medium error is detected
at block 0x1377e2ad on Virtual Disk 3 on Integrated RAID Controller 1."
messages on iDRAC).
After that, the PG we tried to repair was in the state
"active+recovery_unfound+degraded". After a week, we ran the command
"sudo ceph pg 2.1b mark_unfound_lost revert" to try to recover the
damaged PG. We tried to boot the virtual machine that had crashed
because of this incident, but the volume seemed to have been completely
erased, the "mount" command said there was no filesystem on it, so we
recreated the VM from a backup.
A few days later, the same PG was once again damaged, and since we knew
the physical disk on the OSD hosting one part of the PG had problems, we
tried to "out" the OSD from the cluster. That resulted in the two other
OSDs hosting copies of the problematic PG to go down, which caused
timeouts on our virtual machines, so we put the OSD back in.
We then tried to repair the PG again, but that failed and the PG is now
"active+clean+inconsistent+failed_repair", and whenever it goes down,
two other OSDs from two other hosts go down too after a few minutes, so
it's impossible to replace the disk right now, even if we have new ones
available.
We have backups for most of our services, but it would be very
disrupting to delete the whole cluster, and we don't know that to do
with the broken PG and the OSD that can't be shut down.
Any help would be really appreciated, we're not experts with Ceph and
Openstack, and it's likely we handled things wrong at some point, but we
really want to go back to a healthy Ceph.
Here are some information about our cluster :
romain:step@alpha-cen ~ $ sudo ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 2.1b is active+clean+inconsistent+failed_repair, acting [3,11,0]
romain:step@alpha-cen ~ $ sudo ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 70.94226 root default
-7 20.00792 host alpha-cen
3 hdd 1.81879 osd.3 up 1.00000 1.00000
6 hdd 1.81879 osd.6 up 1.00000 1.00000
12 hdd 1.81879 osd.12 up 1.00000 1.00000
13 hdd 1.81879 osd.13 up 1.00000 1.00000
15 hdd 1.81879 osd.15 up 1.00000 1.00000
16 hdd 9.09520 osd.16 up 1.00000 1.00000
17 hdd 1.81879 osd.17 up 1.00000 1.00000
-5 23.64874 host beta-cen
1 hdd 5.45749 osd.1 up 1.00000 1.00000
4 hdd 5.45749 osd.4 up 1.00000 1.00000
8 hdd 5.45749 osd.8 up 1.00000 1.00000
11 hdd 5.45749 osd.11 up 1.00000 1.00000
14 hdd 1.81879 osd.14 up 1.00000 1.00000
-3 27.28560 host gamma-cen
0 hdd 9.09520 osd.0 up 1.00000 1.00000
5 hdd 9.09520 osd.5 up 1.00000 1.00000
9 hdd 9.09520 osd.9 up 1.00000 1.00000
romain:step@alpha-cen ~ $ sudo rados list-inconsistent-obj 2.1b
{"epoch":9787,"inconsistents":[]}
romain:step@alpha-cen ~ $ sudo ceph pg 2.1b query
https://pastebin.com/gsKCPCjr
Best regards,
Romain Lebbadi-Breteau
Hi All,
I'm looking for some pointers/help as to why I can't get my Win10 PC
to connect to our Ceph Cluster's CephFS Service. Details are as follows:
Ceph Cluster:
- IP Addresses: 192.168.1.10, 192.168.1.11, 192.168.1.12
- Each node above is a monitor & an MDS
- Firewall Ports: open (ie 3300, etc)
- CephFS System Name: my_cephfs
- Log files: nothing jumps out at me
Windows PC:
- Keyring file created and findable: ceph.client.me.keyring
- dokany installed
- ceph-for-windows installed
- Can ping all three ceph nodes
- Connection command: ceph-dokan -l v -o -id me --debug --client_fs
my_cephfs -c C:\ProgramData\Ceph\ceph.conf
Ceph.conf contents:
~~~
[global]
mon_host = 192.168.1.10, 192.168.1.11, 192.168.1.12
log to stderr = true
log to syslog = true
run dir = C:/ProgramData/ceph
crash dir = C:/logs/ceph
debug client = 2
[client]
keyring = C:/ProgramData/ceph/ceph.client.me.keyring
log file = C:/logs/ceph/$name.$pid.log
admin socket = C:/ProgramData/ceph/$name.$pid.asok
~~~
Windows Logfile contents (ieC:/logs/ceph/client.me.NNNN.log):
~~~
2024-02-28T18:26:45.201+1100 1 0 monclient(hunting): authenticate timed
out after 300
2024-02-28T18:31:45.203+1100 1 0 monclient(hunting): authenticate timed
out after 300
2024-02-28T18:36:45.205+1100 1 0 monclient(hunting): authenticate timed
out after 300
~~~
Additional info from Windows CLI:
~~~
failed to fetch mon config (--no-mon-config to skip)
~~~
So I've gone through the doco and done some Google-foo and I can't work
out *why* I'm getting a failure; why I'm getting the authentication
failure. I know it'll be something simple, something staring me in the
face, but I'm at the point where I can't see the forest for the trees -
please help.
Any help greatly appreciated
Thanks in advance
Cheers
Dulux-Oz
Hi
Cephadm Reef 18.2.0.
We would like to remove our cluster_network without stopping the cluster
and without having to route between the networks.
global advanced cluster_network
192.168.100.0/24
*
global advanced public_network
172.21.12.0/22
*
The documentation[1] states:
"
You may specifically assign static IP addresses or override
cluster_network settings using the cluster_addr setting for specific OSD
daemons.
"
So for one OSD at a time I could set cluster_addr to override the
cluster_network IP and use the public_network IP instead? As the
containers are using host networking they have access to both IPs and
will just layer 2 the traffic, avoiding routing?
When all OSDs are running with a public_network IP set via cluster_addr
we can just delete the cluster_network setting and then remove all the
ceph_addr settings, as with no cluster_network setting the
public_network setting will be used?
We tried with one OSD and it seems to work. Anyone see a problem with
this approach?
Thanks
Torkil
[1]
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#id3
--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil(a)drcmr.dk