Hoping someone may be able to help point out where my bottleneck(s) may be.
I have an 80TB kRBD image on an EC8:2 pool, with an XFS filesystem on top of that.
This was not an ideal scenario, rather it was a rescue mission to dump a large, aging raid array before it was too late, so I'm working with the hand I was dealt.
To further conflate the issues, the main directory structure consists of lots and lots of small file sizes, and deep directories.
My goal is to try and rsync (or otherwise) data from the RBD to cephfs, but its just unbearably slow and will take ~150 days to transfer ~35TB, which is far from ideal.
> 15.41G 79% 4.36MB/s 0:56:09 (xfr#23165, ir-chk=4061/27259)
> avg-cpu: %user %nice %system %iowait %steal %idle
> 0.17 0.00 1.34 13.23 0.00 85.26
> Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
> rbd0 124.00 0.66 0.00 0.00 17.30 5.48 50.00 0.17 0.00 0.00 31.70 3.49 0.00 0.00 0.00 0.00 0.00 0.00 3.39 96.40
Rsync progress and iostat (during the rsync) from the rbd to a local ssd, to remove any bottlenecks doubling back to cephfs.
About 16G in 1h, not exactly blazing, this being 5 of the 7000 directories I'm looking to offload to cephfs.
Currently running 15.2.11, and the host is Ubuntu 20.04 (5.4.0-72-generic) with a single E5-2620, 64GB of memory, and 4x10GbT bond talking to ceph, iperf proves it out.
EC8:2, across about 16 hosts, 240 OSDs, with 24 of those being 8TB 7.2k SAS, and the other 216 being 2TB 7.2K SATA. So there are quite a few spindles in play here.
Only 128 PGs, in this pool, but its the only RBD image in this pool. Autoscaler recommends going to 512, but was hoping to avoid the performance overhead of the PG splits if possible, given perf is bad enough as is.
Examining the main directory structure it looks like there are 7000 files per directory, about 60% of which are <1MiB, and in all totaling nearly 5GiB per directory.
My fstab for this is:
> xfs _netdev,noatime 0 0
I tried to increase the read_ahead_kb to 4M from 128K at /sys/block/rbd0/queue/read_ahead_kb to match the object/stripe size of the EC pool, but that doesn't appear to have had much of an impact.
The only thing I can think of that I could possibly try as a change would be to increase the queue depth in the rbdmap up from 128, so thats my next bullet to fire.
Attaching xfs_info in case there are any useful nuggets:
> meta-data=/dev/rbd0 isize=256 agcount=81, agsize=268435455 blks
> = sectsz=512 attr=2, projid32bit=0
> = crc=0 finobt=0, sparse=0, rmapbt=0
> = reflink=0
> data = bsize=4096 blocks=21483470848, imaxpct=5
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0, ftype=0
> log =internal log bsize=4096 blocks=32768, version=2
> = sectsz=512 sunit=0 blks, lazy-count=0
> realtime =none extsz=4096 blocks=0, rtextents=0
> rbd image 'rbd-image-name:
> size 85 TiB in 22282240 objects
> order 22 (4 MiB objects)
> snapshot_count: 0
> id: a09cac2b772af5
> data_pool: rbd-ec82-pool
> block_name_prefix: rbd_data.29.a09cac2b772af5
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, data-pool
> create_timestamp: Mon Apr 12 18:44:38 2021
> access_timestamp: Mon Apr 12 18:44:38 2021
> modify_timestamp: Mon Apr 12 18:44:38 2021
Any other ideas or hints are greatly appreciated.
I guess I should probably have been more clear, this is one pool of many, so the other OSDs aren't idle.
So I don't necessarily think that the PG bump would be the worst thing to try, but its definitely not as bad as I may have made it sound.
> On May 27, 2021, at 11:37 PM, Anthony D'Atri <anthony.datri(a)gmail.com> wrote:
> That gives you a PG ratio of …. 5.3 ???
> Run `ceph osd df` ; I wouldn’t be surprised if some of your drives have 0 PGs on them, for sure I would suspect that they aren’t even at all.
> There are bottlenecks in the PG code, and in the OSD code — one reason why with NVMe clusters it’s common to split each drive into at least 2 OSDs. With spinners you don’t want to do that, but you get the idea.
> The pg autoscaler is usually out of its Vulcan mind. 512 would give you a ratio of just 21.
> Prior to 12.2.1 conventional wisdom was a PG ratio of 100-200 on spinners.
> 2048 PGs would give you a ratio of 85, which current (retconned) guidance would call good. I’d probably go to 4096 but 2048 would be way better than 128.
> I strongly suspect that PG splitting would still get you done faster than the way it is, esp. if you’re running BlueStore OSDs.
> Try bumping pg_num up to say 262 and see how bad it is, and if when pgp_num catches up if your ingest rate isn’t a bit higher than it was before.
>> EC8:2, across about 16 hosts, 240 OSDs, with 24 of those being 8TB 7.2k SAS, and the other 216 being 2TB 7.2K SATA. So there are quite a few spindles in play here.
>> Only 128 PGs, in this pool, but its the only RBD image in this pool. Autoscaler recommends going to 512, but was hoping to avoid the performance overhead of the PG splits if possible, given perf is bad enough as is.
I am trying to place the two MDS daemons for CephFS on dedicated nodes. For that purpose I tried out a few different "cephadm orch apply ..." commands with a label but at the end it looks like I messed up with the placement as I now have two mds service_types as you can see below:
# ceph orch ls --service-type mds --export
This second entry at the bottom seems totally wrong and I would like to remove it but I haven't found how to remove it totally. Any ideas?
Ideally I just want to place two MDS daemons on node ceph1a and ceph1g.
I'm attempting to get get ceph up and running, and currently feel like I'm
going around in circles.
I'm attempting to use cephadm and Pacific, currently on debian buster,
mostly because centos7 ain't supported any more and cenotos8 ain't support
by some of my hardware.
Anyway I have a few nodes with 59x 7.2TB disks but for some reason the osd
daemons don't start, the disks get formatted and the osd are created but
the daemons never come up.
They are probably the wrong spec for ceph (48gb of memory and only 4 cores)
but I was expecting them to start and be either dirt slow or crash later,
anyway I've got upto 30 of them, so I was hoping on getting at least get
6PB of raw storage out of them.
As yet I've not spotted any helpful error messages.
This is for a archive / slow ceph cluster so I'm not expecting speed.
Thanks in advance.
I have removed one node, but now ceph seems to stuck in:
Degraded data redundancy: 67/2393 objects degraded (2.800%), 12 pgs
degraded, 12 pgs undersized
How to "force" rebalancing? Or should I just wait a little bit more?
The server run 15.2.9 and has 15 HDD and 3 SSD.
The OSDs was created with this YAML file
The result was that the 3 SSD is added to 1 VG with 15 LV on it.
# vgs | egrep "VG|dbs"
VG #PV #LV #SN Attr
ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b 3 15 0 wz--n-
One of the osd failed and I run rm with replace
# ceph orch osd rm 178 --replace
and the result is
# ceph osd tree | grep "ID|destroyed"
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT
178 hdd 12.82390 osd.178 destroyed 0
But I'm not able to replace the disk with the same YAML file as shown
# ceph orch apply osd -i hdd.yml --dry-run
|SERVICE |NAME |HOST |DATA |DB |WAL |
I guess this is the wrong way to do it, but I can't find the answer in
So how can I replace this failed disk in Cephadm?
Kai Stian Olstad
I have by mistake re-installed the OS of an OSD node of my Octopus cluster (managed by cephadm). Luckily the OSD data is on a separate disk and did not get affected by the re-install.
Now I have the following state:
1 stray daemon(s) not managed by cephadm
1 osds down
1 host (1 osds) down
To fix that I tried to run:
# ceph orch daemon add osd ceph1f:/dev/sda
Created no osd(s) on host ceph1f; already created?
That did not work, so I tried:
# ceph cephadm osd activate ceph1f
no valid command found; 10 closest matches:
Error EINVAL: invalid command
Did not work either. So I wanted to ask how can I "adopt" back an OSD disk to my cluster?
Thanks for your help.
After scaling the number of MDS daemons down, we now have a daemon stuck in the
"up:stopping" state. The documentation says it can take several minutes to stop the
daemon, but it has been stuck in this state for almost a full day. According to
the "ceph fs status" output attached below, it still holds information about 2
inodes, which we assume is the reason why it cannot stop completely.
Does anyone know what we can do to finally stop it?
cephfs - 71 clients
RANK STATE MDS ACTIVITY DNS INOS
0 active ceph-mon-01 Reqs: 0 /s 15.7M 15.4M
1 active ceph-mon-02 Reqs: 48 /s 19.7M 17.1M
2 stopping ceph-mon-03 0 2
POOL TYPE USED AVAIL
cephfs_metadata metadata 652G 185T
cephfs_data data 1637T 539T
MDS version: ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
Hi! is is (technically) possible to instruct cephfs to store files < 1Mib on a (replicate) pool
and the others files on another (ec) pool?
And even more, is it possible to take the same kind of decision on the path of the file?
(let's say that critical files with names like r"/critical_path/critical_.*" i want them in a 6x replication ssd pool)
Could you share the output of
lsblk -o name,rota,size,type
from the affected osd node?
My spec file is for a tiny lab cluster, in your case the db drive size
should be something like '5T:6T' to specify a range.
How large are the HDDs? Also maybe you should use the option
'filter_logic: AND', but I'm not sure if that's already the default, I
remember that there were issues in Nautilus because the default was
OR. I tried this just recently with a version similar to this, I
believe it was 15.2.8 and it worked for me, but again, it's just a
tiny virtual lab cluster.
Zitat von Kai Stian Olstad <ceph+list(a)olstad.com>:
> On 26.05.2021 11:16, Eugen Block wrote:
>> Yes, the LVs are not removed automatically, you need to free up the
>> VG, there are a couple of ways to do so, for example remotely:
>> pacific1:~ # ceph orch device zap pacific4 /dev/vdb --force
>> or directly on the host with:
>> pacific1:~ # cephadm ceph-volume lvm zap --destroy /dev/<CEPH_VG>/<DB_LV>
> I used the cephadm command and deleted the LV and the VG now has free space
> # vgs | egrep "VG|dbs"
> VG #PV #LV #SN
> Attr VSize VFree
> ceph-block-dbs-563432b7-f52d-4cfe-b952-11542594843b 3 14 0
> wz--n- <5.24t 357.74g
> But it doesn't seams to be able to use it, because it can find anyting
> # ceph orch apply osd -i hdd.yml --dry-run
> OSDSPEC PREVIEWS
> |SERVICE |NAME |HOST |DATA |DB |WAL |
> I tried adding size as you have in your configuration
> rotational: 0
> size: '30G:'
> Still it was unable to create the OSD.
> If I removed the : so it is 30GB exact size, it did find the disk,
> but DB is not placed on a SSD since I do not have one with 30 GB
> exact size
> OSDSPEC PREVIEWS
> |SERVICE |NAME |HOST |DATA |DB |WAL |
> |osd |hdd |pech-hd-7 |/dev/sdt |- |- |
> To me I looks like Cephadm can't use/find the free space on the VG
> and use that as a new LV for the OSD.
> Kai Stian Olstad
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io