Hello guys!
I would like to ask if somebody has already experienced a similar
situation. We have a new cluster with 5 nodes with the following setup:
- 128 GB of RAM
- 2 cpus Intel(R) Intel Xeon Silver 4210R
- 1 NVME of 2 TB for the rocks DB caching
- 5 HDDs of 14TB
- 1 NIC dual port of 25GiB in BOND mode.
We are starting with a single dual port NIC (the bond has 50GiB in total),
the design has been prepared so a new NIC can be added, and a new BOND can
be created, where we intend to offload the cluster network. Therefore,
logically speaking, we already configured different VLANs and networks for
public and cluster traffic of Ceph.
We are using Ubuntu 20.04 with Ceph Octopus. It is a standard deployment
that we are used to. During our initial validations and evaluations of the
cluster, we are reaching write speeds between 250-300MB/s, which would be
the ballpark for this kind of setup for HDDs with the NVME as Rocks.db
cache (in our experience). However, the issue is the reading process. While
reading, we barely hit the mark of 100MB/s; we would expect at least
something similar to the write speed. These tests are being performed in a
pool with a replication factor of 3.
We have already checked the disks, and they all seem to be reading just
fine. The network does not seem to be the bottleneck either (checked with
atop while reading/writing to the cluster).
Have you guys ever encountered similar situations? Do you have any tips for
us to proceed with the troubleshooting?
We suspect that we are missing some small tuning detail, which is affecting
the read performance only, but so far we could not pinpoint it. Any help
would be much appreciated :)
Hello ceph community,
We have a ceph cluster (Proxmox based) with is HDD-based. We’ve had some performance and “slow MDS” issues while doing VM/CT backups from the Proxmox cluster, especially when rebalancing is going on at the same time.
My thought is that one of following is going to improve performance / response:
1. Add an M.2 drive for DB store on each node
2. Migrate the cephfs metadata pool to SSDs
We have ~25 nodes with ~3 OSDs per node.
(1) is a lot of work and will cost more.
(2) seems more risky (to me) since the metadata pool would have to be migrated (potential loss in transit?)
Which one of the 2 solutions above will give us more bang for the buck, or just plain better performance? I would hate to implement (1) to find out that another solution would’ve been better.
Any other solutions that I haven’t thought of?
Thank you!
George
Hi,
I want
1) copy a snapshot to an image,
2) no need to copy snapshots,
3) no dependency after copy,
4) all same image format 2.
In that case, is rbd cp the same as rbd clone + rbd flatten?
I ran some tests, seems like it, but want to confirm, in case of missing anything.
Also, seems cp is a bit faster and flatten, is that true?
Thanks!
Tony
We setup a small Ceph cluster about 6 months ago with just 6x 200GB OSDs
with one EC 4x2 pool. When we created that pool, we enabled pg_autoscale.
The OSDs stayed pretty well balanced.
After our developers released a new "feature" that caused the storage to
balloon up to over 80%, we added another 6x 200GB OSDs. When we did that,
we looked at the number of PGs for that pool, and found that there was only
1 for the rgw.data and rgw.log pools, and "osd pool autoscale-status"
doesn't return anything, so it looks like that hasn't been working. The
rebalance operation was extremely slow, and wasn't balancing out osd.0, so
we bumped up the PGs for the rgw.data pool to 16. All the OSDs except osd.0
balanced out quickly, but that one OSDs utilization keeps climbing, and the
number of misplaced objects is increasing, rather than decreasing. We set
noscrub and nodeep-scrub so scrubbing wouldn't slow down the process.
At this point, I don't want to do any more tuning to this cluster until we
can get it back to a healthy state, but it's not fixing itself. I'm open to
any ideas.
Here's the output of ceph -s:
cluster:
id: 159d23e4-2a36-11ed-8b6e-fd27d573fa65
health: HEALTH_WARN
1 pools have many more objects per pg than average
noscrub,nodeep-scrub flag(s) set
1 backfillfull osd(s)
Low space hindering backfill (add storage if this doesn't
resolve itself): 12 pgs backfill_toofull
7 pool(s) backfillfull
services:
mon: 3 daemons, quorum ceph3,ceph5,ceph6 (age 6h)
mgr: ceph5.ksxevx(active, since 23h), standbys: ceph4.frkyyl,
ceph6.slvpzl
osd: 12 osds: 12 up (since 11h), 12 in (since 11h); 12 remapped pgs
flags noscrub,nodeep-scrub
rgw: 3 daemons active (3 hosts, 1 zones)
data:
pools: 7 pools, 161 pgs
objects: 28.61M objects, 211 GiB
usage: 1.5 TiB used, 834 GiB / 2.3 TiB avail
pgs: 91779228/171665865 objects misplaced (53.464%)
149 active+clean
12 active+remapped+backfill_toofull
io:
client: 11 KiB/s rd, 61 KiB/s wr, 11 op/s rd, 27 op/s wr
progress:
Global Recovery Event (23h)
[=========================...] (remaining: 115m)
ceph df:
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 2.3 TiB 834 GiB 1.5 TiB 1.5 TiB 65.24
TOTAL 2.3 TiB 834 GiB 1.5 TiB 1.5 TiB 65.24
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX
AVAIL
.mgr 1 1 897 KiB 2 2.6 MiB 0.18
479 MiB
.rgw.root 2 32 7.1 KiB 18 204 KiB 0.01
479 MiB
charlotte.rgw.log 3 32 27 KiB 347 2.0 MiB 0.14
479 MiB
charlotte.rgw.control 4 32 0 B 9 0 B 0
479 MiB
charlotte.rgw.meta 5 32 9.7 KiB 16 167 KiB 0.01
479 MiB
charlotte.rgw.buckets.data 6 16 734 GiB 28.61M 1.1 TiB 99.87
958 MiB
charlotte.rgw.buckets.index 7 16 16 GiB 691 47 GiB 97.12
479 MiB
ceph osd tree:
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 2.34357 root default
-3 0.39059 host ceph1
0 ssd 0.19530 osd.0 up 0.89999 1.00000
1 ssd 0.19530 osd.1 up 1.00000 1.00000
-5 0.39059 host ceph2
6 ssd 0.19530 osd.6 up 1.00000 1.00000
7 ssd 0.19530 osd.7 up 1.00000 1.00000
-7 0.39059 host ceph3
2 ssd 0.19530 osd.2 up 1.00000 1.00000
8 ssd 0.19530 osd.8 up 1.00000 1.00000
-9 0.39059 host ceph4
3 ssd 0.19530 osd.3 up 1.00000 1.00000
9 ssd 0.19530 osd.9 up 1.00000 1.00000
-11 0.39059 host ceph5
4 ssd 0.19530 osd.4 up 1.00000 1.00000
10 ssd 0.19530 osd.10 up 1.00000 1.00000
-13 0.39059 host ceph6
5 ssd 0.19530 osd.5 up 1.00000 1.00000
11 ssd 0.19530 osd.11 up 1.00000 1.00000
ceph osd df:
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS
0 ssd 0.19530 0.89999 200 GiB 190 GiB 130 GiB 12 GiB 48 GiB
10 GiB 94.94 1.46 52 up
1 ssd 0.19530 1.00000 200 GiB 7.3 GiB 9.8 MiB 6.4 GiB 858 MiB
193 GiB 3.64 0.06 42 up
6 ssd 0.19530 1.00000 200 GiB 148 GiB 97 GiB 14 GiB 38 GiB
52 GiB 74.06 1.14 51 up
7 ssd 0.19530 1.00000 200 GiB 133 GiB 97 GiB 2 KiB 35 GiB
67 GiB 66.47 1.02 43 up
2 ssd 0.19530 1.00000 200 GiB 134 GiB 97 GiB 12 KiB 37 GiB
66 GiB 66.94 1.03 40 up
8 ssd 0.19530 1.00000 200 GiB 136 GiB 97 GiB 2.2 GiB 36 GiB
64 GiB 67.85 1.04 40 up
3 ssd 0.19530 1.00000 200 GiB 134 GiB 97 GiB 4 KiB 37 GiB
66 GiB 66.95 1.03 41 up
9 ssd 0.19530 1.00000 200 GiB 138 GiB 97 GiB 5.2 GiB 36 GiB
62 GiB 69.19 1.06 49 up
4 ssd 0.19530 1.00000 200 GiB 137 GiB 97 GiB 4.3 GiB 36 GiB
63 GiB 68.62 1.05 42 up
10 ssd 0.19530 1.00000 200 GiB 139 GiB 97 GiB 5.5 GiB 36 GiB
61 GiB 69.31 1.06 48 up
5 ssd 0.19530 1.00000 200 GiB 134 GiB 97 GiB 7 KiB 38 GiB
66 GiB 67.13 1.03 34 up
11 ssd 0.19530 1.00000 200 GiB 136 GiB 97 GiB 2.2 GiB 36 GiB
64 GiB 67.80 1.04 49 up
TOTAL 2.3 TiB 1.5 TiB 1.1 TiB 52 GiB 414 GiB
834 GiB 65.24
MIN/MAX VAR: 0.06/1.46 STDDEV: 19.95
Thanks in advance if anyone has any suggestions.
Hi Ceph users!
I've been proposed an interesting EC setup I hadn't thought about before.
Scenario is : we have two server rooms and want to store ~4PiB with the
ability to loose 1 server room without loss of data or RW availability.
For the context, performance is not needed (cold storage mostly, used as
a big filesystem).
The idea is to use EC 8+12 over 24 servers (12 on each server room), so
if we loose 1 room we still have half of the EC parts (10/20) and are
able to loose 2 more servers before reaching the point where we loose data.
I find this pretty elegant when working on a two-sites context, as
efficiency is 40% (better than 33% three times replication) and the
redundancy is good.
What do you think of this setup ? Did you ever used EC profiles with M > K ?
Thanks for sharing your thoughts!
Cheers,
Fabien
Good morning.
I just setup a ceph environment, 9 storage nodes
and i mount it on a cephfs on a 10th independent node.
I execute a fio workload once i got a 3Mb/s throughput.
When i reexecute the same workload after a certain time i got this time
a 9Mb/s throughput.
Do you know why this is happening ?
Cordially,
--
Nguetchouang Ngongang Kevin
ENS de Lyon
https://perso.ens-lyon.fr/kevin.nguetchouang/
Hi,
I have setup a ceph cluster with cephadm with docker backend.
I want to move /var/lib/docker to a separate device to get better
performance and less load on the OS device.
I tried that by stopping docker copy the content of /var/lib/docker to
the new device and mount the new device to /var/lib/docker.
The other containers started as expected and continues to work and run
as expected.
But the ceph containers seems to be broken.
I am not able to get them back in working state.
I have tried to remove the host with `ceph orch host rm itcnchn-bb4067`
and readd it but no effect.
The strange thing is that 2 of 4 containers comes up as expected.
ceph orch ps itcnchn-bb4067
NAME HOST STATUS
REFRESHED AGE VERSION IMAGE NAME IMAGE ID
CONTAINER ID
crash.itcnchn-bb4067 itcnchn-bb4067 running (18h) 10m
ago 4w 15.2.7 docker.io/ceph/ceph:v15 2bc420ddb175
2af28c4571cf
mds.cephfs.itcnchn-bb4067.qzoshl itcnchn-bb4067 error 10m
ago 4w <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>
mon.itcnchn-bb4067 itcnchn-bb4067 error 10m
ago 18h <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>
rgw.ikea.dc9-1.itcnchn-bb4067.gtqedc itcnchn-bb4067 running (18h) 10m
ago 4w 15.2.7 docker.io/ceph/ceph:v15 2bc420ddb175
00d000aec32b
Docker logs from the active manager does not say much about what is
wrong
debug 2021-01-05T09:57:52.537+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring mds.cephfs.itcnchn-bb4067.qzoshl (unknown last
config time)...
debug 2021-01-05T09:57:52.541+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon mds.cephfs.itcnchn-bb4067.qzoshl on
itcnchn-bb4067
debug 2021-01-05T09:57:52.973+0000 7fdb64e88700 0 log_channel(cluster)
log [DBG] : pgmap v347: 241 pgs: 241 active+clean; 18 GiB data, 50 GiB
used, 52 TiB / 52 TiB avail; 18 KiB/s rd, 78 KiB/s wr, 24 op/s
debug 2021-01-05T09:57:53.085+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring mon.itcnchn-bb4067 (unknown last config
time)...
debug 2021-01-05T09:57:53.085+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon mon.itcnchn-bb4067 on itcnchn-bb4067
debug 2021-01-05T09:57:53.625+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring rgw.ikea.dc9-1.itcnchn-bb4067.gtqedc (unknown
last config time)...
debug 2021-01-05T09:57:53.629+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon rgw.ikea.dc9-1.itcnchn-bb4067.gtqedc on
itcnchn-bb4067
debug 2021-01-05T09:57:54.141+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring crash.itcnchn-bb4067 (unknown last config
time)...
debug 2021-01-05T09:57:54.141+0000 7fdb69691700 0 log_channel(cephadm)
log [INF] : Reconfiguring daemon crash.itcnchn-bb4067 on itcnchn-bb4067
- Karsten