Just shooting in the dark here, but you may be affected by similar issue I
had a while back, it was discussed here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ZOPBOY6XQOY…
In short - they've changed setting bluefs_buffered_io to false in the
recent Nautilus release. I guess the same was applied to newer releases.
That lead to severe performance issues and similar symptoms, i.e. lower
memory usage on OSD nodes. Worth checking out.
Of course, it may be something completely different. You should look into
monitoring all your OSDs separately, checking their utilization, await, and
other parameters, at the same time comparing them to pre-upgrade values, to
find the root cause.
пн, 2 нояб. 2020 г. в 11:55, Marc Roos <M.Roos(a)f1-outsourcing.eu>:
>
> I am advocating already a long time for publishing testing data of some
> basic test cluster against different ceph releases. Just a basic ceph
> cluster that covers most configs and run the same tests, so you can
> compare just ceph performance. That would mean a lot for smaller
> companies that do not have access to a good test environment. I have
> asked also about this at some ceph seminar.
>
>
>
> -----Original Message-----
> From: Martin Rasmus Lundquist Hansen [mailto:hansen@imada.sdu.dk]
> Sent: Monday, November 02, 2020 7:53 AM
> To: ceph-users(a)ceph.io
> Subject: [ceph-users] Seriously degraded performance after update to
> Octopus
>
> Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to
> Octopus (15.2.5), an update that was long overdue. We used the Ansible
> playbooks to perform a rolling update and except from a few minor
> problems with the Ansible code, the update went well. The Ansible
> playbooks were also used for setting up the cluster in the first place.
> Before updating the Ceph software we also performed a full update of
> CentOS and the Linux kernel (this part of the update had already been
> tested on one of the OSD nodes the week before and we didn't notice any
> problems).
>
> However, after the update we are seeing a serious decrease in
> performance, more than a factor of 10x in some cases. I spend a week
> trying to come up with an explantion or solution, but I am completely
> blank. Independently of Ceph I tested the network performance and the
> performance of the OSD disks, and I am not really seeing any problems
> here.
>
> The specifications of the cluster is:
> - 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU
> @ 1.80GHz, 16 cores, 196 GB RAM)
> - 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold
> 6126 CPU @ 2.60GHz, 24 cores, 384 GB RAM)
> - CentOS 7.8 and Kernel 5.4.51
> - 100 Gbps Infiniband
>
> We are collecting various metrics using Prometheus, and on the OSD nodes
> we are seeing some clear differences when it comes to CPU and Memory
> usage. I collected some graphs here: http://mitsted.dk/ceph . After the
> update the system load is highly reduced, there is almost no longer any
> iowait for the CPU, and the free memory is no longer used for Buffers (I
> can confirm that the changes in these metrics are not due to the update
> of CentOS or the Linux kernel). All in all, now the OSD nodes are almost
> completely idle all the time (and so are the monitors). On the linked
> page I also attached two RADOS benchmarks. The first benchmark was
> performed when the cluster was initially configured, and the second is
> the same benchmark after the update to Octopus. When comparing these
> two, it is clear that the performance has changed dramatically. For
> example, in the write test the bandwidth is reduced from 320 MB/s to 21
> MB/s and the number of IOPS has also dropped significantly.
>
> I temporarily tried to disable the firewall and SELinux on all nodes to
> see if it made any difference, but it didnt look like it (I did not
> restart any services during this test, I am not sure if that could be
> necessary).
>
> Any suggestions for finding the root cause of this performance decrease
> would be greatly appreciated.
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
Hi everyone
We have faced some RGW outages recently, with the RGW returning HTTP 503. First for a few, then for most, then all requests - in the course of 1-2 hours. This seems to have started since we have updated from 15.2.4 to 15.2.5.
The line that accompanies these outages in the log is the following:
s3:list_bucket Scheduling request failed with -2218
It first pops up a few times here and there, until it eventually applies to all requests. It seems to indicate that the throttler has reached the limit of open connections.
As we run a pair of HAProxy instances in front of RGW, which limit the number of connections to the two RGW instances to 400, this limit should never be reached. We do use RGW metadata sync between the instances, which could account for some extra connections, but if I look at open TCP connections between the instances I can count no more than 20 at any given time.
I also noticed that some connections in the RGW log seem to never complete. That is, I can find a ‘starting new request’ line, but no associated ‘req done’ or ‘beast’ line.
I don’t think there are any hung connections around, as they are killed by HAProxy after a short timeout.
Looking at the code, it seems as if the throttler in use (SimpleThrottler), eventually reaches the maximum count of 1024 connections (outstanding_requests), and never recovers. I believe that the request_complete function is not called in all cases, but I am not familiar with the Ceph codebase, so I am not sure.
See https://github.com/ceph/ceph/blob/cc17681b478594aa39dd80437256a54e388432f0/… <https://github.com/ceph/ceph/blob/cc17681b478594aa39dd80437256a54e388432f0/…>
Does anyone see the same phenomenon? Could this be a bug in the request handling of RGW, or am I wrong in my assumptions?
For now we’re just restarting our RGWs regularly, which seems to keep the problem at bay.
Thanks for any hints.
Denis
I've been struggling with this one for a few days now. We had an OSD report as near full a few days ago. Had this happen a couple of times before and a reweight-by-utilization has sorted it out in the past. Tried the same again but this time we ended up with a couple of pgs in a state of backfill_toofull and a handful of misplaced objects as a result.
Tried doing the reweight a few more times and it's been moving data around. We did have another osd trigger the near full alert but running the reweight a couple more times seems to have moved some of that data around a bit better. However, the original near_full osd doesn't seem to have changed much and the backfill_toofull pgs are still there. I'd keep doing the reweight-by-utilization but I'm not sure if I'm heading down the right path and if it will eventually sort it out.
We have 14 pools, but the vast majority of data resides in just one of those pools (pool 20). The pgs in the backfill state are in pool 2 (as far as I can tell). That particular pool is used for some cephfs stuff and has a handful of large files in there (not sure if this is significant to the problem).
All up, our utilization is showing as 55.13% but some of our OSDs are showing as 76% in use with this one problem sitting at 85.02%. Right now, I'm just not sure what the proper corrective action is. The last couple of reweights I've run have been a bit more targetted in that I've set it to only function on two OSDs at a time. If I run a test-reweight targetting only one osd, it does say it will reweight OSD 9 (the one at 85.02%). I gather this will move data away from this OSD and potentially get it below the threshold. However, at one point in the past couple of days, it's shown as no OSDs in a near full state, yet the two pgs in backfill_toofull didn't change. So, that's why I'm not sure continually reweighting is going to solve this issue.
I'm a long way from knowledgable on Ceph so I'm not really sure what information is useful here. Here's a bit of info on what I'm seeing. Can provide anything else that might help.
Basically, we have a three node cluster but only two have OSDs. The third is there simply to enable a quorum to be established. The OSDs are evenly spread across these two needs and the configuration of each is identical. We are running Jewel and are not in a position to upgrade at this stage.
# ceph --version
ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
# ceph health detail
HEALTH_WARN 2 pgs backfill_toofull; 2 pgs stuck unclean; recovery 33/62099566 objects misplaced (0.000%); 1 near full osd(s)
pg 2.52 is stuck unclean for 201822.031280, current state active+remapped+backfill_toofull, last acting [17,3]
pg 2.18 is stuck unclean for 202114.617682, current state active+remapped+backfill_toofull, last acting [18,2]
pg 2.18 is active+remapped+backfill_toofull, acting [18,2]
pg 2.52 is active+remapped+backfill_toofull, acting [17,3]
recovery 33/62099566 objects misplaced (0.000%)
osd.9 is near full at 85%
# ceph osd df
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
2 1.37790 1.00000 1410G 842G 496G 59.75 1.08 33
3 1.37790 0.45013 1410G 1079G 259G 76.49 1.39 21
4 1.37790 0.95001 1410G 1086G 253G 76.98 1.40 44
5 1.37790 1.00000 1410G 617G 722G 43.74 0.79 43
6 1.37790 0.65009 1410G 616G 722G 43.69 0.79 39
7 1.37790 0.95001 1410G 495G 844G 35.10 0.64 40
8 1.37790 1.00000 1410G 732G 606G 51.93 0.94 52
9 1.37790 0.70007 1410G 1199G 139G 85.02 1.54 37
10 1.37790 1.00000 1410G 611G 727G 43.35 0.79 41
11 1.37790 0.75006 1410G 495G 843G 35.11 0.64 32
0 1.37790 1.00000 1410G 731G 608G 51.82 0.94 43
12 1.37790 1.00000 1410G 851G 487G 60.36 1.09 44
13 1.37790 1.00000 1410G 378G 960G 26.82 0.49 38
14 1.37790 1.00000 1410G 969G 370G 68.68 1.25 37
15 1.37790 1.00000 1410G 724G 614G 51.35 0.93 35
16 1.37790 1.00000 1410G 491G 847G 34.84 0.63 43
17 1.37790 1.00000 1410G 862G 476G 61.16 1.11 50
18 1.37790 0.80005 1410G 1083G 255G 76.78 1.39 26
19 1.37790 0.65009 1410G 963G 375G 68.29 1.24 23
20 1.37790 1.00000 1410G 724G 614G 51.38 0.93 42
TOTAL 28219G 15557G 11227G 55.13
MIN/MAX VAR: 0.49/1.54 STDDEV: 15.57
# ceph pg ls backfill_toofull
pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
2.18 9 0 0 18 0 0 3653 3653 active+remapped+backfill_toofull 2020-10-29 05:31:20.429912 610'549153 656:390372 [9,12] 9 [18,2] 18 594'547482 2020-10-25 20:28:39.680744 594'543841 2020-10-21 21:21:33.092868
2.52 15 0 0 15 0 0 4883 4883 active+remapped+backfill_toofull 2020-10-29 05:31:28.277898 652'502085 656:367288 [17,9] 17 [17,3] 17 594'499108 2020-10-26 11:06:48.417825 594'499108 2020-10-26 11:06:48.417825
pool : 17 18 19 11 20 21 12 13 0 14 1 15 2 16 | SUM
--------------------------------------------------------------------------------------------------------------------------------
osd.4 3 0 0 0 9 2 0 0 12 1 9 0 7 1 | 44
osd.17 1 0 0 0 7 3 1 0 8 1 17 1 11 0 | 50
osd.18 0 0 0 0 9 0 0 0 4 0 7 0 5 0 | 25
osd.5 0 0 0 2 5 1 1 0 5 0 16 0 11 2 | 43
osd.6 0 1 0 1 5 2 0 0 9 0 13 1 7 0 | 39
osd.19 0 0 1 0 8 2 0 1 2 0 6 0 3 0 | 23
osd.7 0 0 0 0 4 1 1 0 3 0 12 0 19 0 | 40
osd.8 0 1 0 0 6 3 0 2 10 1 13 1 15 0 | 52
osd.9 1 0 2 0 10 2 0 0 4 1 6 1 10 0 | 37
osd.10 0 0 1 1 5 2 0 1 7 0 12 0 11 1 | 41
osd.20 1 0 0 0 6 1 0 1 7 0 8 1 17 0 | 42
osd.11 0 0 0 0 4 1 1 1 5 0 11 0 9 0 | 32
osd.12 0 0 1 1 7 1 0 0 5 1 12 1 14 1 | 44
osd.13 0 2 0 0 3 1 0 0 10 1 11 0 10 0 | 38
osd.0 0 1 0 1 6 3 0 1 7 0 11 0 13 0 | 43
osd.14 1 0 0 0 8 1 1 0 4 1 12 0 9 0 | 37
osd.15 1 0 2 1 6 1 1 0 8 0 7 0 6 2 | 35
osd.2 0 2 1 0 7 2 1 0 7 1 4 1 6 0 | 32
osd.3 0 0 0 0 9 0 0 0 2 0 4 0 5 0 | 20
osd.16 0 1 0 1 4 3 1 1 9 0 9 1 12 1 | 43
--------------------------------------------------------------------------------------------------------------------------------
SUM : 8 8 8 8 128 32 8 8 128 8 200 8 200 8 |
Hi:
I have this ceph status:
-----------------------------------------------------------------------------
cluster:
id: 039bf268-b5a6-11e9-bbb7-d06726ca4a78
health: HEALTH_WARN
noout flag(s) set
1 osds down
Reduced data availability: 191 pgs inactive, 2 pgs down, 35
pgs incomplete, 290 pgs stale
5 pgs not deep-scrubbed in time
7 pgs not scrubbed in time
327 slow ops, oldest one blocked for 233398 sec, daemons
[osd.12,osd.36,osd.5] have slow ops.
services:
mon: 1 daemons, quorum fond-beagle (age 23h)
mgr: fond-beagle(active, since 7h)
osd: 48 osds: 45 up (since 95s), 46 in (since 8h); 4 remapped pgs
flags noout
data:
pools: 7 pools, 2305 pgs
objects: 350.37k objects, 1.5 TiB
usage: 3.0 TiB used, 38 TiB / 41 TiB avail
pgs: 6.681% pgs unknown
1.605% pgs not active
1835 active+clean
279 stale+active+clean
154 unknown
22 incomplete
10 stale+incomplete
2 down
2 remapped+incomplete
1 stale+remapped+incomplete
--------------------------------------------------------------------------------------------
How can i fix all of unknown, incomplete, remmaped+incomplete, etc... i
dont care if i need remove PGs
Hi!
I have an interesting question regarding SSDs and I'll try to ask about it here.
During my testing of Ceph & Vitastor & Linstor on servers equipped with Intel D3-4510 SSDs I discovered a very funny problem with these SSDs:
They don't like overwrites of the same sector.
That is, if you overwrite the same sector over and over again you get very low iops:
$ fio -direct=1 -rw=write -bs=4k -size=4k -loops=100000 -iodepth=1
write: IOPS=3142, BW=12.3MiB/s (12.9MB/s)(97.9MiB/7977msec)
And if you overwrite at least ~128k of other sectors between overwriting the same sector you get normal results:
$ fio -direct=1 -rw=write -bs=4k -size=128k -loops=100000 -iodepth=1
write: IOPS=20.8k, BW=81.4MiB/s (85.3MB/s)(543MiB/6675msec)
This slowdown almost doesn't hurt Ceph, slightly hurts Vitastor (the impact was greater before I added a fix), and MASSIVELY hurts Linstor/DRBD9 because of its "bitmap".
By now I've only seen it on this particular model of SSD. For example, Intel P4500, Micron 9300 Pro, Samsung PM983 don't have this issue.
Do you have any contacts of Intel SSD firmware guys to ask them about this bug-o-feature? :-)
--
With best regards,
Vitaliy Filippov
Hi,
we're using rbd for VM disk images and want to make consistent backups of
groups of them.
I know I can create a group and make consistent snapshots of all of them:
# rbd --version
ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus
(stable)
# rbd create test_foo --size 1M
# rbd create test_bar --size 1M
# rbd group create test
# rbd group image add test test_foo
# rbd group image add test test_bar
# rbd group snap create test@1
But how can I export the individual image snapshots? I tried different
ways of addressing them, but nothing worked:
# rbd export test_foo@1 -
error setting snapshot context: (2) No such file or directory
# rbd export test_foo@test/1 -
rbd: error opening pool 'test_foo@test': (2) No such file or directory
# rbd export rbd/test_foo@test/1 -
error setting snapshot context: (2) No such file or directory
# rbd export test@1/test_foo -
rbd: error opening pool 'test@1': (2) No such file or directory
# rbd export rbd/test@1/test_foo -
rbd: error opening image test: (2) No such file or directory
Am I missing something?
Mit freundlichen Grüßen,
Timo Weingärtner
Systemadministrator
--
ITscope GmbH
Ludwig-Erhard-Allee 20
D-76131 Karlsruhe
Tel: +49 721 62737637
Fax: +49 721 66499175
https://www.itscope.com
Handelsregister: AG Mannheim, HRB 232782 Sitz der Gesellschaft: Karlsruhe
Geschäftsführer: Alexander Münkel, Benjamin Mund, Stefan Reger
Hi,
I already submitted a ticket: https://tracker.ceph.com/issues/47951
Maybe other people noticed this as well.
Situation:
- Cluster is running IPv6
- mon_host is set to a DNS entry
- DNS entry is a Round Robin with three AAAA-records
root@wido-standard-benchmark:~# ceph -s
unable to parse addrs in 'mon.objects.xx.xxx.net'
[errno 22] error connecting to the cluster
root@wido-standard-benchmark:~#
The relevant part of the ceph.conf:
[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
mon_host = mon.objects.xxx.xxx.xxx
ms_bind_ipv6 = true
This works fine with 14.2.11 and breaks under 14.2.12
Anybody else seeing this as well?
Wido
Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to Octopus (15.2.5), an update that was long overdue. We used the Ansible playbooks to perform a rolling update and except from a few minor problems with the Ansible code, the update went well. The Ansible playbooks were also used for setting up the cluster in the first place. Before updating the Ceph software we also performed a full update of CentOS and the Linux kernel (this part of the update had already been tested on one of the OSD nodes the week before and we didn't notice any problems).
However, after the update we are seeing a serious decrease in performance, more than a factor of 10x in some cases. I spend a week trying to come up with an explantion or solution, but I am completely blank. Independently of Ceph I tested the network performance and the performance of the OSD disks, and I am not really seeing any problems here.
The specifications of the cluster is:
- 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz, 16 cores, 196 GB RAM)
- 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz, 24 cores, 384 GB RAM)
- CentOS 7.8 and Kernel 5.4.51
- 100 Gbps Infiniband
We are collecting various metrics using Prometheus, and on the OSD nodes we are seeing some clear differences when it comes to CPU and Memory usage. I collected some graphs here: http://mitsted.dk/ceph . After the update the system load is highly reduced, there is almost no longer any iowait for the CPU, and the free memory is no longer used for Buffers (I can confirm that the changes in these metrics are not due to the update of CentOS or the Linux kernel). All in all, now the OSD nodes are almost completely idle all the time (and so are the monitors). On the linked page I also attached two RADOS benchmarks. The first benchmark was performed when the cluster was initially configured, and the second is the same benchmark after the update to Octopus. When comparing these two, it is clear that the performance has changed dramatically. For example, in the write test the bandwidth is reduced from 320 MB/s to 21 MB/s and the number of IOPS has also dropped significantly.
I temporarily tried to disable the firewall and SELinux on all nodes to see if it made any difference, but it didn’t look like it (I did not restart any services during this test, I am not sure if that could be necessary).
Any suggestions for finding the root cause of this performance decrease would be greatly appreciated.