Hi, Cephers.
I would like to hear your ideas about strange situation we have in one of
our clusters.
It's Luminous 12.2.12 cluster. Recently we added 3 nodes with 10x SSD OSDs
to it and dedicated them to SSD pool for our OpenStack volumes. Initial
tests went well, IOPS were great, throughput was perfect - all good. Until
we got first real usage there. Very limited IOPS (~450), high disk
utilization (near 100%) and throughput (less than 1 MB/s) put us into the
tears.
After some investigation we found, that this situation only occurs when all
conditions are met:
1. Disk is RBD (test went fine from same server with local disks)
2. File system is XFS (no problems with ext4)
3. Block size is bigger than write
4. Only one FIO thread (numjobs) is used
When at least one of these conditions are not met - we get ~40k IOPS, great
throughput, etc. We did tests with fio, testing different values, but it's
quite clear: if write size is 4kb (same as block size) iops go up to 40k.
If write size is 3kb, then it limits to ~450 iops. From this point, it
doesn't matter how small the write is - it's always ~450 iops. After
changing block size to 2kb, situation is same - great speed until write is
less than 2kb in size. If we rise fio paramether "numjobs" to 10 we get
maximum possible iops: ~40k. Which is more than simple 10x increase.
Any ideas what is going on and why smaller writes take such a big impact on
performance in XFS, but no problems in EXT4?
Thank you for all the ideas!
Arvydas
Hi,
We hit the following assert:
-10001> 2020-02-13 17:42:35.543 7f11b5669700 -1 /build/ceph-13.2.8/src/mds/MDCache.cc: In function 'MDRequestRef MDCa
che::request_get(metareqid_t)' thread 7f11b5669700 time 2020-02-13 17:42:35.545815
/build/ceph-13.2.8/src/mds/MDCache.cc: 9523: FAILED assert(p != active_requests.end())
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f11bd8e69de]
2: (()+0x287b67) [0x7f11bd8e6b67]
3: (MDCache::request_get(metareqid_t)+0x94) [0x560cde8bb214]
4: (Server::journal_close_session(Session*, int, Context*)+0x9dd) [0x560cde829d1d]
5: (Server::handle_client_session(MClientSession*)+0x1071) [0x560cde82b0f1]
6: (Server::dispatch(Message*)+0x30b) [0x560cde86f87b]
7: (MDSRank::handle_deferrable_message(Message*)+0x434) [0x560cde7e1664]
8: (MDSRank::_dispatch(Message*, bool)+0x89b) [0x560cde7f8c7b]
9: (MDSRankDispatcher::ms_dispatch(Message*)+0xa3) [0x560cde7f92e3]
10: (MDSDaemon::ms_dispatch(Message*)+0xd3) [0x560cde7d92b3]
11: (DispatchQueue::entry()+0xb92) [0x7f11bd9a9e52]
12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f11bda46e2d]
13: (()+0x76db) [0x7f11bd1d76db]
14: (clone()+0x3f) [0x7f11bc3bd88f]
Before we hit this assert there were a few (kernel clients, 5.3.0-26/28)
that were not playing nicely:
16:32 < bitrot> mds.mds1 [WRN] client.61994841 isn't responding to mclientcaps(revoke), ino 0x1003846ddc5 pending
pAsLsXsFscr issued pAsLsXsFscr, sent 62.342791 seconds ago
16:32 < bitrot> mon.mon1 [WRN] Health check failed: 1 clients failing to respond to capability release
(MDS_CLIENT_LATE_RELEASE)
We rebooted both clients. After that one of them again had some slow
requests. We umounted the file system, slowly after that the MDS hit the
assert. Failover went fine this time.
This looks like issue: https://tracker.ceph.com/issues/23059 ... but
that should already have been resolved. Is this the same issue, and or a
regression?
We run 13.2.8.
Thanks,
Stefan
--
| BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl
Hi,
the current output of ceph -s reports a warning:
2 slow ops, oldest one blocked for 347335 sec, mon.ld5505 has slow ops
This time is increasing.
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_WARN
9 daemons have recently crashed
2 slow ops, oldest one blocked for 347335 sec, mon.ld5505
has slow ops
services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d)
mgr: ld5507(active, since 8m), standbys: ld5506, ld5505
mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2
up:standby-replay 3 up:standby
osd: 442 osds: 442 up (since 8d), 442 in (since 9d)
data:
pools: 7 pools, 19628 pgs
objects: 65.78M objects, 251 TiB
usage: 753 TiB used, 779 TiB / 1.5 PiB avail
pgs: 19628 active+clean
io:
client: 427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr
The details are as follows:
root@ld3955:~# ceph health detail
HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one
blocked for 347755 sec, mon.ld5505 has slow ops
RECENT_CRASH 9 daemons have recently crashed
mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z
mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z
mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z
mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z
mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z
mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z
mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z
mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z
mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z
SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has
slow ops
There's no error on services (mgr, mon, osd).
Can you please advise how to identify the root cause of this slow ops?
THX
Hi ceph enthusiasts,
We have a ceph cluster with cephfs and two pools: one replicated for metadata on ssd and one with ec (4+2) on hdd. Recently, we upgraded from 4 to 7 nodes and now want to change the failure domain of the erasure coded pool from 'OSD' to 'HOST'.
What we did was to create a new crush-rule and changed the rule of our ec pool. It still uses the old profile. Details can be found below.
Now there are a couple of questions:
1) Is this equivalent to changing the profile? Below you can see in the profile 'crush-failure-domain=osd' and in the crush-rule '"op": "chooseleaf_indep", "type": "host"'.
2) If we need to change the failure-domain in the profile, can this be done without creating a new pool, which seems troublesome?
3) Finally, if we really need to create a new pool to do this... what is the best way? For the record: our cluster is now (after the upgrade) ~40% full (400TB/1Pb) with 173 OSDs.
Cheers,
Max
some more details:
[root@ceph-node-a ~]# ceph osd lspools
1 ec42
2 cephfs_metadata
[root@ceph-node-a ~]# ceph osd pool get ec42 erasure_code_profile
erasure_code_profile: ec42
[root@ceph-node-a ~]# ceph osd pool get ec42 crush_rule
crush_rule: ec42_host_hdd
[root@ceph-node-a ~]# ceph osd erasure-code-profile get ec42
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8
[root@ceph-node-a ~]# ceph osd crush rule dump ec42_host_hdd
{
"rule_id": 6,
"rule_name": "ec42_host_hdd",
"ruleset": 6,
"type": 3,
"min_size": 3,
"max_size": 6,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
Hi,
We would like to replace the current seagate ST4000NM0034 HDDs in our
ceph cluster with SSDs, and before doing that, we would like to checkout
the typical usage of our current drives, over the last years, so we can
select the best (price/performance/endurance) SSD to replace them with.
I am trying to extract this info from the fields "Blocks received from
initiator" / "blocks sent to initiator", as these are the fields
smartctl gets from the seagate disks. But the numbers seem strange, and
I would like to request feedback here.
Three nodes, all equal, 8 OSDs per node, all 4TB ST4000NM0034
(filestore) HDDs with SSD-based journals:
> root@node1:~# ceph osd crush tree
> ID CLASS WEIGHT TYPE NAME
> -1 87.35376 root default
> -2 29.11688 host node1
> 0 hdd 3.64000 osd.0
> 1 hdd 3.64000 osd.1
> 2 hdd 3.63689 osd.2
> 3 hdd 3.64000 osd.3
> 12 hdd 3.64000 osd.12
> 13 hdd 3.64000 osd.13
> 14 hdd 3.64000 osd.14
> 15 hdd 3.64000 osd.15
> -3 29.12000 host node2
> 4 hdd 3.64000 osd.4
> 5 hdd 3.64000 osd.5
> 6 hdd 3.64000 osd.6
> 7 hdd 3.64000 osd.7
> 16 hdd 3.64000 osd.16
> 17 hdd 3.64000 osd.17
> 18 hdd 3.64000 osd.18
> 19 hdd 3.64000 osd.19
> -4 29.11688 host node3
> 8 hdd 3.64000 osd.8
> 9 hdd 3.64000 osd.9
> 10 hdd 3.64000 osd.10
> 11 hdd 3.64000 osd.11
> 20 hdd 3.64000 osd.20
> 21 hdd 3.64000 osd.21
> 22 hdd 3.64000 osd.22
> 23 hdd 3.63689 osd.23
We are looking at the numbers from smartctl, and basing our calculations
on this output for each individual various OSD:
> Vendor (Seagate) cache information
> Blocks sent to initiator = 3783529066
> Blocks received from initiator = 3121186120
> Blocks read from cache and sent to initiator = 545427169
> Number of read and write commands whose size <= segment size = 93877358
> Number of read and write commands whose size > segment size = 2290879
I created the following spreadsheet:
> blocks sent blocks received total blocks
> to initiator from initiator calculated read% write% aka
> node1
> osd0 905060564 1900663448 2805724012 32,26% 67,74% sda
> osd1 2270442418 3756215880 6026658298 37,67% 62,33% sdb
> osd2 3531938448 3940249192 7472187640 47,27% 52,73% sdc
> osd3 2824808123 3130655416 5955463539 47,43% 52,57% sdd
> osd12 1956722491 1294854032 3251576523 60,18% 39,82% sdg
> osd13 3410188306 1265443936 4675632242 72,94% 27,06% sdh
> osd14 3765454090 3115079112 6880533202 54,73% 45,27% sdi
> osd15 2272246730 2218847264 4491093994 50,59% 49,41% sdj
>
> node2
> osd4 3974937107 740853712 4715790819 84,29% 15,71% sda
> osd5 1181377668 2109150744 3290528412 35,90% 64,10% sdb
> osd5 1903438106 608869008 2512307114 75,76% 24,24% sdc
> osd7 3511170043 724345936 4235515979 82,90% 17,10% sdd
> osd16 2642731906 3981984640 6624716546 39,89% 60,11% sdg
> osd17 3994977805 3703856288 7698834093 51,89% 48,11% sdh
> osd18 3992157229 2096991672 6089148901 65,56% 34,44% sdi
> osd19 279766405 1053039640 1332806045 20,99% 79,01% sdj
>
> node3
> osd8 3711322586 234696960 3946019546 94,05% 5,95% sda
> osd9 1203912715 3132990000 4336902715 27,76% 72,24% sdb
> osd10 912356010 1681434416 2593790426 35,17% 64,83% sdc
> osd11 810488345 2626589896 3437078241 23,58% 76,42% sdd
> osd20 1506879946 2421596680 3928476626 38,36% 61,64% sdg
> osd21 2991526593 7525120 2999051713 99,75% 0,25% sdh
> osd22 29560337 3226114552 3255674889 0,91% 99,09% sdi
> osd23 2019195656 2563506320 4582701976 44,06% 55,94% sdj
But as can be seen above, this results in some very strange numbers, for
example node3/osd21 and node2/osd19, node3/osd8, the numbers are unlikely.
So, probably we're doing something wrong in our logic here.
Can someone explain what we're doing wrong, and is it possible to obtain
stats like these also from ceph directly? Does ceph keep historical
stats like above..?
MJ
Hi,
the current outpu of ceph -s reports a warning:
9 daemons have recently crashed
root@ld3955:~# ceph -s
cluster:
id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
health: HEALTH_WARN
9 daemons have recently crashed
2 slow ops, oldest one blocked for 347335 sec, mon.ld5505
has slow ops
services:
mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 3d)
mgr: ld5507(active, since 8m), standbys: ld5506, ld5505
mds: cephfs:2 {0=ld5507=up:active,1=ld5505=up:active} 2
up:standby-replay 3 up:standby
osd: 442 osds: 442 up (since 8d), 442 in (since 9d)
data:
pools: 7 pools, 19628 pgs
objects: 65.78M objects, 251 TiB
usage: 753 TiB used, 779 TiB / 1.5 PiB avail
pgs: 19628 active+clean
io:
client: 427 KiB/s rd, 22 MiB/s wr, 851 op/s rd, 647 op/s wr
The details are as follows:
root@ld3955:~# ceph health detail
HEALTH_WARN 9 daemons have recently crashed; 2 slow ops, oldest one
blocked for 347755 sec, mon.ld5505 has slow ops
RECENT_CRASH 9 daemons have recently crashed
mds.ld4464 crashed on host ld4464 at 2020-02-09 07:33:59.131171Z
mds.ld5506 crashed on host ld5506 at 2020-02-09 07:42:52.036592Z
mds.ld4257 crashed on host ld4257 at 2020-02-09 07:47:44.369505Z
mds.ld4464 crashed on host ld4464 at 2020-02-09 06:10:24.515912Z
mds.ld5507 crashed on host ld5507 at 2020-02-09 07:13:22.400268Z
mds.ld4257 crashed on host ld4257 at 2020-02-09 06:48:34.742475Z
mds.ld5506 crashed on host ld5506 at 2020-02-09 06:10:24.680648Z
mds.ld4465 crashed on host ld4465 at 2020-02-09 06:52:33.204855Z
mds.ld5506 crashed on host ld5506 at 2020-02-06 07:59:37.089007Z
SLOW_OPS 2 slow ops, oldest one blocked for 347755 sec, mon.ld5505 has
slow ops
However, any crashed host is up and running.
Therefore I would prefer to remove these error messages
Can you please advise how to cleanup the error messages?
THX
Hello All,
We seen one of the Ceph data nodes, all osd's are 90-100% disk utilized ,
those all are SSD drive and traffic is normal compare to other data nodes.
How can we debug it?
Hi,
we sometimes loose access to our cephfs mount and get permission denied
if we try to cd into it. This happens apparently only on some of our HPC
cephfs-client nodes (fs mounted via kernel client) when they are busy
with calculation and I/O.
When we then manually force unmount the fs and remount it, everything is
working again.
This is the dmesg output of the affected client node:
<https://pastebin.com/z5wxUgYS>
All HPC clients and ceph servers are running CentOS 7.7 with the same
kernel:
$ uname -a
Linux apollo-08.local 3.10.0-1062.12.1.el7.x86_64 #1 SMP Tue Feb 4
23:02:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
and all are running ceph version 14.2.7
$ ceph -v
ceph version 14.2.7 (3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8) nautilus
(stable)
Maybe someone has an idea what goes wrong, and how we can fix/avoid this.
Thanks
Dietmar
--
_________________________________________
D i e t m a r R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Institute of Bioinformatics
Email: dietmar.rieder(a)i-med.ac.at
Web: http://www.icbi.at
On Wed, Feb 12, 2020 at 6:08 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
> >
> >
> >>
> >> Say I think my cephfs is slow when I rsync to it, slower than it
> used
> >> to be. First of all, I do not get why it reads so much data. I
> assume
> >> the file attributes need to come from the mds server, so the rsync
> >> backup should mostly cause writes not?
> >>
> >
> >Are you running one or multiple MDS? I've seen cases where the
> >synchronization between the different MDS slow down rsync.
>
> One
>
> >The problem is that rsync creates and renames files a lot. When doing
> >this with small files it can be very heavy for the MDS.
> >
>
> Strange thing is that I did not have performance problems with luminous,
> after upgrading to nautilus and enabling snapshots on a different tree
> of the cephfs. Rsync is taking 10 hours more.
> There is also another option, degrading performance on the source.
> However it is impossible for me to verify this.
> I have increased the mds_cache_memory_limit from 8GB to 16GB, see
> what that brings.
>
how many snapshot are there ?
>
> >
> >> I think it started being slow, after enabling snapshots on the file
> >> system.
> >>
> >> - how can I determine if mds_cache_memory_limit = 8000000000 is
> still
> >> correct?
> >>
> >> - how can I test the mds performance from the command line, so I can
>
> >> experiment with cpu power configurations, and see if this brings a
> >> significant change?
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>
> >> email to ceph-users-leave(a)ceph.io
> >>
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
We have been using RadosGW with Keystone integration for a couple of
years, to allow users of our OpenStack-based IaaS to create their own
credentials for our object store. This has caused us a fair amount of
performance headaches.
Last year, Jjames Weaver (BBC) has contributed a patch (PR #26095) that
changes the handling of S3 authentication when Keystone is used as a
backend for credentials. It was merged to master in March 2019. We run
Nautilus on our production clusters, which doesn't include the patch. A
few weeks ago, we decided to cherry-pick PR #26095 on top of Nautilus
(12.4.5/6/7) and deploy that in production.
So far we haven't noticed any issues. Load on our Keystone system has
decreased significantly, response times for small requests are now
consistently low, and we don't have to re-provision S3 credentials
locally anymore to fix performance emergencies. Thanks a lot!
Blog post with a few performance graphs:
https://cloudblog.switch.ch/2020/02/10/radosgw-keystone-integration-perform…
--
Simon.