August 2020 - ceph-users

Provide more documentation for MDS performance tuning on large file systems

by Janek Bevendorff

Hello, Over the last week I have tried optimising the performance of our MDS nodes for the large amount of files and concurrent clients we have. It turns out that despite various stability fixes in recent releases, the default configuration still doesn't appear to be optimal for keeping the cache size under control and avoid intermittent I/O blocks. Unfortunately, it is very hard to tweak the configuration to something that works, because the tuning parameters needed are largely undocumented or only described in very technical terms in the source code making them quite unapproachable for administrators not familiar with all the CephFS internals. I would therefore like to ask if it were possible to document the "advanced" MDS settings more clearly as to what they do and in what direction they have to be tuned for more or less aggressive cap recall, for instance (sometimes it is not clear if a threshold is a min or a max threshold). I am am in the very (un)fortunate situation to have folders with a several 100K direct sub folders or files (and one extreme case with almost 7 million dentries), which is a pretty good benchmark for measuring cap growth while performing operations on them. For the time being, I came up with this configuration, which seems to work for me, but is still far from optimal: mds basic mds_cache_memory_limit 10737418240 mds advanced mds_cache_trim_threshold 131072 mds advanced mds_max_caps_per_client 500000 mds advanced mds_recall_max_caps 17408 mds advanced mds_recall_max_decay_rate 2.000000 The parameters I am least sure about---because I understand the least how they actually work---are mds_cache_trim_threshold and mds_recall_max_decay_rate. Despite reading the description in src/common/options.cc, I understand only half of what they're doing and I am also not quite sure in which direction to tune them for optimal results. Another point where I am struggling is the correct configuration of mds_recall_max_caps. The default of 5K doesn't work too well for me, but values above 20K also don't seem to be a good choice. While high values result in fewer blocked ops and better performance without destabilising the MDS, they also lead to slow but unbounded cache growth, which seems counter-intuitive. 17K was the maximum I could go. Higher values work for most use cases, but when listing very large folders with millions of dentries, the MDS cache size slowly starts to exceed the limit after a few hours, since the MDSs are failing to keep clients below mds_max_caps_per_client despite not showing any "failing to respond to cache pressure" warnings. With the configuration above, I do not have cache size issues any more, but it comes at the cost of performance and slow/blocked ops. A few hints as to how I could optimise my settings for better client performance would be much appreciated and so would be additional documentation for all the "advanced" MDS settings. Thanks a lot Janek

3 years, 4 months

3
13
0 0

block.db/block.wal device performance dropped after upgrade to 14.2.10

by Vladimir Prokofev

Good day, cephers! We've recently upgraded our cluster from 14.2.8 to 14.2.10 release, also performing full system packages upgrade(Ubuntu 18.04 LTS). After that performance significantly dropped, main reason beeing that journal SSDs are now have no merges, huge queues, and increased latency. There's a few screenshots in attachments. This is for an SSD journal that supports block.db/block.wal for 3 spinning OSDs, and it looks like this for all our SSD block.db/wal devices across all nodes. Any ideas what may cause that? Maybe I've missed something important in release notes?

3 years, 4 months

6
20
0 0

atime with cephfs

by Oliver Freyermuth

Dear Cephers, we are currently mounting CephFS with relatime, using the FUSE client (version 13.2.6): ceph-fuse on /cephfs type fuse.ceph-fuse (rw,relatime,user_id=0,group_id=0,allow_other) For the first time, I wanted to use atime to identify old unused data. My expectation with "relatime" was that the access time stamp would be updated less often, for example, only if the last file access was >24 hours ago. However, that does not seem to be the case: ---------------------------------------------- $ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root ... Access: 2019-04-10 15:50:04.975959159 +0200 Modify: 2019-04-10 15:50:05.651613843 +0200 Change: 2019-04-10 15:50:06.141006962 +0200 ... $ cat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root > /dev/null $ sync $ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root ... Access: 2019-04-10 15:50:04.975959159 +0200 Modify: 2019-04-10 15:50:05.651613843 +0200 Change: 2019-04-10 15:50:06.141006962 +0200 ... ---------------------------------------------- I also tried this via an nfs-ganesha mount, and via a ceph-fuse mount with admin caps, but atime never changes. Is atime really never updated with CephFS, or is this configurable? Something as coarse as "update at maximum once per day only" would be perfectly fine for the use case. Cheers, Oliver

3 years, 4 months

4
6
0 0

Advice on SSD choices for WAL/DB?

by Andrei Mikhailovsky

Hello, We are planning to perform a small upgrade to our cluster and slowly start adding 12TB SATA HDD drives. We need to accommodate for additional SSD WAL/DB requirements as well. Currently we are considering the following: HDD Drives - Seagate EXOS 12TB SSD Drives for WAL/DB - Intel D3 S4510 960GB or Intel D3 S4610 960GB Our cluster isn't hosting any IO intensive DBs nor IO hungry VMs such as Exchange, MSSQL, etc. From the documentation that I've read the recommended size for DB is between 1% and 4% of the size of the osd. Would 2% figure be sufficient enough (so around 240GB DB size for each 12TB osd?) Also, from your experience, which is a better model for the SSD DB/WAL? Would Intel S4510 be sufficient enough for our purpose or would the S4610 be a much better choice? Are there any other cost effective performance to consider instead of the above models? The same question to the HDD. Any other drives we should consider instead of the Seagate EXOS series? Thanks for you help and suggestions. Andrei

3 years, 5 months

4
4
0 0

OSD memory leak?

by Frank Schilder

Hi all, on a mimic 13.2.8 cluster I observe a gradual increase of memory usage by OSD daemons, in particular, under heavy load. For our spinners I use osd_memory_target=2G. The daemons overrun the 2G in virt size rather quickly and grow to something like 4G virtual. The real memory consumption stays more or less around the 2G of the target. There are some overshoots, but these go down again during periods with less load. What I observe now is that the actual memory consumption slowly grows and OSDs start using more than 2G virtual memory. I see this as slowly growing swap usage despite having more RAM available (swappiness=10). This indicates allocated but unused memory or memory not accessed for a long time, usually a leak. Here some heap stats: Before restart: osd.101 tcmalloc heap stats:------------------------------------------------ MALLOC: 3438940768 ( 3279.6 MiB) Bytes in use by application MALLOC: + 5611520 ( 5.4 MiB) Bytes in page heap freelist MALLOC: + 257307352 ( 245.4 MiB) Bytes in central cache freelist MALLOC: + 357376 ( 0.3 MiB) Bytes in transfer cache freelist MALLOC: + 6727368 ( 6.4 MiB) Bytes in thread cache freelists MALLOC: + 25559040 ( 24.4 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 3734503424 ( 3561.5 MiB) Actual memory used (physical + swap) MALLOC: + 575946752 ( 549.3 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 4310450176 ( 4110.8 MiB) Virtual address space used MALLOC: MALLOC: 382884 Spans in use MALLOC: 35 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ # ceph daemon osd.101 dump_mempools { "mempool": { "by_pool": { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 4691828, "bytes": 37534624 }, "bluestore_cache_data": { "items": 0, "bytes": 0 }, "bluestore_cache_onode": { "items": 51, "bytes": 28968 }, "bluestore_cache_other": { "items": 5761276, "bytes": 46292425 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 67, "bytes": 46096 }, "bluestore_writing_deferred": { "items": 208, "bytes": 26037057 }, "bluestore_writing": { "items": 52, "bytes": 6789398 }, "bluefs": { "items": 9478, "bytes": 183720 }, "buffer_anon": { "items": 291450, "bytes": 28093473 }, "buffer_meta": { "items": 546, "bytes": 34944 }, "osd": { "items": 98, "bytes": 1139152 }, "osd_mapbl": { "items": 78, "bytes": 8204276 }, "osd_pglog": { "items": 341944, "bytes": 120607952 }, "osdmap": { "items": 10687217, "bytes": 186830528 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 } }, "total": { "items": 21784293, "bytes": 461822613 } } } Right after restart + health_ok: osd.101 tcmalloc heap stats:------------------------------------------------ MALLOC: 1173996280 ( 1119.6 MiB) Bytes in use by application MALLOC: + 3727360 ( 3.6 MiB) Bytes in page heap freelist MALLOC: + 25493688 ( 24.3 MiB) Bytes in central cache freelist MALLOC: + 17101824 ( 16.3 MiB) Bytes in transfer cache freelist MALLOC: + 20301904 ( 19.4 MiB) Bytes in thread cache freelists MALLOC: + 5242880 ( 5.0 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 1245863936 ( 1188.1 MiB) Actual memory used (physical + swap) MALLOC: + 20488192 ( 19.5 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 1266352128 ( 1207.7 MiB) Virtual address space used MALLOC: MALLOC: 54160 Spans in use MALLOC: 33 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Am I looking at a memory leak here or are these heap stats expected? I don't mind the swap usage, it doesn't have impact. I'm just wondering if I need to restart OSDs regularly. The "leakage" above occurred within only 2 months. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 5 months

4
25
0 0

NoSuchKey on key that is visible in s3 list/radosgw bk

by Mariusz Gronczewski

Hi, I've got a problem on Octopus (15.2.3, debian packages) install, bucket S3 index shows a file: s3cmd ls s3://upvid/255/38355 --recursive 2020-07-27 17:48 50584342 s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4 radosgw-admin bi list also shows it { "type": "plain", "idx": "255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4", "entry": { "name": "255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4", "instance": "", "ver": { "pool": 11, "epoch": 853842 }, "locator": "", "exists": "true", "meta": { "category": 1, "size": 50584342, "mtime": "2020-07-27T17:48:27.203008Z", "etag": "2b31cc8ce8b1fb92a5f65034f2d12581-7", "storage_class": "", "owner": "filmweb-app", "owner_display_name": "filmweb app user", "content_type": "", "accounted_size": 50584342, "user_data": "", "appendable": "false" }, "tag": "_3ubjaztglHXfZr05wZCFCPzebQf-ZFP", "flags": 0, "pending_map": [], "versioned_epoch": 0 } }, but trying to download it via curl (I've set permissions to public0 only gets me <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><BucketName>upvid</BucketName><RequestId>tx0000000000000000e716d-005f1f14cb-e478a-pl-war1</RequestId><HostId>e478a-pl-war1-pl</HostId></Error> (the actually nonexisting files shows access denied in same context) same with other tools: $ s3cmd get s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4 /tmp download: 's3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' -> '/tmp/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' [1 of 1] ERROR: S3 error: 404 (NoSuchKey) cluster health is OK Any ideas what is happening here ? -- Mariusz Gronczewski, Administrator Efigence S. A. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 NOC: [+48] 22 380 10 20 E: admin(a)efigence.com

3 years, 5 months

3
3
0 0

Octopus OSDs dropping out of cluster: _check_auth_rotating possible clock skew, rotating keys expired way too early

by Wido den Hollander

Hi, On a recently deployed Octopus (15.2.2) cluster (240 OSDs) we are seeing OSDs randomly drop out of the cluster. Usually it's 2 to 4 OSDs spread out over different nodes. Each node has 16 OSDs and not all the failing OSDs are on the same node. The OSDs are marked as down and all they keep print in their logs: monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-06-04T07:57:17.706529-0400) Looking at their status through the admin socket: { "cluster_fsid": "68653193-9b84-478d-bc39-1a811dd50836", "osd_fsid": "87231b5d-ae5f-4901-93c5-18034381e5ec", "whoami": 206, "state": "active", "oldest_map": 73697, "newest_map": 75795, "num_pgs": 19 } The message brought me to my own ticket I created 2 years ago: https://tracker.ceph.com/issues/23460 The first thing I've checked is NTP/time. Double, triple check this. All the times are in sync on the cluster. Nothing wrong there. Again, it's not all the OSDs on a node failing. Just 1 or 2 dropping out. Restarting them brings them back right away and then within 24h some other OSDs will drop out. Has anybody seen this behavior with Octopus as well? Wido

3 years, 7 months

2
1
0 0

Solve the issue of time scarcity via assignment help in Kuwait

by kevin wick

Are you facing a scarcity of timings or lack of time while composing assignments for Kuwait Universities? Are you not sure while taking the help of anybody to compose your assignments? For every student, assignments sound necessary but time-consuming tasks. You have to manage your time for your projects to score high marks as you can’t ignore your assignments during your study tenure. In this situation, if you can’t manage your time and require reliable assistance for your important task, place your order for assignment help even in Kuwait. Two important things that you need to keep in mind while working on your assignments are time management and quality content. Most students fail to submit their assignments on time because they could not manage their time and can’t collect relevant information for drafting their academic papers. However, don’t burst out and suffer your marks because of any reason. Instead of it, connect with assignment writing service provider and get your work done on time if you have less time to write your assignment. Many students achieve high marks on their projects due to the assistance of experts and professional writers. When you quote order for assignment writing help, you will provide enough time to engage yourself in some other academic tasks. If you don’t have time for writing your assignments and have no idea how to collect particulars for your work, transfer your project to professionals. Experts know how to arrange relevant information for composing the effective academic papers so you will not lose your marks. Expert’s knowledge and experience will allow you to connect with the right source of information and help you to score high marks. So, if you have issues in writing your academic papers, don’t forget to check out the services of online academic writing. https://www.greatassignmenthelp.com/kw/

3 years, 7 months

3
2
1 0

Nautilus: rbd image stuck unaccessible after VM restart

by islepnev＠gmail.com

Hello, I’m running kvm virtualization with rbd storage, some images on rbd pool become efficiently unusable after VM restart. All I/O to problematic rbd image blocks infinitely. Checked that it is not a permission or locking problem. The bug was silent until we performed a planned restart of few VMs and some of VMs failed to start (kvm process timed out). It could be related to recent upgrades luminous to nautilus or proxmox 5 to 6. Ceph backend is clean, no observable problems, all mons/mgrs/osds up and running. Network is ok. Nothing in logs relevant to the problem. ceph version 14.2.6 (ba51347bdbe28c7c0e2e9172fa2983111137bb60) nautilus (stable) kernel 5.3.13-2-pve #1 SMP PVE 5.3.13-2 (Fri, 24 Jan 2020 09:49:36 +0100) x86_64 GNU/Linux HEALTH_OK No locks: # rbd status rbd-technet/vm-402-disk-0 Watchers: none # rbd status rbd-technet/vm-402-disk-1 Watchers: none Normal image vs problematic: # rbd object-map check rbd-technet/vm-402-disk-0 Object Map Check: 100% complete…done. # rbd object-map check rbd-technet/vm-402-disk-1 ^C disk-0 is good while disk-1 is effectively lost. Command hangs for many minutes with no visible activity, interrupted. rbd export runs without problems, however some data is lost after being imported back (ext4 errors). rbd deep copy worked for me. Copy looks good, no errors. # rbd info rbd-technet/vm-402-disk-1 rbd image 'vm-402-disk-1': size 16 GiB in 4096 objects order 22 (4 MiB objects) snapshot_count: 0 id: c600d06b8b4567 block_name_prefix: rbd_data.c600d06b8b4567 format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling op_features: flags: create_timestamp: Fri Jan 31 17:50:50 2020 access_timestamp: Sat Mar 7 00:30:53 2020 modify_timestamp: Sat Mar 7 00:33:35 2020 journal: c600d06b8b4567 mirroring state: disabled What can be done to debug this problem? Thanks, Ilia.

3 years, 7 months

4
3
1 0

Choosing suitable SSD for Ceph cluster

by Hermann Himmelbauer

Hi, I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on 3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks), interconnected via Infiniband 40. Problem is that the ceph performance is quite bad (approx. 30MiB/s reading, 3-4 MiB/s writing ), so I thought about plugging into each node a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to have a faster ceph storage and also some storage extension. The question is now which SSDs I should use. If I understand it right, not every SSD is suitable for ceph, as is denoted at the links below: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-i… or here: https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a fast SSD for ceph. As the 950 is not available anymore, I ordered a Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO. Before equipping all nodes with these SSDs, I did some tests with "fio" as recommended, e.g. like this: fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test The results are as the following: ----------------------- 1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter Jobs: 1: read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec Jobs: 4: read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec Jobs: 10: read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec ----------------------- So the read speed is impressive, but the write speed is really bad. Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND chips (MLC instead of TLC). The results are, however even worse for writing: ----------------------- Samsung 970 PRO NVMe M.2 mit PCIe Adapter Jobs: 1: read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec Jobs: 4: read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec Jobs: 10: read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec ----------------------- I did some research and found out, that the "--sync" flag sets the flag "O_DSYNC" which seems to disable the SSD cache which leads to these horrid write speeds. It seems that this relates to the fact that the write cache is only not disabled for SSDs which implement some kind of battery buffer that guarantees a data flush to the flash in case of a powerloss. However, It seems impossible to find out which SSDs do have this powerloss protection, moreover, these enterprise SSDs are crazy expensive compared to the SSDs above - moreover it's unclear if powerloss protection is even available in the NVMe form factor. So building a 1 or 2 TB cluster seems not really affordable/viable. So, can please anyone give me hints what to do? Is it possible to ensure that the write cache is not disabled in some way (my server is situated in a data center, so there will probably never be loss of power). Or is the link above already outdated as newer ceph releases somehow deal with this problem? Or maybe a later Debian release (10) will handle the O_DSYNC flag differently? Perhaps I should simply invest in faster (and bigger) harddisks and forget the SSD-cluster idea? Thank you in advance for any help, Best Regards, Hermann -- hermann(a)qwer.tk PGP/GPG: 299893C7 (on keyservers)

3 years, 7 months

12
18
0 0

2024

2023

2022

2021

2020

2019

ceph-users August 2020