Hello,
Over the last week I have tried optimising the performance of our MDS
nodes for the large amount of files and concurrent clients we have. It
turns out that despite various stability fixes in recent releases, the
default configuration still doesn't appear to be optimal for keeping the
cache size under control and avoid intermittent I/O blocks.
Unfortunately, it is very hard to tweak the configuration to something
that works, because the tuning parameters needed are largely
undocumented or only described in very technical terms in the source
code making them quite unapproachable for administrators not familiar
with all the CephFS internals. I would therefore like to ask if it were
possible to document the "advanced" MDS settings more clearly as to what
they do and in what direction they have to be tuned for more or less
aggressive cap recall, for instance (sometimes it is not clear if a
threshold is a min or a max threshold).
I am am in the very (un)fortunate situation to have folders with a
several 100K direct sub folders or files (and one extreme case with
almost 7 million dentries), which is a pretty good benchmark for
measuring cap growth while performing operations on them. For the time
being, I came up with this configuration, which seems to work for me,
but is still far from optimal:
mds basic mds_cache_memory_limit 10737418240
mds advanced mds_cache_trim_threshold 131072
mds advanced mds_max_caps_per_client 500000
mds advanced mds_recall_max_caps 17408
mds advanced mds_recall_max_decay_rate 2.000000
The parameters I am least sure about---because I understand the least
how they actually work---are mds_cache_trim_threshold and
mds_recall_max_decay_rate. Despite reading the description in
src/common/options.cc, I understand only half of what they're doing and
I am also not quite sure in which direction to tune them for optimal
results.
Another point where I am struggling is the correct configuration of
mds_recall_max_caps. The default of 5K doesn't work too well for me, but
values above 20K also don't seem to be a good choice. While high values
result in fewer blocked ops and better performance without destabilising
the MDS, they also lead to slow but unbounded cache growth, which seems
counter-intuitive. 17K was the maximum I could go. Higher values work
for most use cases, but when listing very large folders with millions of
dentries, the MDS cache size slowly starts to exceed the limit after a
few hours, since the MDSs are failing to keep clients below
mds_max_caps_per_client despite not showing any "failing to respond to
cache pressure" warnings.
With the configuration above, I do not have cache size issues any more,
but it comes at the cost of performance and slow/blocked ops. A few
hints as to how I could optimise my settings for better client
performance would be much appreciated and so would be additional
documentation for all the "advanced" MDS settings.
Thanks a lot
Janek
Dear Cephers,
we are currently mounting CephFS with relatime, using the FUSE client (version 13.2.6):
ceph-fuse on /cephfs type fuse.ceph-fuse (rw,relatime,user_id=0,group_id=0,allow_other)
For the first time, I wanted to use atime to identify old unused data. My expectation with "relatime" was that the access time stamp would be updated less often, for example,
only if the last file access was >24 hours ago. However, that does not seem to be the case:
----------------------------------------------
$ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root
...
Access: 2019-04-10 15:50:04.975959159 +0200
Modify: 2019-04-10 15:50:05.651613843 +0200
Change: 2019-04-10 15:50:06.141006962 +0200
...
$ cat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root > /dev/null
$ sync
$ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root
...
Access: 2019-04-10 15:50:04.975959159 +0200
Modify: 2019-04-10 15:50:05.651613843 +0200
Change: 2019-04-10 15:50:06.141006962 +0200
...
----------------------------------------------
I also tried this via an nfs-ganesha mount, and via a ceph-fuse mount with admin caps,
but atime never changes.
Is atime really never updated with CephFS, or is this configurable?
Something as coarse as "update at maximum once per day only" would be perfectly fine for the use case.
Cheers,
Oliver
Hi,
I am running a nice ceph (proxmox 4 / debian-8 / ceph 0.94.3) cluster on
3 nodes (supermicro X8DTT-HIBQF), 2 OSD each (2TB SATA harddisks),
interconnected via Infiniband 40.
Problem is that the ceph performance is quite bad (approx. 30MiB/s
reading, 3-4 MiB/s writing ), so I thought about plugging into each node
a PCIe to NVMe/M.2 adapter and install SSD harddisks. The idea is to
have a faster ceph storage and also some storage extension.
The question is now which SSDs I should use. If I understand it right,
not every SSD is suitable for ceph, as is denoted at the links below:
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-i…
or here:
https://www.proxmox.com/en/downloads/item/proxmox-ve-ceph-benchmark
In the first link, the Samsung SSD 950 PRO 512GB NVMe is listed as a
fast SSD for ceph. As the 950 is not available anymore, I ordered a
Samsung 970 1TB for testing, unfortunately, the "EVO" instead of PRO.
Before equipping all nodes with these SSDs, I did some tests with "fio"
as recommended, e.g. like this:
fio --filename=/dev/DEVICE --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test
The results are as the following:
-----------------------
1) Samsung 970 EVO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=26706MB, bw=445MiB/s, iops=113945, runt= 60001msec
write: io=252576KB, bw=4.1MiB/s, iops=1052, runt= 60001msec
Jobs: 4:
read : io=21805MB, bw=432.7MiB/s, iops=93034, runt= 60001msec
write: io=422204KB, bw=6.8MiB/s, iops=1759, runt= 60002msec
Jobs: 10:
read : io=26921MB, bw=448MiB/s, iops=114859, runt= 60001msec
write: io=435644KB, bw=7MiB/s, iops=1815, runt= 60004msec
-----------------------
So the read speed is impressive, but the write speed is really bad.
Therefore I ordered the Samsung 970 PRO (1TB) as it has faster NAND
chips (MLC instead of TLC). The results are, however even worse for writing:
-----------------------
Samsung 970 PRO NVMe M.2 mit PCIe Adapter
Jobs: 1:
read : io=15570MB, bw=259.4MiB/s, iops=66430, runt= 60001msec
write: io=199436KB, bw=3.2MiB/s, iops=830, runt= 60001msec
Jobs: 4:
read : io=48982MB, bw=816.3MiB/s, iops=208986, runt= 60001msec
write: io=327800KB, bw=5.3MiB/s, iops=1365, runt= 60002msec
Jobs: 10:
read : io=91753MB, bw=1529.3MiB/s, iops=391474, runt= 60001msec
write: io=343368KB, bw=5.6MiB/s, iops=1430, runt= 60005msec
-----------------------
I did some research and found out, that the "--sync" flag sets the flag
"O_DSYNC" which seems to disable the SSD cache which leads to these
horrid write speeds.
It seems that this relates to the fact that the write cache is only not
disabled for SSDs which implement some kind of battery buffer that
guarantees a data flush to the flash in case of a powerloss.
However, It seems impossible to find out which SSDs do have this
powerloss protection, moreover, these enterprise SSDs are crazy
expensive compared to the SSDs above - moreover it's unclear if
powerloss protection is even available in the NVMe form factor. So
building a 1 or 2 TB cluster seems not really affordable/viable.
So, can please anyone give me hints what to do? Is it possible to ensure
that the write cache is not disabled in some way (my server is situated
in a data center, so there will probably never be loss of power).
Or is the link above already outdated as newer ceph releases somehow
deal with this problem? Or maybe a later Debian release (10) will handle
the O_DSYNC flag differently?
Perhaps I should simply invest in faster (and bigger) harddisks and
forget the SSD-cluster idea?
Thank you in advance for any help,
Best Regards,
Hermann
--
hermann(a)qwer.tk
PGP/GPG: 299893C7 (on keyservers)
Hello everybody,
Can somebody add support for Debian buster and ceph-deploy:
https://tracker.ceph.com/issues/42870
Highly appreciated,
Regards,
Jelle de Jong
Has anyone ever tried using this feature? I've added it to the [global]
section of the ceph.conf on my POC cluster but I'm not sure how to tell if
it's actually working. I did find a reference to this feature via Google and
they had it in their [OSD] section?? I've tried that too..
TIA
Adam
Hello List,
first of all: Yes - i made mistakes. Now i am trying to recover :-/
I had a healthy 3 node cluster which i wanted to convert to a single one.
My goal was to reinstall a fresh 3 Node cluster and start with 2 nodes.
I was able to healthy turn it from a 3 Node Cluster to a 2 Node cluster.
Then the problems began.
I started to change size=1 and min_size=1.
Health was okay until here. Then over sudden both nodes got
fenced...one node refused to boot, mons where missing, etc...to make
long story short, here is where i am right now:
root@node03:~ # ceph -s
cluster b3be313f-d0ef-42d5-80c8-6b41380a47e3
health HEALTH_WARN
53 pgs stale
53 pgs stuck stale
monmap e4: 2 mons at {0=10.15.15.3:6789/0,1=10.15.15.2:6789/0}
election epoch 298, quorum 0,1 1,0
osdmap e6097: 14 osds: 9 up, 9 in
pgmap v93644673: 512 pgs, 1 pools, 1193 GB data, 304 kobjects
1088 GB used, 32277 GB / 33366 GB avail
459 active+clean
53 stale+active+clean
root@node03:~ # ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 32.56990 root default
-2 25.35992 host node03
0 3.57999 osd.0 up 1.00000 1.00000
5 3.62999 osd.5 up 1.00000 1.00000
6 3.62999 osd.6 up 1.00000 1.00000
7 3.62999 osd.7 up 1.00000 1.00000
8 3.62999 osd.8 up 1.00000 1.00000
19 3.62999 osd.19 up 1.00000 1.00000
20 3.62999 osd.20 up 1.00000 1.00000
-3 7.20998 host node02
3 3.62999 osd.3 up 1.00000 1.00000
4 3.57999 osd.4 up 1.00000 1.00000
1 0 osd.1 down 0 1.00000
9 0 osd.9 down 0 1.00000
10 0 osd.10 down 0 1.00000
17 0 osd.17 down 0 1.00000
18 0 osd.18 down 0 1.00000
my main mistakes seemd to be:
--------------------------------
ceph osd out osd.1
ceph auth del osd.1
systemctl stop ceph-osd@1
ceph osd rm 1
umount /var/lib/ceph/osd/ceph-1
ceph osd crush remove osd.1
As far as i can tell, ceph waits and needs data from that OSD.1 (which
i removed)
root@node03:~ # ceph health detail
HEALTH_WARN 53 pgs stale; 53 pgs stuck stale
pg 0.1a6 is stuck stale for 5086.552795, current state
stale+active+clean, last acting [1]
pg 0.142 is stuck stale for 5086.552784, current state
stale+active+clean, last acting [1]
pg 0.1e is stuck stale for 5086.552820, current state
stale+active+clean, last acting [1]
pg 0.e0 is stuck stale for 5086.552855, current state
stale+active+clean, last acting [1]
pg 0.1d is stuck stale for 5086.552822, current state
stale+active+clean, last acting [1]
pg 0.13c is stuck stale for 5086.552791, current state
stale+active+clean, last acting [1]
[...] SNIP [...]
pg 0.e9 is stuck stale for 5086.552955, current state
stale+active+clean, last acting [1]
pg 0.87 is stuck stale for 5086.552939, current state
stale+active+clean, last acting [1]
When i try to start ODS.1 manually, i get:
--------------------------------------------
2020-02-10 18:48:26.107444 7f9ce31dd880 0 ceph version 0.94.10
(b1e0532418e4631af01acbc0cedd426f1905f4af), process ceph-osd, pid
10210
2020-02-10 18:48:26.134417 7f9ce31dd880 0
filestore(/var/lib/ceph/osd/ceph-1) backend xfs (magic 0x58465342)
2020-02-10 18:48:26.184202 7f9ce31dd880 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is supported and appears to work
2020-02-10 18:48:26.184209 7f9ce31dd880 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2020-02-10 18:48:26.184526 7f9ce31dd880 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2020-02-10 18:48:26.184585 7f9ce31dd880 0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-1) detect_feature: extsize
is disabled by conf
2020-02-10 18:48:26.309755 7f9ce31dd880 0
filestore(/var/lib/ceph/osd/ceph-1) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2020-02-10 18:48:26.633926 7f9ce31dd880 1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2020-02-10 18:48:26.642185 7f9ce31dd880 1 journal _open
/var/lib/ceph/osd/ceph-1/journal fd 20: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2020-02-10 18:48:26.664273 7f9ce31dd880 0 <cls>
cls/hello/cls_hello.cc:271: loading cls_hello
2020-02-10 18:48:26.732154 7f9ce31dd880 0 osd.1 6002 crush map has
features 1107558400, adjusting msgr requires for clients
2020-02-10 18:48:26.732163 7f9ce31dd880 0 osd.1 6002 crush map has
features 1107558400 was 8705, adjusting msgr requires for mons
2020-02-10 18:48:26.732167 7f9ce31dd880 0 osd.1 6002 crush map has
features 1107558400, adjusting msgr requires for osds
2020-02-10 18:48:26.732179 7f9ce31dd880 0 osd.1 6002 load_pgs
2020-02-10 18:48:31.939810 7f9ce31dd880 0 osd.1 6002 load_pgs opened 53 pgs
2020-02-10 18:48:31.940546 7f9ce31dd880 -1 osd.1 6002 log_to_monitors
{default=true}
2020-02-10 18:48:31.942471 7f9ce31dd880 1 journal close
/var/lib/ceph/osd/ceph-1/journal
2020-02-10 18:48:31.969205 7f9ce31dd880 -1 ESC[0;31m ** ERROR: osd
init failed: (1) Operation not permittedESC[0m
Its mounted:
/dev/sdg1 3.7T 127G 3.6T 4% /var/lib/ceph/osd/ceph-1
Is there any way i can get the OSD.1 back in?
Thanks a lot,
mario
This means it has been applied:
# ceph osd dump -f json | jq .require_osd_release
"nautilus"
-- dan
On Mon, Feb 17, 2020 at 11:10 AM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
>
> How do you check if you issued this command in the past?
>
>
> -----Original Message-----
> To: ceph-users(a)ceph.io
> Subject: [ceph-users] Re: Excessive write load on mons after upgrade
> from 12.2.13 -> 14.2.7
>
> Hi Peter,
>
> could be a totally different problem but did you run the command "ceph
> osd require-osd-release nautilus" after the upgrade?
> We had poor performance after upgrading to nautilus and running this
> command fixed it. The same was reported by others for previous updates.
> Here is my original message regarding this issue:
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/OYFRWSJXPV…
>
> We did not observe the master election problem though.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi guys, I deploy an efk cluster and use ceph as block storage in kubernetes, but RBD write iops sometimes becomes zero and last for a few minutes. I want to check logs about RBD so I add some config to ceph.conf and restart ceph.
Here is my ceph.conf:
[global]
fsid = 53f4e1d5-32ce-4e9c-bf36-f6b54b009962
mon_initial_members = db-16-4-hzxs, db-16-5-hzxs, db-16-6-hzxs
mon_host = 10.25.16.4,10.25.16.5,10.25.16.6
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 3
[client]
debug rbd = 20
debug rbd mirror = 20
debug rbd replay = 20
log file = /var/log/ceph/client_rbd.log
I can not get any logs in /var/log/ceph/client_rbd.log. I also try to execute 'ceph daemon osd.* config set debug_rbd 20’ and there is also no related logs in ceph-osd.log.
How can I get useful logs about this question or How can I analyze this problem? Look forward to your reply.
Thanks
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 声明:此邮件可能包含依图公司保密或特权信息,并且仅应发送至有权接收该邮件的收件人。如果您无权收取该邮件,您应当立即删除该邮件并通知发件人,您并被禁止传播、分发或复制此邮件以及附件。对于此邮件可能携带的病毒引起的任何损害,本公司不承担任何责任。此外,本公司不保证已正确和完整地传输此信息,也不接受任何延迟收件的赔偿责任。 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// Notice: This email may contain confidential or privileged information of Yitu and was sent solely to the intended recipients. If you are unauthorized to receive this email, you should delete the email and contact the sender immediately. Any unauthorized disclosing, distribution, or copying of this email and attachment thereto is prohibited. Yitu does not accept any liability for any loss caused by possibly viruses in this email. E-mail transmission cannot be guaranteed to be secure or error-free and Yitu is not responsible for any delayed transmission.
ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
os :CentOS Linux release 7.7.1908 (Core)
single node ceph cluster with 1 mon,1mgr,1 mds,1rgw and 12osds , but only cephfs is used.
ceph -s is blocked after shutting down the machine (192.168.0.104), then ip address changed to 192.168.1.6
I created the monmap with monmap tool and update the ceph.conf , hosts file and then start ceph-mon.
and the ceph-mon log:
...
2019-12-11 08:57:45.170 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1285.14s
2019-12-11 08:57:50.170 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1290.14s
2019-12-11 08:57:55.171 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1295.14s
2019-12-11 08:58:00.171 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1300.14s
2019-12-11 08:58:05.172 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1305.14s
2019-12-11 08:58:10.171 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1310.14s
2019-12-11 08:58:15.173 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1315.14s
2019-12-11 08:58:20.173 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1320.14s
2019-12-11 08:58:25.174 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1325.14s
...
I changed IP back to 192.168.0.104 yeasterday, but all the same.
# cat /etc/ceph/ceph.conf
[client.libvirt]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # must be writable by QEMU and allowed by SELinux or AppArmor
log file = /var/log/ceph/qemu-guest-$pid.log # must be writable by QEMU and allowed by SELinux or AppArmor
[client.rgw.ceph-node1.rgw0]
host = ceph-node1
keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-node1.rgw0/keyring
log file = /var/log/ceph/ceph-rgw-ceph-node1.rgw0.log
rgw frontends = beast endpoint=192.168.1.6:8080
rgw thread pool size = 512
# Please do not change this file directly since it is managed by Ansible and will be overwritten
[global]
cluster network = 192.168.1.0/24
fsid = e384e8e6-94d5-4812-bfbb-d1b0468bdef5
mon host = [v2:192.168.1.6:3300,v1:192.168.1.6:6789]
mon initial members = ceph-node1
osd crush chooseleaf type = 0
osd pool default crush rule = -1
public network = 192.168.1.0/24
[osd]
osd memory target = 7870655146
Hey all,
We’ve been running some benchmarks against Ceph which we deployed using the Rook operator in Kubernetes. Everything seemed to scale linearly until a point where I see a single OSD receiving much higher CPU load than the other OSDs (nearly 100% saturation). After some investigation we noticed a ton of pubsub traffic in the strace coming from the RGW pods like so:
[pid 22561] sendmsg(77, {msg_name(0)=NULL, msg_iov(3)=[{"\21\2)\0\0\0\10\0:\1\0\0\10\0\0\0\0\0\10\0\0\0\0\0\0\20\0\0-\321\211K"..., 73}, {"\200\0\0\0pubsub.user.ceph-user-wwITOk"..., 314}, {"\0\303\34[\360\314\233\2138\377\377\377\377\377\377\377\377", 17}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL|MSG_MORE <unfinished …>
I’ve checked other OSDs and only a single OSD receives these messages. I suspect its creating a bottleneck. Does anyone have an idea on why these are being generated or how to stop them? The pubsub sync module doesn’t appear to be enabled, and our benchmark is doing simple gets/puts/deletes.
We’re running Ceph 14.2.5 nautilus
Thank you!