November 2019 - ceph-users

by Vincent Godin

Till i submit the mail below few days ago, we found some clues We observed a lot of lossy connexion like : ceph-osd.9.log:2019-11-27 11:03:49.369 7f6bb77d0700 0 -- 192.168.4.181:6818/2281415 >> 192.168.4.41:0/1962809518 conn(0x563979a9f600 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy) channel (new one lossy=1) We raised the log of the messenger to 5/5 and observed for the whole cluster more than 80 000 lossy connexion per minute !!! We adjusted the "ms_tcp_read_timeout" from 900 to 60 sec then no more lossy connexion in logs nor health check failed It's just a workaround but there is a real problem with these broken sessions and it leads to two assertions : - Ceph take too much time to detect broken session and should recycle quicker ! - The reasons for these broken sessions ? We have a other mimic cluster on different hardware and observed the same behavior : lot of lossy sessions, slow ops and co. Symptoms are the same : - some OSDs on one host have no response from an other osd on a different hosts - after some time, slow ops are detected - sometime it leads to ioblocked - after about 15mn the problem vanish ----------- Help on diag needed : heartbeat_failed We encounter a strange behavior on our Mimic 13.2.6 cluster. A any time, and without any load, some OSDs become unreachable from only some hosts. It last 10 mn and then the problem vanish. It 's not always the same OSDs and the same hosts. There is no network failure on any of the host (because only some OSDs become unreachable) nor disk freeze as we can see in our grafana dashboard. Logs message are : first msg : 2019-11-24 09:19:43.292 7fa9980fc700 -1 osd.596 146481 heartbeat_check: no reply from 192.168.6.112:6817 osd.394 since back 2019-11-24 09:19:22.761142 front 2019-11-24 09:19:39.769138 (cutoff 2019-11-24 09:19:23.293436) last msg: 2019-11-24 09:30:33.735 7f632354f700 -1 osd.591 146481 heartbeat_check: no reply from 192.168.6.123:6828 osd.600 since back 2019-11-24 09:27:05.269330 front 2019-11-24 09:30:33.214874 (cutoff 2019-11-24 09:30:13.736517) During this time, 3 hosts were involved : host-18, host-20 and host-30 : host-30 is the only one who can't see osds 346,356,and 352 on host-18 host-30 is the only one who can't see osds 387 and 394 on host-20 host-18 is the only one who can't see osds 583, 585, 591 and 597 on host-30 We can't see any strange behavior on hosts 18, 20 and 30 in our node exporter data during this time Any ideas or advices ?

4 years, 4 months

1
0
0 0

EC PGs stuck activating, 2^31-1 as OSD ID, automatic recovery not kicking in

by Zoltan Arnold Nagy

Hi, We have a cluster where we mix HDDs and NVMe drives using device classes with a specific crush role for each class. One of our NVMe drives physically died which caused some of our PGs to go into this state: pg 26.ac is stuck undersized for 60830.991784, current state activating+undersized+degraded+remapped, last acting [353,373,368,377,2147483647,350] pg 26.d1 is stuck undersized for 60830.587711, current state activating+undersized+degraded+remapped, last acting [343,2147483647,347,358,366,355] pg 26.e1 is stuck undersized for 60830.980585, current state activating+undersized+degraded+remapped, last acting [340,349,370,2147483647,360,376] ... and so on. Recovery never happened and we had to manually restart all affected OSDs for all PGs stuck in such a state. The 2^31-1 in there seems to indicate an overflow somewhere - the way we were able to figure out where exactly is to query the PG and compare the "up" and "acting" sets - only _one_ of them had the 2^31-1 number in place of the correct OSD number. We restarted that and the PG started doing its job and recovered. The issue seems to be going back to 2015: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001661.html however no solution... I'm more concerned about the cluster not being able to recover (it's a 4+2 EC pool across 12 hosts - plenty of room to heal) than about the weird print-out. The VMs who wanted to access data in any of the affected PGs of course died. Are we missing some settings to let the cluster self-heal even for EC pools? First EC pool in production :) Cheers, Zoltan

4 years, 4 months

4
4
0 0

EC pool used space high

by Erdem Agaoglu

Hi, I have a rook-provisioned cluster to be used for RBDs only. I have 2 pools named replicated-metadata-pool and ec-data-pool. EC parameters are 6+3. I've been writing some data to this cluster for some time and noticed that the reported usage is not what I was expecting. # ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 5.4 PiB 4.3 PiB 1.2 PiB 1.2 PiB 21.77 TOTAL 5.4 PiB 4.3 PiB 1.2 PiB 1.2 PiB 21.77 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL replicated-metadata-pool 1 90 KiB 408 38 MiB 0 1.2 PiB ec-data-pool 2 722 TiB 191.64M 1.2 PiB 25.04 2.4 PiB Since these numbers are rounded a bit too much, I generally use prometheus metrics on mgr, which are as follows: ceph_pool_stored : 793,746 G for ec-data-pool and 92323 for replicated-metadata-pool ceph_pool_stored_raw: 1,190,865 G for ec-data-pool and 99213 for replicated-metadata-pool ceph_cluster_total_used_bytes: 1,329,374 G ceph_cluster_total_used_raw_bytes: 1,333,013 G sum(ceph_bluefs_db_used_bytes) : 3,638 G So ceph_pool_stored for the EC pool is a bit higher than the total used space of the formatted RBDs. I think that's because of the sparse nature and deleted blocks not being fstrimmed yet. That's OK. ceph_pool_stored_raw is almost exactly 1.5x ceph_pool_stored which is what I'd expect considering EC parameters of 6+3. What I can't find is the 138,509 G difference between the ceph_cluster_total_used_bytes and ceph_pool_stored_raw. This is not static BTW, checking the same data historically shows we have about 1.12x of what we expect. This seems to make our 1.5x EC overhead a 1.68x overhead in reality. Anyone have any ideas for why this is the case? We also have a ceph_cluster_total_used_raw_bytes metric, I believe to be close to data+metadata. Which is why I tried to show with sum(ceph_bluefs_db_used_bytes). Is that correct? Best, -- erdem agaoglu

4 years, 4 months

3
5
0 0

[radosgw-admin] Unable to Unlink Bucket From UID

by Mac Wynkoop

Hi all, I seem to be running into an issue when attempting to unlink a bucket from a user; this is my output: user@server ~ $ radosgw-admin bucket unlink --bucket=user_5493/LF-Store --uid=user_5493 failure: 2019-11-26 15:19:48.689 7fda1c2009c0 0 bucket entry point user mismatch, can't unlink bucket: user_5493$BRTC != user_5493 (22) Invalid argument user@server ~ $ I did some searching around, and no one seems to have seen this before. Any ideas? Thanks, Mac

4 years, 4 months

1
0
0 0

rbd lvm xfs fstrim vs rbd xfs fstrim

by Marc Roos

If I do an fstrim /mount/fs and this is an xsf directly on a rbd device. I can see space being freed instantly with eg. rbd du. However when there is an lvm in between, it looks like this is not freed. I already enabled issue_discards = 1 in lvm.conf but as the comment says, probably only in case of lvremove. Is it possible to get fstrim working with lvm?

4 years, 4 months

1
1
0 0

ceph user list respone

by Frank R

Hi, For your response: "You should use not more 1Gb for WAL and 30Gb for RocksDB. Numbers ! 3,30,300 (Gb) for block.db is useless. " Do you mean the block.db size should be 3, 30 or 300GB and nothing else? If so, thy not? Thanks, Frank

4 years, 4 months

2
1
0 0

ceph-fuse non-privileged user mount

by yi zhang

Hi, Recently, I'm trying to mount cephfs using non-privileged users via ceph-fuse，but it‘s always fail. I looked at the code and found that there will be a remount operation when using ceph-fuse mount. Remount will execute the 'mount -i -o remount {mountpoint}' command and that causes the mount to fail. I want to ask How to use non-privileged users to mount cephfs via ceph-fuse? Thanks.

4 years, 4 months

1
0
0 0

Single mount X multiple mounts

by Rodrigo Severo - Fábrica

Hi, Just starting to use CephFS. I would like to know the impacts of having one single CephFS mount X having several. If I have several subdirectories in my CephFS that should be accessible to different users, with users needing access to different sets of mounts, would it be important for me to try to predefine these sets so to minimize the number of mounts each user needs or can I treat each subdirectory as an independent entity and just mount all necessary for each user and, so, increasing the number of different mounts for each user? Regards, Rodrigo Severo

4 years, 4 months

2
1
0 0

ceph cache pool question

by Shawn A Kwang

I have a question about ceph cache pools as documented on this page: https://docs.ceph.com/docs/nautilus/dev/cache-pool/ Is the cache pool feature still considered a good idea? Reading some of the email archives I find some discussion of how this caching is not recommended anymore, for version=nautilus. Is that correct? What is my use-case? We are using cephfs, and have a large cephfs pool of HDD with some NVMe drives for bluestore WAL and DB. Total storage is 2.1 PB. Replication=3. There is also a separate metadata pool, as required for cephfs. We have a spare storage server with about 14 TB of NVME drives. Would it be worthwhile to setup an NVMe cache pool for the main cephfs pool? Sincerely, Shawn Kwang -- Associate Scientist Center for Gravitation, Cosmology, and Astrophysics University of Wisconsin-Milwaukee office: +1 414 229 4960 kwangs(a)uwm.edu

4 years, 4 months

1
0
0 0

FUSE X kernel mounts

by Rodrigo Severo - Fábrica

Hi, I'm just deploying a CephFS service. I would like to know the expected differences between a FUSE and a kernel mount. Why the 2 options? When should I use one and when should I use the other? Regards, Rodrigo Severo

4 years, 4 months

3
3
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2019