Hi,
We have a cluster where we mix HDDs and NVMe drives using device classes
with a specific crush role for each class.
One of our NVMe drives physically died which caused some of our PGs to
go into this state:
pg 26.ac is stuck undersized for 60830.991784, current state
activating+undersized+degraded+remapped, last acting
[353,373,368,377,2147483647,350]
pg 26.d1 is stuck undersized for 60830.587711, current state
activating+undersized+degraded+remapped, last acting
[343,2147483647,347,358,366,355]
pg 26.e1 is stuck undersized for 60830.980585, current state
activating+undersized+degraded+remapped, last acting
[340,349,370,2147483647,360,376]
... and so on.
Recovery never happened and we had to manually restart all affected OSDs
for all PGs stuck in such a state.
The 2^31-1 in there seems to indicate an overflow somewhere - the way we
were able to figure out where exactly
is to query the PG and compare the "up" and "acting" sets - only _one_
of them had the 2^31-1 number in place
of the correct OSD number. We restarted that and the PG started doing
its job and recovered.
The issue seems to be going back to 2015:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001661.html
however no solution...
I'm more concerned about the cluster not being able to recover (it's a
4+2 EC pool across 12 hosts - plenty of room
to heal) than about the weird print-out.
The VMs who wanted to access data in any of the affected PGs of course
died.
Are we missing some settings to let the cluster self-heal even for EC
pools? First EC pool in production :)
Cheers,
Zoltan
Hi,
I have a rook-provisioned cluster to be used for RBDs only. I have 2 pools
named replicated-metadata-pool and ec-data-pool. EC parameters are 6+3.
I've been writing some data to this cluster for some time and noticed that
the reported usage is not what I was expecting.
# ceph df
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 5.4 PiB 4.3 PiB 1.2 PiB 1.2 PiB 21.77
TOTAL 5.4 PiB 4.3 PiB 1.2 PiB 1.2 PiB 21.77
POOLS:
POOL ID STORED OBJECTS USED
%USED MAX AVAIL
replicated-metadata-pool 1 90 KiB 408 38 MiB
0 1.2 PiB
ec-data-pool 2 722 TiB 191.64M 1.2 PiB
25.04 2.4 PiB
Since these numbers are rounded a bit too much, I generally use prometheus
metrics on mgr, which are as follows:
ceph_pool_stored : 793,746 G for ec-data-pool and 92323 for
replicated-metadata-pool
ceph_pool_stored_raw: 1,190,865 G for ec-data-pool and 99213 for
replicated-metadata-pool
ceph_cluster_total_used_bytes: 1,329,374 G
ceph_cluster_total_used_raw_bytes: 1,333,013 G
sum(ceph_bluefs_db_used_bytes) : 3,638 G
So ceph_pool_stored for the EC pool is a bit higher than the total used
space of the formatted RBDs. I think that's because of the sparse nature
and deleted blocks not being fstrimmed yet. That's OK.
ceph_pool_stored_raw is almost exactly 1.5x ceph_pool_stored which is what
I'd expect considering EC parameters of 6+3.
What I can't find is the 138,509 G difference between the
ceph_cluster_total_used_bytes and ceph_pool_stored_raw. This is not static
BTW, checking the same data historically shows we have about 1.12x of what
we expect. This seems to make our 1.5x EC overhead a 1.68x overhead in
reality. Anyone have any ideas for why this is the case?
We also have a ceph_cluster_total_used_raw_bytes metric, I believe to be
close to data+metadata. Which is why I tried to show with
sum(ceph_bluefs_db_used_bytes). Is that correct?
Best,
--
erdem agaoglu
Hi all,
I seem to be running into an issue when attempting to unlink a bucket from
a user; this is my output:
user@server ~ $ radosgw-admin bucket unlink --bucket=user_5493/LF-Store
--uid=user_5493
failure: 2019-11-26 15:19:48.689 7fda1c2009c0 0 bucket entry point user
mismatch, can't unlink bucket: user_5493$BRTC != user_5493
(22) Invalid argument
user@server ~ $
I did some searching around, and no one seems to have seen this before. Any
ideas?
Thanks,
Mac
If I do an fstrim /mount/fs and this is an xsf directly on a rbd device.
I can see space being freed instantly with eg. rbd du. However when
there is an lvm in between, it looks like this is not freed. I already
enabled issue_discards = 1 in lvm.conf but as the comment says, probably
only in case of lvremove.
Is it possible to get fstrim working with lvm?
Hi,
For your response:
"You should use not more 1Gb for WAL and 30Gb for RocksDB. Numbers !
3,30,300 (Gb) for block.db is useless.
"
Do you mean the block.db size should be 3, 30 or 300GB and nothing else?
If so, thy not?
Thanks,
Frank
Hi,
Recently, I'm trying to mount cephfs using non-privileged users via
ceph-fuse,but it‘s always fail. I looked at the code and found that there
will be a remount operation when using ceph-fuse mount. Remount will
execute the 'mount -i -o remount {mountpoint}' command and that causes the
mount to fail.
I want to ask How to use non-privileged users to mount cephfs via ceph-fuse?
Thanks.
Hi,
Just starting to use CephFS.
I would like to know the impacts of having one single CephFS mount X
having several.
If I have several subdirectories in my CephFS that should be
accessible to different users, with users needing access to different
sets of mounts, would it be important for me to try to predefine these
sets so to minimize the number of mounts each user needs or can I
treat each subdirectory as an independent entity and just mount all
necessary for each user and, so, increasing the number of different
mounts for each user?
Regards,
Rodrigo Severo
I have a question about ceph cache pools as documented on this page:
https://docs.ceph.com/docs/nautilus/dev/cache-pool/
Is the cache pool feature still considered a good idea? Reading some of
the email archives I find some discussion of how this caching is not
recommended anymore, for version=nautilus. Is that correct?
What is my use-case? We are using cephfs, and have a large cephfs pool
of HDD with some NVMe drives for bluestore WAL and DB. Total storage is
2.1 PB. Replication=3. There is also a separate metadata pool, as
required for cephfs.
We have a spare storage server with about 14 TB of NVME drives. Would it
be worthwhile to setup an NVMe cache pool for the main cephfs pool?
Sincerely,
Shawn Kwang
--
Associate Scientist
Center for Gravitation, Cosmology, and Astrophysics
University of Wisconsin-Milwaukee
office: +1 414 229 4960
kwangs(a)uwm.edu
Hi,
I'm just deploying a CephFS service.
I would like to know the expected differences between a FUSE and a kernel mount.
Why the 2 options? When should I use one and when should I use the other?
Regards,
Rodrigo Severo