December 2020 - ceph-users

The ceph balancer sets upmap items which violates my crushrule

by Manuel Lausch

The ceph balancer sets upmap items which violates my crushrule the rule: rule cslivebapfirst { id 0 type replicated min_size 2 max_size 4 step take csliveeubap-u01dc step chooseleaf firstn 2 type room step emit step take csliveeubs-u01dc step chooseleaf firstn 2 type room step emit } So my intention is, that the first two replicas are stored in the datacenter „csliveeubap-u01dc“ and the next two replicas are stored in the datacenter „csliveeubs-u01dc“ The cluster has 49152 PGs and 665 of them has at least 3 replicas in one datacenter which is not expected! One example on PG 3.96e The acting OSDs are in this order: 504 -> DC: csliveeubap-u01dc, room: csliveeubap-u01r03 1968 -> DC: csliveeubap-u01dc, room: csliveeubap-u01r01 420 -> DC: csliveeubap-u01dc, room: csliveeubap-u01r02 1945 -> DC: csliveeubs-u01dc, room: csliveeubs-u01r01 This PG has one upmap item: ceph osd dump | grep 3.96e pg_upmap_items 3.96e [2013,420] OSD 2013 is in the DC: csliveeubs-u01dc I checked this by hand with ceph osd pg-upmap-item If I try to set two relicas in one room I will get a appropriate error in the mon log and nothing happens. But setting it to a other dc worked unfortunately. I would suggest this is a ugly bug. What do you think? ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) Manuel

3 years, 4 months

3
2
0 0

[OSSN-0087] Ceph user credential leakage to consumers of OpenStack Manila

by gouthampravi＠gmail.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hello, Forwarding a security note that was shared with the OpenStack community here for your awareness. This concerns a security vulnerability that has now been addressed. I'd like to thank Ceph contributors: Patrick Donnelly, Kotresh Hiremath Ravishankar and Ramana Raja for their help in addressing this issue. Please find information regarding patches and releases in the security note below. Thanks, Goutham Ceph user credential leakage to consumers of OpenStack Manila - ------------------------------------------------------------- ### Summary ### OpenStack Manila users can request access on a share to any arbitrary cephx user, including privileged pre-existing users of a Ceph cluster. They can then retrieve access secret keys for these pre-existing ceph users via Manila APIs. A cephx client user name and access secret key are required to mount a Native CephFS manila share. With a secret key, a manila user can impersonate a pre-existing ceph user and gain capabilities to manipulate resources that the manila user was never intended to have access to. It is possible to even obtain the default ceph "admin" user's key in this manner, and execute any commands as the ceph administrator. ### Affected Services / Software ### - - OpenStack Shared File Systems Service (Manila) versions Mitaka (2.0.0) through Victoria (11.0.0) - - Ceph Luminous (<=v12.2.13), Mimic (<=v13.2.10), Nautilus (<=v14.2.15), Octopus (<=v15.2.7) ### Discussion ### OpenStack Manila can provide users with Native CephFS shared file systems. When a user creates a "share" (short for "shared file system") via Manila, a CephFS "subvolume" is created on the Ceph cluster and exported. After creating their share, a user can specify who can have access to the share with the help of "cephx" client user names. A cephx client corresponds to Ceph Client Users [2]. When access is provided, a client user "access key" is returned via manila. A ceph client user account is required to access any ceph resource. This includes interacting with Ceph cluster infrastructure daemons (ceph-mgr, ceph-mds, ceph-mon, ceph-osd) or consuming Ceph storage via RBD, RGW or CephFS. Deployment and orchestration services like ceph-ansible, nfs-ganesha, kolla, tripleo need ceph client users to work, as do OpenStack services such as cinder, manila, glance and nova for their own interactions with Ceph. For the purpose of illustrating this vulnerability, we'll call them "pre-existing" users of the Ceph cluster. Another example of a pre-existing user includes the "admin" user that is created by default on the ceph cluster. In theory, manila's cephx users are no different from a ceph client user. When a manila user requests access to a share, a corresponding ceph user account is created if one already does not exist. If a ceph user account already exists, the existing capabilities of that user are adjusted to provide them permissions to access the manila share in question. There is no reasonable way for this mechanism to know what pre-existing ceph client users must be protected against unauthorized abuse. Therefore there is a risk that a manila user can claim to be a pre-existing ceph user to steal their access secret key. To resolve this issue, the ceph interface that manila uses was patched to no longer allow manila to claim a pre-existing user account that didn't create. By consequence this means that manila users cannot use cephx usernames that correspond to ceph client users that exist outside of manila. ### Recommended Actions ### #. Upgrade your ceph software to the latest patched releases of ceph to take advantage of the fix to this vulnerability. #. Audit cephx access keys provisioned via manila. You may use "ceph auth ls" and ensure that no clients have been compromised. If they have been, you may need to delete and recreate the client credentials to prevent unauthorized access. #. The audit can also be performed on manila by enumerating all CephFS shares and their access rules as a system administrator. If a reserved ceph client username has been used, you may deny access and recreate the client credential on ceph to refresh the access secret. No code changes were necessary in the OpenStack Shared File System service (manila). With an upgraded ceph, when manila users attempt to provide share access to a cephx username that they cannot use, the access rule's "state" attribute is set to "error" because this operation is no longer permitted. ### Patches ### The Ceph community has provided the following patches: Ceph Octopus: https://github.com/ceph/ceph/commit/1b8a634fdcd94dfb3ba650793fb1b6d09af65e05 Ceph Nautilus: https://github.com/ceph/ceph/commit/7e3e4e73783a98bb07ab399438eb3aab41a6fc8b Ceph Luminous: https://github.com/ceph/ceph/commit/956ceb853a58f6b6847b31fac34f2f0228a70579 The fixes are in the latest releases of Ceph Nautilus (14.2.16) and Ceph Octopus (15.2.8). The patch for Luminous was provided as a courtesy to possible users of OpenStack Manila, however the Ceph community no longer produces releases for Luminous or Mimic as they are end of life. See `here for information about ceph releases. <https://docs.ceph.com/en/latest/releases/general/>`_ ### Contacts / References ### Author: - - Pacha Ravi, Goutham gouthamr(a)redhat.com (Red Hat) Credits: - - Garbutt, John john(a)johngarbutt.com (StackHPC) - - Babel, Jahson jahson.babel(a)cc.in2p3.fr (Centre de Calcul de l'IN2P3) This OSSN : https://wiki.openstack.org/wiki/OSSN/OSSN-0087 Original LaunchPad Bug : https://launchpad.net/bugs/1904015 Mailing List : [Security] tag on openstack-discuss(a)lists.openstack.org OpenStack Security Project : https://launchpad.net/~openstack-ossg CVE: CVE-2020-27781 -----BEGIN PGP SIGNATURE----- wsFcBAEBCAAGBQJf2rsUAAoJENJP32eYWZR3gswP/iF9Q+itZjSbuUSaYuZn lKU2eu6AWJDW5AIarbO5i3ck6uWEzFCOPsVySpOaHvwUXZyYLd4TXqWIIszi hj0Q4Kjyl3kehg2uW8Krj3wnf5y2eCpCwzJFIjB9tClldcCXu2UviFm2dHDL kjDXj55USPSgZuOKKBIx8KXJCloA2lMpg0a9zM5LJA63ymqPycjeDkskxaTq seG2C6JyPdiQEtOFnl5DnflTA51jTEbxMwxwSMlkMlxCJGZyYf+2AqJ3MZkN G13kBWzwmUzNbmsY7QriOnl/jYwpMoHW01TZ8ga5O5ursV+9Lh0aZEfN7aMm KzEBPQOaT5vpeYG7yirA1kSAAebewkoABaZsExQ/IyzClF+aHbC+fr4k0NvK PBR0T8cwwQhzvcnNtyRe9mhdIQo/DIK+rqPLAeIlDjmzO6T2IzBC7OUDPlmr bszGSFxJ4ZOemJgOLUjcazxEX/k2Jj559cc/B8/Ak489f/xUoJYs0MmioF8a Ft9iA97ZvtoblfdCSuzVIl7r4jQL59qBfLykotU22ydwOT3VmILPVZRbPc+0 p+Nkj0xpGw3MEKtZr1+2OJpAsYhm+k6FoRvOd1wvHIB8QJXnCTjOZe8F/cLE axCIChFquoAN+YjwpUEVbo79sJv6wxEc1dmB5UT/InyELaDz3US7OIXWm0f6 UNdI =n0oA -----END PGP SIGNATURE-----

3 years, 4 months

1
0
0 0

block.db Permission denied

by Seena Fallah

Hi. When I deployed an OSD with a separate db block it gets me a Permission denied on its path! I don't have any idea why but the only change I've done with my previous deployments was I change the osd_crush_initial_weight from 0 to 1. when I restart the host OSD can get up without any errors. I have tested another deployment with osd_crush_initial_weight=0 and it deployed successfully and again when I deploy with osd_crush_initial_weight=1 for another new node it gives me Permission denied! Here is the full trace: https://paste.ubuntu.com/p/6HDzwSrK3p/

3 years, 4 months

1
0
0 0

Possibly unused client

by Alexander E. Patrakov

Hello, While working with a customer, I went through the output of "ceph auth list", and found a client entry that nobody can tell what it is used for. There is a strong suspicion that it is an unused left-over from old times, but again, nobody is sure. How can I confirm that it was not used for, say, the past week? Or, what logs should I turn on so that if it is used during the next week, it is mentioned there? -- Alexander E. Patrakov CV: http://pc.cd/PLz7

3 years, 4 months

2
2
0 0

ceph stuck removing image from trash

by Andre Gebers

Hi, I'm running a 15.2.4 test cluster in a rook-ceph environment. The cluster is reporting HEALTH_OK but it seems it is stuck removing an image. Last section of 'ceph status' output: progress: Removing image replicapool/43def5e07bf47 from trash (6h) [............................] (remaining: 32y) This is now going for a couple of weeks and I was wondering if there is a way to speed it up? The cluster doesn't seem to be doing much judging from the system load. I've created this largish image to test what is possible with the setup but how do I get it out of the trash now? # rbd info --image-id 43def5e07bf47 -p replicapool rbd image 'csi-vol-cfaa1b00-1711-11eb-b9c9-2aa51e1e24e5': size 1 EiB in 274877906944 objects order 22 (4 MiB objects) snapshot_count: 0 id: 43def5e07bf47 block_name_prefix: rbd_data.43def5e07bf47 format: 2 features: layering op_features: flags: create_timestamp: Sun Oct 25 22:31:23 2020 access_timestamp: Sun Oct 25 22:31:23 2020 modify_timestamp: Sun Oct 25 22:31:23 2020 Any pointers how to resolve this issue are much appreciated. Regards Andre

3 years, 4 months

4
3
0 0

issue on adding SSD to SATA cluster for db/wal

by Zhenshi Zhou

Hi all, I have a 14.2.15 cluster with all SATA OSDs. Now we plan to add SSDs in the cluster for db/wal usage. I checked the docs and found a command 'ceph-bluestore-tool' can deal with the issue. I added db/wal to the osd in my test environment but in the end it still get the warning message. "osd.0 spilled over 64 KiB metadata from 'db' device (7 MiB used of 8.0 GiB) to slow device" my procedure : sdd is the new disk for db/wal. sgdisk --new=1:0:+8GB --change-name=1:bluestore_block_db_0 --partition-guid=1:$(uuidgen) --mbrtogpt -- /dev/sdd sgdisk --new=2:0:+1GB --change-name=2:bluestore_block_wal_0 --partition-guid=2:$(uuidgen) --mbrtogpt -- /dev/sdd systemctl stop ceph-osd@0 CEPH_ARGS="--bluestore-block-db-size 8589934592" ceph-bluestore-tool bluefs-bdev-new-db --path /var/lib/ceph/osd/ceph-0 --dev-target /dev/sdd1 ceph-bluestore-tool bluefs-bdev-new-wal --path /var/lib/ceph/osd/ceph-0/ --dev-target /dev/sdd2 ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0/ systemctl start ceph-osd@0 ceph tell osd.0 compact The warning message tells that there is some metadata still in the slow device. How can I deal with this issue? Thanks

3 years, 4 months

2
2
0 0

Whether read I/O is accpted when the number of replica is under pool's min_size

by Satoru Takeuchi

Hi, Could you tell me whether read I/O is acxepted when the number of replicas is under pool's min_size? I read the official document and found that there is a difference of the effect of pool's min_size between pool's document and the pool's configuration document. Pool's document: https://docs.ceph.com/en/latest/rados/operations/pools/ > To set a minimum number of required replicas for I/O, you should use the min_size setting. Pool's configuration document. https://docs.ceph.com/en/latest/rados/configuration/pool-pg-config-ref/ > osd pool default min size Description Sets the minimum number of written replicas for objects in the pool in order to acknowledge a write operation to the client. The former says that min_size is related to both read/write I/O. On the other hands, the latter says it's only about write I/O. So which is correct description? As a result of my verification, the former looks correct. I created a RGW object in a pool having 3 replicas and its min_size is 2. Both read and write aren't accepted when 2 replicas are gone. Thanks, Satoru

3 years, 4 months

2
2
0 0

multiple OSD crash, unfound objects

by Michael Thomas

Over the weekend I had multiple OSD servers in my Octopus cluster (15.2.4) crash and reboot at nearly the same time. The OSDs are part of an erasure coded pool. At the time the cluster had been busy with a long-running (~week) remapping of a large number of PGs after I incrementally added more OSDs to the cluster. After bringing all of the OSDs back up, I have 25 unfound objects and 75 degraded objects. There are other problems reported, but I'm primarily concerned with these unfound/degraded objects. The pool with the missing objects is a cephfs pool. The files stored in the pool are backed up on tape, so I can easily restore individual files as needed (though I would not want to restore the entire filesystem). I tried following the guide at https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg…. I found a number of OSDs that are still 'not queried'. Restarting a sampling of these OSDs changed the state from 'not queried' to 'already probed', but that did not recover any of the unfound or degraded objects. I have also tried 'ceph pg deep-scrub' on the affected PGs, but never saw them get scrubbed. I also tried doing a 'ceph pg force-recovery' on the affected PGs, but only one seems to have been tagged accordingly (see ceph -s output below). The guide also says "Sometimes it simply takes some time for the cluster to query possible locations." I'm not sure how long "some time" might take, but it hasn't changed after several hours. My questions are: * Is there a way to force the cluster to query the possible locations sooner? * Is it possible to identify the files in cephfs that are affected, so that I could delete only the affected files and restore them from backup tapes? --Mike ceph -s: cluster: id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad health: HEALTH_ERR 1 clients failing to respond to capability release 1 MDSs report slow requests 25/78520351 objects unfound (0.000%) 2 nearfull osd(s) Reduced data availability: 1 pg inactive Possible data damage: 9 pgs recovery_unfound Degraded data redundancy: 75/626645098 objects degraded (0.000%), 9 pgs degraded 1013 pgs not deep-scrubbed in time 1013 pgs not scrubbed in time 2 pool(s) nearfull 1 daemons have recently crashed 4 slow ops, oldest one blocked for 77939 sec, daemons [osd.0,osd.41] have slow ops. services: mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d) mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1 mds: archive:1 {0=ceph4=up:active} 3 up:standby osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped pgs task status: scrub status: mds.ceph4: idle data: pools: 9 pools, 2433 pgs objects: 78.52M objects, 298 TiB usage: 412 TiB used, 545 TiB / 956 TiB avail pgs: 0.041% pgs unknown 75/626645098 objects degraded (0.000%) 135224/626645098 objects misplaced (0.022%) 25/78520351 objects unfound (0.000%) 2421 active+clean 5 active+recovery_unfound+degraded 3 active+recovery_unfound+degraded+remapped 2 active+clean+scrubbing+deep 1 unknown 1 active+forced_recovery+recovery_unfound+degraded progress: PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d) [............................]

3 years, 4 months

4
34
0 0

Provide more documentation for MDS performance tuning on large file systems

by Janek Bevendorff

Hello, Over the last week I have tried optimising the performance of our MDS nodes for the large amount of files and concurrent clients we have. It turns out that despite various stability fixes in recent releases, the default configuration still doesn't appear to be optimal for keeping the cache size under control and avoid intermittent I/O blocks. Unfortunately, it is very hard to tweak the configuration to something that works, because the tuning parameters needed are largely undocumented or only described in very technical terms in the source code making them quite unapproachable for administrators not familiar with all the CephFS internals. I would therefore like to ask if it were possible to document the "advanced" MDS settings more clearly as to what they do and in what direction they have to be tuned for more or less aggressive cap recall, for instance (sometimes it is not clear if a threshold is a min or a max threshold). I am am in the very (un)fortunate situation to have folders with a several 100K direct sub folders or files (and one extreme case with almost 7 million dentries), which is a pretty good benchmark for measuring cap growth while performing operations on them. For the time being, I came up with this configuration, which seems to work for me, but is still far from optimal: mds basic mds_cache_memory_limit 10737418240 mds advanced mds_cache_trim_threshold 131072 mds advanced mds_max_caps_per_client 500000 mds advanced mds_recall_max_caps 17408 mds advanced mds_recall_max_decay_rate 2.000000 The parameters I am least sure about---because I understand the least how they actually work---are mds_cache_trim_threshold and mds_recall_max_decay_rate. Despite reading the description in src/common/options.cc, I understand only half of what they're doing and I am also not quite sure in which direction to tune them for optimal results. Another point where I am struggling is the correct configuration of mds_recall_max_caps. The default of 5K doesn't work too well for me, but values above 20K also don't seem to be a good choice. While high values result in fewer blocked ops and better performance without destabilising the MDS, they also lead to slow but unbounded cache growth, which seems counter-intuitive. 17K was the maximum I could go. Higher values work for most use cases, but when listing very large folders with millions of dentries, the MDS cache size slowly starts to exceed the limit after a few hours, since the MDSs are failing to keep clients below mds_max_caps_per_client despite not showing any "failing to respond to cache pressure" warnings. With the configuration above, I do not have cache size issues any more, but it comes at the cost of performance and slow/blocked ops. A few hints as to how I could optimise my settings for better client performance would be much appreciated and so would be additional documentation for all the "advanced" MDS settings. Thanks a lot Janek

3 years, 4 months

3
13
0 0

Weird ceph df

by Szabo, Istvan (Agoda)

Hi, It is a nautilus 14.2.13 ceph. The quota on the pool is 745GiB, how can be the stored data 788GiB? (2 replicas pool). Based on the used column it means just 334GiB is used because the pool has 2 replicas only. I don't understand. POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR k8s-dbss-w-mdc 12 788 GiB 202.42k 668 GiB 0.75 43 TiB N/A 745 GiB 202.42k 0 B 0 B Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 4 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users December 2020