The ceph balancer sets upmap items which violates my crushrule
the rule:
rule cslivebapfirst {
id 0
type replicated
min_size 2
max_size 4
step take csliveeubap-u01dc
step chooseleaf firstn 2 type room
step emit
step take csliveeubs-u01dc
step chooseleaf firstn 2 type room
step emit
}
So my intention is, that the first two replicas are stored in the
datacenter „csliveeubap-u01dc“ and the next two replicas are stored in
the datacenter „csliveeubs-u01dc“
The cluster has 49152 PGs and 665 of them has at least 3 replicas in
one datacenter which is not expected!
One example on PG 3.96e
The acting OSDs are in this order:
504 -> DC: csliveeubap-u01dc, room: csliveeubap-u01r03
1968 -> DC: csliveeubap-u01dc, room: csliveeubap-u01r01
420 -> DC: csliveeubap-u01dc, room: csliveeubap-u01r02
1945 -> DC: csliveeubs-u01dc, room: csliveeubs-u01r01
This PG has one upmap item:
ceph osd dump | grep
3.96e pg_upmap_items 3.96e [2013,420]
OSD 2013 is in the DC: csliveeubs-u01dc
I checked this by hand with ceph osd pg-upmap-item
If I try to set two relicas in one room I will get a appropriate error
in the mon log and nothing happens. But setting it to a other dc worked
unfortunately.
I would suggest this is a ugly bug. What do you think?
ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf)
nautilus (stable)
Manuel
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hello,
Forwarding a security note that was shared with the OpenStack
community here for your awareness. This concerns a security
vulnerability that has now been addressed. I'd like to thank Ceph
contributors: Patrick Donnelly, Kotresh Hiremath Ravishankar and
Ramana Raja for their help in addressing this issue. Please find
information regarding patches and releases in the security note below.
Thanks,
Goutham
Ceph user credential leakage to consumers of OpenStack Manila
- -------------------------------------------------------------
### Summary ###
OpenStack Manila users can request access on a share to any
arbitrary cephx user, including privileged pre-existing users
of a Ceph cluster. They can then retrieve access secret keys
for these pre-existing ceph users via Manila APIs. A cephx
client user name and access secret key are required to mount
a Native CephFS manila share. With a secret key, a manila user
can impersonate a pre-existing ceph user and gain capabilities
to manipulate resources that the manila user was never intended
to have access to. It is possible to even obtain the default
ceph "admin" user's key in this manner, and execute any commands
as the ceph administrator.
### Affected Services / Software ###
- - OpenStack Shared File Systems Service (Manila) versions Mitaka (2.0.0)
through Victoria (11.0.0)
- - Ceph Luminous (<=v12.2.13), Mimic (<=v13.2.10),
Nautilus (<=v14.2.15), Octopus (<=v15.2.7)
### Discussion ###
OpenStack Manila can provide users with Native CephFS shared
file systems. When a user creates a "share" (short for
"shared file system") via Manila, a CephFS "subvolume" is
created on the Ceph cluster and exported. After creating
their share, a user can specify who can have access to the
share with the help of "cephx" client user names. A cephx
client corresponds to Ceph Client Users [2]. When access
is provided, a client user "access key" is returned via
manila.
A ceph client user account is required to access any ceph
resource. This includes interacting with Ceph cluster
infrastructure daemons (ceph-mgr, ceph-mds, ceph-mon, ceph-osd)
or consuming Ceph storage via RBD, RGW or CephFS. Deployment and
orchestration services like ceph-ansible, nfs-ganesha, kolla,
tripleo need ceph client users to work, as do OpenStack services
such as cinder, manila, glance and nova for their own interactions
with Ceph. For the purpose of illustrating this vulnerability,
we'll call them "pre-existing" users of the Ceph cluster. Another
example of a pre-existing user includes the "admin" user that
is created by default on the ceph cluster.
In theory, manila's cephx users are no different from a ceph
client user. When a manila user requests access to a share,
a corresponding ceph user account is created if one already
does not exist. If a ceph user account already exists, the
existing capabilities of that user are adjusted to provide
them permissions to access the manila share in question.
There is no reasonable way for this mechanism to know what
pre-existing ceph client users must be protected against
unauthorized abuse. Therefore there is a risk that a
manila user can claim to be a pre-existing ceph user to
steal their access secret key.
To resolve this issue, the ceph interface that manila uses
was patched to no longer allow manila to claim a pre-existing
user account that didn't create. By consequence this means
that manila users cannot use cephx usernames that correspond
to ceph client users that exist outside of manila.
### Recommended Actions ###
#. Upgrade your ceph software to the latest patched releases of
ceph to take advantage of the fix to this vulnerability.
#. Audit cephx access keys provisioned via manila. You may use
"ceph auth ls" and ensure that no clients have been compromised.
If they have been, you may need to delete and recreate the
client credentials to prevent unauthorized access.
#. The audit can also be performed on manila by enumerating all
CephFS shares and their access rules as a system administrator. If a
reserved ceph client username has been used, you may deny access
and recreate the client credential on ceph to refresh the
access secret.
No code changes were necessary in the OpenStack Shared File
System service (manila). With an upgraded ceph, when manila
users attempt to provide share access to a cephx username
that they cannot use, the access rule's "state" attribute is
set to "error" because this operation is no longer permitted.
### Patches ###
The Ceph community has provided the following patches:
Ceph Octopus: https://github.com/ceph/ceph/commit/1b8a634fdcd94dfb3ba650793fb1b6d09af65e05
Ceph Nautilus: https://github.com/ceph/ceph/commit/7e3e4e73783a98bb07ab399438eb3aab41a6fc8b
Ceph Luminous: https://github.com/ceph/ceph/commit/956ceb853a58f6b6847b31fac34f2f0228a70579
The fixes are in the latest releases of Ceph Nautilus (14.2.16) and Ceph
Octopus (15.2.8). The patch for Luminous was provided as a courtesy to possible
users of OpenStack Manila, however the Ceph community no longer produces
releases for Luminous or Mimic as they are end of life. See `here for
information about ceph releases.
<https://docs.ceph.com/en/latest/releases/general/>`_
### Contacts / References ###
Author:
- - Pacha Ravi, Goutham gouthamr(a)redhat.com (Red Hat)
Credits:
- - Garbutt, John john(a)johngarbutt.com (StackHPC)
- - Babel, Jahson jahson.babel(a)cc.in2p3.fr (Centre de Calcul de l'IN2P3)
This OSSN : https://wiki.openstack.org/wiki/OSSN/OSSN-0087
Original LaunchPad Bug : https://launchpad.net/bugs/1904015
Mailing List : [Security] tag on openstack-discuss(a)lists.openstack.org
OpenStack Security Project : https://launchpad.net/~openstack-ossg
CVE: CVE-2020-27781
-----BEGIN PGP SIGNATURE-----
wsFcBAEBCAAGBQJf2rsUAAoJENJP32eYWZR3gswP/iF9Q+itZjSbuUSaYuZn
lKU2eu6AWJDW5AIarbO5i3ck6uWEzFCOPsVySpOaHvwUXZyYLd4TXqWIIszi
hj0Q4Kjyl3kehg2uW8Krj3wnf5y2eCpCwzJFIjB9tClldcCXu2UviFm2dHDL
kjDXj55USPSgZuOKKBIx8KXJCloA2lMpg0a9zM5LJA63ymqPycjeDkskxaTq
seG2C6JyPdiQEtOFnl5DnflTA51jTEbxMwxwSMlkMlxCJGZyYf+2AqJ3MZkN
G13kBWzwmUzNbmsY7QriOnl/jYwpMoHW01TZ8ga5O5ursV+9Lh0aZEfN7aMm
KzEBPQOaT5vpeYG7yirA1kSAAebewkoABaZsExQ/IyzClF+aHbC+fr4k0NvK
PBR0T8cwwQhzvcnNtyRe9mhdIQo/DIK+rqPLAeIlDjmzO6T2IzBC7OUDPlmr
bszGSFxJ4ZOemJgOLUjcazxEX/k2Jj559cc/B8/Ak489f/xUoJYs0MmioF8a
Ft9iA97ZvtoblfdCSuzVIl7r4jQL59qBfLykotU22ydwOT3VmILPVZRbPc+0
p+Nkj0xpGw3MEKtZr1+2OJpAsYhm+k6FoRvOd1wvHIB8QJXnCTjOZe8F/cLE
axCIChFquoAN+YjwpUEVbo79sJv6wxEc1dmB5UT/InyELaDz3US7OIXWm0f6
UNdI
=n0oA
-----END PGP SIGNATURE-----
Hi.
When I deployed an OSD with a separate db block it gets me a Permission
denied on its path! I don't have any idea why but the only change I've done
with my previous deployments was I change the osd_crush_initial_weight from
0 to 1. when I restart the host OSD can get up without any errors. I have
tested another deployment with osd_crush_initial_weight=0 and it deployed
successfully and again when I deploy with osd_crush_initial_weight=1 for
another new node it gives me Permission denied!
Here is the full trace: https://paste.ubuntu.com/p/6HDzwSrK3p/
Hello,
While working with a customer, I went through the output of "ceph auth
list", and found a client entry that nobody can tell what it is used
for. There is a strong suspicion that it is an unused left-over from
old times, but again, nobody is sure.
How can I confirm that it was not used for, say, the past week? Or,
what logs should I turn on so that if it is used during the next week,
it is mentioned there?
--
Alexander E. Patrakov
CV: http://pc.cd/PLz7
Hi,
I'm running a 15.2.4 test cluster in a rook-ceph environment. The cluster is reporting HEALTH_OK but it seems it is stuck removing an image. Last section of 'ceph status' output:
progress:
Removing image replicapool/43def5e07bf47 from trash (6h)
[............................] (remaining: 32y)
This is now going for a couple of weeks and I was wondering if there is a way to speed it up? The cluster doesn't seem to be doing much judging from the system load.
I've created this largish image to test what is possible with the setup but how do I get it out of the trash now?
# rbd info --image-id 43def5e07bf47 -p replicapool
rbd image 'csi-vol-cfaa1b00-1711-11eb-b9c9-2aa51e1e24e5':
size 1 EiB in 274877906944 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 43def5e07bf47
block_name_prefix: rbd_data.43def5e07bf47
format: 2
features: layering
op_features:
flags:
create_timestamp: Sun Oct 25 22:31:23 2020
access_timestamp: Sun Oct 25 22:31:23 2020
modify_timestamp: Sun Oct 25 22:31:23 2020
Any pointers how to resolve this issue are much appreciated.
Regards
Andre
Hi all,
I have a 14.2.15 cluster with all SATA OSDs. Now we plan to add SSDs in the
cluster for db/wal usage. I checked the docs and found a command
'ceph-bluestore-tool' can deal with the issue.
I added db/wal to the osd in my test environment but in the end it still
get the warning message.
"osd.0 spilled over 64 KiB metadata from 'db' device (7 MiB used of 8.0
GiB) to slow device"
my procedure :
sdd is the new disk for db/wal.
sgdisk --new=1:0:+8GB --change-name=1:bluestore_block_db_0
--partition-guid=1:$(uuidgen) --mbrtogpt -- /dev/sdd
sgdisk --new=2:0:+1GB --change-name=2:bluestore_block_wal_0
--partition-guid=2:$(uuidgen) --mbrtogpt -- /dev/sdd
systemctl stop ceph-osd@0
CEPH_ARGS="--bluestore-block-db-size 8589934592" ceph-bluestore-tool
bluefs-bdev-new-db --path /var/lib/ceph/osd/ceph-0 --dev-target /dev/sdd1
ceph-bluestore-tool bluefs-bdev-new-wal --path /var/lib/ceph/osd/ceph-0/
--dev-target /dev/sdd2
ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0/
systemctl start ceph-osd@0
ceph tell osd.0 compact
The warning message tells that there is some metadata still in the slow
device.
How can I deal with this issue?
Thanks
Hi,
Could you tell me whether read I/O is acxepted when the number of replicas
is under pool's min_size?
I read the official document and found that there is a difference of the
effect of pool's
min_size between pool's document and the pool's configuration document.
Pool's document:
https://docs.ceph.com/en/latest/rados/operations/pools/
> To set a minimum number of required replicas for I/O, you should use the
min_size setting.
Pool's configuration document.
https://docs.ceph.com/en/latest/rados/configuration/pool-pg-config-ref/
> osd pool default min size
Description
Sets the minimum number of written replicas for objects in the pool in
order to acknowledge a write operation to the client.
The former says that min_size is related to both read/write I/O.
On the other hands, the latter says it's only about write I/O. So which is
correct description?
As a result of my verification, the former looks correct.
I created a RGW object in a pool having 3 replicas and its min_size is 2.
Both read and write aren't accepted when 2 replicas are gone.
Thanks,
Satoru
Over the weekend I had multiple OSD servers in my Octopus cluster
(15.2.4) crash and reboot at nearly the same time. The OSDs are part of
an erasure coded pool. At the time the cluster had been busy with a
long-running (~week) remapping of a large number of PGs after I
incrementally added more OSDs to the cluster. After bringing all of the
OSDs back up, I have 25 unfound objects and 75 degraded objects. There
are other problems reported, but I'm primarily concerned with these
unfound/degraded objects.
The pool with the missing objects is a cephfs pool. The files stored in
the pool are backed up on tape, so I can easily restore individual files
as needed (though I would not want to restore the entire filesystem).
I tried following the guide at
https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg….
I found a number of OSDs that are still 'not queried'. Restarting a
sampling of these OSDs changed the state from 'not queried' to 'already
probed', but that did not recover any of the unfound or degraded objects.
I have also tried 'ceph pg deep-scrub' on the affected PGs, but never
saw them get scrubbed. I also tried doing a 'ceph pg force-recovery' on
the affected PGs, but only one seems to have been tagged accordingly
(see ceph -s output below).
The guide also says "Sometimes it simply takes some time for the cluster
to query possible locations." I'm not sure how long "some time" might
take, but it hasn't changed after several hours.
My questions are:
* Is there a way to force the cluster to query the possible locations
sooner?
* Is it possible to identify the files in cephfs that are affected, so
that I could delete only the affected files and restore them from backup
tapes?
--Mike
ceph -s:
cluster:
id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad
health: HEALTH_ERR
1 clients failing to respond to capability release
1 MDSs report slow requests
25/78520351 objects unfound (0.000%)
2 nearfull osd(s)
Reduced data availability: 1 pg inactive
Possible data damage: 9 pgs recovery_unfound
Degraded data redundancy: 75/626645098 objects degraded
(0.000%), 9 pgs degraded
1013 pgs not deep-scrubbed in time
1013 pgs not scrubbed in time
2 pool(s) nearfull
1 daemons have recently crashed
4 slow ops, oldest one blocked for 77939 sec, daemons
[osd.0,osd.41] have slow ops.
services:
mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d)
mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1
mds: archive:1 {0=ceph4=up:active} 3 up:standby
osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped pgs
task status:
scrub status:
mds.ceph4: idle
data:
pools: 9 pools, 2433 pgs
objects: 78.52M objects, 298 TiB
usage: 412 TiB used, 545 TiB / 956 TiB avail
pgs: 0.041% pgs unknown
75/626645098 objects degraded (0.000%)
135224/626645098 objects misplaced (0.022%)
25/78520351 objects unfound (0.000%)
2421 active+clean
5 active+recovery_unfound+degraded
3 active+recovery_unfound+degraded+remapped
2 active+clean+scrubbing+deep
1 unknown
1 active+forced_recovery+recovery_unfound+degraded
progress:
PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d)
[............................]
Hello,
Over the last week I have tried optimising the performance of our MDS
nodes for the large amount of files and concurrent clients we have. It
turns out that despite various stability fixes in recent releases, the
default configuration still doesn't appear to be optimal for keeping the
cache size under control and avoid intermittent I/O blocks.
Unfortunately, it is very hard to tweak the configuration to something
that works, because the tuning parameters needed are largely
undocumented or only described in very technical terms in the source
code making them quite unapproachable for administrators not familiar
with all the CephFS internals. I would therefore like to ask if it were
possible to document the "advanced" MDS settings more clearly as to what
they do and in what direction they have to be tuned for more or less
aggressive cap recall, for instance (sometimes it is not clear if a
threshold is a min or a max threshold).
I am am in the very (un)fortunate situation to have folders with a
several 100K direct sub folders or files (and one extreme case with
almost 7 million dentries), which is a pretty good benchmark for
measuring cap growth while performing operations on them. For the time
being, I came up with this configuration, which seems to work for me,
but is still far from optimal:
mds basic mds_cache_memory_limit 10737418240
mds advanced mds_cache_trim_threshold 131072
mds advanced mds_max_caps_per_client 500000
mds advanced mds_recall_max_caps 17408
mds advanced mds_recall_max_decay_rate 2.000000
The parameters I am least sure about---because I understand the least
how they actually work---are mds_cache_trim_threshold and
mds_recall_max_decay_rate. Despite reading the description in
src/common/options.cc, I understand only half of what they're doing and
I am also not quite sure in which direction to tune them for optimal
results.
Another point where I am struggling is the correct configuration of
mds_recall_max_caps. The default of 5K doesn't work too well for me, but
values above 20K also don't seem to be a good choice. While high values
result in fewer blocked ops and better performance without destabilising
the MDS, they also lead to slow but unbounded cache growth, which seems
counter-intuitive. 17K was the maximum I could go. Higher values work
for most use cases, but when listing very large folders with millions of
dentries, the MDS cache size slowly starts to exceed the limit after a
few hours, since the MDSs are failing to keep clients below
mds_max_caps_per_client despite not showing any "failing to respond to
cache pressure" warnings.
With the configuration above, I do not have cache size issues any more,
but it comes at the cost of performance and slow/blocked ops. A few
hints as to how I could optimise my settings for better client
performance would be much appreciated and so would be additional
documentation for all the "advanced" MDS settings.
Thanks a lot
Janek
Hi,
It is a nautilus 14.2.13 ceph.
The quota on the pool is 745GiB, how can be the stored data 788GiB? (2 replicas pool).
Based on the used column it means just 334GiB is used because the pool has 2 replicas only. I don't understand.
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
k8s-dbss-w-mdc 12 788 GiB 202.42k 668 GiB 0.75 43 TiB N/A 745 GiB 202.42k 0 B 0 B
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.