May 2021 - ceph-users - lists.ceph.io

by David Galloway

This is a hotfix release addressing a number of security issues and regressions. We recommend all users update to this release. For a detailed release notes with links & changelog please refer to the official blog entry at https://ceph.io/releases/v15-2-12-octopus-released Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-15.2.12.tar.gz * For packages, see https://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: ce065eabfa5ce81323b009786bdf5bb03127cbe1

2 years, 11 months

1
0
0 0

Ceph Octopus 15.2.11 - rbd diff --from-snap lists all objects

by David Herselman

Hi, Has something change with 'rbd diff' in Octopus or have I hit a bug? I am no longer able to obtain the list of objects that have changed between two snapshots of an image, it always lists all allocated regions of the RBD image. This behaviour however only occurs when I add the '--whole-object' switch. Using KRBD client with kernel 5.11.7 and Ceph Octopus 15.2.11 as part of Proxmox PVE 6.4 which is based on Debian 10. Images have the following features and I've performed offline object map checks and rebuilds (no errors reported). To reproduce my issue I first create a new RBD image (default features are 63), map it using KRBD, write some data, create first snapshot, write a single object (4 MiB), create a second snapshot and then list the differences: [admin@kvm1a ~]# rbd create rbd_hdd/test --size 40G [admin@kvm1a ~]# rbd info rbd_hdd/test rbd image 'test': size 40 GiB in 10240 objects order 22 (4 MiB objects) snapshot_count: 0 id: 73363f8443987b block_name_prefix: rbd_data.73363f8443987b format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten op_features: flags: create_timestamp: Wed May 12 23:01:11 2021 access_timestamp: Wed May 12 23:01:11 2021 modify_timestamp: Wed May 12 23:01:11 2021 [admin@kvm1a ~]# rbd map rbd_hdd/test /dev/rbd18 [admin@kvm1a ~]# dd if=/dev/zero of=/dev/rbd18 bs=64M count=1 1+0 records in 1+0 records out 67108864 bytes (67 MB, 64 MiB) copied, 0.668701 s, 100 MB/s [admin@kvm1a ~]# sync [admin@kvm1a ~]# rbd snap create rbd_hdd/test@snap1 [admin@kvm1a ~]# dd if=/dev/zero of=/dev/rbd18 bs=4M count=1 1+0 records in 1+0 records out 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.265691 s, 15.8 MB/s [admin@kvm1a ~]# sync [admin@kvm1a ~]# rbd snap create rbd_hdd/test@snap2 [admin@kvm1a ~]# rbd diff --from-snap snap1 rbd_hdd/test@snap2 --format=json [{"offset":0,"length":4194304,"exists":"true"}] [admin@kvm1b ~]# rbd diff --from-snap snap1 rbd_hdd/test@snap2 --format=json --whole-object [{"offset":0,"length":4194304,"exists":"true"},{"offset":4194304,"length":4194304,"exists":"true"},{"offset":8388608,"length":4194304,"exists":"true"},{"offset":12582912,"length":4194304,"exists":"true"},{"offset":16777216,"length":4194304,"exists":"true"},{"offset":20971520,"length":4194304,"exists":"true"},{"offset":25165824,"length":4194304,"exists":"true"},{"offset":29360128,"length":4194304,"exists":"true"},{"offset":33554432,"length":4194304,"exists":"true"},{"offset":37748736,"length":4194304,"exists":"true"},{"offset":41943040,"length":4194304,"exists":"true"},{"offset":46137344,"length":4194304,"exists":"true"},{"offset":50331648,"length":4194304,"exists":"true"},{"offset":54525952,"length":4194304,"exists":"true"},{"offset":58720256,"length":4194304,"exists":"true"},{"offset":62914560,"length":4194304,"exists":"true"}] [admin@kvm1a ~]# rbd du rbd_hdd/test NAME PROVISIONED USED test@snap1 40 GiB 64 MiB test@snap2 40 GiB 64 MiB test 40 GiB 4 MiB <TOTAL> 40 GiB 132 MiB My tests appear to confirm that adding the 'whole-object' option to rbd diff results in it listing every allocated extend instead of only the changes... Regards David Herselman

2 years, 11 months

1
1
0 0

monitor connection error

by Tuffli, Chuck

Hi I'm new to ceph and have been following the Manual Deployment document [1]. The process seems to work correctly until step 18 ("Verify that the monitor is running"): [centos@cnode-01 ~]$ uname -a Linux cnode-01 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux [centos@cnode-01 ~]$ ceph -v ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable) [centos@cnode-01 ~]$ sudo ceph --cluster es-c1 -s [errno 2] RADOS object not found (error connecting to the cluster) [centos@cnode-01 ~]$ What is this error trying to tell me? TIA [1] https://docs.ceph.com/en/latest/install/manual-deployment/

2 years, 11 months

2
9
0 0

rgw bug adding null characters in multipart object names and in Etags

by Scheurer François

Dear All We are trying to remove old multipart uploads but get in trouble with some of them having null characters: rados -p zh-1.rgw.buckets.index rmomapkey .dir.cb1594b3-a782-49d0-a19f-68cd48870a63.81880353.1.0 '_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC.25' rados -p zh-1.rgw.buckets.index rmomapkey .dir.cb1594b3-a782-49d0-a19f-68cd48870a63.81880353.1.0 $(echo -ne '_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC\0.25') -bash: warning: command substitution: ignored null byte in input rados -p zh-1.rgw.buckets.index listomapkeys .dir.cb1594b3-a782-49d0-a19f-68cd48870a63.81880353.1.0 | grep -a '_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC' | cat -A _multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC^@.25$ # <= not deleted ! It is not working, as the Null Char is stripped off. Any Idea how to proceed? This bucket was created on luminous. But this specific object was created after our upgrade to nautilus. Apparently some bugs have added NullChars at the end of MPU object names, between uploadid and suffix. Output from 'radosgw-admin bi list' (see the \u0000 NullChars): { "type": "plain", "idx": "_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC\u0000.25", "entry": { "name": "_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC\u0000.25", "instance": "", "ver": { "pool": 6, "epoch": 852938 }, "locator": "", "exists": "true", "meta": { "category": 1, "size": 157286400, "mtime": "2020-12-25 23:39:20.019898Z", "etag": "a126c2f0d439c44176a5d07bd5841575", "storage_class": "", "owner": "40eb21a9092c4948bcf94386f6042f94", "owner_display_name": "amsler1", "content_type": "", "accounted_size": 157286400, "user_data": "", "appendable": "false" }, "tag": "_vMx_4vu-E5nWf7kCHJIQCFPGEHRiUAG", "flags": 0, "pending_map": [], "versioned_epoch": 0 } }, On the same bucket, we also see NullChars at the end of some Etags when we using 'radosgw-admin bucket list --bucket' but not with 'radosgw-admin object stat': object='MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_HV/BSSRV01.Aerztehaus-allschwil.ch/BSSRV05/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml:/20191103183115/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml' radosgw-admin object stat --bucket="$bucket" --object="$object" | jq -c '{name, size, etag, tag, obj_size: .manifest.obj_size, marker:.manifest.tail_placement.bucket.marker, bucket_id:.manifest.tail_placement.bucket.bucket_id}' | cat -A {"name":"MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_HV/BSSRV01.Aerztehaus-allschwil.ch/BSSRV05/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml:/20191103183115/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml","size":39372,"etag":"0e73d594032900acb74d3f06b230aeb9","tag":"_xhNKxuWrfxDO5XfYs8Llq8vLTUYqtmm","obj_size":39372,"marker":"cb1594b3-a782-49d0-a19f-68cd48870a63.19334234.139","bucket_id":"cb1594b3-a782-49d0-a19f-68cd48870a63.20382694.169"}$ # <= no NullChar radosgw-admin bucket list --bucket "${bucket}" --allow-unordered --max-entries 20000000 | jq -c 'sort_by(.bucket) | .[] | {name, accounted_size: .meta.accounted_size, etag: .meta.etag}' | fgrep -a 0e73d594032900acb74d3f06b230aeb9 | cat -A {"name":"MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_HV/BSSRV01.Aerztehaus-allschwil.ch/BSSRV05/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml:/20191103183115/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml","accounted_size":39372,"etag":"0e73d594032900acb74d3f06b230aeb9\u0000"}$ # <= no NullChar rados -p zh-1.rgw.buckets.data stat 'cb1594b3-a782-49d0-a19f-68cd48870a63.19334234.139_'"$object" zh-1.rgw.buckets.data/cb1594b3-a782-49d0-a19f-68cd48870a63.19334234.139_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_HV/BSSRV01.Aerztehaus-allschwil.ch/BSSRV05/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml:/20191103183115/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml mtime 2020-04-21 14:21:27.000000, size 39372 This bucket was causing multi-site rgw sync to crash every minute when using rgw_sync_obj_etag_verify = true. These Etag NullChars may be the cause of this bug: * https://tracker.ceph.com/issues/49955 It may also be related to: * https://tracker.ceph.com/issues/23939 So we would be glad to know how to remove these NullChars from the Etags and how to remove the MPU's with NullChars in the object names... These both issues seem to be the cause of many weird behaviors: 1. rgw sync crashes (with rgw_sync_obj_etag_verify = true) 2. radosgw-admin bucket sync status --bucket "$bucket" --source-zone ch-zh1-az2 => reports "bucket is caught up with source" but when most of the objects are missing 3. radosgw-admin bucket list --bucket "$bucket" --allow-unordered --max-entries 99000000 => returns an imcomplete list 4. radosgw-admin bucket stats --bucket "$bucket" => returns wrong number of objects and utilized size The only reliable outputs is from bi list: * radosgw-admin bi list --bucket=$bucket | jq -cr 'map(select(.type == "plain" or .type == "instance") | .entry' Do you know if following commands may help and are safe in multi-site? * radosgw-admin bucket check --bucket $bucket --fix --check-objects * radosgw-admin bucket rewrite --bucket $bucket --min-rewrite-size 0 Or maybe only a dedicated tool need to be developped to deal with these NullChars? Many thanks in advance. Cheers Francois Scheurer -- EveryWare AG François Scheurer Senior Systems Engineer Zurlindenstrasse 52a CH-8003 Zürich tel: +41 44 466 60 00 fax: +41 44 466 60 10 mail: francois.scheurer(a)everyware.ch web: http://www.everyware.ch

2 years, 11 months

1
3
0 0

RGW federated user cannot access created bucket

by Daniel Iwan

Hi all Scenario is as follows Federated user assumes a role via AssumeRoleWithWebIdentity, which gives permission to create a bucket. User creates a bucket and becomes an owner (this is visible in Ceph's web ui as Owner $oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b). User cannot list the content of the bucket however, because role's policy does not give access to the bucket. Later on when user re-authenticates and assumes the same role again. At this point user cannot access a bucket it owns for the reason as above I'm assuming. Bucket's ACL after creation radosgw-admin policy --bucket my-bucket { "acl": { "acl_user_map": [ { "user": "$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b", "acl": 15 } ], "acl_group_map": [], "grant_map": [ { "id": "$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b", "grant": { "type": { "type": 0 }, "id": "$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b", "email": "", "permission": { "flags": 15 }, "name": "", "group": 0, "url_spec": "" } } ] }, "owner": { "id": "$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b", "display_name": "" } } This seems inconsistent with buckets created by regular users Is this expected behaviour? Regards Daniel

2 years, 11 months

2
3
0 0

Using ID of a federated user in a bucket policy in RGW

by Daniel Iwan

Hi all I'm working on the following scenario User is authenticated with OIDC and tries to access a bucket which it does not own. How to specify user ID etc. to give access to such a user? By trial and error I found out that principal can be specified as "Principal": {"Federated":["arn:aws:sts:::assumed-role/MySession"]}, but I want to use shadow user ID or something similar as the principal Docs https://docs.ceph.com/en/latest/radosgw/STS/ states: 'A shadow user is created corresponding to every federated user. The user id is derived from the ‘sub’ field of the incoming web token. The user is created in a separate namespace - ‘oidc’ such that the user id doesn’t clash with any other user ids in rgw. The format of the user id is - <tenant>$<user-namespace>$<sub> where user-namespace is ‘oidc’ for users that authenticate with oidc providers.' I see a shadow user in Web UI as e.g. 7f71c7c5-c24f-418e-87ac-aa8fe271289b but I cannot work out the syntax of a user id, I was expecting something like "arn:aws:iam:::user/$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b" but when trying to list content of a bucket I get AccessDenied. If bucket policy has Principal "*" the my authenticated user can access the bucket Is this possible? Regards Daniel

2 years, 11 months

2
3
0 0

"ceph orch ls", "ceph orch daemon rm" fail with exception "'KeyError: 'not'" on 15.2.10

by Erkki Seppala

Hi, How I got here -------------- Yesterday evening I added an OSD to my hobby system most likely using the command: # ceph-volume raw prepare --bluestore --data /dev/bcache0 # cephadm adopt --style legacy --name osd.20 I also used the command (after not having much luck with that, but I don't have the specifics): % ceph orch daemon add osd tutu:/tmp/bcache0 per https://docs.ceph.com/en/latest/cephadm/osd/#creating-new-osds ..which I think resulted in new osd.18, putting the bcache0 inside its own VG and its own LV. I don't have actual log of the used command available, but I did end up with new osds 18 and 20. First time using these command as well, my previous ways to achieve the same were a bit more long-winded.. According to my monitoring my main issue appeared around the same time. In this post I don't worry about the state of the OSD but only about management. Actual issue ------------ So when I now issue "ceph orch ls" I get the following output: % ceph orch ls Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1204, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in handle_command return dispatch[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in <lambda> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in wrapper return func(*args, **kwargs) File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in _list_services raise_if_exception(completion) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in raise_if_exception raise e AssertionError: not ("ceph orch ps" works fine.) Similarly the output of "ceph -s" is: % ceph -s ... health: HEALTH_ERR Module 'cephadm' has failed: 'not' ... The relevant log from the manager, as per the mgr web interface, is: _Promise failed Traceback (most recent call last): File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in _finalize next_result = self._on_complete(self._value) File "/usr/share/ceph/mgr/cephadm/module.py", line 107, in <lambda> return CephadmCompletion(on_complete=lambda _: f(*args, **kwargs)) File "/usr/share/ceph/mgr/cephadm/module.py", line 1333, in describe_service hosts=[dd.hostname] File "/lib/python3.6/site-packages/ceph/deployment/service_spec.py", line 429, in __init__ assert service_type in ServiceSpec.KNOWN_SERVICE_TYPES, service_type AssertionError: not I also noticed this seemingly highly relevant bit in my ceph orch ps: NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID not.osd.20 tutu stopped 13h ago 14h <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown> I'm not quite sure how I ended up with that, but I wouldn't exclude operator error :) such as entering "cephadm adopt --style legacy --name not.osd.20" (but WHY..). Sure enough, there is no such docker container running in the host and the job ceph-3046312a-e453-11ea-b1f5-b42e993e47fc(a)osd.20.service has failed with "RuntimeError: could not find osd.20 with osd_fsid 212c336a-9516-4818-aeaf-2d0c24c4ca65" (this error makes sense, as both osds 18 and 20 try to use the same bcache0, but the actual bluestore filesystem is inside vg/lv as used by 18, whereas 20 tries to use bcache0 directly), but as I said I won't worry about the OSD at the moment. I tried the command "ceph orch daemon rm not.osd.20", however I'm not sure if it even should work. It nevertheless fails the same way: % ceph orch daemon rm not.osd.20 Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1204, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in handle_command return dispatch[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in <lambda> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in wrapper return func(*args, **kwargs) File "/usr/share/ceph/mgr/orchestrator/module.py", line 1061, in _daemon_rm raise_if_exception(completion) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in raise_if_exception raise e KeyError: 'not' with the following entries in the mgr log: 5/13/21 1:26:06 PM[ERR]_Promise failed Traceback (most recent call last): File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in _finalize next_result = self._on_complete(self._value) File "/usr/share/ceph/mgr/cephadm/module.py", line 107, in <lambda> return CephadmCompletion(on_complete=lambda _: f(*args, **kwargs)) File "/usr/share/ceph/mgr/cephadm/module.py", line 1515, in remove_daemons return self._remove_daemons(args) File "/usr/share/ceph/mgr/cephadm/utils.py", line 65, in forall_hosts_wrapper return CephadmOrchestrator.instance._worker_pool.map(do_work, vals) File "/lib64/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/lib64/python3.6/multiprocessing/pool.py", line 644, in get raise self._value File "/lib64/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar return list(map(*args)) File "/usr/share/ceph/mgr/cephadm/utils.py", line 58, in do_work return f(self, *arg) File "/usr/share/ceph/mgr/cephadm/module.py", line 1804, in _remove_daemons return self._remove_daemon(name, host) File "/usr/share/ceph/mgr/cephadm/module.py", line 1818, in _remove_daemon self.cephadm_services[daemon_type].pre_remove(daemon) KeyError: 'not' 5/13/21 1:26:06 PM[ERR]executing _remove_daemons((<cephadm.module.CephadmOrchestrator object at 0x7f1f4fec2bd0>, [('not.osd.20', 'tutu')])) failed. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 58, in do_work return f(self, *arg) File "/usr/share/ceph/mgr/cephadm/module.py", line 1804, in _remove_daemons return self._remove_daemon(name, host) File "/usr/share/ceph/mgr/cephadm/module.py", line 1818, in _remove_daemon self.cephadm_services[daemon_type].pre_remove(daemon) KeyError: 'not' I tried also that "ceph orch daemon rm foo.bar.42" gives the error "Error EINVAL: Unable to find daemon(s) ['foo.bar.42']", so it seems it processes the actual command fine in part. Thanks for any assistance! -- _____________________________________________________________________ / __// /__ ____ __ Erkki Seppälä\ \ / /_ / // // /\ \/ / \ / /_/ /_/ \___/ /_/\_\(a)inside.org http://www.inside.org/~flux/

2 years, 11 months

1
0
0 0

OSD lost: firmware bug in Kingston SSDs?

by Frank Schilder

Hi all, I lost 2 OSDs deployed on a single Kingston SSD in a rather strange way and am wondering if anyone has made similar observations or is aware of a firmware bug with these disks. Disk model: KINGSTON SEDC500M3840G (it ought to be a DC grade model with super capacitors) Smartctl does not report any drive errors. Performance per TB is as expected, OSDs are "ceph-volume lvm batch" bluestore deployed, everything collocated. Short version: I disable volatile write cache on all OSD disks, but the Kingston disks seem to behave as if this cache is *not* disabled. Smartctl and hdparm report wcache=off though. The OSD loss looks like what unflushed write cache during power loss would result in. I'm afraid now that our cluster might be vulnerable to power loss. Long version: Our disks are on Dell HBA330 Mini controllers and are in state "non-raid". The controller itself has no cache and is HBA-mode only. Log entry: The iDRAC log shows that the disk was removed from a drive group: --- PDR5 Disk 6 in Backplane 2 of Integrated Storage Controller 1 is removed. Detailed Description: A physical disk has been removed from the disk group. This alert can also be caused by loose or defective cables or by problems with the enclosure. --- The iDRAC did not report the disk as failed and neither as "removed from drive bay". I reseated the disk and it came back as healthy. I assume it was a problem with connectivity to the back-plane (chassis). If I now try to start up the OSDs on this disk, I get the error: starting osd.581 at - osd_data /var/lib/ceph/osd/ceph-581 /var/lib/ceph/osd/ceph-581/journal starting osd.580 at - osd_data /var/lib/ceph/osd/ceph-580 /var/lib/ceph/osd/ceph-580/journal 2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluefs mount failed to replay log: (5) Input/output error 2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluestore(/var/lib/ceph/osd/ceph-581) _open_db failed bluefs mount: (5) Input/output error 2021-05-06 09:23:47.630 7fead5a1fb80 -1 osd.581 0 OSD:init: unable to mount object store 2021-05-06 09:23:47.630 7fead5a1fb80 -1 ** ERROR: osd init failed: (5) Input/output error I have removed disks of active OSDs before without any bluestore corruption happening. While it is very well possible that this particular "disconnect" event may lead to a broken OSD, there is also another observation where the Kingston disks stick out compared with other SSD OSDs, which make me suspicious of this being a disk cache firmware problem: The I/O indicator LED lights up with significantly lower frequency than for all other SSD types on the same pool even though we have 2 instead of 1 OSD deployed on the Kingstons (the other disks are 2TB Micron Pro). While this could be due to a wiring difference I'm starting to suspect that this might be an indication of volatile caching. Does anyone using Kingston DC-M-SSDs have similar or contradicting experience? How did these disks handle power outages? Any recommendations? Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

2 years, 11 months

2
4
0 0

Manager carries wrong information until killing it

by Nico Schottelius

Hello, we have a recurring, funky problem with managers on Nautilus (and probably also earlier versions): the manager displays incorrect information. This is a recurring pattern and it also breaks the prometheus graphs, as the I/O is described insanely incorrectly: "recovery: 43 TiB/s, 3.62k keys/s, 11.40M objects/s" - which basically changes the scale of any related graph to unusable. The latest example from today shows slow ops for an OSD that has been down for 17h: -------------------------------------------------------------------------------- [09:50:31] black2.place6:~# ceph -s cluster: id: 1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_WARN 18 slow ops, oldest one blocked for 975 sec, osd.53 has slow ops services: mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w) mgr: server2(active, since 2w), standbys: server8, server4, server9, server6, ciara3 osd: 108 osds: 107 up (since 17h), 107 in (since 17h) data: pools: 4 pools, 2624 pgs objects: 42.52M objects, 162 TiB usage: 486 TiB used, 298 TiB / 784 TiB avail pgs: 2616 active+clean 8 active+clean+scrubbing+deep io: client: 522 MiB/s rd, 22 MiB/s wr, 8.18k op/s rd, 689 op/s wr -------------------------------------------------------------------------------- Killing the manager on server2 changes the status to another temporary incorrect status, because the rebalance finished hours ago, paired with the incorrect rebalance speed that we see from time to time: -------------------------------------------------------------------------------- [09:51:59] black2.place6:~# ceph -s cluster: id: 1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_OK services: mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w) mgr: server8(active, since 11s), standbys: server4, server9, server6, ciara3 osd: 108 osds: 107 up (since 17h), 107 in (since 17h) data: pools: 4 pools, 2624 pgs objects: 42.52M objects, 162 TiB usage: 486 TiB used, 298 TiB / 784 TiB avail pgs: 2616 active+clean 8 active+clean+scrubbing+deep io: client: 214 TiB/s rd, 54 TiB/s wr, 4.86G op/s rd, 1.06G op/s wr recovery: 43 TiB/s, 3.62k keys/s, 11.40M objects/s progress: Rebalancing after osd.53 marked out [========================......] -------------------------------------------------------------------------------- Then a bit later, the status on the newly started manager is correct: -------------------------------------------------------------------------------- [09:52:18] black2.place6:~# ceph -s cluster: id: 1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_OK services: mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w) mgr: server8(active, since 47s), standbys: server4, server9, server6, server2, ciara3 osd: 108 osds: 107 up (since 17h), 107 in (since 17h) data: pools: 4 pools, 2624 pgs objects: 42.52M objects, 162 TiB usage: 486 TiB used, 298 TiB / 784 TiB avail pgs: 2616 active+clean 8 active+clean+scrubbing+deep io: client: 422 MiB/s rd, 39 MiB/s wr, 7.91k op/s rd, 752 op/s wr -------------------------------------------------------------------------------- Question: is this a know bug, is anyone else seeing it or are we doing something wrong? Best regards, Nico -- Sustainable and modern Infrastructures by ungleich.ch

2 years, 11 months

2
2
0 0

May 10 Upstream Lab Outage

by David Galloway

Hi all, I wanted to provide an RCA for the outage you may have been affected by yesterday. Some services that went down: - All CI/testing - quay.ceph.io - telemetry.ceph.com (your cluster may have gone into HEALTH_WARN if you report telemetry data) - lists.ceph.io (so all mailing lists) All of our critical infra is running in a Red Hat Virtualization (RHV) instance backed by Red Hat Gluster Storage (RHGS) as the storage. Before you go, "wait.. Gluster?" Yes, this cluster was set up before Ceph was supported as backend storage for RHV/RHEV. The root cause for the outage is the Gluster volumes got 100% full. Once no writes were possible, RHV paused all the VMs. Why didn't monitoring catch this? I honestly don't know. # grep ssdstore01 nagios-05-*2021* | grep Disk nagios-05-01-2021-00.log:[1619740800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-02-2021-00.log:[1619827200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-03-2021-00.log:[1619913600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-04-2021-00.log:[1620000000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-05-2021-00.log:[1620086400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-06-2021-00.log:[1620172800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-07-2021-00.log:[1620259200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-08-2021-00.log:[1620345600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-09-2021-00.log:[1620432000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-10-2021-00.log:[1620518400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now nagios-05-11-2021-00.log:[1620604800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now Yet RHV knew we were running out of space. I don't have e-mail notifications set up in RHV, however. # zgrep "disk space" engine*202105*.gz | cut -d ',' -f4 | head -n 10 Low disk space. hosted_storage domain has 24 GB of free space. Low disk space. hosted_storage domain has 24 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 23 GB of free space. Low disk space. hosted_storage domain has 21 GB of free space. Low disk space. hosted_storage domain has 20 GB of free space. Low disk space. hosted_storage domain has 11 GB of free space. Our nagios instances runs this to check disk space: https://github.com/ceph/ceph-cm-ansible/blob/master/roles/common/files/libe… You can ignore the comment about it only working for EXT2. [root@ssdstore01 ~]# /usr/libexec/diskusage.pl 90 95 Disks are OK now I ran this manually on one of the storage hosts and intentionally set the WARN level to a number lower than the current usage percentage. [root@ssdstore01 ~]# df -h | grep 'Size\|gluster' Filesystem Size Used Avail Use% Mounted on /dev/md124 8.8T 6.7T 2.1T 77% /gluster [root@ssdstore01 ~]# /usr/libexec/diskusage.pl 95 70 /gluster is at 77% [root@ssdstore01 ~]# echo $? 2 When I logged in to the storage hosts yesterday morning, the /gluster mount was at 100%. So nagios should have known. How'd it get fixed? I happened to have some large capacity drives that fit the storage nodes lying around. They're being installed in a different project soon. However, I was able to add these drives, add "bricks" to the Gluster storage, then rebalance the data. Once that was done, I was able to restart all the VMs and delete old VMs and snapshots I no longer needed. How do we keep this from happening again? Well, as you may have been able to deduce... we were running out of space at a rate of 1-10 GB/day. As you can see now, the Gluster volume has 2.1TB of space left. So even if we grew by 10GB/day again, we'd be okay for 200ish days. I aim to have some (if not all) of these services moved off this platform and into an Openshift cluster backed by Ceph this year. Sadly, I just don't think I have enough logging enabled to nail down exactly what happened. -- David Galloway Senior Systems Administrator Ceph Engineering

2 years, 11 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users May 2021