Hi,
Has something change with 'rbd diff' in Octopus or have I hit a bug? I am no longer able to obtain the list of objects that have changed between two snapshots of an image, it always lists all allocated regions of the RBD image. This behaviour however only occurs when I add the '--whole-object' switch.
Using KRBD client with kernel 5.11.7 and Ceph Octopus 15.2.11 as part of Proxmox PVE 6.4 which is based on Debian 10. Images have the following features and I've performed offline object map checks and rebuilds (no errors reported).
To reproduce my issue I first create a new RBD image (default features are 63), map it using KRBD, write some data, create first snapshot, write a single object (4 MiB), create a second snapshot and then list the differences:
[admin@kvm1a ~]# rbd create rbd_hdd/test --size 40G
[admin@kvm1a ~]# rbd info rbd_hdd/test
rbd image 'test':
size 40 GiB in 10240 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 73363f8443987b
block_name_prefix: rbd_data.73363f8443987b
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
op_features:
flags:
create_timestamp: Wed May 12 23:01:11 2021
access_timestamp: Wed May 12 23:01:11 2021
modify_timestamp: Wed May 12 23:01:11 2021
[admin@kvm1a ~]# rbd map rbd_hdd/test
/dev/rbd18
[admin@kvm1a ~]# dd if=/dev/zero of=/dev/rbd18 bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (67 MB, 64 MiB) copied, 0.668701 s, 100 MB/s
[admin@kvm1a ~]# sync
[admin@kvm1a ~]# rbd snap create rbd_hdd/test@snap1
[admin@kvm1a ~]# dd if=/dev/zero of=/dev/rbd18 bs=4M count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.265691 s, 15.8 MB/s
[admin@kvm1a ~]# sync
[admin@kvm1a ~]# rbd snap create rbd_hdd/test@snap2
[admin@kvm1a ~]# rbd diff --from-snap snap1 rbd_hdd/test@snap2 --format=json
[{"offset":0,"length":4194304,"exists":"true"}]
[admin@kvm1b ~]# rbd diff --from-snap snap1 rbd_hdd/test@snap2 --format=json --whole-object
[{"offset":0,"length":4194304,"exists":"true"},{"offset":4194304,"length":4194304,"exists":"true"},{"offset":8388608,"length":4194304,"exists":"true"},{"offset":12582912,"length":4194304,"exists":"true"},{"offset":16777216,"length":4194304,"exists":"true"},{"offset":20971520,"length":4194304,"exists":"true"},{"offset":25165824,"length":4194304,"exists":"true"},{"offset":29360128,"length":4194304,"exists":"true"},{"offset":33554432,"length":4194304,"exists":"true"},{"offset":37748736,"length":4194304,"exists":"true"},{"offset":41943040,"length":4194304,"exists":"true"},{"offset":46137344,"length":4194304,"exists":"true"},{"offset":50331648,"length":4194304,"exists":"true"},{"offset":54525952,"length":4194304,"exists":"true"},{"offset":58720256,"length":4194304,"exists":"true"},{"offset":62914560,"length":4194304,"exists":"true"}]
[admin@kvm1a ~]# rbd du rbd_hdd/test
NAME PROVISIONED USED
test@snap1 40 GiB 64 MiB
test@snap2 40 GiB 64 MiB
test 40 GiB 4 MiB
<TOTAL> 40 GiB 132 MiB
My tests appear to confirm that adding the 'whole-object' option to rbd diff results in it listing every allocated extend instead of only the changes...
Regards
David Herselman
Hi
I'm new to ceph and have been following the Manual Deployment document [1]. The process seems to work correctly until step 18 ("Verify that the monitor is running"):
[centos@cnode-01 ~]$ uname -a
Linux cnode-01 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 20 20:32:50 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[centos@cnode-01 ~]$ ceph -v
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)
[centos@cnode-01 ~]$ sudo ceph --cluster es-c1 -s
[errno 2] RADOS object not found (error connecting to the cluster)
[centos@cnode-01 ~]$
What is this error trying to tell me? TIA
[1] https://docs.ceph.com/en/latest/install/manual-deployment/
Dear All
We are trying to remove old multipart uploads but get in trouble with some of them having null characters:
rados -p zh-1.rgw.buckets.index rmomapkey .dir.cb1594b3-a782-49d0-a19f-68cd48870a63.81880353.1.0 '_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC.25'
rados -p zh-1.rgw.buckets.index rmomapkey .dir.cb1594b3-a782-49d0-a19f-68cd48870a63.81880353.1.0 $(echo -ne '_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC\0.25')
-bash: warning: command substitution: ignored null byte in input
rados -p zh-1.rgw.buckets.index listomapkeys .dir.cb1594b3-a782-49d0-a19f-68cd48870a63.81880353.1.0 | grep -a '_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC' | cat -A
_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC^@.25$ # <= not deleted !
It is not working, as the Null Char is stripped off.
Any Idea how to proceed?
This bucket was created on luminous. But this specific object was created after our upgrade to nautilus.
Apparently some bugs have added NullChars at the end of MPU object names, between uploadid and suffix.
Output from 'radosgw-admin bi list' (see the \u0000 NullChars):
{
"type": "plain",
"idx": "_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC\u0000.25",
"entry": {
"name": "_multipart_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_DiskImage/Disk_4f8130ff-fef5-4b0f-b25e-c6b8b3dba9bf/Volume_NTFS_5b4f5274-9107-4386-93d9-e7f31193805a$/20201218230243/0.cbrevision.525Sr39KY5yVbD_w9ipOXSXsQ95YUnC\u0000.25",
"instance": "",
"ver": {
"pool": 6,
"epoch": 852938
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 157286400,
"mtime": "2020-12-25 23:39:20.019898Z",
"etag": "a126c2f0d439c44176a5d07bd5841575",
"storage_class": "",
"owner": "40eb21a9092c4948bcf94386f6042f94",
"owner_display_name": "amsler1",
"content_type": "",
"accounted_size": 157286400,
"user_data": "",
"appendable": "false"
},
"tag": "_vMx_4vu-E5nWf7kCHJIQCFPGEHRiUAG",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}
},
On the same bucket, we also see NullChars at the end of some Etags when we using 'radosgw-admin bucket list --bucket' but not with 'radosgw-admin object stat':
object='MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_HV/BSSRV01.Aerztehaus-allschwil.ch/BSSRV05/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml:/20191103183115/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml'
radosgw-admin object stat --bucket="$bucket" --object="$object" | jq -c '{name, size, etag, tag, obj_size: .manifest.obj_size, marker:.manifest.tail_placement.bucket.marker, bucket_id:.manifest.tail_placement.bucket.bucket_id}' | cat -A
{"name":"MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_HV/BSSRV01.Aerztehaus-allschwil.ch/BSSRV05/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml:/20191103183115/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml","size":39372,"etag":"0e73d594032900acb74d3f06b230aeb9","tag":"_xhNKxuWrfxDO5XfYs8Llq8vLTUYqtmm","obj_size":39372,"marker":"cb1594b3-a782-49d0-a19f-68cd48870a63.19334234.139","bucket_id":"cb1594b3-a782-49d0-a19f-68cd48870a63.20382694.169"}$ # <= no NullChar
radosgw-admin bucket list --bucket "${bucket}" --allow-unordered --max-entries 20000000 | jq -c 'sort_by(.bucket) | .[] | {name, accounted_size: .meta.accounted_size, etag: .meta.etag}' | fgrep -a 0e73d594032900acb74d3f06b230aeb9 | cat -A
{"name":"MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_HV/BSSRV01.Aerztehaus-allschwil.ch/BSSRV05/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml:/20191103183115/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml","accounted_size":39372,"etag":"0e73d594032900acb74d3f06b230aeb9\u0000"}$ # <= no NullChar
rados -p zh-1.rgw.buckets.data stat 'cb1594b3-a782-49d0-a19f-68cd48870a63.19334234.139_'"$object"
zh-1.rgw.buckets.data/cb1594b3-a782-49d0-a19f-68cd48870a63.19334234.139_MBS-35a9b79c-f27d-44f2-804f-472ef0520816/CBB_BSSRV01/CBB_HV/BSSRV01.Aerztehaus-allschwil.ch/BSSRV05/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml:/20191103183115/D0F970B6-DB86-48AF-AA68-946D4642E2A6.xml mtime 2020-04-21 14:21:27.000000, size 39372
This bucket was causing multi-site rgw sync to crash every minute when using rgw_sync_obj_etag_verify = true.
These Etag NullChars may be the cause of this bug:
* https://tracker.ceph.com/issues/49955
It may also be related to:
* https://tracker.ceph.com/issues/23939
So we would be glad to know how to remove these NullChars from the Etags and how to remove the MPU's with NullChars in the object names...
These both issues seem to be the cause of many weird behaviors:
1. rgw sync crashes (with rgw_sync_obj_etag_verify = true)
2. radosgw-admin bucket sync status --bucket "$bucket" --source-zone ch-zh1-az2 => reports "bucket is caught up with source" but when most of the objects are missing
3. radosgw-admin bucket list --bucket "$bucket" --allow-unordered --max-entries 99000000 => returns an imcomplete list
4. radosgw-admin bucket stats --bucket "$bucket" => returns wrong number of objects and utilized size
The only reliable outputs is from bi list:
* radosgw-admin bi list --bucket=$bucket | jq -cr 'map(select(.type == "plain" or .type == "instance") | .entry'
Do you know if following commands may help and are safe in multi-site?
* radosgw-admin bucket check --bucket $bucket --fix --check-objects
* radosgw-admin bucket rewrite --bucket $bucket --min-rewrite-size 0
Or maybe only a dedicated tool need to be developped to deal with these NullChars?
Many thanks in advance.
Cheers
Francois Scheurer
--
EveryWare AG
François Scheurer
Senior Systems Engineer
Zurlindenstrasse 52a
CH-8003 Zürich
tel: +41 44 466 60 00
fax: +41 44 466 60 10
mail: francois.scheurer(a)everyware.ch
web: http://www.everyware.ch
Hi all
Scenario is as follows
Federated user assumes a role via AssumeRoleWithWebIdentity, which gives
permission to create a bucket.
User creates a bucket and becomes an owner (this is visible in Ceph's web
ui as Owner $oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b).
User cannot list the content of the bucket however, because role's policy
does not give access to the bucket.
Later on when user re-authenticates and assumes the same role again.
At this point user cannot access a bucket it owns for the reason as above
I'm assuming.
Bucket's ACL after creation
radosgw-admin policy --bucket my-bucket
{
"acl": {
"acl_user_map": [
{
"user": "$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b",
"acl": 15
}
],
"acl_group_map": [],
"grant_map": [
{
"id": "$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b",
"grant": {
"type": {
"type": 0
},
"id": "$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b",
"email": "",
"permission": {
"flags": 15
},
"name": "",
"group": 0,
"url_spec": ""
}
}
]
},
"owner": {
"id": "$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b",
"display_name": ""
}
}
This seems inconsistent with buckets created by regular users
Is this expected behaviour?
Regards
Daniel
Hi all
I'm working on the following scenario
User is authenticated with OIDC and tries to access a bucket which it does
not own.
How to specify user ID etc. to give access to such a user?
By trial and error I found out that principal can be specified as
"Principal": {"Federated":["arn:aws:sts:::assumed-role/MySession"]},
but I want to use shadow user ID or something similar as the principal
Docs
https://docs.ceph.com/en/latest/radosgw/STS/
states:
'A shadow user is created corresponding to every federated user. The user
id is derived from the ‘sub’ field of the incoming web token. The user is
created in a separate namespace - ‘oidc’ such that the user id doesn’t
clash with any other user ids in rgw. The format of the user id is -
<tenant>$<user-namespace>$<sub> where user-namespace is ‘oidc’ for users
that authenticate with oidc providers.'
I see a shadow user in Web UI as e.g. 7f71c7c5-c24f-418e-87ac-aa8fe271289b
but I cannot work out the syntax of a user id, I was expecting something
like
"arn:aws:iam:::user/$oidc$7f71c7c5-c24f-418e-87ac-aa8fe271289b"
but when trying to list content of a bucket I get AccessDenied.
If bucket policy has Principal "*" the my authenticated user can access the
bucket
Is this possible?
Regards
Daniel
Hi,
How I got here
--------------
Yesterday evening I added an OSD to my hobby system most likely using the
command:
# ceph-volume raw prepare --bluestore --data /dev/bcache0
# cephadm adopt --style legacy --name osd.20
I also used the command (after not having much luck with that, but I don't
have the specifics):
% ceph orch daemon add osd tutu:/tmp/bcache0
per https://docs.ceph.com/en/latest/cephadm/osd/#creating-new-osds
..which I think resulted in new osd.18, putting the bcache0 inside its own
VG and its own LV.
I don't have actual log of the used command available, but I did end up with
new osds 18 and 20. First time using these command as well, my previous
ways to achieve the same were a bit more long-winded..
According to my monitoring my main issue appeared around the same time.
In this post I don't worry about the state of the OSD but only about
management.
Actual issue
------------
So when I now issue "ceph orch ls" I get the following output:
% ceph orch ls
Error EINVAL: Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1204, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in _list_services
raise_if_exception(completion)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in raise_if_exception
raise e
AssertionError: not
("ceph orch ps" works fine.)
Similarly the output of "ceph -s" is:
% ceph -s
...
health: HEALTH_ERR
Module 'cephadm' has failed: 'not'
...
The relevant log from the manager, as per the mgr web interface, is:
_Promise failed Traceback (most recent call last): File
"/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in _finalize
next_result = self._on_complete(self._value) File
"/usr/share/ceph/mgr/cephadm/module.py", line 107, in <lambda> return
CephadmCompletion(on_complete=lambda _: f(*args, **kwargs)) File
"/usr/share/ceph/mgr/cephadm/module.py", line 1333, in describe_service
hosts=[dd.hostname] File
"/lib/python3.6/site-packages/ceph/deployment/service_spec.py", line 429, in
__init__ assert service_type in ServiceSpec.KNOWN_SERVICE_TYPES,
service_type AssertionError: not
I also noticed this seemingly highly relevant bit in my ceph orch ps:
NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID
not.osd.20 tutu stopped 13h ago 14h <unknown> docker.io/ceph/ceph:v15 <unknown> <unknown>
I'm not quite sure how I ended up with that, but I wouldn't exclude operator
error :) such as entering "cephadm adopt --style legacy --name not.osd.20"
(but WHY..).
Sure enough, there is no such docker container running in the host and the
job ceph-3046312a-e453-11ea-b1f5-b42e993e47fc(a)osd.20.service has failed with
"RuntimeError: could not find osd.20 with osd_fsid
212c336a-9516-4818-aeaf-2d0c24c4ca65" (this error makes sense, as both osds
18 and 20 try to use the same bcache0, but the actual bluestore filesystem
is inside vg/lv as used by 18, whereas 20 tries to use bcache0 directly),
but as I said I won't worry about the OSD at the moment.
I tried the command "ceph orch daemon rm not.osd.20", however I'm not sure
if it even should work. It nevertheless fails the same way:
% ceph orch daemon rm not.osd.20
Error EINVAL: Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1204, in _handle_command
return self.handle_command(inbuf, cmd)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in handle_command
return dispatch[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in <lambda>
wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/share/ceph/mgr/orchestrator/module.py", line 1061, in _daemon_rm
raise_if_exception(completion)
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in raise_if_exception
raise e
KeyError: 'not'
with the following entries in the mgr log:
5/13/21 1:26:06 PM[ERR]_Promise failed Traceback (most recent call last):
File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 294, in
_finalize next_result = self._on_complete(self._value) File
"/usr/share/ceph/mgr/cephadm/module.py", line 107, in <lambda> return
CephadmCompletion(on_complete=lambda _: f(*args, **kwargs)) File
"/usr/share/ceph/mgr/cephadm/module.py", line 1515, in remove_daemons return
self._remove_daemons(args) File "/usr/share/ceph/mgr/cephadm/utils.py", line
65, in forall_hosts_wrapper return
CephadmOrchestrator.instance._worker_pool.map(do_work, vals) File
"/lib64/python3.6/multiprocessing/pool.py", line 266, in map return
self._map_async(func, iterable, mapstar, chunksize).get() File
"/lib64/python3.6/multiprocessing/pool.py", line 644, in get raise
self._value File "/lib64/python3.6/multiprocessing/pool.py", line 119, in
worker result = (True, func(*args, **kwds)) File
"/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar return
list(map(*args)) File "/usr/share/ceph/mgr/cephadm/utils.py", line 58, in
do_work return f(self, *arg) File "/usr/share/ceph/mgr/cephadm/module.py",
line 1804, in _remove_daemons return self._remove_daemon(name, host) File
"/usr/share/ceph/mgr/cephadm/module.py", line 1818, in _remove_daemon
self.cephadm_services[daemon_type].pre_remove(daemon) KeyError: 'not'
5/13/21 1:26:06 PM[ERR]executing
_remove_daemons((<cephadm.module.CephadmOrchestrator object at
0x7f1f4fec2bd0>, [('not.osd.20', 'tutu')])) failed. Traceback (most recent
call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 58, in do_work
return f(self, *arg) File "/usr/share/ceph/mgr/cephadm/module.py", line
1804, in _remove_daemons return self._remove_daemon(name, host) File
"/usr/share/ceph/mgr/cephadm/module.py", line 1818, in _remove_daemon
self.cephadm_services[daemon_type].pre_remove(daemon) KeyError: 'not'
I tried also that "ceph orch daemon rm foo.bar.42" gives the error "Error
EINVAL: Unable to find daemon(s) ['foo.bar.42']", so it seems it processes
the actual command fine in part.
Thanks for any assistance!
--
_____________________________________________________________________
/ __// /__ ____ __ Erkki Seppälä\ \
/ /_ / // // /\ \/ / \ /
/_/ /_/ \___/ /_/\_\(a)inside.org http://www.inside.org/~flux/
Hi all,
I lost 2 OSDs deployed on a single Kingston SSD in a rather strange way and am wondering if anyone has made similar observations or is aware of a firmware bug with these disks.
Disk model: KINGSTON SEDC500M3840G (it ought to be a DC grade model with super capacitors)
Smartctl does not report any drive errors.
Performance per TB is as expected, OSDs are "ceph-volume lvm batch" bluestore deployed, everything collocated.
Short version: I disable volatile write cache on all OSD disks, but the Kingston disks seem to behave as if this cache is *not* disabled. Smartctl and hdparm report wcache=off though. The OSD loss looks like what unflushed write cache during power loss would result in. I'm afraid now that our cluster might be vulnerable to power loss.
Long version:
Our disks are on Dell HBA330 Mini controllers and are in state "non-raid". The controller itself has no cache and is HBA-mode only.
Log entry:
The iDRAC log shows that the disk was removed from a drive group:
---
PDR5 Disk 6 in Backplane 2 of Integrated Storage Controller 1 is removed.
Detailed Description: A physical disk has been removed from the disk group. This alert can also be caused by loose or defective cables or by problems with the enclosure.
---
The iDRAC did not report the disk as failed and neither as "removed from drive bay". I reseated the disk and it came back as healthy. I assume it was a problem with connectivity to the back-plane (chassis). If I now try to start up the OSDs on this disk, I get the error:
starting osd.581 at - osd_data /var/lib/ceph/osd/ceph-581 /var/lib/ceph/osd/ceph-581/journal
starting osd.580 at - osd_data /var/lib/ceph/osd/ceph-580 /var/lib/ceph/osd/ceph-580/journal
2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluefs mount failed to replay log: (5) Input/output error
2021-05-06 09:23:47.160 7fead5a1fb80 -1 bluestore(/var/lib/ceph/osd/ceph-581) _open_db failed bluefs mount: (5) Input/output error
2021-05-06 09:23:47.630 7fead5a1fb80 -1 osd.581 0 OSD:init: unable to mount object store
2021-05-06 09:23:47.630 7fead5a1fb80 -1 ** ERROR: osd init failed: (5) Input/output error
I have removed disks of active OSDs before without any bluestore corruption happening. While it is very well possible that this particular "disconnect" event may lead to a broken OSD, there is also another observation where the Kingston disks stick out compared with other SSD OSDs, which make me suspicious of this being a disk cache firmware problem:
The I/O indicator LED lights up with significantly lower frequency than for all other SSD types on the same pool even though we have 2 instead of 1 OSD deployed on the Kingstons (the other disks are 2TB Micron Pro). While this could be due to a wiring difference I'm starting to suspect that this might be an indication of volatile caching.
Does anyone using Kingston DC-M-SSDs have similar or contradicting experience?
How did these disks handle power outages?
Any recommendations?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello,
we have a recurring, funky problem with managers on Nautilus (and
probably also earlier versions): the manager displays incorrect
information.
This is a recurring pattern and it also breaks the prometheus graphs, as
the I/O is described insanely incorrectly: "recovery: 43 TiB/s, 3.62k
keys/s, 11.40M objects/s" - which basically changes the scale of any
related graph to unusable.
The latest example from today shows slow ops for an OSD
that has been down for 17h:
--------------------------------------------------------------------------------
[09:50:31] black2.place6:~# ceph -s
cluster:
id: 1ccd84f6-e362-4c50-9ffe-59436745e445
health: HEALTH_WARN
18 slow ops, oldest one blocked for 975 sec, osd.53 has slow ops
services:
mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w)
mgr: server2(active, since 2w), standbys: server8, server4, server9, server6, ciara3
osd: 108 osds: 107 up (since 17h), 107 in (since 17h)
data:
pools: 4 pools, 2624 pgs
objects: 42.52M objects, 162 TiB
usage: 486 TiB used, 298 TiB / 784 TiB avail
pgs: 2616 active+clean
8 active+clean+scrubbing+deep
io:
client: 522 MiB/s rd, 22 MiB/s wr, 8.18k op/s rd, 689 op/s wr
--------------------------------------------------------------------------------
Killing the manager on server2 changes the status to another temporary
incorrect status, because the rebalance finished hours ago, paired with
the incorrect rebalance speed that we see from time to time:
--------------------------------------------------------------------------------
[09:51:59] black2.place6:~# ceph -s
cluster:
id: 1ccd84f6-e362-4c50-9ffe-59436745e445
health: HEALTH_OK
services:
mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w)
mgr: server8(active, since 11s), standbys: server4, server9, server6, ciara3
osd: 108 osds: 107 up (since 17h), 107 in (since 17h)
data:
pools: 4 pools, 2624 pgs
objects: 42.52M objects, 162 TiB
usage: 486 TiB used, 298 TiB / 784 TiB avail
pgs: 2616 active+clean
8 active+clean+scrubbing+deep
io:
client: 214 TiB/s rd, 54 TiB/s wr, 4.86G op/s rd, 1.06G op/s wr
recovery: 43 TiB/s, 3.62k keys/s, 11.40M objects/s
progress:
Rebalancing after osd.53 marked out
[========================......]
--------------------------------------------------------------------------------
Then a bit later, the status on the newly started manager is correct:
--------------------------------------------------------------------------------
[09:52:18] black2.place6:~# ceph -s
cluster:
id: 1ccd84f6-e362-4c50-9ffe-59436745e445
health: HEALTH_OK
services:
mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w)
mgr: server8(active, since 47s), standbys: server4, server9, server6, server2, ciara3
osd: 108 osds: 107 up (since 17h), 107 in (since 17h)
data:
pools: 4 pools, 2624 pgs
objects: 42.52M objects, 162 TiB
usage: 486 TiB used, 298 TiB / 784 TiB avail
pgs: 2616 active+clean
8 active+clean+scrubbing+deep
io:
client: 422 MiB/s rd, 39 MiB/s wr, 7.91k op/s rd, 752 op/s wr
--------------------------------------------------------------------------------
Question: is this a know bug, is anyone else seeing it or are we doing
something wrong?
Best regards,
Nico
--
Sustainable and modern Infrastructures by ungleich.ch
Hi all,
I wanted to provide an RCA for the outage you may have been affected by yesterday. Some services that went down:
- All CI/testing
- quay.ceph.io
- telemetry.ceph.com (your cluster may have gone into HEALTH_WARN if you report telemetry data)
- lists.ceph.io (so all mailing lists)
All of our critical infra is running in a Red Hat Virtualization (RHV) instance backed by Red Hat Gluster Storage (RHGS) as the storage. Before you go, "wait.. Gluster?" Yes, this cluster was set up before Ceph was supported as backend storage for RHV/RHEV.
The root cause for the outage is the Gluster volumes got 100% full. Once no writes were possible, RHV paused all the VMs.
Why didn't monitoring catch this? I honestly don't know.
# grep ssdstore01 nagios-05-*2021* | grep Disk
nagios-05-01-2021-00.log:[1619740800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-02-2021-00.log:[1619827200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-03-2021-00.log:[1619913600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-04-2021-00.log:[1620000000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-05-2021-00.log:[1620086400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-06-2021-00.log:[1620172800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-07-2021-00.log:[1620259200] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-08-2021-00.log:[1620345600] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-09-2021-00.log:[1620432000] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-10-2021-00.log:[1620518400] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
nagios-05-11-2021-00.log:[1620604800] CURRENT SERVICE STATE: ssdstore01;Disk Space;OK;HARD;1;Disks are OK now
Yet RHV knew we were running out of space. I don't have e-mail notifications set up in RHV, however.
# zgrep "disk space" engine*202105*.gz | cut -d ',' -f4 | head -n 10
Low disk space. hosted_storage domain has 24 GB of free space.
Low disk space. hosted_storage domain has 24 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 23 GB of free space.
Low disk space. hosted_storage domain has 21 GB of free space.
Low disk space. hosted_storage domain has 20 GB of free space.
Low disk space. hosted_storage domain has 11 GB of free space.
Our nagios instances runs this to check disk space: https://github.com/ceph/ceph-cm-ansible/blob/master/roles/common/files/libe…
You can ignore the comment about it only working for EXT2.
[root@ssdstore01 ~]# /usr/libexec/diskusage.pl 90 95
Disks are OK now
I ran this manually on one of the storage hosts and intentionally set the WARN level to a number lower than the current usage percentage.
[root@ssdstore01 ~]# df -h | grep 'Size\|gluster'
Filesystem Size Used Avail Use% Mounted on
/dev/md124 8.8T 6.7T 2.1T 77% /gluster
[root@ssdstore01 ~]# /usr/libexec/diskusage.pl 95 70
/gluster is at 77%
[root@ssdstore01 ~]# echo $?
2
When I logged in to the storage hosts yesterday morning, the /gluster mount was at 100%. So nagios should have known.
How'd it get fixed? I happened to have some large capacity drives that fit the storage nodes lying around. They're being installed in a different project soon. However, I was able to add these drives, add "bricks" to the Gluster storage, then rebalance the data. Once that was done, I was able to restart all the VMs and delete old VMs and snapshots I no longer needed.
How do we keep this from happening again? Well, as you may have been able to deduce... we were running out of space at a rate of 1-10 GB/day. As you can see now, the Gluster volume has 2.1TB of space left. So even if we grew by 10GB/day again, we'd be okay for 200ish days.
I aim to have some (if not all) of these services moved off this platform and into an Openshift cluster backed by Ceph this year. Sadly, I just don't think I have enough logging enabled to nail down exactly what happened.
--
David Galloway
Senior Systems Administrator
Ceph Engineering