Folks,
Any idea what is going on, I am running 3 node quincy version of openstack
and today suddenly i noticed the following error. I found reference link
but not sure if that is my issue or not
https://tracker.ceph.com/issues/51974
root@ceph1:~# ceph -s
cluster:
id: cd748128-a3ea-11ed-9e46-c309158fad32
health: HEALTH_ERR
1 mgr modules have recently crashed
services:
mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 2d)
mgr: ceph1.ckfkeb(active, since 6h), standbys: ceph2.aaptny
osd: 9 osds: 9 up (since 2d), 9 in (since 2d)
data:
pools: 4 pools, 128 pgs
objects: 1.18k objects, 4.7 GiB
usage: 17 GiB used, 16 TiB / 16 TiB avail
pgs: 128 active+clean
root@ceph1:~# ceph health
HEALTH_ERR Module 'devicehealth' has failed: disk I/O error; 1 mgr modules
have recently crashed
root@ceph1:~# ceph crash ls
ID ENTITY
NEW
2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035
mgr.ceph1.ckfkeb *
root@ceph1:~# ceph crash info
2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035
{
"backtrace": [
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373,
in serve\n self.scrape_all()",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425,
in scrape_all\n self.put_device_metrics(device, data)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500,
in put_device_metrics\n self._create_device(devid)",
" File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487,
in _create_device\n cursor = self.db.execute(SQL, (devid,))",
"sqlite3.OperationalError: disk I/O error"
],
"ceph_version": "17.2.5",
"crash_id":
"2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035",
"entity_name": "mgr.ceph1.ckfkeb",
"mgr_module": "devicehealth",
"mgr_module_caller": "PyModuleRunner::serve",
"mgr_python_exception": "OperationalError",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-mgr",
"stack_sig":
"7e506cc2729d5a18403f0373447bb825b42aafa2405fb0e5cfffc2896b093ed8",
"timestamp": "2023-02-07T00:07:12.739187Z",
"utsname_hostname": "ceph1",
"utsname_machine": "x86_64",
"utsname_release": "5.15.0-58-generic",
"utsname_sysname": "Linux",
"utsname_version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
Good morning everyone.
On this Thursday night we went through an accident, where they accidentally renamed the .data pool of a File System making it instantly inaccessible, when renaming it again to the correct name it was possible to mount and list the files, but could not read or write. When trying to write, the FS returned as Read Only, when trying to read it returned Operation not allowed.
After a period of breaking my head I tried to mount with the ADMIN user and everything worked correctly.
I tried to remove the authentication of the current user through `ceph auth rm`, I created a new user through `ceph fs authorize <fs_name> client.<user> / rw` and it continued the same way, I also tried to recreate it through `ceph auth get-or-create` and nothing different happened, it stayed exactly the same.
After setting `allow *` in mon, mds and osd I was able to mount, read and write again with the new user.
I can understand why the File System stopped after renaming the pool, what I don't understand is why users are unable to perform operations on FS even with RW or any other user created.
What could have happened behind the scenes to not be able to perform IO even with the correct permissions? Or did I apply incorrect permissions that caused this problem?
Right now everything is working, I would really like to understand what happened, because I didn't find anything documented about this type of incident.
Hi,
I have a healthy (test) cluster running 17.2.5:
root@cephtest20:~# ceph status
cluster:
id: ba37db20-2b13-11eb-b8a9-871ba11409f6
health: HEALTH_OK
services:
mon: 3 daemons, quorum cephtest31,cephtest41,cephtest21 (age 2d)
mgr: cephtest22.lqzdnk(active, since 4d), standbys: cephtest32.ybltym, cephtest42.hnnfaf
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 48 osds: 48 up (since 4d), 48 in (since 4M)
rgw: 2 daemons active (2 hosts, 1 zones)
tcmu-runner: 6 portals active (3 hosts)
data:
volumes: 1/1 healthy
pools: 17 pools, 513 pgs
objects: 28.25k objects, 4.7 GiB
usage: 26 GiB used, 4.7 TiB / 4.7 TiB avail
pgs: 513 active+clean
io:
client: 4.3 KiB/s rd, 170 B/s wr, 5 op/s rd, 0 op/s wr
CephFS is mounted and can be used without any issue.
But I get an error when I when querying its status:
root@cephtest20:~# ceph fs status
Error EINVAL: Traceback (most recent call last):
File "/usr/share/ceph/mgr/mgr_module.py", line 1757, in _handle_command
return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
File "/usr/share/ceph/mgr/mgr_module.py", line 462, in call
return self.func(mgr, **kwargs)
File "/usr/share/ceph/mgr/status/module.py", line 159, in handle_fs_status
assert metadata
AssertionError
The dashboard's filesystem page shows no error and displays
all information about cephfs.
Where does this AssertionError come from?
Regards
--
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de
Tel: 030-405051-43
Fax: 030-405051-19
Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
Hi Cephers,
These are the minutes of today's meeting (quicker than usual since some CLT
members were at Ceph Days NYC):
- *[Yuri] Upcoming Releases:*
- Pending PRs for Quincy
- Sepia Lab still absorbing the PR queue after the past issues
- [Ernesto] Github started sending dependabot alerts to devels
(previously it was only sent to org admins)
-
https://github.blog/2023-01-17-dependabot-alerts-are-now-visible-to-more-de…
- Most don't necessarily involve a risk (e.g.: Javascript dependency
only exploitable in a back-end/node.js server)...
- ... but it might still cause some unnecessary concern among devs/users
regarding Ceph security status
- Current list of vulnerable dependencies:
https://github.com/ceph/ceph/security/dependabot
- 40% are Dashboard Javascript ones (most could be dismissed since only
impact when used on node.js apps)
- Remaining ones are:
- Python: requirements.txt (not relevant since Python package versions
change with every distro and we assume distro-maintainers will fix those)
- It might become more relevant when we start packaging Python deps (
https://github.com/ceph/ceph/pull/47501/)
- Golang: "/examples/rgw" path (Casey opened
https://tracker.ceph.com/issues/58828, but maybe we should just dismiss
the alert?)
- [Ernesto] Enabling Github Auto-merge feature in the Ceph repo
-
https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/i…
- Use case:
- There's a PR with approvals but flaky CI tests (API, make check, ...)
(example: https://github.com/ceph/ceph/pull/50201)
- We could retrigger tests and come back to the PR page multiple times
until all tests pass...
- ... Or we just click the "Auto-merge" button, fill out the merge
message as usual, and let Github merge it when the CI tests pass.
- It'd reduce cognitive load, especially with small PRs (docs, backport
PRs) where the overhead of the PR process is more noticeable.
- There's still one issue:
- Keeping Redmine in sync with Github
- It could be done: when clicking the Auto-merge or still requiring
reviewers to poll the PR until passed and then updating Redmine (not ideal)
- A Github action that update a tracker when Github merges the PR would
be very useful
- Yuri/Ilya: discussion around backport requirement reverse order
(needs-qa label vs. approvals vs. CI tests passing).
- Greg pointed out the risks of auto-merge merging PRs with patches
submitted after passing requirements or approvals. Auto-merge status should
be reset on new commits.
- Decision: not to enable it.
- Yuri suggested auto-labeling PRs with passing CI, so they better know
when to start QA testing.
- Separate discussion on CI flakiness & stability and lack of clear
points of contact (Kefu and David did that). For unit tests it's clear that
affected teams should do that, but for infrastructure issues there's still
a vacuum.
Kind Regards,
Ernesto
Hi all,
we encountered some strange behavior when using storage classes for S3
protocol. Some objects end up in a different pool than we would expect.
Below is a list of commands used for create an account with replicated
storage class, upload some files to the bucket and checked that they
were uploaded to the correct location. Most of the files were uploaded
to the correct pool, but some were uploaded to the erasure-code pool.
Specifically, certain multipart objects (but they have zero size) and
stub. The multipart objects probably wouldn't mind so much, but stub is
important because we tried to delete it and without it the operations
will not go through.
Are there any errors in our setting (see zonegroup and zone setttings
below)?
Can someone please send us a setup that is functional so we can build it
accordingly?
We have a Pacific version.
=== Create the account ===
radosgw-admin user create --tenant vo_du_test --uid
s3_test_tomash_ec_replicated --display-name s3_test_tomash_ec_replicated
--storage-class "" --placement-id "EC_replicated" --tags ""
radosgw-admin user info --tenant vo_du_test --uid
s3_test_tomash_ec_replicated | jq
'.default_placement,.default_storage_class,.placement_tags'
"EC_replicated"
""
[]
=== Create the bucket with replicated storage class ===
s3cmd mb s3://b7-user-ec-stcl-replicated --storage-class=replicated
Bucket 's3://b7-user-ec-stcl-replicated/' created
s3cmd -c ~/.s3cfg/.s3cfg_clx_vo_du_test_s3_test_tomash_ec_replicated put
20MiB.dat s3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat
--storage-class=replicated
upload: '20MiB.dat' ->
's3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat' [part 1 of 2, 15MB]
[1 of 1]
15728640 of 15728640 100% in 0s 20.86 MB/s done
upload: '20MiB.dat' ->
's3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat' [part 2 of 2, 5MB]
[1 of 1]
5242880 of 5242880 100% in 0s 18.34 MB/s done
s3cmd -c ~/.s3cfg/.s3cfg_clx_vo_du_test_s3_test_tomash_ec_replicated
info s3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat
s3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat (object):
File size: 20971520
Storage: replicated
...
ACL: s3_test_tomash_ec_replicated: FULL_CONTROL
...
=== Checking where objects are stored ===
rados -p storage-clx.rgw.replicated.data ls | grep _b7_repl
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_1
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_3
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_2
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2_1
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1
ceph_mon:# rados -p storage-clx.rgw.EC.data ls | grep _b7_repl
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1_20MiB_b7_repl.dat
ceph_mon:# rados stat -p storage-clx.rgw.EC.data
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1_20MiB_b7_repl.dat
storage-clx.rgw.EC.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1_20MiB_b7_repl.dat
mtime 2023-01-24T12:46:48.000000+0100, size 0
ceph_mon:# rados stat -p storage-clx.rgw.EC.data
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1
storage-clx.rgw.EC.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1
mtime 2023-01-24T12:46:48.000000+0100, size 0
ceph_mon:# rados stat -p storage-clx.rgw.EC.data
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2
storage-clx.rgw.EC.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2
mtime 2023-01-24T12:46:48.000000+0100, size 0
ceph_mon:# rados stat -p storage-clx.rgw.replicated.data
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1
storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1
mtime 2023-01-24T12:46:47.000000+0100, size 4194304
rados stat -p storage-clx.rgw.replicated.data
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_1
storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_1
mtime 2023-01-24T12:46:48.000000+0100, size 4194304
ceph_mon:# rados stat -p storage-clx.rgw.replicated.data
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_2
storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_2
mtime 2023-01-24T12:46:48.000000+0100, size 4194304
ceph_mon:# rados stat -p storage-clx.rgw.replicated.data
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_3
storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_3
mtime 2023-01-24T12:46:48.000000+0100, size 3145728
ceph_mon:# rados stat -p storage-clx.rgw.replicated.data
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2
storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2
mtime 2023-01-24T12:46:48.000000+0100, size 4194304
ceph_mon:# rados stat -p storage-clx.rgw.replicated.data
68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2_1
storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2_1
mtime 2023-01-24T12:46:48.000000+0100, size 1048576
=== Zonegroup setting ===
radosgw-admin zonegroup get
{
"id": "kda7dadb-7bbc-4197-b070-147f76e74717",
"name": "storage",
"api_name": "storage",
"is_master": "true",
"endpoints": [
"https://s3.clx.domain.cz"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "69922cb9-d006-4b86-b8f0-3722d952fd33",
"zones": [
{
"id": "24654cb9-d006-4b86-b8f0-3722d952fd33",
"name": "storage-clx",
"endpoints": [
"https://s3.clx.domain.cz"
],
"log_meta": "false",
"log_data": "false",
"bucket_index_max_shards": 11,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
}
],
"placement_targets": [
{
"name": "EC_replicated",
"tags": [],
"storage_classes": [
"EC",
"STANDARD",
"replicated"
]
},
{
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
},
{
"name": "replicated",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "87a9217-7e8d-4a1b-b508-eb634cc55647",
"sync_policy": {
"groups": []
}
}
=== Zone setting ===
radosgw-admin zone get
{
"id": "24654cb9-d006-4b86-b8f0-3722d952fd33",
"name": "storage-clx",
"domain_root": "storage-clx.rgw.meta:root",
"control_pool": "storage-clx.rgw.control",
"gc_pool": "storage-clx.rgw.log:gc",
"lc_pool": "storage-clx.rgw.log:lc",
"log_pool": "storage-clx.rgw.log",
"intent_log_pool": "storage-clx.rgw.log:intent",
"usage_log_pool": "storage-clx.rgw.log:usage",
"roles_pool": "storage-clx.rgw.meta:roles",
"reshard_pool": "storage-clx.rgw.log:reshard",
"user_keys_pool": "storage-clx.rgw.meta:users.keys",
"user_email_pool": "storage-clx.rgw.meta:users.email",
"user_swift_pool": "storage-clx.rgw.meta:users.swift",
"user_uid_pool": "storage-clx.rgw.meta:users.uid",
"otp_pool": "storage-clx.rgw.otp",
"system_key": {
"access_key": "",
"secret_key": ""
},
"placement_pools": [
{
"key": "EC_replicated",
"val": {
"index_pool": "storage-clx.rgw.EC.index",
"storage_classes": {
"STANDARD": {
"data_pool": "storage-clx.rgw.EC.data"
},
"replicated": {
"data_pool": "storage-clx.rgw.replicated.data"
}
},
"data_extra_pool": "storage-clx.rgw.EC.non-ec",
"index_type": 0
}
},
{
"key": "default-placement",
"val": {
"index_pool": "storage-clx.rgw.EC.index",
"storage_classes": {
"STANDARD": {
"data_pool": "storage-clx.rgw.EC.data"
}
},
"data_extra_pool": "storage-clx.rgw.EC.non-ec",
"index_type": 0
}
},
{
"key": "replicated",
"val": {
"index_pool": "storage-clx.rgw.replicated.index",
"storage_classes": {
"STANDARD": {
"data_pool": "storage-clx.rgw.replicated.data"
}
},
"data_extra_pool": "",
"index_type": 0
}
}
],
"realm_id": "87a9217-7e8d-4a1b-b508-eb634cc55647",
"notif_pool": "storage-clx.rgw.log:notif"
}
Thank you
Regards,
Michal Strnad
Hi *,
I was playing around on an upgraded test cluster (from N to Q),
current version:
"overall": {
"ceph version 17.2.5
(98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 18
}
I tried to replace an OSD after destroying it with 'ceph orch osd rm
osd.5 --replace'. The OSD was drained successfully and marked as
"destroyed" as expected, the zapping also worked. At this point I
didn't have an osd spec in place because all OSDs were adopted during
the upgrade process. So I created a new spec which was not applied
successfully (I'm wondering if there's another/new issue with
ceph-volume, but that's not the focus here), so I tried it manually
with 'cephadm ceph-volume lvm create'. I'll add the output at the end
for a better readability. Apparently, there's no boostrap-osd keyring
for cephadm so it can't search the desired osd_id in the osd tree, the
command it tries is this:
ceph --cluster ceph --name client.bootstrap-osd --keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
In the local filesystem the required keyring is present, though:
nautilus:~ # cat /var/lib/ceph/bootstrap-osd/ceph.keyring
[client.bootstrap-osd]
key = AQBOCbpgixIsOBAAgBzShsFg/l1bOze4eTZHug==
caps mgr = "allow r"
caps mon = "profile bootstrap-osd"
Is there something missing during the adoption process? Or are the
docs lacking some upgrade info? I found a section about putting
keyrings under management [1], but I'm not sure if that's what's
missing here.
Any insights are highly appreciated!
Thanks,
Eugen
[1]
https://docs.ceph.com/en/quincy/cephadm/operations/#putting-a-keyring-under…
---snip---
nautilus:~ # cephadm ceph-volume lvm create --osd-id 5 --data /dev/sde
--block.db /dev/sdb --block.db-size 5G
Inferring fsid <FSID>
Using recent ceph image
<LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92
Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host
--stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host
--entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk
--init -e
CONTAINER_IMAGE=<LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92 -e NODE_NAME=nautilus -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/<FSID>:/var/run/ceph:z -v /var/log/ceph/<FSID>:/var/log/ceph:z -v /var/lib/ceph/<FSID>/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /tmp/ceph-tmpuydvbhuk:/etc/ceph/ceph.conf:z <LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92 lvm create --osd-id 5 --data /dev/sde --block.db /dev/sdb --block.db-size
5G
/usr/bin/podman: stderr time="2023-02-20T09:02:49+01:00" level=warning
msg="Path \"/etc/SUSEConnect\" from \"/etc/containers/mounts.conf\"
doesn't exist, skipping"
/usr/bin/podman: stderr time="2023-02-20T09:02:49+01:00" level=warning
msg="Path \"/etc/zypp/credentials.d/SCCcredentials\" from
\"/etc/containers/mounts.conf\" doesn't exist, skipping"
/usr/bin/podman: stderr Running command: /usr/bin/ceph-authtool
--gen-print-key
/usr/bin/podman: stderr Running command: /usr/bin/ceph --cluster ceph
--name client.bootstrap-osd --keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
/usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.848+0000
7fd255e30700 -1 auth: unable to find a keyring on
/etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or
directory
/usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.848+0000
7fd255e30700 -1 AuthRegistry(0x7fd250060d50) no keyring found at
/etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling
cephx
/usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.852+0000
7fd255e30700 -1 auth: unable to find a keyring on
/var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
/usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.852+0000
7fd255e30700 -1 AuthRegistry(0x7fd250060d50) no keyring found at
/var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
/usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.856+0000
7fd255e30700 -1 auth: unable to find a keyring on
/var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
/usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.856+0000
7fd255e30700 -1 AuthRegistry(0x7fd250065910) no keyring found at
/var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
/usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.856+0000
7fd255e30700 -1 auth: unable to find a keyring on
/var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
/usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.856+0000
7fd255e30700 -1 AuthRegistry(0x7fd255e2eea0) no keyring found at
/var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
/usr/bin/podman: stderr stderr: [errno 2] RADOS object not found
(error connecting to the cluster)
/usr/bin/podman: stderr Traceback (most recent call last):
/usr/bin/podman: stderr File "/usr/sbin/ceph-volume", line 11, in <module>
/usr/bin/podman: stderr load_entry_point('ceph-volume==1.0.0',
'console_scripts', 'ceph-volume')()
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in
__init__
/usr/bin/podman: stderr self.main(self.argv)
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59,
in newfunc
/usr/bin/podman: stderr return f(*a, **kw)
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in
main
/usr/bin/podman: stderr terminal.dispatch(self.mapper, subcommand_args)
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194,
in dispatch
/usr/bin/podman: stderr instance.main()
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py",
line 46, in main
/usr/bin/podman: stderr terminal.dispatch(self.mapper, self.argv)
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194,
in dispatch
/usr/bin/podman: stderr instance.main()
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/create.py",
line 77, in main
/usr/bin/podman: stderr self.create(args)
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16,
in is_root
/usr/bin/podman: stderr return func(*a, **kw)
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/create.py",
line 26, in create
/usr/bin/podman: stderr prepare_step.safe_prepare(args)
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py",
line 252, in safe_prepare
/usr/bin/podman: stderr self.prepare()
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16,
in is_root
/usr/bin/podman: stderr return func(*a, **kw)
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py",
line 292, in prepare
/usr/bin/podman: stderr self.osd_id =
prepare_utils.create_id(osd_fsid, json.dumps(secrets),
osd_id=self.args.osd_id)
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line
166, in create_id
/usr/bin/podman: stderr if osd_id_available(osd_id):
/usr/bin/podman: stderr File
"/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line
204, in osd_id_available
/usr/bin/podman: stderr raise RuntimeError('Unable check if OSD id
exists: %s' % osd_id)
/usr/bin/podman: stderr RuntimeError: Unable check if OSD id exists: 5
Traceback (most recent call last):
File "/usr/sbin/cephadm", line 9170, in <module>
main()
File "/usr/sbin/cephadm", line 9158, in main
r = ctx.func(ctx)
File "/usr/sbin/cephadm", line 1917, in _infer_config
return func(ctx)
File "/usr/sbin/cephadm", line 1877, in _infer_fsid
return func(ctx)
File "/usr/sbin/cephadm", line 1945, in _infer_image
return func(ctx)
File "/usr/sbin/cephadm", line 1835, in _validate_fsid
return func(ctx)
File "/usr/sbin/cephadm", line 5294, in command_ceph_volume
out, err, code = call_throws(ctx, c.run_cmd())
File "/usr/sbin/cephadm", line 1637, in call_throws
raise RuntimeError('Failed command: %s' % ' '.join(command))
RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host
--stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host
--entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk
--init -e
CONTAINER_IMAGE=<LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92 -e NODE_NAME=nautilus -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/<FSID>:/var/run/ceph:z -v /var/log/ceph/<FSID>:/var/log/ceph:z -v /var/lib/ceph/<FSID>/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /tmp/ceph-tmpuydvbhuk:/etc/ceph/ceph.conf:z <LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92 lvm create --osd-id 5 --data /dev/sde --block.db /dev/sdb --block.db-size
5G
---snip---
Hello,
Asking for help with an issue. Maybe someone has a clue about what's
going on.
Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I removed
it. A bit later, nearly half of the PGs of the pool entered snaptrim and
snaptrim_wait state, as expected. The problem is that such operations
ran extremely slow and client I/O was nearly nothing, so all VMs in the
cluster got stuck as they could not I/O to the storage. Taking and
removing big snapshots is a normal operation that we do often and this
is the first time I see this issue in any of my clusters.
Disks are all Samsung PM1733 and network is 25G. It gives us plenty of
performance for the use case and never had an issue with the hardware.
Both disk I/O and network I/O was very low. Still, client I/O seemed to
get queued forever. Disabling snaptrim (ceph osd set nosnaptrim) stops
any active snaptrim operation and client I/O resumes back to normal.
Enabling snaptrim again makes client I/O to almost halt again.
I've been playing with some settings:
ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'
None really seemed to help. Also tried restarting OSD services.
This cluster was upgraded from 14.2.x to 15.2.17 a couple of months. Is
there any setting that must be changed which may cause this problem?
I have scheduled a maintenance window, what should I look for to
diagnose this problem?
Any help is very appreciated. Thanks in advance.
Victor
Hi, please see the output below.
ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain is the one who is being
messed up with a wrong hostname. I want to delete it.
/iscsi-target...-igw/gateways> ls
o- gateways ..................................................................................................
[Up: 2/3, Portals: 3]
o- ceph-iscsi-gw-1.ipa.pthl.hk
.............................................................................
[172.16.202.251 (UP)]
o- ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
.............................................. [172.16.202.251
(UNAUTHORIZED)]
o- ceph-iscsi-gw-2.ipa.pthl.hk
.............................................................................
[172.16.202.252 (UP)]
/iscsi-target...-igw/gateways> delete
gateway_name=ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
confirm=true
Deleting gateway, ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
Could not contact ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain. If
the gateway is permanently down. Use confirm=true to force removal.
WARNING: Forcing removal of a gateway that can still be reached by an
initiator may result in data corruption.
/iscsi-target...-igw/gateways>
/iscsi-target...-igw/gateways> delete
gateway_name=ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
confirm=true
Deleting gateway, ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
Failed : Unhandled exception: list.remove(x): x not in list
However ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain is still there.
Version info is ceph-iscsi-3.5-1.el8cp.noarch on RHEL 8.4.
/iscsi-target...-igw/gateways> ls
o- gateways ..................................................................................................
[Up: 2/3, Portals: 3]
o- ceph-iscsi-gw-1.ipa.pthl.hk
.............................................................................
[172.16.202.251 (UP)]
o- ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
................................................... [172.16.202.251
(UNKNOWN)]
o- ceph-iscsi-gw-2.ipa.pthl.hk
.............................................................................
[172.16.202.252 (UP)]
/iscsi-target...-igw/gateways> delete
ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain confirm=true
Deleting gateway, ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain
Failed : Unhandled exception: list.remove(x): x not in list
However ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain is still there.
Please help, thanks.
I have an OSD that is causing slow ops, and appears to be backed by a
failing drive according to smartctl outputs. I am using cephadm, and
wondering what is the best way to remove this drive from the cluster and
proper steps to replace the disk?
Mark the osd.35 as out.
`sudo ceph osd out osd.35`
Then mark osd.35 as down.
`sudo ceph osd down osd.35`
The OSD is marked as out, but it does come back up after a couple of
seconds. I do not know if that is a problem or to just let the drive stay
online as long as it lasts during the removal from the cluster.
After the recovery completes, I would then `destroy` the osd:
`ceph osd destroy {id} --yes-i-really-mean-it`
(https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/)
Besides checking steps above, my question now is ..* If the drive is acting
very slow and causing slow ops, should I be trying to shut down its OSD
and keep it down? There is an example to stop the OSD on the server using
systemctl, outside of cephadm:*
ssh {osd-host}sudo systemctl stop ceph-osd@{osd-num}
Thanks,
Matt
--
Matt Larson, PhD
Madison, WI 53705 U.S.A.