February 2023 - ceph-users

[Quincy] Module 'devicehealth' has failed: disk I/O error

by Satish Patel

Folks, Any idea what is going on, I am running 3 node quincy version of openstack and today suddenly i noticed the following error. I found reference link but not sure if that is my issue or not https://tracker.ceph.com/issues/51974 root@ceph1:~# ceph -s cluster: id: cd748128-a3ea-11ed-9e46-c309158fad32 health: HEALTH_ERR 1 mgr modules have recently crashed services: mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 2d) mgr: ceph1.ckfkeb(active, since 6h), standbys: ceph2.aaptny osd: 9 osds: 9 up (since 2d), 9 in (since 2d) data: pools: 4 pools, 128 pgs objects: 1.18k objects, 4.7 GiB usage: 17 GiB used, 16 TiB / 16 TiB avail pgs: 128 active+clean root@ceph1:~# ceph health HEALTH_ERR Module 'devicehealth' has failed: disk I/O error; 1 mgr modules have recently crashed root@ceph1:~# ceph crash ls ID ENTITY NEW 2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035 mgr.ceph1.ckfkeb * root@ceph1:~# ceph crash info 2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035 { "backtrace": [ " File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\n self.scrape_all()", " File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\n self.put_device_metrics(device, data)", " File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\n self._create_device(devid)", " File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\n cursor = self.db.execute(SQL, (devid,))", "sqlite3.OperationalError: disk I/O error" ], "ceph_version": "17.2.5", "crash_id": "2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035", "entity_name": "mgr.ceph1.ckfkeb", "mgr_module": "devicehealth", "mgr_module_caller": "PyModuleRunner::serve", "mgr_python_exception": "OperationalError", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-mgr", "stack_sig": "7e506cc2729d5a18403f0373447bb825b42aafa2405fb0e5cfffc2896b093ed8", "timestamp": "2023-02-07T00:07:12.739187Z", "utsname_hostname": "ceph1", "utsname_machine": "x86_64", "utsname_release": "5.15.0-58-generic", "utsname_sysname": "Linux", "utsname_version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"

1 year, 2 months

2
1
0 0

Problem with IO after renaming File System .data pool

by murilo＠evocorp.com.br

Good morning everyone. On this Thursday night we went through an accident, where they accidentally renamed the .data pool of a File System making it instantly inaccessible, when renaming it again to the correct name it was possible to mount and list the files, but could not read or write. When trying to write, the FS returned as Read Only, when trying to read it returned Operation not allowed. After a period of breaking my head I tried to mount with the ADMIN user and everything worked correctly. I tried to remove the authentication of the current user through `ceph auth rm`, I created a new user through `ceph fs authorize <fs_name> client.<user> / rw` and it continued the same way, I also tried to recreate it through `ceph auth get-or-create` and nothing different happened, it stayed exactly the same. After setting `allow *` in mon, mds and osd I was able to mount, read and write again with the new user. I can understand why the File System stopped after renaming the pool, what I don't understand is why users are unable to perform operations on FS even with RW or any other user created. What could have happened behind the scenes to not be able to perform IO even with the correct permissions? Or did I apply incorrect permissions that caused this problem? Right now everything is working, I would really like to understand what happened, because I didn't find anything documented about this type of incident.

1 year, 2 months

3
2
0 0

17.2.5 ceph fs status: AssertionError

by Robert Sander

Hi, I have a healthy (test) cluster running 17.2.5: root@cephtest20:~# ceph status cluster: id: ba37db20-2b13-11eb-b8a9-871ba11409f6 health: HEALTH_OK services: mon: 3 daemons, quorum cephtest31,cephtest41,cephtest21 (age 2d) mgr: cephtest22.lqzdnk(active, since 4d), standbys: cephtest32.ybltym, cephtest42.hnnfaf mds: 1/1 daemons up, 1 standby, 1 hot standby osd: 48 osds: 48 up (since 4d), 48 in (since 4M) rgw: 2 daemons active (2 hosts, 1 zones) tcmu-runner: 6 portals active (3 hosts) data: volumes: 1/1 healthy pools: 17 pools, 513 pgs objects: 28.25k objects, 4.7 GiB usage: 26 GiB used, 4.7 TiB / 4.7 TiB avail pgs: 513 active+clean io: client: 4.3 KiB/s rd, 170 B/s wr, 5 op/s rd, 0 op/s wr CephFS is mounted and can be used without any issue. But I get an error when I when querying its status: root@cephtest20:~# ceph fs status Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1757, in _handle_command return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 462, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/status/module.py", line 159, in handle_fs_status assert metadata AssertionError The dashboard's filesystem page shows no error and displays all information about cephfs. Where does this AssertionError come from? Regards -- Robert Sander Heinlein Support GmbH Linux: Akademie - Support - Hosting http://www.heinlein-support.de Tel: 030-405051-43 Fax: 030-405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin

1 year, 2 months

3
3
0 0

Ceph Leadership Team Meeting, Feb 22 2023 Minutes

by Ernesto Puerta

Hi Cephers, These are the minutes of today's meeting (quicker than usual since some CLT members were at Ceph Days NYC): - *[Yuri] Upcoming Releases:* - Pending PRs for Quincy - Sepia Lab still absorbing the PR queue after the past issues - [Ernesto] Github started sending dependabot alerts to devels (previously it was only sent to org admins) - https://github.blog/2023-01-17-dependabot-alerts-are-now-visible-to-more-de… - Most don't necessarily involve a risk (e.g.: Javascript dependency only exploitable in a back-end/node.js server)... - ... but it might still cause some unnecessary concern among devs/users regarding Ceph security status - Current list of vulnerable dependencies: https://github.com/ceph/ceph/security/dependabot - 40% are Dashboard Javascript ones (most could be dismissed since only impact when used on node.js apps) - Remaining ones are: - Python: requirements.txt (not relevant since Python package versions change with every distro and we assume distro-maintainers will fix those) - It might become more relevant when we start packaging Python deps ( https://github.com/ceph/ceph/pull/47501/) - Golang: "/examples/rgw" path (Casey opened https://tracker.ceph.com/issues/58828, but maybe we should just dismiss the alert?) - [Ernesto] Enabling Github Auto-merge feature in the Ceph repo - https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/i… - Use case: - There's a PR with approvals but flaky CI tests (API, make check, ...) (example: https://github.com/ceph/ceph/pull/50201) - We could retrigger tests and come back to the PR page multiple times until all tests pass... - ... Or we just click the "Auto-merge" button, fill out the merge message as usual, and let Github merge it when the CI tests pass. - It'd reduce cognitive load, especially with small PRs (docs, backport PRs) where the overhead of the PR process is more noticeable. - There's still one issue: - Keeping Redmine in sync with Github - It could be done: when clicking the Auto-merge or still requiring reviewers to poll the PR until passed and then updating Redmine (not ideal) - A Github action that update a tracker when Github merges the PR would be very useful - Yuri/Ilya: discussion around backport requirement reverse order (needs-qa label vs. approvals vs. CI tests passing). - Greg pointed out the risks of auto-merge merging PRs with patches submitted after passing requirements or approvals. Auto-merge status should be reset on new commits. - Decision: not to enable it. - Yuri suggested auto-labeling PRs with passing CI, so they better know when to start QA testing. - Separate discussion on CI flakiness & stability and lack of clear points of contact (Kefu and David did that). For unit tests it's clear that affected teams should do that, but for infrastructure issues there's still a vacuum. Kind Regards, Ernesto

1 year, 2 months

1
0
0 0

Strange behavior when using storage classes

by Michal Strnad

Hi all, we encountered some strange behavior when using storage classes for S3 protocol. Some objects end up in a different pool than we would expect. Below is a list of commands used for create an account with replicated storage class, upload some files to the bucket and checked that they were uploaded to the correct location. Most of the files were uploaded to the correct pool, but some were uploaded to the erasure-code pool. Specifically, certain multipart objects (but they have zero size) and stub. The multipart objects probably wouldn't mind so much, but stub is important because we tried to delete it and without it the operations will not go through. Are there any errors in our setting (see zonegroup and zone setttings below)? Can someone please send us a setup that is functional so we can build it accordingly? We have a Pacific version. === Create the account === radosgw-admin user create --tenant vo_du_test --uid s3_test_tomash_ec_replicated --display-name s3_test_tomash_ec_replicated --storage-class "" --placement-id "EC_replicated" --tags "" radosgw-admin user info --tenant vo_du_test --uid s3_test_tomash_ec_replicated | jq '.default_placement,.default_storage_class,.placement_tags' "EC_replicated" "" [] === Create the bucket with replicated storage class === s3cmd mb s3://b7-user-ec-stcl-replicated --storage-class=replicated Bucket 's3://b7-user-ec-stcl-replicated/' created s3cmd -c ~/.s3cfg/.s3cfg_clx_vo_du_test_s3_test_tomash_ec_replicated put 20MiB.dat s3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat --storage-class=replicated upload: '20MiB.dat' -> 's3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat' [part 1 of 2, 15MB] [1 of 1] 15728640 of 15728640 100% in 0s 20.86 MB/s done upload: '20MiB.dat' -> 's3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat' [part 2 of 2, 5MB] [1 of 1] 5242880 of 5242880 100% in 0s 18.34 MB/s done s3cmd -c ~/.s3cfg/.s3cfg_clx_vo_du_test_s3_test_tomash_ec_replicated info s3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat s3://b7-user-ec-stcl-replicated/20MiB_b7_repl.dat (object): File size: 20971520 Storage: replicated ... ACL: s3_test_tomash_ec_replicated: FULL_CONTROL ... === Checking where objects are stored === rados -p storage-clx.rgw.replicated.data ls | grep _b7_repl 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_1 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_3 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_2 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2_1 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1 ceph_mon:# rados -p storage-clx.rgw.EC.data ls | grep _b7_repl 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1_20MiB_b7_repl.dat ceph_mon:# rados stat -p storage-clx.rgw.EC.data 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1_20MiB_b7_repl.dat storage-clx.rgw.EC.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1_20MiB_b7_repl.dat mtime 2023-01-24T12:46:48.000000+0100, size 0 ceph_mon:# rados stat -p storage-clx.rgw.EC.data 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1 storage-clx.rgw.EC.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1 mtime 2023-01-24T12:46:48.000000+0100, size 0 ceph_mon:# rados stat -p storage-clx.rgw.EC.data 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2 storage-clx.rgw.EC.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2 mtime 2023-01-24T12:46:48.000000+0100, size 0 ceph_mon:# rados stat -p storage-clx.rgw.replicated.data 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1 storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1 mtime 2023-01-24T12:46:47.000000+0100, size 4194304 rados stat -p storage-clx.rgw.replicated.data 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_1 storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_1 mtime 2023-01-24T12:46:48.000000+0100, size 4194304 ceph_mon:# rados stat -p storage-clx.rgw.replicated.data 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_2 storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_2 mtime 2023-01-24T12:46:48.000000+0100, size 4194304 ceph_mon:# rados stat -p storage-clx.rgw.replicated.data 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_3 storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.1_3 mtime 2023-01-24T12:46:48.000000+0100, size 3145728 ceph_mon:# rados stat -p storage-clx.rgw.replicated.data 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2 storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__multipart_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2 mtime 2023-01-24T12:46:48.000000+0100, size 4194304 ceph_mon:# rados stat -p storage-clx.rgw.replicated.data 68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2_1 storage-clx.rgw.replicated.data/68618cb9-d006-4b86-b8f0-3722d952fd33.69137019.1__shadow_20MiB_b7_repl.dat.2~AooZ4xMGNV7oOzIQbNdo9fV5qw3vEr4.2_1 mtime 2023-01-24T12:46:48.000000+0100, size 1048576 === Zonegroup setting === radosgw-admin zonegroup get { "id": "kda7dadb-7bbc-4197-b070-147f76e74717", "name": "storage", "api_name": "storage", "is_master": "true", "endpoints": [ "https://s3.clx.domain.cz" ], "hostnames": [], "hostnames_s3website": [], "master_zone": "69922cb9-d006-4b86-b8f0-3722d952fd33", "zones": [ { "id": "24654cb9-d006-4b86-b8f0-3722d952fd33", "name": "storage-clx", "endpoints": [ "https://s3.clx.domain.cz" ], "log_meta": "false", "log_data": "false", "bucket_index_max_shards": 11, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" } ], "placement_targets": [ { "name": "EC_replicated", "tags": [], "storage_classes": [ "EC", "STANDARD", "replicated" ] }, { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] }, { "name": "replicated", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "87a9217-7e8d-4a1b-b508-eb634cc55647", "sync_policy": { "groups": [] } } === Zone setting === radosgw-admin zone get { "id": "24654cb9-d006-4b86-b8f0-3722d952fd33", "name": "storage-clx", "domain_root": "storage-clx.rgw.meta:root", "control_pool": "storage-clx.rgw.control", "gc_pool": "storage-clx.rgw.log:gc", "lc_pool": "storage-clx.rgw.log:lc", "log_pool": "storage-clx.rgw.log", "intent_log_pool": "storage-clx.rgw.log:intent", "usage_log_pool": "storage-clx.rgw.log:usage", "roles_pool": "storage-clx.rgw.meta:roles", "reshard_pool": "storage-clx.rgw.log:reshard", "user_keys_pool": "storage-clx.rgw.meta:users.keys", "user_email_pool": "storage-clx.rgw.meta:users.email", "user_swift_pool": "storage-clx.rgw.meta:users.swift", "user_uid_pool": "storage-clx.rgw.meta:users.uid", "otp_pool": "storage-clx.rgw.otp", "system_key": { "access_key": "", "secret_key": "" }, "placement_pools": [ { "key": "EC_replicated", "val": { "index_pool": "storage-clx.rgw.EC.index", "storage_classes": { "STANDARD": { "data_pool": "storage-clx.rgw.EC.data" }, "replicated": { "data_pool": "storage-clx.rgw.replicated.data" } }, "data_extra_pool": "storage-clx.rgw.EC.non-ec", "index_type": 0 } }, { "key": "default-placement", "val": { "index_pool": "storage-clx.rgw.EC.index", "storage_classes": { "STANDARD": { "data_pool": "storage-clx.rgw.EC.data" } }, "data_extra_pool": "storage-clx.rgw.EC.non-ec", "index_type": 0 } }, { "key": "replicated", "val": { "index_pool": "storage-clx.rgw.replicated.index", "storage_classes": { "STANDARD": { "data_pool": "storage-clx.rgw.replicated.data" } }, "data_extra_pool": "", "index_type": 0 } } ], "realm_id": "87a9217-7e8d-4a1b-b508-eb634cc55647", "notif_pool": "storage-clx.rgw.log:notif" } Thank you Regards, Michal Strnad

1 year, 2 months

1
0
0 0

Re: Stuck OSD service specification - can't remove

by Eugen Block

Hi, did you ever resolve that? I'm stuck with the same "deleting" service in 'ceph orch ls' and found your thread. Thanks, Eugen

1 year, 2 months

1
0
0 0

Missing keyrings on upgraded cluster

by Eugen Block

Hi *, I was playing around on an upgraded test cluster (from N to Q), current version: "overall": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 18 } I tried to replace an OSD after destroying it with 'ceph orch osd rm osd.5 --replace'. The OSD was drained successfully and marked as "destroyed" as expected, the zapping also worked. At this point I didn't have an osd spec in place because all OSDs were adopted during the upgrade process. So I created a new spec which was not applied successfully (I'm wondering if there's another/new issue with ceph-volume, but that's not the focus here), so I tried it manually with 'cephadm ceph-volume lvm create'. I'll add the output at the end for a better readability. Apparently, there's no boostrap-osd keyring for cephadm so it can't search the desired osd_id in the osd tree, the command it tries is this: ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json In the local filesystem the required keyring is present, though: nautilus:~ # cat /var/lib/ceph/bootstrap-osd/ceph.keyring [client.bootstrap-osd] key = AQBOCbpgixIsOBAAgBzShsFg/l1bOze4eTZHug== caps mgr = "allow r" caps mon = "profile bootstrap-osd" Is there something missing during the adoption process? Or are the docs lacking some upgrade info? I found a section about putting keyrings under management [1], but I'm not sure if that's what's missing here. Any insights are highly appreciated! Thanks, Eugen [1] https://docs.ceph.com/en/quincy/cephadm/operations/#putting-a-keyring-under… ---snip--- nautilus:~ # cephadm ceph-volume lvm create --osd-id 5 --data /dev/sde --block.db /dev/sdb --block.db-size 5G Inferring fsid <FSID> Using recent ceph image <LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92 Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE=<LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92 -e NODE_NAME=nautilus -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/<FSID>:/var/run/ceph:z -v /var/log/ceph/<FSID>:/var/log/ceph:z -v /var/lib/ceph/<FSID>/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /tmp/ceph-tmpuydvbhuk:/etc/ceph/ceph.conf:z <LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92 lvm create --osd-id 5 --data /dev/sde --block.db /dev/sdb --block.db-size 5G /usr/bin/podman: stderr time="2023-02-20T09:02:49+01:00" level=warning msg="Path \"/etc/SUSEConnect\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping" /usr/bin/podman: stderr time="2023-02-20T09:02:49+01:00" level=warning msg="Path \"/etc/zypp/credentials.d/SCCcredentials\" from \"/etc/containers/mounts.conf\" doesn't exist, skipping" /usr/bin/podman: stderr Running command: /usr/bin/ceph-authtool --gen-print-key /usr/bin/podman: stderr Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json /usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.848+0000 7fd255e30700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or directory /usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.848+0000 7fd255e30700 -1 AuthRegistry(0x7fd250060d50) no keyring found at /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling cephx /usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.852+0000 7fd255e30700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory /usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.852+0000 7fd255e30700 -1 AuthRegistry(0x7fd250060d50) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx /usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.856+0000 7fd255e30700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory /usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.856+0000 7fd255e30700 -1 AuthRegistry(0x7fd250065910) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx /usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.856+0000 7fd255e30700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory /usr/bin/podman: stderr stderr: 2023-02-20T08:02:50.856+0000 7fd255e30700 -1 AuthRegistry(0x7fd255e2eea0) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx /usr/bin/podman: stderr stderr: [errno 2] RADOS object not found (error connecting to the cluster) /usr/bin/podman: stderr Traceback (most recent call last): /usr/bin/podman: stderr File "/usr/sbin/ceph-volume", line 11, in <module> /usr/bin/podman: stderr load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')() /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in __init__ /usr/bin/podman: stderr self.main(self.argv) /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc /usr/bin/podman: stderr return f(*a, **kw) /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main /usr/bin/podman: stderr terminal.dispatch(self.mapper, subcommand_args) /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch /usr/bin/podman: stderr instance.main() /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 46, in main /usr/bin/podman: stderr terminal.dispatch(self.mapper, self.argv) /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch /usr/bin/podman: stderr instance.main() /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/create.py", line 77, in main /usr/bin/podman: stderr self.create(args) /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root /usr/bin/podman: stderr return func(*a, **kw) /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/create.py", line 26, in create /usr/bin/podman: stderr prepare_step.safe_prepare(args) /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py", line 252, in safe_prepare /usr/bin/podman: stderr self.prepare() /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root /usr/bin/podman: stderr return func(*a, **kw) /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/prepare.py", line 292, in prepare /usr/bin/podman: stderr self.osd_id = prepare_utils.create_id(osd_fsid, json.dumps(secrets), osd_id=self.args.osd_id) /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 166, in create_id /usr/bin/podman: stderr if osd_id_available(osd_id): /usr/bin/podman: stderr File "/usr/lib/python3.6/site-packages/ceph_volume/util/prepare.py", line 204, in osd_id_available /usr/bin/podman: stderr raise RuntimeError('Unable check if OSD id exists: %s' % osd_id) /usr/bin/podman: stderr RuntimeError: Unable check if OSD id exists: 5 Traceback (most recent call last): File "/usr/sbin/cephadm", line 9170, in <module> main() File "/usr/sbin/cephadm", line 9158, in main r = ctx.func(ctx) File "/usr/sbin/cephadm", line 1917, in _infer_config return func(ctx) File "/usr/sbin/cephadm", line 1877, in _infer_fsid return func(ctx) File "/usr/sbin/cephadm", line 1945, in _infer_image return func(ctx) File "/usr/sbin/cephadm", line 1835, in _validate_fsid return func(ctx) File "/usr/sbin/cephadm", line 5294, in command_ceph_volume out, err, code = call_throws(ctx, c.run_cmd()) File "/usr/sbin/cephadm", line 1637, in call_throws raise RuntimeError('Failed command: %s' % ' '.join(command)) RuntimeError: Failed command: /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE=<LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92 -e NODE_NAME=nautilus -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/<FSID>:/var/run/ceph:z -v /var/log/ceph/<FSID>:/var/log/ceph:z -v /var/lib/ceph/<FSID>/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /tmp/ceph-tmpuydvbhuk:/etc/ceph/ceph.conf:z <LOCAL_REGISTRY>/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92 lvm create --osd-id 5 --data /dev/sde --block.db /dev/sdb --block.db-size 5G ---snip---

1 year, 2 months

2
7
0 0

Very slow snaptrim operations blocking client I/O

by Victor Rodriguez

Hello, Asking for help with an issue. Maybe someone has a clue about what's going on. Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I removed it. A bit later, nearly half of the PGs of the pool entered snaptrim and snaptrim_wait state, as expected. The problem is that such operations ran extremely slow and client I/O was nearly nothing, so all VMs in the cluster got stuck as they could not I/O to the storage. Taking and removing big snapshots is a normal operation that we do often and this is the first time I see this issue in any of my clusters. Disks are all Samsung PM1733 and network is 25G. It gives us plenty of performance for the use case and never had an issue with the hardware. Both disk I/O and network I/O was very low. Still, client I/O seemed to get queued forever. Disabling snaptrim (ceph osd set nosnaptrim) stops any active snaptrim operation and client I/O resumes back to normal. Enabling snaptrim again makes client I/O to almost halt again. I've been playing with some settings: ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1' ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30' ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30' ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1' None really seemed to help. Also tried restarting OSD services. This cluster was upgraded from 14.2.x to 15.2.17 a couple of months. Is there any setting that must be changed which may cause this problem? I have scheduled a maintenance window, what should I look for to diagnose this problem? Any help is very appreciated. Thanks in advance. Victor

1 year, 2 months

8
18
0 0

ceph-iscsi-cli: cannot remove duplicated gateways.

by luckydog xf

Hi, please see the output below. ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain is the one who is being messed up with a wrong hostname. I want to delete it. /iscsi-target...-igw/gateways> ls o- gateways .................................................................................................. [Up: 2/3, Portals: 3] o- ceph-iscsi-gw-1.ipa.pthl.hk ............................................................................. [172.16.202.251 (UP)] o- ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain .............................................. [172.16.202.251 (UNAUTHORIZED)] o- ceph-iscsi-gw-2.ipa.pthl.hk ............................................................................. [172.16.202.252 (UP)] /iscsi-target...-igw/gateways> delete gateway_name=ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain confirm=true Deleting gateway, ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain Could not contact ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain. If the gateway is permanently down. Use confirm=true to force removal. WARNING: Forcing removal of a gateway that can still be reached by an initiator may result in data corruption. /iscsi-target...-igw/gateways> /iscsi-target...-igw/gateways> delete gateway_name=ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain confirm=true Deleting gateway, ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain Failed : Unhandled exception: list.remove(x): x not in list However ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain is still there. Version info is ceph-iscsi-3.5-1.el8cp.noarch on RHEL 8.4. /iscsi-target...-igw/gateways> ls o- gateways .................................................................................................. [Up: 2/3, Portals: 3] o- ceph-iscsi-gw-1.ipa.pthl.hk ............................................................................. [172.16.202.251 (UP)] o- ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain ................................................... [172.16.202.251 (UNKNOWN)] o- ceph-iscsi-gw-2.ipa.pthl.hk ............................................................................. [172.16.202.252 (UP)] /iscsi-target...-igw/gateways> delete ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain confirm=true Deleting gateway, ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain Failed : Unhandled exception: list.remove(x): x not in list However ceph-iscsi-gw-1.ipa.pthl.hklocalhost.localdomain is still there. Please help, thanks.

1 year, 2 months

2
8
0 0

Removing failing OSD with cephadm?

by Matt Larson

I have an OSD that is causing slow ops, and appears to be backed by a failing drive according to smartctl outputs. I am using cephadm, and wondering what is the best way to remove this drive from the cluster and proper steps to replace the disk? Mark the osd.35 as out. `sudo ceph osd out osd.35` Then mark osd.35 as down. `sudo ceph osd down osd.35` The OSD is marked as out, but it does come back up after a couple of seconds. I do not know if that is a problem or to just let the drive stay online as long as it lasts during the removal from the cluster. After the recovery completes, I would then `destroy` the osd: `ceph osd destroy {id} --yes-i-really-mean-it` (https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/) Besides checking steps above, my question now is ..* If the drive is acting very slow and causing slow ops, should I be trying to shut down its OSD and keep it down? There is an example to stop the OSD on the server using systemctl, outside of cephadm:* ssh {osd-host}sudo systemctl stop ceph-osd@{osd-num} Thanks, Matt -- Matt Larson, PhD Madison, WI 53705 U.S.A.

1 year, 2 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2023