July 2023 - ceph-users - lists.ceph.io

by Yuri Weinstein

This is the third and possibly last release candidate for Reef. The Reef release comes with a new RockDB version (7.9.2) [0], which incorporates several performance improvements and features. Our internal testing doesn't show any side effects from the new version, but we are very eager to hear community feedback on it. This is the first release to have the ability to tune RockDB settings per column family [1], which allows for more granular tunings to be applied to different kinds of data stored in RocksDB. A new set of settings has been used in Reef to optimize performance for most kinds of workloads with a slight penalty in some cases, outweighed by large improvements in use cases such as RGW, in terms of compactions and write amplification. We would highly encourage community members to give these a try against their performance benchmarks and use cases. The detailed list of changes in terms of RockDB and BlueStore can be found in https://pad.ceph.com/p/reef-rc-relnotes. If any of our community members would like to help us with performance investigations or regression testing of the Reef release candidate, please feel free to provide feedback via email or in https://pad.ceph.com/p/reef_scale_testing. For more active discussions, please use the #ceph-at-scale slack channel in ceph-storage.slack.com. This RC has gone thru partial testing due to issues we are experiencing in the sepia lab. Please try it out and report any issues you encounter. Happy testing! Thanks, YuriW Get the release from * Git at git://github.com/ceph/ceph.git * Tarball at https://download.ceph.com/tarballs/ceph-18.1.3.tar.gz * Containers at https://quay.io/repository/ceph/ceph * For packages, see https://docs.ceph.com/en/latest/install/get-packages/ * Release git sha1: f594a0802c34733bb06e5993bc4bdb085c9a5f3f

9 months, 2 weeks

1
0
0 0

MON sync time depends on outage duration

by Eugen Block

Hi *, I'm investigating an interesting issue on two customer clusters (used for mirroring) I've not solved yet, but today we finally made some progress. Maybe someone has an idea where to look next, I'd appreciate any hints or comments. These are two (latest) Octopus clusters, main usage currently is RBD mirroring with snapshot mode (around 500 RBD images are synced every 30 minutes). They noticed very long startup times of MON daemons after reboot, times between 10 and 30 minutes (reboot time already subtracted). These delays are present on both sites. Today we got a maintenance window and started to check in more detail by just restarting the MON service (joins quorum within seconds), then stopping the MON service and wait a few minutes (still joins quorum within seconds). And then we stopped the service and waited for more than 5 minutes, simulating a reboot, and then we were able to reproduce it. The sync then takes around 15 minutes, we verified with other MONs as well. The MON store is around 2 GB of size (on HDD), I understand that the sync itself can take some time, but what is the threshold here? I tried to find a hint in the MON config, searching for timeouts with 300 seconds, there were only a few matches (mon_session_timeout is one of them), but I'm not sure if they can explain this behavior. Investigating the MON store (ceph-monstore-tool dump-keys) I noticed that there were more than 42 Million osd_snap keys, which is quite a lot and would explain the size of the MON store. But I'm also not sure if it's related to the long syncing process. Does that sound familiar to anyone? Thanks, Eugen

9 months, 2 weeks

4
13
0 0

Multiple object instances with null version id

by Huy Nguyen

Hi, I have a Ceph cluster in v16.2.13. I'm not sure why does this happen and how to clean it? [2023-07-12 21:23:13 +07] 299B STANDARD null v18 PUT index.txt [2023-07-12 21:27:54 +07] 299B STANDARD null v17 PUT index.txt [2023-07-12 21:48:01 +07] 299B STANDARD null v16 PUT index.txt [2023-07-12 21:42:24 +07] 299B STANDARD null v15 PUT index.txt [2023-07-12 21:03:42 +07] 299B STANDARD null v14 PUT index.txt [2023-07-12 21:16:25 +07] 299B STANDARD null v13 PUT index.txt [2023-07-12 21:09:27 +07] 299B STANDARD null v12 PUT index.txt [2023-07-12 22:01:28 +07] 299B STANDARD null v11 PUT index.txt [2023-07-25 08:33:03 +07] 0B null v10 DEL index.txt [2023-07-12 21:31:26 +07] 299B STANDARD null v9 PUT index.txt [2023-07-12 21:08:35 +07] 299B STANDARD null v8 PUT index.txt [2023-07-12 21:19:28 +07] 299B STANDARD null v7 PUT index.txt [2023-07-12 21:11:53 +07] 299B STANDARD null v6 PUT index.txt [2023-07-12 23:13:52 +07] 299B STANDARD null v5 PUT index.txt [2023-07-12 22:00:38 +07] 299B STANDARD null v4 PUT index.txt [2023-07-12 23:12:09 +07] 299B STANDARD null v3 PUT index.txt [2023-07-12 23:20:50 +07] 299B STANDARD null v2 PUT index.txt [2023-07-12 21:42:00 +07] 299B STANDARD null v1 PUT index.txt I tried to delete the object, but it only create delete marker. I can't even specify the versionid because all of them is null. I also tried to find that object with rados ls, and it returned only 1 object (which should be 18): 17a4ce99-009e-40f2-a2d2-2afc218ebd9b.876888518.16_airbnbnova/files/category/index.txt rados rm this object doesn't help anything Anyone have any ideal? Thanks

9 months, 2 weeks

1
1
0 0

Ceph 17.2.6 alert-manager receives error 500 from inactive MGR

by Robert Sander

Hi, we noticed a strange error message in the logfiles: The alert-manager deployed with cephadm receives a HTTP 500 error from the inactive MGR when trying to call the URI /api/prometheus_receiver: Jul 25 09:35:25 alert-manager conmon[2426]: level=error ts=2023-07-25T07:35:25.171Z caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" num_alerts=45 err="ceph-dashboard/webhook[0]: notify retry canceled after 7 attempts: unexpected status code 500: https://mgr001.example.net:8443/api/prometheus_receiver; ceph-dashboard/webhook[2]: notify retry canceled after 8 attempts: unexpected status code 500: https://mgr003.example.net:8443/api/prometheus_receiver" Jul 25 09:35:25 alert-manager conmon[2426]: level=warn ts=2023-07-25T07:35:25.175Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[2] msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://mgr003.example.net:8443/api/prometheus_receiver" Jul 25 09:35:25 alert-manager conmon[2426]: level=warn ts=2023-07-25T07:35:25.177Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://mgr001.example.net:8443/api/prometheus_receiver" Jul 25 09:35:35 alert-manager conmon[2426]: level=error ts=2023-07-25T07:35:35.171Z caller=dispatch.go:354 component=dispatcher msg="Notify for alerts failed" num_alerts=45 err="ceph-dashboard/webhook[2]: notify retry canceled after 7 attempts: unexpected status code 500: https://mgr003.example.net:8443/api/prometheus_receiver; ceph-dashboard/webhook[0]: notify retry canceled after 8 attempts: unexpected status code 500: https://mgr001.example.net:8443/api/prometheus_receiver" Jul 25 09:35:35 alert-manager conmon[2426]: level=warn ts=2023-07-25T07:35:35.176Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[2] msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://mgr003.example.net:8443/api/prometheus_receiver" Jul 25 09:35:35 alert-manager conmon[2426]: level=warn ts=2023-07-25T07:35:35.176Z caller=notify.go:724 component=dispatcher receiver=ceph-dashboard integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="unexpected status code 500: https://mgr001.example.net:8443/api/prometheus_receiver" This is from the logfile of mgr002, which was passive first and then became active. After being active the errors on the MGR where gone but showed on the newly passive MGR. Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard INFO request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [581dce66-9c65-4e84-a41a-8d72b450791e] /api/prometheus_receiver Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [::ffff:10.54.226.222:49904] [POST] [500] [0.001s] [513.0B] [26e1854a-3b93-49c4-8afc-1a96426a3dab] /api/prometheus_receiver Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [b'{"status": "500 Internal Server Error", "detail": "The server encountered an unexpected condition which prevented it from fulfilling the request.", "request _id": "26e1854a-3b93-49c4-8afc-1a96426a3dab"} '] Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard INFO request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [26e1854a-3b93-49c4-8afc-1a96426a3dab] /api/prometheus_receiver Jul 25 09:25:26 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [::ffff:10.54.226.222:49904] [POST] [500] [0.001s] [513.0B] [46d7e78c-49d5-4652-9877-973129ad3977] /api/prometheus_receiver Jul 25 09:25:26 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [b'{"status": "500 Internal Server Error", "detail": "The server encountered an unexpected condition which prevented it from fulfilling the request.", "request _id": "46d7e78c-49d5-4652-9877-973129ad3977"} '] Jul 25 09:25:26 mgr002 ceph-mgr[1841]: [dashboard INFO request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [46d7e78c-49d5-4652-9877-973129ad3977] /api/prometheus_receiver Jul 25 09:25:27 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [a9b25e54-f1e1-42eb-90b2-af5aa22769cf] /api/prometheus_receiver Jul 25 09:25:27 mgr002 ceph-mgr[1841]: [dashboard ERROR request] [b'{"status": "500 Internal Server Error", "detail": "The server encountered an unexpected condition which prevented it from fulfilling the request.", "request _id": "a9b25e54-f1e1-42eb-90b2-af5aa22769cf"} '] Jul 25 09:25:27 mgr002 ceph-mgr[1841]: [dashboard INFO request] [::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B] [a9b25e54-f1e1-42eb-90b2-af5aa22769cf] /api/prometheus_receiver Jul 25 09:25:28 mgr002 ceph-mgr[1841]: mgr handle_mgr_map Activating! Jul 25 09:25:28 mgr002 ceph-mgr[1841]: mgr handle_mgr_map I am now activating We have a test cluster running also with version 17.2.6 where this does not happen. In this test cluster the passive MGRs return an HTTP code 204 when the alert-manager tries to request /api/prometheus_receiver. What is happening here? Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 220009 B Geschäftsführer: Peer Heinlein - Sitz: Berlin

9 months, 2 weeks

2
2
0 0

inactive PGs looking for a non existent OSD

by Alfredo Rezinovsky

I had a problem with a server, hardware completely broken. "ceph orch rm host" hanged, even with force and offline options I reinstalled other server with the same IP address and then I removed the OSD with: ceph osd purge osd.10 ceph osd purge osd.11 Now I have 0.342% pgs not active with ceph pg <pg.id> query I can see the PG is blocked by a non existent OSD.10 or 11 (in the other problematic PG) I already tried setting osd_find_best_info_ignore_history_les = false in the intervening OSDs and restarted them with some luck (I had 3 non active PGs, now I have 2) Also after that another OSD keeps restarting. Fixed that by setting the reweight to 0 and still waiting until the OSD is empty to destroy it. -- Alfrenovsky

9 months, 2 weeks

2
1
0 0

PG backfilled slow

by Peter

Hi all, I need replace some disk due to bad sector. I have crush out these disks and ceph did backfilling and migrate data as I want. However, I could see these OSD has one or more PG left after a day wait and backfilling really slow. Now it has only one backfilling PG at the same time. host001:~# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 122 hdd 9.37500 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 1 up 123 hdd 9.37500 1.00000 9.4 TiB 2.1 TiB 1.8 TiB 224 KiB 4.7 GiB 7.3 TiB 22.06 0.69 64 up 124 hdd 9.37500 1.00000 9.4 TiB 2.0 TiB 1.7 TiB 211 KiB 4.4 GiB 7.4 TiB 21.14 0.67 61 up 125 hdd 9.37500 1.00000 9.4 TiB 2.2 TiB 1.9 TiB 218 KiB 5.0 GiB 7.2 TiB 22.94 0.72 67 up 126 hdd 9.37500 1.00000 9.4 TiB 2.3 TiB 2.0 TiB 235 KiB 4.7 GiB 7.1 TiB 24.50 0.77 72 up 127 hdd 9.37500 1.00000 9.4 TiB 2.4 TiB 2.1 TiB 248 KiB 5.5 GiB 6.9 TiB 25.91 0.82 77 up 128 hdd 9.37500 1.00000 9.4 TiB 2.2 TiB 1.9 TiB 349 KiB 5.0 GiB 7.2 TiB 23.52 0.74 69 up 129 hdd 9.37500 1.00000 9.4 TiB 2.1 TiB 1.8 TiB 216 KiB 4.6 GiB 7.3 TiB 22.62 0.71 66 up 130 hdd 9.37500 1.00000 9.4 TiB 2.5 TiB 2.2 TiB 244 KiB 5.3 GiB 6.9 TiB 26.51 0.83 79 up 131 hdd 9.37500 1.00000 9.4 TiB 2.1 TiB 1.8 TiB 230 KiB 4.0 GiB 7.3 TiB 22.09 0.70 64 up 132 hdd 9.37500 1.00000 9.4 TiB 2.2 TiB 2.0 TiB 231 KiB 5.1 GiB 7.1 TiB 23.93 0.75 70 up 133 hdd 9.37500 1.00000 9.4 TiB 2.7 TiB 2.4 TiB 479 KiB 6.1 GiB 6.7 TiB 28.92 0.91 87 up 134 hdd 9.37500 1.00000 9.4 TiB 2.3 TiB 2.1 TiB 225 KiB 4.9 GiB 7.0 TiB 25.02 0.79 74 up 135 hdd 9.37500 1.00000 9.4 TiB 2.0 TiB 1.7 TiB 395 KiB 4.5 GiB 7.4 TiB 21.46 0.68 62 up 136 hdd 9.37500 1.00000 9.4 TiB 2.8 TiB 2.5 TiB 294 KiB 5.6 GiB 6.6 TiB 29.52 0.93 89 up 137 hdd 9.37500 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 2 up 138 hdd 9.37500 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 5 up 139 hdd 9.37500 1.00000 9.4 TiB 2.4 TiB 2.2 TiB 259 KiB 5.3 GiB 6.9 TiB 25.94 0.82 77 up 140 hdd 9.37500 1.00000 9.4 TiB 2.5 TiB 2.2 TiB 355 KiB 4.8 GiB 6.9 TiB 26.86 0.85 80 up 141 hdd 9.37500 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 1 up 142 hdd 9.37500 1.00000 9.4 TiB 2.6 TiB 2.3 TiB 1.6 GiB 4.9 GiB 6.8 TiB 27.43 0.86 83 up 143 hdd 9.37500 1.00000 9.4 TiB 2.7 TiB 2.4 TiB 276 KiB 5.7 GiB 6.7 TiB 28.64 0.90 86 up 144 hdd 9.37500 1.00000 9.4 TiB 2.5 TiB 2.2 TiB 256 KiB 5.5 GiB 6.9 TiB 26.77 0.84 80 up 145 hdd 9.37500 1.00000 9.4 TiB 2.3 TiB 2.0 TiB 248 KiB 5.0 GiB 7.1 TiB 24.46 0.77 72 up 146 hdd 9.37500 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 1 up 147 hdd 9.37500 1.00000 9.4 TiB 2.2 TiB 1.9 TiB 237 KiB 5.1 GiB 7.2 TiB 23.53 0.74 69 up 148 hdd 9.37500 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 1 up host001:~# ceph pg dump_stuck PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 5.3fd active+remapped+backfill_wait [145,158,151] 145 [145,151,126] 145 5.3a8 active+remapped+backfilling [136,133,158] 136 [136,133,167] 136 5.2e0 active+remapped+backfill_wait [147,158,135] 147 [147,135,166] 147 5.294 active+remapped+backfill_wait [147,128,164] 147 [147,128,138] 147 5.ef active+remapped+backfill_wait [123,134,158] 123 [123,148,137] 123 5.116 active+remapped+backfill_wait [123,166,145] 123 [123,166,138] 123 5.1e8 active+remapped+backfill_wait [127,158,157] 127 [127,157,161] 127 5.106 active+remapped+backfill_wait [124,158,144] 124 [124,144,167] 124 5.1c active+remapped+backfill_wait [128,158,155] 128 [128,155,140] 128 5.2ef active+remapped+backfill_wait [128,163,153] 128 [128,163,137] 128 5.1e0 active+remapped+backfill_wait [129,158,153] 129 [129,153,162] 129 5.1d2 active+remapped+backfill_wait [128,168,149] 128 [128,168,146] 128 5.167 active+remapped+backfill_wait [129,142,158] 129 [129,142,168] 129 5.f1 active+remapped+backfill_wait [124,147,158] 124 [124,147,168] 124 5.2c active+remapped+backfill_wait [129,159,154] 129 [129,159,141] 129 5.12b active+remapped+backfill_wait [128,169,157] 128 [128,169,138] 128 5.3eb active+remapped+backfill_wait [136,158,149] 136 [136,149,127] 136 5.6e active+remapped+backfill_wait [136,168,152] 136 [136,168,122] 136 5.3d8 active+remapped+backfill_wait [124,147,134] 124 [124,147,138] 124 5.b4 active+remapped+backfill_wait [123,142,166] 123 [123,166,138] 123 5.1f5 active+remapped+backfill_wait [145,153,158] 145 [145,153,169] 145 5.19c active+remapped+backfill_wait [129,158,151] 129 [129,151,164] 129 5.b3 active+remapped+backfill_wait [124,143,158] 124 [124,143,155] 124 5.108 active+remapped+backfill_wait [136,158,133] 136 [136,133,153] 136 Anyone can suggest what to do to fasten this process. Thanks, Peter

9 months, 2 weeks

2
1
0 0

cephbot - a Slack bot for Ceph has been added to the github.com/ceph project

by David Turner

cephbot [1] is a project that I've been working on and using for years now and it has been added to the github.com/ceph project to increase visibility for other people that would like to implement slack-ops for their Ceph clusters. The instructions show how to set it up so that read-only operations can be performed from Slack for security purposes, but there are settings that could make it possible to lock down who can communicate with cephbot which could make it relatively secure to run administrative tasks as well. Ask here or in the Ceph Slack instance if you have any questions about its uses, implementation, or would like to contribute. I hope you find it as useful as I have. David Turner Sony Interactive Entertainment [1] https://github.com/ceph/cephbot-slack

9 months, 2 weeks

2
1
0 0

Ceph Leadership Team Meeting, 2023-07-26 Minutes

by Casey Bodley

Welcome to Aviv Caro as new Ceph NVMe-oF lead Reef status: * reef 18.1.3 built, gibba cluster upgraded, plan to publish this week * https://pad.ceph.com/p/reef_final_blockers all resolved except for bookworm builds https://tracker.ceph.com/issues/61845 * only blockers will merge to reef so the release matches final rc Planning for distribution updates earlier in release process: * centos 9 testing wasn't enabled for reef until very late -- partly because of missing python dependencies -- required fixes to test suites of every component so we couldn't merge until everything was fixed * also applies to major dependencies like boost and rocksdb -- boost upgrade on main disrupted testing on other release branches -- build containerization in CI would help a lot here. discussion continues tomorrow in Ceph Infrastructure meeting Improving the documentation/procedure for deploying a vstart cluster: * including installation of dependencies and compilation -- add test coverage on fresh distros to verify that all required dependencies are installed * README.md will be the canonical guide CDS concluded yesterday: * recordings at https://ceph.io/en/community/events/2023/ceph-developer-summit-squid/ * component leads to update ceph backlog on trello

9 months, 2 weeks

1
0
0 0

RGWs offline after upgrade to Nautilus

by Ben.Zieglmeier

Hello, We have an RGW cluster that was recently upgraded from 12.2.11 to 14.2.22. The upgrade went mostly fine, though now several of our RGWs will not start. One RGW is working fine, the rest will not initialize. They are on a crash loop. This is part of a multisite configuration, and is currently not the master zone. Current master zone is running 14.2.22. These are the only two zones in the zonegroup. After turning debug up to 20, these are the log snippets between each crash: ``` 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.52 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got periods.1b6e1a93-98ba-4378-bc5c-d36cd5542f11.54 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got realms_names. <redacted> 2023-07-20 14:29:56.371 7fd8dec40900 20 RGWRados::pool_iterate: got <redacted> 2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.371 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0 2023-07-20 14:29:56.371 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=-2 bl.length=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=114 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.373 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46 2023-07-20 14:29:56.373 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=686 2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup init ret 0 2023-07-20 14:29:56.374 7fd8dec40900 20 period zonegroup name <redacted> 2023-07-20 14:29:56.374 7fd8dec40900 20 using current period zonegroup <redacted> 2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.374 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=46 2023-07-20 14:29:56.374 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903 2023-07-20 14:29:56.375 7fd8dec40900 10 Cannot find current period zone using local zone 2023-07-20 14:29:56.375 7fd8dec40900 20 rados->read ofs=0 len=0 2023-07-20 14:29:56.375 7fd8dec40900 20 rados_obj.operate() r=0 bl.length=903 2023-07-20 14:29:56.375 7fd8dec40900 20 zone <redacted> 2023-07-20 14:29:56.375 7fd8dec40900 20 generating connection object for zone <redacted> id f10b465f-bf18-47d0-a51c-ca4f17118ee1 2023-07-20 14:34:56.198 7fd8cafe8700 -1 Initialization timeout, failed to initialize ``` I’ve checked all file permissions, filesystem free space, disabled selinux and firewalld, tried turning up the initialization timeout to 600, and tried removing all non-essential config from ceph.conf. All produce the same results. I would greatly appreciate any other ideas or insight. Thanks, Ben

9 months, 2 weeks

3
3
0 0

Re: 1 PG stucked in "active+undersized+degraded for long time

by Matthew Leonard (BLOOMBERG/ 120 PARK)

Assuming you're running systemctl OSDs you can run the following command on the host that OSD 343 resides on. systemctl restart ceph-osd@343 From: siddhit.renake(a)nxtgen.com At: 07/20/23 13:44:36 UTC-4:00To: ceph-users(a)ceph.io Subject: [ceph-users] Re: 1 PG stucked in "active+undersized+degraded for long time What should be appropriate way to restart primary OSD in this case (343) ? _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

9 months, 2 weeks

3
2
0 0

2024

2023

2022

2021

2020

2019

ceph-users July 2023