August 2023 - ceph-users

by Josh Durgin

We weren't targeting bullseye once we discovered the compiler version problem, the focus shifted to bookworm. If anyone would like to help maintaining debian builds, or looking into these issues, it would be welcome: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129 https://tracker.ceph.com/issues/61845 On Mon, Aug 21, 2023 at 7:50 AM Matthew Darwin <bugs(a)mdarwin.ca> wrote: > Thanks for the link to the issue. Any reason it wasn't added to the > release notes (for bullseye). > > I am also waiting for this to be available to start testing. > On 2023-08-21 10:25, Josh Durgin wrote: > > There was difficulty building on bullseye due to the older version of GCC > available: https://tracker.ceph.com/issues/61845 > > On Mon, Aug 21, 2023 at 3:01 AM Chris Palmer <chris.palmer(a)idnet.com> <chris.palmer(a)idnet.com> wrote: > > > I'd like to try reef, but we are on debian 11 (bullseye). > In the ceph repos, there is debian-quincy/bullseye and > debian-quincy/focal, but under reef there is only focal & jammy. > > Is there a reason why there is no reef/bullseye build? I had thought > that the blocker only affected debian-bookworm builds. > > Thanks, Chris > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >

8 months

2
2
0 0

lack of RGW_API_HOST in ceph dashboard, 17.2.6, causes ceph mgr dashboard problems

by Christopher Durham

Hi, I am using 17.2.6 on Rocky Linux 8 The ceph mgr dashboard, in my situation, (bare metal install, upgraded from 15->16-> 17.2.6), can no longer hit the ObjectStore->(Daemons,Users,Buckets) pages. When I try to hit those pages, it gives an error: RGW REST API failed request with status code 403 {"Code": "AccessDenied", RequestId: "xxxxxxx", HostId: "yyyy-<my zone>"} The log of the rgw server it hit has: "GET /admin/metadata/user?myself HTTP/1.1" 403 125 It appears that the mgr dashboard setting RGW_API_HOST is no longer an option that can be set, nor does that name exist anywhere under /usr/share/ceph/mgr/dashboard, and: # ceph dashboard set-rgw-api-host <host> is no longer in existence in 17.2.6 However, since my situation is an upgrade, the config value still exists in my config, and I can retrieve it with: # ceph dashboard get-rgw-api-host To get the to work in my situation, I have modified /usr/share/ceph/mgr/dashboard/settings.py and re-added RGW_API_HOST to the Options class using RGW_API_HOST = Settings('', [dict,str]) I then modified /usr/share/ceph/mgr/dashboard/services/rgw_request.py such that each rgw daemon retrieved has its 'host' member set to Settings.RGW_API_HOST. Then after restarting the mgr, I was able to access the Objectstore->(Daemons,Users,Buckets) pages in the dashboard. HOWEVER, I know this is NOT the right way to fix this, it is a hack. It seems like the dashboard is trying to contact an rgw server individually. For us, the RGW_API_HOST is a name in DNS: s3.my.dom, that has multiple A records, one for each of our rgw servers, each of which have the *same* SSL cert with CN and SubjectAltNames that allow the cert to present itself as both s3.my.dom as well as the individual host name (SubjectAltName has ALL the rgw servers in it). This works well for us and has done so since 15.x.y, The endpoint for the zone is set to s3.my.dom. Thus my users only have a single endpoint to care about, unless there is a failure situation onan rgw server. (We have other ways of handling that). Any thoughts on the CORRECT way to handle this so I can have the ceph dashboard work with the ObjectStore->(Daemons,Users,Buckets) pages? Thanks. -Chris

8 months

2
2
0 0

rgw replication sync issue

by ankit raikwar

Hello Users, We have the environment as below. Both environments are the zones of one RGW multisite zonegroup, whereas the DC zone is the primary and the DR zone is the secondary at this point. DC Ceph Version: 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) Number of rgw daemons : 25 DR Ceph Version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) Number of rgw daemons : 25 Environment description: Both the mentioned zones are in production and the RGW multisite bandwidth is over MPLS of around 3 Gbps. Issue description : We have enabled the multisite between DC-DR almost around a month ago. The total data at the DC zone is around 159 TiB and the sync has been going as expected . But when the sync had gone around 120 TiB we saw the speed drastically fell low, the ideal was around 2 Gbps, and it fell way below 10 Mbps though the link is not saturated. After checking "# radosgw-admin sync status " the output says "metadata is caught up with master" and "data is caught up with source" but with almost 25 TB data behind as compared to DC. It also looks like the sync status of the bucket " radosgw-admin bucket sync status --bucket=<bucket-name>" still bucket is behind shards. Attaching the log and the output below. The possibility of issuing a resync of the data from the beginning is quite low and not feasible in our case. The "# radosgw-admin sync error list" output is also attached with some information redacted and we see errors. radosgw-sync status radosgw-admin sync status realm 6a7fab77-64e3-453e-b54b-066bc8af2f00 (realm0) zonegroup be660604-d853-4f8e-a576-579cae2e07c2 (zg0) zone d06a8dd3-5bcb-486c-945b-2a98969ccd5f (fbd) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: d09d3d16-8601-448b-bf3d-609b8a29647d (ahd) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source radosgw-admin bucket sync status --bucket=<bucket-name> realm 6a7fab77-64e3-453e-b54b-066bc8af2f00 (realm0) zonegroup be660604-d853-4f8e-a576-579cae2e07c2 (zg0) zone d06a8dd3-5bcb-486c-945b-2a98969ccd5f (fbd) bucket :tc******rc-b1[d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1]) source zone d09d3d16-8601-448b-bf3d-609b8a29647d (ahd) source bucket :tc*******arc-b1[d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1]) full sync: 14/9221 shards full sync: 49448693 objects completed incremental sync: 9207/9221 shards bucket is behind on 25 shards behind shards: [9,111,590,826,1774,2968,3132,3382,3386,3409,3685,3820,4174,4544,4708,4811,5733,6285,6558,7288,7417,7443,7876,8151,8878] Error: radosgw-admin sync error list "id": "1_1690799008.725414_3926410.1", "section": "data", "name": "bucket0:d09d3d16-8601-448b-bf3d-609b8a29647d.89871.1:1949", "timestamp": "2023-07-31T10:23:28.725414Z", "info": { "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d", "error_code": 125, "message": "failed to sync bucket instance: (125) Operation canceled" "id": "1_1690804503.144829_3759212.1", "section": "data", "name": "bucket1:d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1:1232/S01/1/120/2b7ea802-efad-41d3-9d90-9**************523.txt", "timestamp": "2023-07-31T11:54:53.233451Z", "info": { "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d", "error_code": 5, "message": "failed to sync object(5) Input/output error" Thanks Ankit

8 months, 1 week

2
3
0 0

Is it safe to add different OS but same ceph version to the existing cluster?

by Szabo, Istvan (Agoda)

Hi, I have an octopus cluster on the latest octopus version with mgr/mon/rgw/osds on centos 8. Is it safe to add an ubuntu osd host with the same octopus version? Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

8 months, 1 week

3
3
0 0

Client failing to respond to capability release

by Frank Schilder

Hi all, I have this warning the whole day already (octopus latest cluster): HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not deep-scrubbed in time [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability release mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 145698301 mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511877 mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511887 mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 231250695 If I look at the session info from mds.1 for these clients I see this: # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 189511877 -e 189511887 -e 231250695 {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0} {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0} {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0} {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0} We have mds_min_caps_per_client=4096, so it looks like the limit is well satisfied. Also, the file system is pretty idle at the moment. Why and what exactly is the MDS complaining about here? Thanks and best regards. ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

8 months, 1 week

4
6
0 0

When to use the auth profiles simple-rados-client and profile simple-rados-client-with-blocklist?

by Christian Rohmann

Hey ceph-users, 1) When configuring Gnocchi to use Ceph storage (see https://gnocchi.osci.io/install.html#ceph-requirements) I was wondering if one could use any of the auth profiles like * simple-rados-client * simple-rados-client-with-blocklist ? Or are those for different use cases? 2) I was also wondering why the documentation mentions "(Monitor only)" but then it says "Gives a user read-only permissions for monitor, OSD, and PG data."? 3) And are those profiles really for "read-only" users? Why don't they have "read-only" in their name like the rbd and the corresponding "rbd-read-only" profile? Regards Christian

8 months, 1 week

2
1
0 0

Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation

by Frédéric Nass

Hello, This message does not concern Ceph itself but a hardware vulnerability which can lead to permanent loss of data on a Ceph cluster equipped with the same hardware in separate fault domains. The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD drives of the 13G generation of DELL servers are subject to a vulnerability which renders them unusable after 70,000 hours of operation, i.e. approximately 7 years and 11 months of activity. This topic has been discussed here: https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-commu… The risk is all the greater since these disks may die at the same time in the same server leading to the loss of all data in the server. To date, DELL has not provided any firmware fixing this vulnerability, the latest firmware version being "A3B3" released on Sept. 12, 2016: https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k If your have servers running these drives, check their uptime. If they are close to the 70,000 hour limit, replace them immediately. The smartctl tool does not report the uptime for these SSDs, but if you have HDDs in the server, you can query their SMART status and get their uptime, which should be about the same as the SSDs. The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the iSCSI bus number). We have informed DELL about this but have no information yet on the arrival of a fix. We have lost 6 disks, in 3 different servers, in the last few weeks. Our observation shows that the drives don't survive full shutdown and restart of the machine (power off then power on in iDrac), but they may also die during a single reboot (init 6) or even while the machine is running. Fujitsu released a corrective firmware in June 2021 but this firmware is most certainly not applicable to DELL drives: https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf Regards, Frederic Sous-direction Infrastructure and Services Direction du Numérique Université de Lorraine

8 months, 1 week

2
2
1 0

OSDs spam log with scrub starts

by Adrien Georget

Hello, On our 16.2.14 CephFS cluster, all OSDs are spamming logs with messages like "log_channel(cluster) log [DBG] : xxx scrub starts". All OSDs are concerned, for different PGs. Cluster is healthy without any recovery ops. For a single PG, we can have hundreds of scrub starts msg in less than an hour. With 720 OSDs (8k PG, EC8+2), it can lead to millions of messages by hour... For example with PG 3.1d57 or||3.1988 : |Aug 31 16:02:09 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:09.453+0000 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:11 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:11.446+0000 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:12 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:12.428+0000 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:13 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:13.456+0000 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:14 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:14.431+0000 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:15 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:15.475+0000 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:21 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:21.516+0000 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:23 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:23.555+0000 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 scrub starts|| ||Aug 31 16:02:24 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-58[1310188]: debug 2023-08-31T14:02:24.510+0000 7fdab1ec4700 0 log_channel(cluster) log [DBG] : 3.1d57 deep-scrub starts|| Aug 31 16:02:10 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:10.384+0000 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 deep-scrub starts Aug 31 16:02:11 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:11.377+0000 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 scrub starts Aug 31 16:02:13 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:13.383+0000 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 scrub starts Aug 31 16:02:15 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:15.383+0000 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 deep-scrub starts Aug 31 16:02:17 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:17.336+0000 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 scrub starts Aug 31 16:02:19 ceph-86cd8a68-7649-11ed-b2be-5cba2c7fdb30-osd-276[1325507]: debug 2023-08-31T14:02:19.328+0000 7f0606ce3700 0 log_channel(cluster) log [DBG] : 3.1988 scrub starts || ||PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN|| ||3.1d57 52757 0 0 0 0 167596026648 0 0 1799 1799 active+clean 2023-08-31T14:27:24.025755+0000 236010'4532653 236011:8745383 [58,421,335,9,59,199,390,481,425,480] 58 [58,421,335,9,59,199,390,481,425,480] 58 231791'4531915 *2023-08-29T22:41:12.266874+0000* 229377'4526369 *2023-08-26T04:30:42.894505+0000* 0| |3.1988 52867 0 0 0 0 168603872808 0 0 1811 1811 active+clean 2023-08-31T14:32:13.361420+0000 236018'4241611 236018:9815753 [276,342,345,299,210,349,85,481,446,46] 276 [276,342,345,299,210,349,85,481,446,46] 276 236012'4241602 *2023-08-31T14:32:13.361343+0000* 228157'4229095 *2023-08-24T05:59:16.573471+0000*| However scrub is working fine, scrub stamp looks OK (2023-08-29 or 2023-08-31) as we have default value for scrub interval (min 24h / max 7days). I tried to play with scrub parameters like osd_scrub_load_threshold (->20), osd_max_scrubs (->3), osd_scrub_*_interval but nothing better. Any idea what's going on and how to fix this? Cheers, Adrien ||

8 months, 1 week

3
3
0 0

osdspec_affinity error in the Cephadm module

by Adam Huffman

I've been having fun today trying to invite a new disk that replaced a failing one into a cluster. One of my attempts to apply an OSD spec was clearly wrong, because I now have this error: Module 'cephadm' has failed: 'osdspec_affinity' and this was the traceback in the mgr logs: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 77, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/serve.py", line 224, in refresh r = self._refresh_host_devices(host) File "/usr/share/ceph/mgr/cephadm/serve.py", line 396, in _refresh_host_devices self.update_osdspec_previews(host) File "/usr/share/ceph/mgr/cephadm/serve.py", line 412, in update_osdspec_previews previews.extend(self.mgr.osd_service.get_previews(search_host)) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 258, in get_previews return self.generate_previews(osdspecs, host) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 291, in generate_previews for host, ds in self.prepare_drivegroup(osdspec): File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 225, in prepare_drivegroup existing_daemons=len(dd_for_spec_and_host)) File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 35, in __init__ self._data = self.assign_devices('data_devices', self.spec.data_devices) File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 19, in wrapper return f(self, ds) File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 134, in assign_devices if lv['osdspec_affinity'] != self.spec.service_id: KeyError: 'osdspec_affinity' This cluster is running 16.2.13. The exported service spec is: service_type: osd service_id: osd_spec-0.3 service_name: osd.osd_spec-0.3 placement: host_pattern: cepho-* spec: data_devices: rotational: true db_devices: model: SSDPE2KE032T8L encrypted: true filter_logic: AND objectstore: bluestore Best Wishes, Adam

8 months, 1 week

3
3
0 0

Re: Reef - what happened to OSD spec?

by Nigel Williams

Thanks Eugen for following up. Sorry my second response was incomplete. I can confirm that it works as expected too. My confusion was that the section from the online documentation seemed to be missing/moved, and when it initially failed I wrongly thought that the OSD-add process had changed in the Reef release. There might still need to be a way that "destroy" does additional clean-up to clear remnants of LVM fingerprints on the devices as this tripped me up when the OSDspec apply failed due to "filesystem on device" checks. Documentation has been improved and OSD spec is now under this heading for Reef: https://docs.ceph.com/en/reef/cephadm/services/osd/#advanced-osd-service-sp…

8 months, 1 week

2
2
0 0

2024

2023

2022

2021

2020

2019

ceph-users August 2023