We weren't targeting bullseye once we discovered the compiler version
problem, the focus shifted to bookworm. If anyone would like to help
maintaining debian builds, or looking into these issues, it would be
welcome:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129https://tracker.ceph.com/issues/61845
On Mon, Aug 21, 2023 at 7:50 AM Matthew Darwin <bugs(a)mdarwin.ca> wrote:
> Thanks for the link to the issue. Any reason it wasn't added to the
> release notes (for bullseye).
>
> I am also waiting for this to be available to start testing.
> On 2023-08-21 10:25, Josh Durgin wrote:
>
> There was difficulty building on bullseye due to the older version of GCC
> available: https://tracker.ceph.com/issues/61845
>
> On Mon, Aug 21, 2023 at 3:01 AM Chris Palmer <chris.palmer(a)idnet.com> <chris.palmer(a)idnet.com> wrote:
>
>
> I'd like to try reef, but we are on debian 11 (bullseye).
> In the ceph repos, there is debian-quincy/bullseye and
> debian-quincy/focal, but under reef there is only focal & jammy.
>
> Is there a reason why there is no reef/bullseye build? I had thought
> that the blocker only affected debian-bookworm builds.
>
> Thanks, Chris
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
Hi,
I am using 17.2.6 on Rocky Linux 8
The ceph mgr dashboard, in my situation, (bare metal install, upgraded from 15->16-> 17.2.6), can no longer hit the ObjectStore->(Daemons,Users,Buckets) pages.
When I try to hit those pages, it gives an error:
RGW REST API failed request with status code 403 {"Code": "AccessDenied", RequestId: "xxxxxxx", HostId: "yyyy-<my zone>"}
The log of the rgw server it hit has:
"GET /admin/metadata/user?myself HTTP/1.1" 403 125
It appears that the mgr dashboard setting RGW_API_HOST is no longer an option that can be set, nor does that name exist anywhere under /usr/share/ceph/mgr/dashboard, and:
# ceph dashboard set-rgw-api-host <host>
is no longer in existence in 17.2.6
However, since my situation is an upgrade, the config value still exists in my config, and I can retrieve it with:
# ceph dashboard get-rgw-api-host
To get the to work in my situation, I have modified /usr/share/ceph/mgr/dashboard/settings.py and re-added RGW_API_HOST to the Options class using
RGW_API_HOST = Settings('', [dict,str])
I then modified /usr/share/ceph/mgr/dashboard/services/rgw_request.py such that each rgw daemon retrieved has its 'host' member set to Settings.RGW_API_HOST.
Then after restarting the mgr, I was able to access the Objectstore->(Daemons,Users,Buckets) pages in the dashboard.
HOWEVER, I know this is NOT the right way to fix this, it is a hack. It seems like the dashboard is trying to contact an rgw server individually. For us, the RGW_API_HOST is
a name in DNS: s3.my.dom, that has multiple A records, one for each of our rgw servers, each of which have the *same* SSL cert with CN and SubjectAltNames that allow
the cert to present itself as both s3.my.dom as well as the individual host name (SubjectAltName has ALL the rgw servers in it). This works well for us and has
done so since 15.x.y, The endpoint for the zone is set to s3.my.dom. Thus my users only have a single endpoint to care about, unless there is a failure situation onan rgw server. (We have other ways of handling that).
Any thoughts on the CORRECT way to handle this so I can have the ceph dashboard work with the ObjectStore->(Daemons,Users,Buckets) pages? Thanks.
-Chris
Hello Users,
We have the environment as below. Both environments are the zones of one RGW multisite zonegroup, whereas the DC zone is the primary and the DR zone is the secondary at this point.
DC
Ceph Version: 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Number of rgw daemons : 25
DR
Ceph Version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Number of rgw daemons : 25
Environment description:
Both the mentioned zones are in production and the RGW multisite bandwidth is over MPLS of around 3 Gbps.
Issue description :
We have enabled the multisite between DC-DR almost around a month ago. The total data at the DC zone is around 159 TiB and the sync has been going as expected . But when the sync had gone around 120 TiB we saw the speed drastically fell low, the ideal was around 2 Gbps, and it fell way below 10 Mbps though the link is not saturated. After checking "# radosgw-admin sync status " the output says "metadata is caught up with master" and "data is caught up with source" but with almost 25 TB data behind as compared to DC. It also looks like the sync status of the bucket " radosgw-admin bucket sync status --bucket=<bucket-name>" still bucket is behind shards. Attaching the log and the output below.
The possibility of issuing a resync of the data from the beginning is quite low and not feasible in our case. The "# radosgw-admin sync error list" output is also attached with some information redacted and we see errors.
radosgw-sync status
radosgw-admin sync status
realm 6a7fab77-64e3-453e-b54b-066bc8af2f00 (realm0)
zonegroup be660604-d853-4f8e-a576-579cae2e07c2 (zg0)
zone d06a8dd3-5bcb-486c-945b-2a98969ccd5f (fbd)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: d09d3d16-8601-448b-bf3d-609b8a29647d (ahd)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
radosgw-admin bucket sync status --bucket=<bucket-name>
realm 6a7fab77-64e3-453e-b54b-066bc8af2f00 (realm0)
zonegroup be660604-d853-4f8e-a576-579cae2e07c2 (zg0)
zone d06a8dd3-5bcb-486c-945b-2a98969ccd5f (fbd)
bucket :tc******rc-b1[d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1])
source zone d09d3d16-8601-448b-bf3d-609b8a29647d (ahd)
source bucket :tc*******arc-b1[d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1])
full sync: 14/9221 shards
full sync: 49448693 objects completed
incremental sync: 9207/9221 shards
bucket is behind on 25 shards
behind shards: [9,111,590,826,1774,2968,3132,3382,3386,3409,3685,3820,4174,4544,4708,4811,5733,6285,6558,7288,7417,7443,7876,8151,8878]
Error: radosgw-admin sync error list
"id": "1_1690799008.725414_3926410.1",
"section": "data",
"name": "bucket0:d09d3d16-8601-448b-bf3d-609b8a29647d.89871.1:1949",
"timestamp": "2023-07-31T10:23:28.725414Z",
"info": {
"source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d",
"error_code": 125,
"message": "failed to sync bucket instance: (125) Operation canceled"
"id": "1_1690804503.144829_3759212.1",
"section": "data",
"name": "bucket1:d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1:1232/S01/1/120/2b7ea802-efad-41d3-9d90-9**************523.txt",
"timestamp": "2023-07-31T11:54:53.233451Z",
"info": {
"source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d",
"error_code": 5,
"message": "failed to sync object(5) Input/output error"
Thanks
Ankit
Hi,
I have an octopus cluster on the latest octopus version with mgr/mon/rgw/osds on centos 8.
Is it safe to add an ubuntu osd host with the same octopus version?
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi all,
I have this warning the whole day already (octopus latest cluster):
HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not deep-scrubbed in time
[WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability release
mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 145698301
mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511877
mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511887
mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 231250695
If I look at the session info from mds.1 for these clients I see this:
# ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 189511877 -e 189511887 -e 231250695
{"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
{"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
{"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
{"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}
We have mds_min_caps_per_client=4096, so it looks like the limit is well satisfied. Also, the file system is pretty idle at the moment.
Why and what exactly is the MDS complaining about here?
Thanks and best regards.
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hey ceph-users,
1) When configuring Gnocchi to use Ceph storage (see
https://gnocchi.osci.io/install.html#ceph-requirements)
I was wondering if one could use any of the auth profiles like
* simple-rados-client
* simple-rados-client-with-blocklist ?
Or are those for different use cases?
2) I was also wondering why the documentation mentions "(Monitor only)"
but then it says
"Gives a user read-only permissions for monitor, OSD, and PG data."?
3) And are those profiles really for "read-only" users? Why don't they
have "read-only" in their name like the rbd and the corresponding
"rbd-read-only" profile?
Regards
Christian
Hello,
This message does not concern Ceph itself but a hardware vulnerability which can lead to permanent loss of data on a Ceph cluster equipped with the same hardware in separate fault domains.
The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD drives of the 13G generation of DELL servers are subject to a vulnerability which renders them unusable after 70,000 hours of operation, i.e. approximately 7 years and 11 months of activity.
This topic has been discussed here: https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-commu…
The risk is all the greater since these disks may die at the same time in the same server leading to the loss of all data in the server.
To date, DELL has not provided any firmware fixing this vulnerability, the latest firmware version being "A3B3" released on Sept. 12, 2016: https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k
If your have servers running these drives, check their uptime. If they are close to the 70,000 hour limit, replace them immediately.
The smartctl tool does not report the uptime for these SSDs, but if you have HDDs in the server, you can query their SMART status and get their uptime, which should be about the same as the SSDs.
The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the iSCSI bus number).
We have informed DELL about this but have no information yet on the arrival of a fix.
We have lost 6 disks, in 3 different servers, in the last few weeks. Our observation shows that the drives don't survive full shutdown and restart of the machine (power off then power on in iDrac), but they may also die during a single reboot (init 6) or even while the machine is running.
Fujitsu released a corrective firmware in June 2021 but this firmware is most certainly not applicable to DELL drives: https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf
Regards,
Frederic
Sous-direction Infrastructure and Services
Direction du Numérique
Université de Lorraine
I've been having fun today trying to invite a new disk that replaced a
failing one into a cluster.
One of my attempts to apply an OSD spec was clearly wrong, because I now
have this error:
Module 'cephadm' has failed: 'osdspec_affinity'
and this was the traceback in the mgr logs:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/utils.py", line 77, in do_work
return f(*arg)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 224, in refresh
r = self._refresh_host_devices(host)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 396, in
_refresh_host_devices
self.update_osdspec_previews(host)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 412, in
update_osdspec_previews
previews.extend(self.mgr.osd_service.get_previews(search_host))
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 258, in
get_previews
return self.generate_previews(osdspecs, host)
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 291, in
generate_previews
for host, ds in self.prepare_drivegroup(osdspec):
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 225, in
prepare_drivegroup
existing_daemons=len(dd_for_spec_and_host))
File
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py",
line 35, in __init__
self._data = self.assign_devices('data_devices',
self.spec.data_devices)
File
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py",
line 19, in wrapper
return f(self, ds)
File
"/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py",
line 134, in assign_devices
if lv['osdspec_affinity'] != self.spec.service_id:
KeyError: 'osdspec_affinity'
This cluster is running 16.2.13.
The exported service spec is:
service_type: osd
service_id: osd_spec-0.3
service_name: osd.osd_spec-0.3
placement:
host_pattern: cepho-*
spec:
data_devices:
rotational: true
db_devices:
model: SSDPE2KE032T8L
encrypted: true
filter_logic: AND
objectstore: bluestore
Best Wishes,
Adam
Thanks Eugen for following up. Sorry my second response was incomplete. I
can confirm that it works as expected too. My confusion was that the
section from the online documentation seemed to be missing/moved, and when
it initially failed I wrongly thought that the OSD-add process had changed
in the Reef release.
There might still need to be a way that "destroy" does additional clean-up
to clear remnants of LVM fingerprints on the devices as this tripped me up
when the OSDspec apply failed due to "filesystem on device" checks.
Documentation has been improved and OSD spec is now under this heading for
Reef:
https://docs.ceph.com/en/reef/cephadm/services/osd/#advanced-osd-service-sp…