Hi,
Mgr of my cluster logs this every few seconds:
[progress WARNING root] complete: ev 7de5bb74-790b-4fda-8838-e4af4af18c62
does not exist
[progress WARNING root] complete: ev fff93fce-b630-4141-81ee-19e7a3e61483
does not exist
[progress WARNING root] complete: ev a02f6966-5b9f-49e8-89c4-b4fb8e6f4423
does not exist
[progress WARNING root] complete: ev 8d318560-ff1a-477f-9386-43f6b51080bf
does not exist
[progress WARNING root] complete: ev ff3740a9-6434-470a-808f-a2762fb542a0
does not exist
[progress WARNING root] complete: ev 7d0589f1-545e-4970-867b-8482ce48d7f0
does not exist
[progress WARNING root] complete: ev 78d57e43-5be5-43f0-8b1a-cdc60e410892
does not exist
I would appreciate an advice on what these warnings mean and how they can
be resolved.
Best regards,
Zakhar
Hi,
Very basic question : 2 days ago I reboot all the cluster. Everything work
fine. But I'm guessing during the shutdown 4 osd was mark as crash
[WRN] RECENT_CRASH: 4 daemons have recently crashed
osd.381 crashed on host cthulhu5 at 2024-03-20T18:33:12.017102Z
osd.379 crashed on host cthulhu4 at 2024-03-20T18:47:13.838839Z
osd.376 crashed on host cthulhu3 at 2024-03-20T18:50:00.877536Z
osd.373 crashed on host cthulhu1 at 2024-03-20T18:56:46.887394Z
is they are any way to «clean» that ? Because otherwise my icinga
complain....
I don't like to add a downtime in icinga.
Thanks.
--
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
ven. 22 mars 2024 22:24:35 CET
Hi,
We have 2 clusters (v18.2.1) primarily used for RGW which has over 2+ billion RGW objects. They are also in multisite configuration totaling to 2 zones and we've got around 2 Gbps of bandwidth dedicated (P2P) for the multisite traffic. We see that using "radosgw-admin sync status" on the zone 2, all the 128 shards are recovering and unfortunately there is very less data transfer from primary zone ie., the link utilization is barely 100 Mbps / 2 Gbps. Our objects are quite small as well like avg. of 1 MB in size.
On further inspection, we noticed the rgw access the logs at primary site are mostly yielding "304 Not Modified" for RGWs at site-2. Is this expected? Here are some of the logs (information is redacted)
root@host-04:~# tail -f /var/log/haproxy-msync.log
Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:33730 [12/Feb/2024:05:06:51.047] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - ---- 56/55/1/0/0 0/0 "GET /bucket1/object1.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-sync-cloudtiered&rgwx-skip-decrypt&rgwx-if-not-replicated-to=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7 HTTP/1.1"
Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:59730 [12/Feb/2024:05:06:51.048] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - ---- 56/55/3/1/0 0/0 "GET /bucket1/object91.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-sync-cloudtiered&rgwx-skip-decrypt&rgwx-if-not-replicated-to=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7 HTTP/1.1"
We also took a look at our Grafana instance and out of 1000 requests / second, 200 requests are getting "200 OK" and 800 requests are getting "304 Not Modified". Sync threads are run on only 2 rgw daemons per zone and are behind a Load Balancer. "# radosgw-admin sync error list" also contains around 20 errors which are mostly automatically recoverable.
As we understand, does it mean that RGW multisite sync logs in the log pool are yet to be generated or some sort? Please provide us some insights and let us know how to resolve this.
Thanks,
Praveen
Hi.
We're currently getting these errors - and I seem to be missing a clear overview over the cause and how to debug.
3/26/24 9:38:09 PM[ERR]executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028', 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', 'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036', 'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039', 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042', 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file await asyncssh.scp(f.name, (conn, tmp_path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run await self._send_files(path, b'') File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await self._send_file(srcpath, dstpath, attrs) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await self._make_cd_request(b'C', attrs, size, srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files self._write_client_files(client_files, host) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 261, in write_remote_file host, path, content, mode, uid, gid, addr)) File "/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async return self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 56, in get_result return asyncio.run_coroutine_threadsafe(coro, self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/share/ceph/mgr/cephadm/ssh.py", line 249, in _write_remote_file raise OrchestratorError(msg) orchestrator._interface.OrchestratorError: Unable to write dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied
3/26/24 9:38:09 PM[ERR]Unable to write dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file await asyncssh.scp(f.name, (conn, tmp_path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run await self._send_files(path, b'') File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await self._send_file(srcpath, dstpath, attrs) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await self._make_cd_request(b'C', attrs, size, srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied
3/26/24 9:38:09 PM[INF]Updating dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf
It seem to be related to the permissions that the manager writes the files with and the process copying them around.
$ sudo ceph -v
[sudo] password for adminjskr:
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Best regards,
Jesper Agerbo Krogh
Director Digitalization
Digitalization
Topsoe A/S
Haldor Topsøes Allé 1
2800 Kgs. Lyngby
Denmark
Phone (direct): 27773240
   
Read more attopsoe.com
Topsoe A/S and/or its affiliates. This e-mail message (including attachments, if any) is confidential and may be privileged. It is intended only for the addressee.
Any unauthorised distribution or disclosure is prohibited. Disclosure to anyone other than the intended recipient does not constitute waiver of privilege.
If you have received this email in error, please notify the sender by email and delete it and any attachments from your computer system and records.
On 29/05/2023 20.55, Anthony D'Atri wrote:
> Check the uptime for the OSDs in question
I restarted all my OSDs within the past 10 days or so. Maybe OSD
restarts are somehow breaking these stats?
>
>> On May 29, 2023, at 6:44 AM, Hector Martin <marcan(a)marcan.st> wrote:
>>
>> Hi,
>>
>> I'm watching a cluster finish a bunch of backfilling, and I noticed that
>> quite often PGs end up with zero misplaced objects, even though they are
>> still backfilling.
>>
>> Right now the cluster is down to 6 backfilling PGs:
>>
>> data:
>> volumes: 1/1 healthy
>> pools: 6 pools, 268 pgs
>> objects: 18.79M objects, 29 TiB
>> usage: 49 TiB used, 25 TiB / 75 TiB avail
>> pgs: 262 active+clean
>> 6 active+remapped+backfilling
>>
>> But there are no misplaced objects, and the misplaced column in `ceph pg
>> dump` is zero for all PGs.
>>
>> If I do a `ceph pg dump_json`, I can see `num_objects_recovered`
>> increasing for these PGs... but the misplaced count is still 0.
>>
>> Is there something else that would cause recoveries/backfills other than
>> misplaced objects? Or perhaps there is a bug somewhere causing the
>> misplaced object count to be misreported as 0 sometimes?
>>
>> # ceph -v
>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
>> (stable)
>>
>> - Hector
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
- Hector
Hi,
as the documentation sends mixed signals in
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#ipv…
"Note
Binding to IPv4 is enabled by default, so if you just add the option to
bind to IPv6 you’ll actually put yourself into dual stack mode."
and
https://docs.ceph.com/en/latest/rados/configuration/msgr2/#address-formats
"Note
The ability to bind to multiple ports has paved the way for dual-stack
IPv4 and IPv6 support. That said, dual-stack operation is not yet
supported as of Quincy v17.2.0."
just the quick questions:
Is a dual stacked networking with IPv4 and IPv6 now supported or not?
From which version on is it considered stable?
Are OSDs now able to register themselves with two IP addresses in the
cluster map? MONs too?
Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin
https://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
Hello,
there is one bucket for a user in our Ceph cluster who is suddenly not
able to write to one of his buckets.
Reading works fine.
All other buckets work fine.
If we copy the bucket to another bucket on the same cluster, the error
stays. Writing is not possible in the new bucket, too.
Interesting: If we copy the contents of the bucket to a bucket in
another Ceph cluster the error is gone.
So now we know how to solve this but we do not finde the root cause.
I checked the policies, lifecycle and versioning.
Nothing. The user has FULL_CONTROL. Same settings for the user's other
buckets he can still write to.
Wenn setting debugging to higher numbers all I can see is something like
this while trying to write to the bucket:
s3:put_obj reading permissions
s3:put_obj init op
s3:put_obj verifying op mask
s3:put_obj verifying op permissions
op->ERRORHANDLER: err_no=-13 new_err_no=-13
cache get: name=default.rgw.log++script.postrequest. : hit (negative entry)
s3:put_obj op status=0
s3:put_obj http status=403
1 ====== req done req=0x7fe8bb60a710 op status=0 http_status=403
latency=0.000000000s ======
I still think there is something with a policy or so. When we copy the
bucket to another bucket in the same cluster, at first, while copying
you can write to the new bucket but when the copy job progresses at one
point writing is not possible anymore.
But what is it?
Best,
Malte
Hi everyone,
On behalf of the Ceph Foundation Board, I would like to announce the
creation of, and cordially invite you to, the first of a recurring series
of meetings focused solely on gathering feedback from the users of
Ceph. The overarching goal of these meetings is to elicit feedback from the
users, companies, and organizations who use Ceph in their production
environments. You can find more details about the motivation behind this
effort in our user survey [1] that we highly encourage all of you to take.
This is an extension of the Ceph User Dev Meeting with concerted focus on
Performance (led by Vincent Hsu, IBM) and Orchestration/Deployment (led by
Matt Leonard, Bloomberg), to start off with. We would like to kick off this
series of meetings on March 21, 2024. The survey will be open until March
18, 2024.
Looking forward to hearing from you!
Thanks,
Neha
[1]
https://docs.google.com/forms/d/15aWxoG4wSQz7ziBaReVNYVv94jA0dSNQsDJGqmHCLM…
Hi,
I'm running Ceph Quincy (17.2.6) with a rados-gateway. I have muti tenants,
for example:
- Tenant1$manager
- Tenant1$readwrite
I would like to set a policy on a bucket (backups for example) owned by
*Tenant1$manager* to allow *Tenant1$readwrite* access to that bucket. I
can't find any documentation that discusses this scenario.
Does anyone know how to specify the Principle and Resource section of a
policy.json file? Or any other configuration that I might be missing?
I've tried some variations on Principal and Resource including and
excluding tenant information, but not no luck yet.
For example:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"AWS": ["arn:aws:iam:::user/*Tenant1$readwrite*"]},
"Action": ["s3:ListBucket","s3:GetObject", ,"s3:PutObject"],
"Resource": [
"arn:aws:s3:::*Tenant1/backups*"
]
}]
}
I'm using s3cmd for testing, so:
s3cmd --config s3cfg.manager setpolicy policy.json s3://backups/
Returns:
s3://backups/: Policy updated
And then testing:
s3cmd --config s3cfg.readwrite ls s3://backups/
ERROR: Access to bucket 'backups' was denied
ERROR: S3 error: 403 (AccessDenied)
Thanks,
Tom