Hi.
We're currently getting these errors - and I seem to be missing a clear overview over the cause and how to debug.
3/26/24 9:38:09 PM[ERR]executing _write_files((['dkcphhpcadmin01', 'dkcphhpcmgt028', 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', 'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036', 'dkcphhpcosd037', 'dkcphhpcosd038', 'dkcphhpcosd039', 'dkcphhpcosd040', 'dkcphhpcosd041', 'dkcphhpcosd042', 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file await asyncssh.scp(f.name, (conn, tmp_path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run await self._send_files(path, b'') File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await self._send_file(srcpath, dstpath, attrs) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await self._make_cd_request(b'C', attrs, size, srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files self._write_client_files(client_files, host) File "/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 261, in write_remote_file host, path, content, mode, uid, gid, addr)) File "/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async return self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 56, in get_result return asyncio.run_coroutine_threadsafe(coro, self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result return self.__get_result() File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception File "/usr/share/ceph/mgr/cephadm/ssh.py", line 249, in _write_remote_file raise OrchestratorError(msg) orchestrator._interface.OrchestratorError: Unable to write dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied
3/26/24 9:38:09 PM[ERR]Unable to write dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file await asyncssh.scp(f.name, (conn, tmp_path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run await self._send_files(path, b'') File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await self._send_file(srcpath, dstpath, attrs) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await self._make_cd_request(b'C', attrs, size, srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: /tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: Permission denied
3/26/24 9:38:09 PM[INF]Updating dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf
It seem to be related to the permissions that the manager writes the files with and the process copying them around.
$ sudo ceph -v
[sudo] password for adminjskr:
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
Best regards,
Jesper Agerbo Krogh
Director Digitalization
Digitalization
Topsoe A/S
Haldor Topsøes Allé 1
2800 Kgs. Lyngby
Denmark
Phone (direct): 27773240
   
Read more attopsoe.com
Topsoe A/S and/or its affiliates. This e-mail message (including attachments, if any) is confidential and may be privileged. It is intended only for the addressee.
Any unauthorised distribution or disclosure is prohibited. Disclosure to anyone other than the intended recipient does not constitute waiver of privilege.
If you have received this email in error, please notify the sender by email and delete it and any attachments from your computer system and records.
On 29/05/2023 20.55, Anthony D'Atri wrote:
> Check the uptime for the OSDs in question
I restarted all my OSDs within the past 10 days or so. Maybe OSD
restarts are somehow breaking these stats?
>
>> On May 29, 2023, at 6:44 AM, Hector Martin <marcan(a)marcan.st> wrote:
>>
>> Hi,
>>
>> I'm watching a cluster finish a bunch of backfilling, and I noticed that
>> quite often PGs end up with zero misplaced objects, even though they are
>> still backfilling.
>>
>> Right now the cluster is down to 6 backfilling PGs:
>>
>> data:
>> volumes: 1/1 healthy
>> pools: 6 pools, 268 pgs
>> objects: 18.79M objects, 29 TiB
>> usage: 49 TiB used, 25 TiB / 75 TiB avail
>> pgs: 262 active+clean
>> 6 active+remapped+backfilling
>>
>> But there are no misplaced objects, and the misplaced column in `ceph pg
>> dump` is zero for all PGs.
>>
>> If I do a `ceph pg dump_json`, I can see `num_objects_recovered`
>> increasing for these PGs... but the misplaced count is still 0.
>>
>> Is there something else that would cause recoveries/backfills other than
>> misplaced objects? Or perhaps there is a bug somewhere causing the
>> misplaced object count to be misreported as 0 sometimes?
>>
>> # ceph -v
>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
>> (stable)
>>
>> - Hector
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
- Hector
Hi,
as the documentation sends mixed signals in
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#ipv…
"Note
Binding to IPv4 is enabled by default, so if you just add the option to
bind to IPv6 you’ll actually put yourself into dual stack mode."
and
https://docs.ceph.com/en/latest/rados/configuration/msgr2/#address-formats
"Note
The ability to bind to multiple ports has paved the way for dual-stack
IPv4 and IPv6 support. That said, dual-stack operation is not yet
supported as of Quincy v17.2.0."
just the quick questions:
Is a dual stacked networking with IPv4 and IPv6 now supported or not?
From which version on is it considered stable?
Are OSDs now able to register themselves with two IP addresses in the
cluster map? MONs too?
Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin
https://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
Hello,
there is one bucket for a user in our Ceph cluster who is suddenly not
able to write to one of his buckets.
Reading works fine.
All other buckets work fine.
If we copy the bucket to another bucket on the same cluster, the error
stays. Writing is not possible in the new bucket, too.
Interesting: If we copy the contents of the bucket to a bucket in
another Ceph cluster the error is gone.
So now we know how to solve this but we do not finde the root cause.
I checked the policies, lifecycle and versioning.
Nothing. The user has FULL_CONTROL. Same settings for the user's other
buckets he can still write to.
Wenn setting debugging to higher numbers all I can see is something like
this while trying to write to the bucket:
s3:put_obj reading permissions
s3:put_obj init op
s3:put_obj verifying op mask
s3:put_obj verifying op permissions
op->ERRORHANDLER: err_no=-13 new_err_no=-13
cache get: name=default.rgw.log++script.postrequest. : hit (negative entry)
s3:put_obj op status=0
s3:put_obj http status=403
1 ====== req done req=0x7fe8bb60a710 op status=0 http_status=403
latency=0.000000000s ======
I still think there is something with a policy or so. When we copy the
bucket to another bucket in the same cluster, at first, while copying
you can write to the new bucket but when the copy job progresses at one
point writing is not possible anymore.
But what is it?
Best,
Malte
Hi everyone,
On behalf of the Ceph Foundation Board, I would like to announce the
creation of, and cordially invite you to, the first of a recurring series
of meetings focused solely on gathering feedback from the users of
Ceph. The overarching goal of these meetings is to elicit feedback from the
users, companies, and organizations who use Ceph in their production
environments. You can find more details about the motivation behind this
effort in our user survey [1] that we highly encourage all of you to take.
This is an extension of the Ceph User Dev Meeting with concerted focus on
Performance (led by Vincent Hsu, IBM) and Orchestration/Deployment (led by
Matt Leonard, Bloomberg), to start off with. We would like to kick off this
series of meetings on March 21, 2024. The survey will be open until March
18, 2024.
Looking forward to hearing from you!
Thanks,
Neha
[1]
https://docs.google.com/forms/d/15aWxoG4wSQz7ziBaReVNYVv94jA0dSNQsDJGqmHCLM…
Hi,
I'm running Ceph Quincy (17.2.6) with a rados-gateway. I have muti tenants,
for example:
- Tenant1$manager
- Tenant1$readwrite
I would like to set a policy on a bucket (backups for example) owned by
*Tenant1$manager* to allow *Tenant1$readwrite* access to that bucket. I
can't find any documentation that discusses this scenario.
Does anyone know how to specify the Principle and Resource section of a
policy.json file? Or any other configuration that I might be missing?
I've tried some variations on Principal and Resource including and
excluding tenant information, but not no luck yet.
For example:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"AWS": ["arn:aws:iam:::user/*Tenant1$readwrite*"]},
"Action": ["s3:ListBucket","s3:GetObject", ,"s3:PutObject"],
"Resource": [
"arn:aws:s3:::*Tenant1/backups*"
]
}]
}
I'm using s3cmd for testing, so:
s3cmd --config s3cfg.manager setpolicy policy.json s3://backups/
Returns:
s3://backups/: Policy updated
And then testing:
s3cmd --config s3cfg.readwrite ls s3://backups/
ERROR: Access to bucket 'backups' was denied
ERROR: S3 error: 403 (AccessDenied)
Thanks,
Tom
Hi All,
I've been battling this for a while and I'm not sure where to go from
here. I have a Ceph health warning as such:
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
1 MDSs report slow requests
1 MDSs behind on trimming
services:
mon: 5 daemons, quorum
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
mds: 1/1 daemons up, 2 standby
osd: 46 osds: 46 up (since 9h), 46 in (since 2w)
data:
volumes: 1/1 healthy
pools: 4 pools, 1313 pgs
objects: 260.72M objects, 466 TiB
usage: 704 TiB used, 424 TiB / 1.1 PiB avail
pgs: 1306 active+clean
4 active+clean+scrubbing+deep
3 active+clean+scrubbing
io:
client: 123 MiB/s rd, 75 MiB/s wr, 109 op/s rd, 1.40k op/s wr
And the specifics are:
# ceph health detail
HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
mds.slugfs.pr-md-01.xdtppo(mds.0): 99 slow requests are blocked >
30 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.slugfs.pr-md-01.xdtppo(mds.0): Behind on trimming (13884/250)
max_segments: 250, num_segments: 13884
That "num_segments" number slowly keeps increasing. I suspect I just
need to tell the MDS servers to trim faster but after hours of googling
around I just can't figure out the best way to do it. The best I could
come up with was to decrease "mds_cache_trim_decay_rate" from 1.0 to .8
(to start), based on this page:
https://www.suse.com/support/kb/doc/?id=000019740
But it doesn't seem to help, maybe I should decrease it further? I am
guessing this must be a common issue...? I am running Reef on the MDS
servers, but most clients are on Quincy.
Thanks for any advice!
cheers,
erich
Hello Ceph List,
I'd like to formally let the wider community know of some work I've been
involved with for a while now: adding Managed SMB Protocol Support to Ceph.
SMB being the well known network file protocol native to Windows systems and
supported by MacOS (and Linux). The other key word "managed" meaning
integrating with Ceph management tooling - in this particular case cephadm for
orchestration and eventually a new MGR module for managing SMB shares.
The effort is still in it's very early stages. We have a PR adding initial
support for Samba Containers to cephadm [1] and a prototype for an smb MGR
module [2]. We plan on using container images based on the samba-container
project [3] - a team I am already part of. What we're aiming for is a feature
set similar to the current NFS integration in Ceph, but with a focus on
bridging non-Linux/Unix clients to CephFS using a protocol built into those
systems.
A few major features we have planned include:
* Standalone servers (internally defined users/groups)
* Active Directory Domain Member Servers
* Clustered Samba support
* Exporting Samba stats via Prometheus metrics
* A `ceph` cli workflow loosely based on the nfs mgr module
I wanted to share this information in case there's wider community interest in
this effort. I'm happy to take your questions / thoughts / suggestions in this
email thread, via Ceph slack (or IRC), or feel free to attend a Ceph
Orchestration weekly meeting! I try regularly attend and we sometimes discuss
design aspects of the smb effort there. It's on the Ceph Community Calendar.
Thanks!
[1] - https://github.com/ceph/ceph/pull/55068
[2] - https://github.com/ceph/ceph/pull/56350
[3] - https://github.com/samba-in-kubernetes/samba-container/
Thanks for reading,
--John Mulligan
I have a virtual ceph cluster running 17.2.6 with 4 ubuntu 22.04 hosts in it, each with 4 OSD's attached. The first 2 servers hosting mgr's have 32GB of RAM each, and the remaining have 24gb
For some reason i am unable to identify, the first host in the cluster appears to constantly be trying to set the osd_memory_target variable to roughly half of what the calculated minimum is for the cluster, i see the following spamming the logs constantly
Unable to set osd_memory_target on my-ceph01 to 480485376: error parsing value: Value '480485376' is below minimum 939524096
Default is set to 4294967296.
I did double check and osd_memory_base (805306368) + osd_memory_cache_min (134217728) adds up to minimum exactly
osd_memory_target_autotune is currently enabled. But i cannot for the life of me figure out how it is arriving at 480485376 as a value for that particular host that even has the most RAM. Neither the cluster or the host is even approaching max utilization on memory, so it's not like there are processes competing for resources.