Hi,
yesterday we changed RGW from civetweb to beast and at 04:02 RGW stopped
working; we had to restart it in the morning.
In one rgw log for previous day we can see:
2023-10-06T04:02:01.105+0200 7fb71d45d700 -1 received signal: Hangup from
killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw
rbd-mirror cephfs-mirror (PID: 3202663) UID: 0
and in the next day log we can see:
2023-10-06T04:02:01.133+0200 7fb71d45d700 -1 received signal: Hangup from
(PID: 3202664) UID: 0
and after that no requests came. We had to restart rgw.
In ceph.conf we have something like
[client.radosgw.ctplmon2]
host = ctplmon2
log_file = /var/log/ceph/client.radosgw.ctplmon2.log
rgw_dns_name = ctplmon2
rgw_frontends = "beast ssl_endpoint=0.0.0.0:4443 ssl_certificate=..."
rgw_max_put_param_size = 15728640
We assume it has something to do with logrotate.
/etc/logrotate.d/ceph:
/var/log/ceph/*.log {
rotate 90
daily
compress
sharedscripts
postrotate
killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw
rbd-mirror cephfs-mirror || pkill -1 -x
"ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw|rbd-mirror|cephfs-mirror"
|| true
endscript
missingok
notifempty
su root ceph
}
ceph version 16.2.14 (238ba602515df21ea7ffc75c88db29f9e5ef12c9) pacific
(stable)
And ideas why this happend?
Kind regards,
Rok
Hi,
Our Ceph 16.2.x cluster managed by cephadm is logging a lot of very
detailed messages, Ceph logs alone on hosts with monitors and several OSDs
has already eaten through 50% of the endurance of the flash system drives
over a couple of years.
Cluster logging settings are default, and it seems that all daemons are
writing lots and lots of debug information to the logs, such as for
example: https://pastebin.com/ebZq8KZk (it's just a snippet, but there's
lots and lots of various information).
Is there a way to reduce the amount of logging and, for example, limit the
logging to warnings or important messages so that it doesn't include every
successful authentication attempt, compaction etc, etc, when the cluster is
healthy and operating normally?
I would very much appreciate your advice on this.
Best regards,
Zakhar
Hi folks,
I am aware that dynamic resharding isn't supported before Reef with multisite. However, does manual resharding work? It doesn't seem to be so, either. First of all, running "bucket reshard" has to be in the master zone. But if the objects of that bucket isn't in the master zone, resharding in the master zone seems to render those objects inaccessible in the zone that actually has them. So, what is recommended practice of resharding with multiste? No resharding at all?
Thanks,
Yixin
Hi
we're still struggling with our getting our ceph to health_ok. We're
having compounded issues interfering with recovery, as I understand it.
To summarize, we have a cluster of 22 osd nodes running ceph 16.2.x.
About a month back we had one of the OSDs break down (just the OS disk,
but we didn't have a cold spare available, it took a week to get it
fixed). Since the failure of the node, ceph has been repairing the
situation of course, but then it became a problem that our OSDs are
really unevenly balanced (lowest below 50%, highest around 85%). So
whenever a disk fails (and there were 2 since then), the load spreads
over the other OSDs and our fullest OSDs go over the 85% threshold,
slowing down recovery, normal use and rebalancing.
We had issues with degraded PGs, but they weren't being repaired
(because we had turned on the scrubbing during recovery, since we got
messages that lots of PGs weren't being scrubbed in time.
Now there's still one remaining PG degraded because one object is
unfound. The whole error state is taking far too long now and as this is
going on, I was wondering how the balancer wasn't doing its job. Turns
out this is dependent on the cluster being OK or at least not having any
degraded things in it. The balancer hasn't done it's job even though our
cluster was OK for a long time before; we added some 8 nodes a few years
ago and still the newer nodes are having the lowest used OSDs.
Our cluster has about 70-71% usage overall, but with the unbalanced
situation we cannot grow any more. The single node issue (though now
resolved) and ongoing disk failures (we are seeing a handful of OSDs
with read-repaired messages), it looks like we can't get back to health
for a while.
I'm trying to mitigate this by reweighting the fullest OSDs, but the
fuller OSDs keep going over the threshold, while the emptiest OSDs have
plenty of space (just 55% full now).
If you read this far ;-) I'm wondering, can I force repair a PG around
all the restrictions so it doesn't block auto rebalancing?
It seems to me, like that would help, but perhaps there are other things
I can do as well?
(Budget wise, adding more OSD nodes is a bit difficult at the moment...)
Thanks for reading!
Cheers
/Simon
Dear All,
I hope you are all well. I would like to introduce new tools I have developed, named "LBA tools" which including hd_write_verify & hd_write_verify_dump.
github: https://github.com/zhangyoujia/hd_write_verify
pdf: https://github.com/zhangyoujia/hd_write_verify/DISK&MEMORY stability testing and DATA consistency verifying tools and system.pdf
ppt: https://github.com/zhangyoujia/hd_write_verify/存储稳定性测试与数据一致性校验工具和系统.pptx
bin: https://github.com/zhangyoujia/hd_write_verify/bin
iso: https://github.com/zhangyoujia/hd_write_verify/iso
Data is a vital asset for many businesses, making storage stability and data consistency the most fundamental requirements in storage technology scenarios.
The purpose of storage stability testing is to ensure that storage devices or systems can operate normally and remain stable over time, while also handling various abnormal situations such as sudden power outages and network failures. This testing typically includes stress testing, load testing, fault tolerance testing, and other evaluations to assess the performance and reliability of the storage system.
Data consistency checking is designed to ensure that the data stored in the system is accurate and consistent. This means that whenever data changes occur, all replicas should be updated simultaneously to avoid data inconsistency. Data consistency checking typically involves aspects such as data integrity, accuracy, consistency, and reliability.
LBA tools are very useful for testing Storage stability and verifying DATA consistency, there are much better than FIO & vdbench's verifying functions.
I believe that LBA tools will have a positive impact on the community and help users handle storage data more effectively. Your feedback and suggestions are greatly appreciated, and I hope you can try using LBA tools and share your experiences and recommendations.
Best regards
Hi
we're still in HEALTH_ERR state with our cluster, this is the top of the
output of `ceph health detail`
HEALTH_ERR 1/846829349 objects unfound (0.000%); 248 scrub errors;
Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent;
Degraded data redundancy: 6/7118781559 objects degraded (0.000%), 1 pg
degraded, 1 pg undersized; 63 pgs not deep-scrubbed in time; 657 pgs not
scrubbed in time
[WRN] OBJECT_UNFOUND: 1/846829349 objects unfound (0.000%)
pg 26.323 has 1 unfound objects
[ERR] OSD_SCRUB_ERRORS: 248 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 2 pgs
inconsistent
pg 26.323 is active+recovery_unfound+degraded+remapped, acting
[92,109,116,70,158,128,243,189,256], 1 unfound
pg 26.337 is active+clean+inconsistent, acting
[139,137,48,126,165,89,237,199,189]
pg 26.3e2 is active+clean+inconsistent, acting
[12,27,24,234,195,173,98,32,35]
[WRN] PG_DEGRADED: Degraded data redundancy: 6/7118781559 objects
degraded (0.000%), 1 pg degraded, 1 pg undersized
pg 13.3a5 is stuck undersized for 4m, current state
active+undersized+remapped+backfilling, last acting
[2,45,32,62,2147483647,55,116,25,225,202,240]
pg 26.323 is active+recovery_unfound+degraded+remapped, acting
[92,109,116,70,158,128,243,189,256], 1 unfound
For the PG_DAMAGED pgs I try the usual `ceph pg repair 26.323` etc.,
however it fails to get resolved.
The osd.116 is already marked out and is beginning to get empty. I've
tried restarting the osd processes of the first osd listed for each PG,
but that doesn't get it resolved either.
I guess we should have enough redundancy to get the correct data back,
but how can I tell ceph to fix it in order to get back to a healthy state?
Cheers
/Simon
Hi ceph users,
We have a few clusters with quincy 17.2.6 and we are preparing to migrate from ceph-deploy to cephadm for better management.
We are using Ubuntu20 with latest updates (latest openssh).
While testing the migration to cephadm on a test cluster with octopus (v16 latest) we had no issues replacing ceph generated cert/key with our own CA signed certs (ECDSA).
After upgrading to quincy the test cluster and test again the migration we cannot add hosts due to the errors below, ssh access errors specified a while ago in a tracker.
We use the following type of certs:
Type: ecdsa-sha2-nistp384-cert-v01(a)openssh.com user certificate
The certificate works everytime when using ssh client from shell to connect to all hosts in the cluster.
We do a ceph mgr fail every time we replace cert/key so they are restarted.
----- cephadm logs from mgr ------
Oct 06 09:23:27 ceph-m2 bash[1363]: Log: Opening SSH connection to 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connected to SSH server at 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Local address: 10.10.12.160, port 51870
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Peer address: 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Beginning auth for user root
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Auth failed for user root
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connection failure: Permission denied
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Aborting connection
Oct 06 09:23:27 ceph-m2 bash[1363]: Traceback (most recent call last):
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 111, in redirect_log
Oct 06 09:23:27 ceph-m2 bash[1363]: yield
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 90, in _remote_connection
Oct 06 09:23:27 ceph-m2 bash[1363]: preferred_auth=['publickey'], options=ssh_options)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib/python3.6/site-packages/asyncssh/connection.py", line 6804, in connect
Oct 06 09:23:27 ceph-m2 bash[1363]: 'Opening SSH connection to')
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib/python3.6/site-packages/asyncssh/connection.py", line 303, in _connect
Oct 06 09:23:27 ceph-m2 bash[1363]: await conn.wait_established()
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib/python3.6/site-packages/asyncssh/connection.py", line 2243, in wait_established
Oct 06 09:23:27 ceph-m2 bash[1363]: await self._waiter
Oct 06 09:23:27 ceph-m2 bash[1363]: asyncssh.misc.PermissionDenied: Permission denied
Oct 06 09:23:27 ceph-m2 bash[1363]: During handling of the above exception, another exception occurred:
Oct 06 09:23:27 ceph-m2 bash[1363]: Traceback (most recent call last):
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in wrapper
Oct 06 09:23:27 ceph-m2 bash[1363]: return OrchResult(f(*args, **kwargs))
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 2810, in apply
Oct 06 09:23:27 ceph-m2 bash[1363]: results.append(self._apply(spec))
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 2558, in _apply
Oct 06 09:23:27 ceph-m2 bash[1363]: return self._add_host(cast(HostSpec, spec))
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1434, in _add_host
Oct 06 09:23:27 ceph-m2 bash[1363]: ip_addr = self._check_valid_addr(spec.hostname, spec.addr)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1415, in _check_valid_addr
Oct 06 09:23:27 ceph-m2 bash[1363]: error_ok=True, no_fsid=True))
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async
Oct 06 09:23:27 ceph-m2 bash[1363]: return self.event_loop.get_result(coro)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 56, in get_result
Oct 06 09:23:27 ceph-m2 bash[1363]: return asyncio.run_coroutine_threadsafe(coro, self._loop).result()
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
Oct 06 09:23:27 ceph-m2 bash[1363]: return self.__get_result()
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
Oct 06 09:23:27 ceph-m2 bash[1363]: raise self._exception
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/serve.py", line 1361, in _run_cephadm
Oct 06 09:23:27 ceph-m2 bash[1363]: await self.mgr.ssh._remote_connection(host, addr)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 96, in _remote_connection
Oct 06 09:23:27 ceph-m2 bash[1363]: raise
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib64/python3.6/contextlib.py", line 99, in __exit__
Oct 06 09:23:27 ceph-m2 bash[1363]: self.gen.throw(type, value, traceback)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 123, in redirect_log
Oct 06 09:23:27 ceph-m2 bash[1363]: raise HostConnectionError(msg, host, addr)
Oct 06 09:23:27 ceph-m2 bash[1363]: cephadm.ssh.HostConnectionError: Failed to connect to ceph-m1 (10.10.10.232). Permission denied
Oct 06 09:23:27 ceph-m2 bash[1363]: Log: Opening SSH connection to 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connected to SSH server at 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Local address: 10.10.12.160, port 51870
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Peer address: 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Beginning auth for user root
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Auth failed for user root
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connection failure: Permission denied
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Aborting connection
Oct 06 09:23:27 ceph-m2 bash[1363]: debug 2023-10-06T09:23:27.081+0000 7f78d86d8700 -1 log_channel(cephadm) log [ERR] : Failed to connect to ceph-m1 (10.10.10.232). Permission denied
Oct 06 09:23:27 ceph-m2 bash[1363]: Log: Opening SSH connection to 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connected to SSH server at 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Local address: 10.10.12.160, port 51870
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Peer address: 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Beginning auth for user root
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Auth failed for user root
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connection failure: Permission denied
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Aborting connection
Oct 06 09:23:27 ceph-m2 bash[1363]: Traceback (most recent call last):
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 111, in redirect_log
Oct 06 09:23:27 ceph-m2 bash[1363]: yield
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 90, in _remote_connection
Oct 06 09:23:27 ceph-m2 bash[1363]: preferred_auth=['publickey'], options=ssh_options)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib/python3.6/site-packages/asyncssh/connection.py", line 6804, in connect
Oct 06 09:23:27 ceph-m2 bash[1363]: 'Opening SSH connection to')
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib/python3.6/site-packages/asyncssh/connection.py", line 303, in _connect
Oct 06 09:23:27 ceph-m2 bash[1363]: await conn.wait_established()
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib/python3.6/site-packages/asyncssh/connection.py", line 2243, in wait_established
Oct 06 09:23:27 ceph-m2 bash[1363]: await self._waiter
Oct 06 09:23:27 ceph-m2 bash[1363]: asyncssh.misc.PermissionDenied: Permission denied
Oct 06 09:23:27 ceph-m2 bash[1363]: During handling of the above exception, another exception occurred:
Oct 06 09:23:27 ceph-m2 bash[1363]: Traceback (most recent call last):
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in wrapper
Oct 06 09:23:27 ceph-m2 bash[1363]: return OrchResult(f(*args, **kwargs))
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 2810, in apply
Oct 06 09:23:27 ceph-m2 bash[1363]: results.append(self._apply(spec))
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 2558, in _apply
Oct 06 09:23:27 ceph-m2 bash[1363]: return self._add_host(cast(HostSpec, spec))
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1434, in _add_host
Oct 06 09:23:27 ceph-m2 bash[1363]: ip_addr = self._check_valid_addr(spec.hostname, spec.addr)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 1415, in _check_valid_addr
Oct 06 09:23:27 ceph-m2 bash[1363]: error_ok=True, no_fsid=True))
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async
Oct 06 09:23:27 ceph-m2 bash[1363]: return self.event_loop.get_result(coro)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 56, in get_result
Oct 06 09:23:27 ceph-m2 bash[1363]: return asyncio.run_coroutine_threadsafe(coro, self._loop).result()
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
Oct 06 09:23:27 ceph-m2 bash[1363]: return self.__get_result()
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
Oct 06 09:23:27 ceph-m2 bash[1363]: raise self._exception
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/serve.py", line 1361, in _run_cephadm
Oct 06 09:23:27 ceph-m2 bash[1363]: await self.mgr.ssh._remote_connection(host, addr)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 96, in _remote_connection
Oct 06 09:23:27 ceph-m2 bash[1363]: raise
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/lib64/python3.6/contextlib.py", line 99, in __exit__
Oct 06 09:23:27 ceph-m2 bash[1363]: self.gen.throw(type, value, traceback)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 123, in redirect_log
Oct 06 09:23:27 ceph-m2 bash[1363]: raise HostConnectionError(msg, host, addr)
Oct 06 09:23:27 ceph-m2 bash[1363]: cephadm.ssh.HostConnectionError: Failed to connect to ceph-m1 (10.10.10.232). Permission denied
Oct 06 09:23:27 ceph-m2 bash[1363]: Log: Opening SSH connection to 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connected to SSH server at 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Local address: 10.10.12.160, port 51870
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Peer address: 10.10.10.232, port 22
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Beginning auth for user root
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Auth failed for user root
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connection failure: Permission denied
Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Aborting connection
Oct 06 09:23:27 ceph-m2 bash[1363]: debug 2023-10-06T09:23:27.081+0000 7f78d86d8700 -1 mgr handle_command module 'orchestrator' command handler threw exception: __init__() missing 2 required positional arguments: >
Oct 06 09:23:27 ceph-m2 bash[1363]: debug 2023-10-06T09:23:27.093+0000 7f78d86d8700 -1 mgr.server reply reply (22) Invalid argument Traceback (most recent call last):
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/mgr_module.py", line 1756, in _handle_command
Oct 06 09:23:27 ceph-m2 bash[1363]: return self.handle_command(inbuf, cmd)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 171, in handle_command
Oct 06 09:23:27 ceph-m2 bash[1363]: return dispatch[cmd['prefix']].call(self, cmd, inbuf)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/mgr_module.py", line 462, in call
Oct 06 09:23:27 ceph-m2 bash[1363]: return self.func(mgr, **kwargs)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in <lambda>
Oct 06 09:23:27 ceph-m2 bash[1363]: wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) # noqa: E731
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in wrapper
Oct 06 09:23:27 ceph-m2 bash[1363]: return func(*args, **kwargs)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/orchestrator/module.py", line 356, in _add_host
Oct 06 09:23:27 ceph-m2 bash[1363]: return self._apply_misc([s], False, Format.plain)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/orchestrator/module.py", line 1092, in _apply_misc
Oct 06 09:23:27 ceph-m2 bash[1363]: raise_if_exception(completion)
Oct 06 09:23:27 ceph-m2 bash[1363]: File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 225, in raise_if_exception
Oct 06 09:23:27 ceph-m2 bash[1363]: e = pickle.loads(c.serialized_exception)
Oct 06 09:23:27 ceph-m2 bash[1363]: TypeError: __init__() missing 2 required positional arguments: 'hostname' and 'addr'
----- cephadm logs from mgr ------
----- sshd logs DEBUG3 level ------
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug2: input_userauth_request: try method publickey [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug2: userauth_pubkey: valid user root querying public key ecdsa-sha2-nistp384 AAAAE2VjZHNhLXNoYTItbmlzdHAzO------------ [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: userauth_pubkey: test pkalg ecdsa-sha2-nistp384 pkblob ECDSA SHA256:m6Q0ZQVjjDLWxbmCn0hcGQ2---------- [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_key_allowed entering [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_request_send entering: type 22 [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_key_allowed: waiting for MONITOR_ANS_KEYALLOWED [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_request_receive_expect entering: type 23 [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_request_receive entering [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_request_receive entering
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: monitor_read: checking request 22
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_answer_keyallowed entering
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_answer_keyallowed: key_from_blob: 0x5568f0aa7880
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: temporarily_use_uid: 0/0 (e=0/0)
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: trying public key file /etc/ssh/fake_authorized_keys
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: fd 5 clearing O_NONBLOCK
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: restore_uid: 0/0
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_answer_keyallowed: publickey authentication test: ECDSA key is not allowed
Oct 6 09:33:09 ceph-m1 sshd[57168]: Failed publickey for root from 10.10.12.160 port 40854 ssh2: ECDSA SHA256:m6Q0ZQVjjDLWxbmCn0hcGQ24gbpk-------------
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_request_send entering: type 23
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug2: userauth_pubkey: authenticated 0 pkalg ecdsa-sha2-nistp384 [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: user_specific_delay: user specific delay 0.000ms [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: ensure_minimum_time_since: elapsed 8.263ms, delaying 8.080ms (requested 8.171ms) [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: userauth_finish: failure partial=0 next methods="publickey" [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: send packet: type 51 [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: Connection closed by authenticating user root 10.10.12.160 port 40854 [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: do_cleanup [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: PAM: sshpam_thread_cleanup entering [preauth]
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: monitor_read_log: child log fd closed
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: mm_request_receive entering
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: do_cleanup
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: PAM: cleanup
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug3: PAM: sshpam_thread_cleanup entering
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: Killing privsep child 57169
Oct 6 09:33:09 ceph-m1 sshd[57168]: debug1: audit_event: unhandled event 12
Oct 6 09:33:09 ceph-m1 sshd[757]: debug1: main_sigchld_handler: Child exited
---------------
I get "ECDSA key is not allowed" above.
From sshd logs, it looks like the client is not sending what is required or in the expected format.
Now, what was changed in quincy/mgr on ssh client?
Is anyone else using ECDSA keys and it works with quincy?
I could not find in PRs something specific to this that could block the access, but it might be.
Any suggestion?
Thank you!
Paul
Hello
Short question regarding journal-based rbd mirroring.
▪IO path with journaling w/o cache:
a. Create an event to describe the update
b. Asynchronously append event to journal object
c. Asynchronously update image once event is safe
d. Complete IO to client once update is safe
[cf. https://events.static.linuxfound.org/sites/events/files/slides/Disaster%20R…]
If a client crashes between b. and c., is there a mechanism to replay the IO from the journal on the primary image?
If not, then the primary and secondary images would get out-of-sync (because of the extra write(s) on secondary) and subsequent writes to the primary would corrupt the secondary. Is that correct?
Cheers
Francois Scheurer
--
EveryWare AG
François Scheurer
Senior Systems Engineer
Zurlindenstrasse 52a
CH-8003 Zürich
tel: +41 44 466 60 00
fax: +41 44 466 60 10
mail: francois.scheurer(a)everyware.ch
web: http://www.everyware.ch