Thank you very much for the hint regarding the log files, I wasn't aware that it still
saves the logs on the host although everything is running in containers nowadays.
So there was nothing in the log files but I could find out that finally the host (a
RasPi4) could not cope with 2 SSD external USB disks connected to it. Probably due to not
enough power, so the disks disappeared and the OSD went away with them. After a restart of
the host the disks where back as well as the OSD containers. So I have now remove that
second OSD and will keep only one single OSD per server.
For reference here is the relevant part of the kernel log I saw:
[Thu May 6 15:24:34 2021] blk_update_request: I/O error, dev sda, sector 40063143 op
0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[Thu May 6 15:24:34 2021] usb 1-1-port4: over-current change #1
and of course it did that for both sda and sdb.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, May 6, 2021 4:17 PM, David Caro <dcaro(a)wikimedia.org> wrote:
On 05/06 14:03, mabi wrote:
Hello,
I have a small 6 nodes Octopus 15.2.11 cluster installed on bare metal with cephadm and I
added a second OSD to one of my 3 OSD nodes. I started then copying data to my ceph fs
mounted with kernel mount but then both OSDs on that specific nodes crashed.
To this topic I have the following questions:
1. How can I find out why the two OSD crashed? because everything is in podman
containers I don't know where are the logs to find out the reason why this happened.
From the OS itself everything looks ok, there was no out of memory error.
There should be some logs under /var/log/ceph/<cluster_fsid>/osd.<osd_id>/ on
the host/hosts that were running the osds.
I found myself sometimes though disabling the '--rm' flag for the pod in the
'unit.run' script under
/va/lib/ceph/<ceph_fsid>/osd.<id>/unit.run to make podman persist the
container and be able to do a 'podman logs' on it.
Though that's probably sensible only when debugging.
2. I would assume the two OSD container would
restart on their own but this is not the case it looks like. How can I restart manually
these 2 OSD containers on that node? I believe this should be a "cephadm orch"
command?
I think 'ceph orch daemon redeploy' might do it? What is the output of 'ceph
orch ls' and 'ceph orch ps'?
The health of the cluster right now is:
CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
PG_DEGRADED: Degraded data redundancy: 132518/397554 objects degraded (33.333%), 65
pgs degraded, 65 pgs undersized
Thank your for your hints.
Best regards,
Mabi
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
--
David Caro
SRE - Cloud Services
Wikimedia Foundation
https://wikimediafoundation.org/
PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io