Hello,
'almost all diagnostic ceph subcommands hang!' -> this triggered my bell.
We've had a similar issue with many ceph commands hanging due to a missing L3 ACL
between MGRs and a new MDS machine that we added to the cluster.
I second Eugen analysis: network issue, whatever the OSI layer.
Regards,
Frédéric.
----- Le 26 Avr 24, à 9:31, Eugen Block eblock(a)nde.ag a écrit :
> Hi, it's unlikely that all OSDs fail at the same time, it seems like a
> network issue. Do you have an active MGR? Just a couple of days ago
> someone reported incorrect OSD stats because no MGR was up. Although
> your 'ceph health detail' output doesn't mention that, there are still
> issues when MGR processes are active according to ceph but don't
> respond anymore.
> I would probably start with basic network debugging, e. g. iperf,
> pings on public and cluster networks (if present) and so on.
>
> Regards,
> Eugen
>
> Zitat von Alexey GERASIMOV <alexey.gerasimov(a)opencascade.com>om>:
>
>> Colleagues, I have the update.
>>
>> Starting from yestrerday the situation with ceph health is much
>> worse than it was previously.
>> We found that
>> - ceph -s informs us that some PGs are in stale state
>> - almost all diagnostic ceph subcommands hang! For example, "ceph
>> osd ls" , "ceph osd dump", "ceph osd tree", "ceph
health detail"
>> provide the output - but "ceph osd status", all the commands
"ceph
>> pg ..." and other ones hang.
>>
>> So, it looks that the crashes of MDS daemons were the first signs of
>> problems only.
>> I read that "stale" state for PGs means that all nodes storing this
>> placement group may be down - but it's wrong, all osd daemons are up
>> on all three nodes:
>>
>> ------- ceph osd tree
>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
>> -1 68.05609 root default
>> -3 22.68536 host asrv-dev-stor-1
>> 0 hdd 5.45799 osd.0 up 1.00000 1.00000
>> 1 hdd 5.45799 osd.1 up 1.00000 1.00000
>> 2 hdd 5.45799 osd.2 up 1.00000 1.00000
>> 3 hdd 5.45799 osd.3 up 1.00000 1.00000
>> 12 ssd 0.42670 osd.12 up 1.00000 1.00000
>> 13 ssd 0.42670 osd.13 up 1.00000 1.00000
>> -5 22.68536 host asrv-dev-stor-2
>> 4 hdd 5.45799 osd.4 up 1.00000 1.00000
>> 5 hdd 5.45799 osd.5 up 1.00000 1.00000
>> 6 hdd 5.45799 osd.6 up 1.00000 1.00000
>> 7 hdd 5.45799 osd.7 up 1.00000 1.00000
>> 14 ssd 0.42670 osd.14 up 1.00000 1.00000
>> 15 ssd 0.42670 osd.15 up 1.00000 1.00000
>> -7 22.68536 host asrv-dev-stor-3
>> 8 hdd 5.45799 osd.8 up 1.00000 1.00000
>> 10 hdd 5.45799 osd.10 up 1.00000 1.00000
>> 11 hdd 5.45799 osd.11 up 1.00000 1.00000
>> 18 hdd 5.45799 osd.18 up 1.00000 1.00000
>> 16 ssd 0.42670 osd.16 up 1.00000 1.00000
>> 17 ssd 0.42670 osd.17 up 1.00000 1.00000
>>
>> May it be the physical problem with our drives? "smartctl -a"
>> informs nothing wrong. We started the surface check using dd
>> command also but it will be 7 hours per drive at least...
>>
>> What should we do also?
>>
>> The output of "ceph health detail":
>>
>> ceph health detail
>> HEALTH_ERR 1 MDSs report damaged metadata; insufficient standby MDS
>> daemons available; Reduced data availability: 50 pgs stale; 90
>> daemons have recently crashed; 3 mgr modules have recently crashed
>> [ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
>> mds.asrv-dev-stor-2(mds.0): Metadata damage detected
>> [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
>> have 0; want 1 more
>> [WRN] PG_AVAILABILITY: Reduced data availability: 50 pgs stale
>> pg 5.0 is stuck stale for 67m, current state stale+active+clean,
>> last acting [4,1,11]
>> pg 5.13 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,0,10]
>> pg 5.18 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,11,2]
>> pg 5.19 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,3,10]
>> pg 5.1e is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,7,11]
>> pg 5.22 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,18]
>> pg 5.26 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,1,18]
>> pg 5.29 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,11,6]
>> pg 5.2b is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,18,6]
>> pg 5.30 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,8,7]
>> pg 5.37 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,10,0]
>> pg 5.3c is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,10,3]
>> pg 5.43 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,18]
>> pg 5.44 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,11]
>> pg 5.45 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,11,3]
>> pg 5.47 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,10,1]
>> pg 5.48 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,5,11]
>> pg 5.60 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,10,7]
>> pg 7.2 is stuck stale for 67m, current state stale+active+clean,
>> last acting [4,2,10]
>> pg 7.4 is stuck stale for 67m, current state stale+active+clean,
>> last acting [4,18,3]
>> pg 7.f is stuck stale for 10h, current state stale+active+clean,
>> last acting [0,4,8]
>> pg 7.13 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,7,11]
>> pg 7.18 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,0,10]
>> pg 7.1b is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,8,0]
>> pg 7.1f is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,5,11]
>> pg 7.2a is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,8]
>> pg 7.35 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,10]
>> pg 7.36 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,8]
>> pg 7.37 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,8,7]
>> pg 7.38 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,11]
>> pg 9.10 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,0,8]
>> pg 9.16 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,4,11]
>> pg 9.20 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,3,8]
>> pg 9.2a is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,8,0]
>> pg 9.33 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,18,5]
>> pg 9.3a is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,8,5]
>> pg 9.48 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,11]
>> pg 9.4b is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,7,11]
>> pg 9.4f is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,10]
>> pg 9.52 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,8,0]
>> pg 9.53 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,11,7]
>> pg 9.56 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,5,18]
>> pg 9.5a is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,7,8]
>> pg 9.5d is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,10]
>> pg 9.6b is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,11,0]
>> pg 9.6f is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,18]
>> pg 9.73 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,2,10]
>> pg 9.76 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,10,2]
>> pg 9.79 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,8]
>> pg 9.7f is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,10,5]
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io