On 24-03-2024 00:31, Alexander E. Patrakov wrote:
Hi Torkil,
Hi Alexander
Thanks for the update. Even though the improvement is
small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still unsolved peering issues.
I have looked at both the old and the new state of the PG, but could
not find anything else interesting.
I also looked again at the state of PG 37.1. It is known what blocks
the backfill of this PG; please search for "blocked_by." However, this
is just one data point, which is insufficient for any conclusions. Try
looking at other PGs. Is there anything too common in the non-empty
"blocked_by" blocks?
I'll take a look at that tomorrow, perhaps we can script something
meaningful.
I think we have to look for patterns in other ways,
too. One tool that
produces good visualizations is TheJJ balancer. Although it is called
a "balancer," it can also visualize the ongoing backfills.
The tool is available at
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptim…
Run it as follows:
./placementoptimizer.py showremapped --by-osd | tee remapped.txt
Output attached.
Thanks again.
Mvh.
Torkil
> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard <torkil(a)drcmr.dk> wrote:
>>
>> Hi Alex
>>
>> New query output attached after restarting both OSDs. OSD 237 is no
>> longer mentioned but it unfortunately made no difference for the number
>> of backfills which went 59->62->62.
>>
>> Mvh.
>>
>> Torkil
>>
>> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
>>
Hi Torkil,
>>>
>>> I have looked at the files that you attached. They were helpful: pool
>>> 11 is problematic, it complains about degraded objects for no obvious
>>> reason. I think that is the blocker.
>>>
>>> I also noted that you mentioned peering problems, and I suspect that
>>> they are not completely resolved. As a somewhat-irrational move, to
>>> confirm this theory, you can restart osd.237 (it is mentioned at the
>>> end of query.11.fff.txt, although I don't understand why it is there)
>>> and then osd.298 (it is the primary for that pg) and see if any
>>> additional backfills are unblocked after that. Also, please re-query
>>> that PG again after the OSD restart.
>>>
>>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard <torkil(a)drcmr.dk>
wrote:
>>>>
>>>>
>>>>
>>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
>>>>
Hi Torkil,
>>>>
>>>> Hi Alexander
>>>>
>>>>> I have looked at the CRUSH rules, and the equivalent rules work on
my
>>>>> test cluster. So this cannot be the cause of the blockage.
>>>>
>>>> Thank you for taking the time =)
>>>>
>>>>> What happens if you increase the osd_max_backfills setting
temporarily?
>>>>
>>>> We already had the mclock override option in place and I re-enabled our
>>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
>>>> on how full they are. Active backfills went from 16 to 53 which is
>>>> probably because default osd_max_backfills for mclock is 1.
>>>>
>>>> I think 53 is still a low number of active backfills given the large
>>>> percentage misplaced.
>>>>
>>>>> It may be a good idea to investigate a few of the stalled PGs.
Please
>>>>> run commands similar to this one:
>>>>>
>>>>> ceph pg 37.0 query > query.37.0.txt
>>>>> ceph pg 37.1 query > query.37.1.txt
>>>>> ...
>>>>> and the same for the other affected pools.
>>>>
>>>> A few samples attached.
>>>>
>>>>> Still, I must say that some of your rules are actually unsafe.
>>>>>
>>>>> The 4+2 rule as used by rbd_ec_data will not survive a
>>>>> datacenter-offline incident. Namely, for each PG, it chooses OSDs
from
>>>>> two hosts in each datacenter, so 6 OSDs total. When a datacenter is
>>>>> offline, you will, therefore, have only 4 OSDs up, which is exactly
>>>>> the number of data chunks. However, the pool requires min_size 5, so
>>>>> all PGs will be inactive (to prevent data corruption) and will stay
>>>>> inactive until the datacenter comes up again. However, please
don't
>>>>> set min_size to 4 - then, any additional incident (like a defective
>>>>> disk) will lead to data loss, and the shards in the datacenter which
>>>>> went offline would be useless because they do not correspond to the
>>>>> updated shards written by the clients.
>>>>
>>>> Thanks for the explanation. This is an old pool predating the 3 DC setup
>>>> and we'll migrate the data to a 4+5 pool when we can.
>>>>
>>>>> The 4+5 rule as used by cephfs.hdd.data has min_size equal to the
>>>>> number of data chunks. See above why it is bad. Please set min_size
to
>>>>> 5.
>>>>
>>>> Thanks, that was a leftover for getting the PGs to peer (stuck at
>>>> creating+incomplete) when we created the pool. It's back to 5 now.
>>>>
>>>>> The rbd.ssd.data pool seems to be OK - and, by the way, its PGs are
>>>>> 100% active+clean.
>>>>
>>>> There is very little data in this pool, that is probably the main
reason.
>>>>
>>>>> Regarding the mon_max_pg_per_osd setting, you have a few OSDs that
>>>>> have 300+ PGs, the observed maximum is 347. Please set it to 400.
>>>>
>>>> Copy that. Didn't seem to make a difference though, and we have
>>>> osd_max_pg_per_osd_hard_ratio set to 5.000000.
>>>>
>>>> Mvh.
>>>>
>>>> Torkil
>>>>
>>>>> On Sun, Mar 24, 2024 at 3:16 AM Torkil Svensgaard
<torkil(a)drcmr.dk> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 23-03-2024 19:05, Alexander E. Patrakov wrote:
>>>>>>> Sorry for replying to myself, but "ceph osd pool ls
detail" by itself
>>>>>>> is insufficient. For every erasure code profile mentioned in
the
>>>>>>> output, please also run something like this:
>>>>>>>
>>>>>>> ceph osd erasure-code-profile get prf-for-ec-data
>>>>>>>
>>>>>>> ...where "prf-for-ec-data" is the name that appears
after the words
>>>>>>> "erasure profile" in the "ceph osd pool ls
detail" output.
>>>>>>
>>>>>> [root@lazy ~]# ceph osd pool ls detail | grep erasure
>>>>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6
min_size 5
>>>>>> crush_rule 0 object_hash rjenkins pg_num 4096 pgp_num 4096
>>>>>> autoscale_mode off last_change 2257933 lfor 0/1291190/1755101
flags
>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width
16384
>>>>>> fast_read 1 compression_algorithm snappy compression_mode
aggressive
>>>>>> application rbd
>>>>>> pool 37 'cephfs.hdd.data' erasure profile
DRCMR_k4m5_datacenter_hdd size
>>>>>> 9 min_size 4 crush_rule 7 object_hash rjenkins pg_num 2048
pgp_num 2048
>>>>>> autoscale_mode off last_change 2257933 lfor 0/0/2139486 flags
>>>>>> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
>>>>>> compression_algorithm zstd compression_mode aggressive
application cephfs
>>>>>> pool 38 'rbd.ssd.data' erasure profile
DRCMR_k4m5_datacenter_ssd size 9
>>>>>> min_size 5 crush_rule 8 object_hash rjenkins pg_num 32 pgp_num
32
>>>>>> autoscale_mode warn last_change 2198930 lfor 0/2198930/2198928
flags
>>>>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
>>>>>> compression_algorithm zstd compression_mode aggressive
application rbd
>>>>>>
>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get DRCMR_k4m2
>>>>>> crush-device-class=hdd
>>>>>> crush-failure-domain=host
>>>>>> crush-root=default
>>>>>> jerasure-per-chunk-alignment=false
>>>>>> k=4
>>>>>> m=2
>>>>>> plugin=jerasure
>>>>>> technique=reed_sol_van
>>>>>> w=8
>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get
DRCMR_k4m5_datacenter_hdd
>>>>>> crush-device-class=hdd
>>>>>> crush-failure-domain=datacenter
>>>>>> crush-root=default
>>>>>> jerasure-per-chunk-alignment=false
>>>>>> k=4
>>>>>> m=5
>>>>>> plugin=jerasure
>>>>>> technique=reed_sol_van
>>>>>> w=8
>>>>>> [root@lazy ~]# ceph osd erasure-code-profile get
DRCMR_k4m5_datacenter_ssd
>>>>>> crush-device-class=ssd
>>>>>> crush-failure-domain=datacenter
>>>>>> crush-root=default
>>>>>> jerasure-per-chunk-alignment=false
>>>>>> k=4
>>>>>> m=5
>>>>>> plugin=jerasure
>>>>>> technique=reed_sol_van
>>>>>> w=8
>>>>>>
>>>>>> But as I understand it those profiles are only used to create
the
>>>>>> initial crush rule for the pool, and we have manually edited
those along
>>>>>> the way. Here are the 3 rules in use for the 3 EC pools:
>>>>>>
>>>>>> rule rbd_ec_data {
>>>>>> id 0
>>>>>> type erasure
>>>>>> step set_chooseleaf_tries 5
>>>>>> step set_choose_tries 100
>>>>>> step take default class hdd
>>>>>> step choose indep 0 type datacenter
>>>>>> step chooseleaf indep 2 type host
>>>>>> step emit
>>>>>> }
>>>>>> rule cephfs.hdd.data {
>>>>>> id 7
>>>>>> type erasure
>>>>>> step set_chooseleaf_tries 5
>>>>>> step set_choose_tries 100
>>>>>> step take default class hdd
>>>>>> step choose indep 0 type datacenter
>>>>>> step chooseleaf indep 3 type host
>>>>>> step emit
>>>>>> }
>>>>>> rule rbd.ssd.data {
>>>>>> id 8
>>>>>> type erasure
>>>>>> step set_chooseleaf_tries 5
>>>>>> step set_choose_tries 100
>>>>>> step take default class ssd
>>>>>> step choose indep 0 type datacenter
>>>>>> step chooseleaf indep 3 type host
>>>>>> step emit
>>>>>> }
>>>>>>
>>>>>> Which should first pick all 3 datacenters in the choose step and
then
>>>>>> either 2 or 3 hosts in the chooseleaf step, matching EC 4+2 and
4+5
>>>>>> respectively.
>>>>>>
>>>>>> Mvh.
>>>>>>
>>>>>> Torkil
>>>>>>
>>>>>>> On Sun, Mar 24, 2024 at 1:56 AM Alexander E. Patrakov
>>>>>>> <patrakov(a)gmail.com> wrote:
>>>>>>>>
>>>>>>>
Hi Torkil,
>>>>>>>>
>>>>>>>> I take my previous response back.
>>>>>>>>
>>>>>>>> You have an erasure-coded pool with nine shards but only
three
>>>>>>>> datacenters. This, in general, cannot work. You need
either nine
>>>>>>>> datacenters or a very custom CRUSH rule. The second
option may not be
>>>>>>>> available if the current EC setup is already
incompatible, as there is
>>>>>>>> no way to change the EC parameters.
>>>>>>>>
>>>>>>>> It would help if you provided the output of "ceph
osd pool ls detail".
>>>>>>>>
>>>>>>>> On Sun, Mar 24, 2024 at 1:43 AM Alexander E. Patrakov
>>>>>>>> <patrakov(a)gmail.com> wrote:
>>>>>>>>>
>>>>>>>>
Hi Torkil,
>>>>>>>>>
>>>>>>>>> Unfortunately, your files contain nothing obviously
bad or suspicious,
>>>>>>>>> except for two things: more PGs than usual and bad
balance.
>>>>>>>>>
>>>>>>>>> What's your "mon max pg per osd"
setting?
>>>>>>>>>
>>>>>>>>> On Sun, Mar 24, 2024 at 1:08 AM Torkil Svensgaard
<torkil(a)drcmr.dk> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2024-03-23 17:54, Kai Stian Olstad wrote:
>>>>>>>>>>> On Sat, Mar 23, 2024 at 12:09:29PM +0100,
Torkil Svensgaard wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> The other output is too big for pastebin
and I'm not familiar with
>>>>>>>>>>>> paste services, any suggestion for a
preferred way to share such
>>>>>>>>>>>> output?
>>>>>>>>>>>
>>>>>>>>>>> You can attached files to the mail here on
the list.
>>>>>>>>>>
>>>>>>>>>> Doh, for some reason I was sure attachments would
be stripped. Thanks,
>>>>>>>>>> attached.
>>>>>>>>>>
>>>>>>>>>> Mvh.
>>>>>>>>>>
>>>>>>>>>> Torkil
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Alexander E. Patrakov
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Alexander E. Patrakov
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Torkil Svensgaard
>>>>>> Systems Administrator
>>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>>>>>> Copenhagen University Hospital Amager and Hvidovre
>>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Torkil Svensgaard
>>>> Systems Administrator
>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>>>> Copenhagen University Hospital Amager and Hvidovre
>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>>>
>>>
>>>
>>
>> --
>> Torkil Svensgaard
>> Systems Administrator
>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>> Copenhagen University Hospital Amager and Hvidovre
>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>
>
>
--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark