OSDs in pool full : can't restart to clean - ceph-users

13 Jan 2021

Hey all

We landed in a bad place (tm) with our nvme metadata tier.  I'll root cause how we got
here after it's all back up.  I suspect it was a pool got misconfigured and just
filled it all up.

Short version, the OSDs are all full (or full enough) that I can't get them to spin
back up.  They crash with enospc.  Average fragmentation for block is in the .8 range and
bluefs-db is slightly better (using ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-412
free-score).  I've tried all sorts of things.  I was able to get a few to spin up but
once they came up and rejoined they tried to pull MORE data in and crashed out again.   

I changed the crush_rule for the pool I care about to a much larger (and slower) set of
disks.  That way if I get anything else to come up I'm not just making it worse.

I increased the size of the backing LV for one of the OSDs to see if I could get
ceph-bluestore-tool to expand it, but that too crashes out enospc.   

In theory, there are a few pools I don't care about as much on there and I could
delete them to make space, but I can't get them up enough -or- get the offline tools
to do it.  

Some logs from the attempted expansion that fails:

[root@ceph-b-07 ceph-412]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-412 
bluefs-bdev-expand
inferring bluefs devices from bluestore path
1 : device size 0x44aa000000 : own
0x[520000~20000,23e0000~620000,2ae0000~4d20000,78d0000~f30000,8900000~1600000,9fc0000~30000,a000000~5d00000,fe00000~3b00000,139e0000~5420000,19000000~100000,
::snip::
4f0000~20000,25c17c0000~10000,25c2ea0000~20000,25c9f20000~10000,25d0860000~10000,25d50e0000~20000,25d5170000~10000,25ded20000~20000,25f4fc0000~20000]
= 0x59c5b0000 : using 0x58f220000(22 GiB) : bluestore has 0x10260000(258 MiB) available
Expanding DB/WAL...
Expanding Main...
2021-01-13 16:40:46.481 7f33d1998ec0 -1 bluestore(/var/lib/ceph/osd/ceph-412)
allocate_bluefs_freespace failed to allocate on 0x32c70000 min_size 0xf700000 >
allocated total 0x1e80000 bluefs_shared_alloc_size 0x10000 allocated 0x1e80000 available
0x 90210000
2021-01-13 16:40:46.482 7f33d1998ec0 -1 bluefs _allocate failed to expand slow device to
fit +0xf6f0def
2021-01-13 16:40:46.482 7f33d1998ec0 -1 bluefs _flush_range allocated: 0x0 offset: 0x0
length: 0xf6f0def
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.15/rpm/el7/BUILD/ceph-14.2.15/src/os/bluestore/BlueFS.cc:
In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)'
thread 7f33d1998ec0 time 2021-01-13 16:40:46.482978
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.15/rpm/el7/BUILD/ceph-14.2.15/src/os/bluestore/BlueFS.cc:
2351: ceph_abort_msg("bluefs enospc")
 ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)

The orignal LV under that is 172g and the new LV size is double that.  

I'm going to keep poking at this, but I'm really hoping for some new info.  Either
to increase the size of the OSDs to get it back up enough so I can then rebuild them with
a different layout,  delete some data I don't care about, pull the data off and put it
back to defrag... I don't care which so long as I get it back up.

Thanks
-paul