Hi Igor,
I just want to thank you for taking the time to help with this issue.
On 3/18/20 5:30 AM, Igor Fedotov wrote:
Most probably you will need additional 30GB of free
space per each OSD
if going this way. So please let me know if you can afford this.
Well I had
already increased 709's initial space from 106GB to 200GB and
now I gave it 10GB more but it still can not actually resize. Here is
the relevant information I think but the full logs is here[0]. I then
did it with 30G (now total of 240G) and it still failed[1]. I am out of
space without some additional hardware in this node though I have an
idea. If I knew what size it is (and what space it needs for recovery
this would be very helpful).
There is no much sense in increasing main device for this specific OSD
(and similarly failing one, i.e. OSDs mentioning RocksDB recovery in
back trace) at this point.
It's in the "deadlock" state I mentioned before. And hence expand is
unable to proceed.
I'm checking some workarounds to get out of this state at the moment.
Still in progress though.
What I meant before is that you need more available space if workaround
would be the assignment of a new standalone DB volume. It's a
questionable way so I'm trying other ways for now.
I had to go forward with this and migrate the db to a separate partition
as we were production impacting. I did some testing by making a copy
(dd'ing to new LVM) of one of the OSDs and performing the steps before I
tried this. Essential steps (understanding this is a Nautilus cluster)
for someone who might come across this in the future (for each OSD you
need to do this for):
1) create new partition to hold the db volume
lvcreate -L30G -n db-20-6 /dev/ceph-db-vol04
2) migrate with the ceph-bluestore-tool
ceph-bluestore-tool bluefs-bdev-migrate --path
/var/lib/ceph/osd/ceph-715 --dev-target /dev/ceph-db-vol04/db-20-6
--devs-source /var/lib/ceph/osd/ceph-715/block
3) make sure the block device are owned by the ceph user (
chown -h ceph:ceph /var/lib/ceph/osd/ceph-715/block.db
chown ceph:ceph /dev/ceph-db-vol04/db-20-6
4) run the ceph-bluestore-tool repair
ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-715
--command repair
5) test compaction
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-715 compact
6) start the OSD
systemctl start ceph-osd@715
Lesson is that bluestore does not handle well when you hit capacity like
this with the simple configuration of co-located rocksdb and data. I
think that even for fast disks such as nvme you should always create a
separate db partition as this deadlock scenario is very problematic if
you don't have additional storage (or can add some quickly).
It seems that we had about 2million cephfs log segments that were behind
on trimming. I am not sure where these segments are kept but I am
guessing in the mds metadata pool which seems to have driven this
maximum space issue. As we are now down to about %17 used in the nvme
class when we were at 100% during this issue.
Thanks,
derek
--
Derek T. Yarnell
Director of Computing Facilities
University of Maryland
Institute for Advanced Computer Studies