crashing OSDs: ceph_assert(h->file->fnode.ino != 1)

List overview All Threads
Download

newer

older

Cephadm Hangs During OSD Apply

Re: ceph orch upgrade stuck at the...

Harald Staub

28 May 2020 28 May '20

11:51 p.m.

This is again about our bad cluster, with too much objects, and the hdd OSDs have a DB device that is (much) too small (e.g. 20 GB, i.e. 3 GB usable). Now several OSDs do not come up any more. Typical error message: /build/ceph-14.2.8/src/os/bluestore/BlueFS.cc: 2261: FAILED ceph_assert(h->file->fnode.ino != 1) Also just tried to add a few GB to the DB device (lvextend, ceph-bluestore-tool bluefs-bdev-expand), but this also crashes, also with this message. Options that helped us before (thanks Wido :-) do not help here, e.g. CEPH_ARGS="--bluestore-rocksdb-options compaction_readahead_size=0" ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$OSD compact Any ideas that I could try to save these OSDs? Cheers Harry

Show replies by date

Simon Leinen

29 May 29 May

3:05 a.m.

Colleague of Harry's here... Harald Staub writes:

...

This is again about our bad cluster, with too much objects, and the hdd OSDs have a DB device that is (much) too small (e.g. 20 GB, i.e. 3 GB usable). Now several OSDs do not come up any more.

...

Typical error message: /build/ceph-14.2.8/src/os/bluestore/BlueFS.cc: 2261: FAILED ceph_assert(h->file->fnode.ino != 1)

The context of that line is "we should never run out of log space here": // previously allocated extents. bool must_dirty = false; if (allocated < offset + length) { // we should never run out of log space here; see the min runway check // in _flush_and_sync_log. ceph_assert(h->file->fnode.ino != 1); So I guess we are violating that "should", and the Bluestore code doesn't handle that case. And the "min runway" check may not be reliable. Should we file a bug? Again, help on how to proceed would be greatly appreciated... -- Simon.

...

Also just tried to add a few GB to the DB device (lvextend, ceph-bluestore-tool bluefs-bdev-expand), but this also crashes, also with this message.

...

Options that helped us before (thanks Wido :-) do not help here, e.g. CEPH_ARGS="--bluestore-rocksdb-options compaction_readahead_size=0" ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$OSD compact

...

Any ideas that I could try to save these OSDs?

> Cheers > Harry

Igor Fedotov

3:36 a.m.

Hi Simon, your analysis is correct, you've stepped into an unexpected state for BlueFS log. This is the second occurrence of the issue, the first one is mentioned at https://tracker.ceph.com/issues/45519 Looking if we can get out of this state and how to fix that... Thanks, Igor On 5/29/2020 1:05 PM, Simon Leinen wrote: > Colleague of Harry's here... > > Harald Staub writes: >> This is again about our bad cluster, with too much objects, and the >> hdd OSDs have a DB device that is (much) too small (e.g. 20 GB, i.e. 3 >> GB usable). Now several OSDs do not come up any more. >> Typical error message: >> /build/ceph-14.2.8/src/os/bluestore/BlueFS.cc: 2261: FAILED >> ceph_assert(h->file->fnode.ino != 1) > The context of that line is "we should never run out of log space here": > > // previously allocated extents. > bool must_dirty = false; > if (allocated < offset + length) { > // we should never run out of log space here; see the min runway check > // in _flush_and_sync_log. > ceph_assert(h->file->fnode.ino != 1); > > So I guess we are violating that "should", and the Bluestore code > doesn't handle that case. And the "min runway" check may not be > reliable. Should we file a bug? > > Again, help on how to proceed would be greatly appreciated...

Igor Fedotov

4:13 a.m.

Simon, Harry, so the log from the ticket I can see a huge ((400+ MB) bluefs log kept over many small non-adjustent extents. Presumably it was caused by either setting small bluefs_alloc_size or high disk space fragmentation or both. Now I'd like more details on your OSDs. Could you please collect OSD startup log with debug_bluefs set to 20? Also please run the following commands for broken OSD (need results only, no need to collect the log unless they're failing): ceph-bluestore-tool --path <path-to-osd> --command bluefs-bdev-sizes ceph-bluestore-tool --path <path-to-osd> --command free-score Thanks, Igor On 5/29/2020 1:05 PM, Simon Leinen wrote: > Colleague of Harry's here... > > Harald Staub writes: >> This is again about our bad cluster, with too much objects, and the >> hdd OSDs have a DB device that is (much) too small (e.g. 20 GB, i.e. 3 >> GB usable). Now several OSDs do not come up any more. >> Typical error message: >> /build/ceph-14.2.8/src/os/bluestore/BlueFS.cc: 2261: FAILED >> ceph_assert(h->file->fnode.ino != 1) > The context of that line is "we should never run out of log space here": > > // previously allocated extents. > bool must_dirty = false; > if (allocated < offset + length) { > // we should never run out of log space here; see the min runway check > // in _flush_and_sync_log. > ceph_assert(h->file->fnode.ino != 1); > > So I guess we are violating that "should", and the Bluestore code > doesn't handle that case. And the "min runway" check may not be > reliable. Should we file a bug? > > Again, help on how to proceed would be greatly appreciated...

Simon Leinen

7:38 a.m.

Dear Igor, thanks a lot for your assistance. We're still trying to bring OSDs back up... the cluster is not in a great shape right now.

...

so the log from the ticket I can see a huge ((400+ MB) bluefs log kept over many small non-adjustent extents.

...

Presumably it was caused by either setting small bluefs_alloc_size or high disk space fragmentation or both. Now I'd like more details on your OSDs.

...

Could you please collect OSD startup log with debug_bluefs set to 20?

Yes, I now have such a log from an OSD that crashed with the assertion in the subject after about 30 seconds. The log file is about 850'000 lines / 100 MB in size. How can I make it available to you?

...

Also please run the following commands for broken OSD (need results only, no need to collect the log unless they're failing):

...

ceph-bluestore-tool --path <path-to-osd> --command bluefs-bdev-sizes

---------------------------------------------------------------------- inferring bluefs devices from bluestore path slot 2 /var/lib/ceph/osd/ceph-46/block -> /dev/dm-7 slot 1 /var/lib/ceph/osd/ceph-46/block.db -> /dev/dm-17 1 : device size 0xa74c00000 : own 0x[2000~6b4bfe000] = 0x6b4bfe000 : using 0x6b4bfe000(27 GiB) 2 : device size 0x74702000000 : own 0x[37e3e600000~4a85400000] = 0x4a85400000 : using 0x4a85400000(298 GiB) ----------------------------------------------------------------------

...

ceph-bluestore-tool --path <path-to-osd> --command free-score

---------------------------------------------------------------------- block: { "fragmentation_rating": 0.84012572151981013 } bluefs-db: { "fragmentation_rating": -nan } failure querying 'bluefs-wal' 2020-05-29 16:31:54.882 7fec3c89cd80 -1 asok(0x55c4ec574000) AdminSocket: request '{"prefix": "bluestore allocator score bluefs-wal"}' not defined ---------------------------------------------------------------------- See anything interesting? -- Simon.

...

Thanks,

...

Igor

...

On 5/29/2020 1:05 PM, Simon Leinen wrote: > Colleague of Harry's here... > > Harald Staub writes: >> This is again about our bad cluster, with too much objects, and the >> hdd OSDs have a DB device that is (much) too small (e.g. 20 GB, i.e. 3 >> GB usable). Now several OSDs do not come up any more. >> Typical error message: >> /build/ceph-14.2.8/src/os/bluestore/BlueFS.cc: 2261: FAILED >> ceph_assert(h->file->fnode.ino != 1) > The context of that line is "we should never run out of log space here": > > // previously allocated extents. > bool must_dirty = false; > if (allocated < offset + length) { > // we should never run out of log space here; see the min runway check > // in _flush_and_sync_log. > ceph_assert(h->file->fnode.ino != 1); > > So I guess we are violating that "should", and the Bluestore code > doesn't handle that case. And the "min runway" check may not be > reliable. Should we file a bug? > > Again, help on how to proceed would be greatly appreciated...

Igor Fedotov

9:20 a.m.

Simon, Harald thanks for the information. Got your log offline too. Here is a brief analysis: 1) Your DB is pretty large - 27GB at DB device (making it full) and 279GB at main spinning one. I.e. RocksDB is experiencing huge spillover to slow main device - expect performance drop. And generally DB is highly under-provisioned. 2) Main device space is highly fragmented - 0.84012572151981013 where 1.0 is the maximum. Can't say for sure but I presume it's pretty full as well. The above are indirect factors resulting in the current failure though, primarily I just want to make you aware since this might cause other issues later on. The major reason preventing OSD from proper starting is BlueFS attempt to claim additional space (~52GB), see in the log: 2020-05-29 16:26:53.507 7f0a9f78ec00 10 bluefs _expand_slow_device expanding slow device by 0xc36040000 which results in a lot of pretty short allocations (remember fragmented space as per above) which in its turn cause pretty large write to bluefs log. The latter has a preallocated spare space of 4MB which isn't enough to keep all the update and hence asserts. I can suggest the following workarounds to start the OSD for now: 1) switch allocator to stupid by setting 'bluestore allocator' parameter to 'stupid'. Presume you have default setting of 'bitmap' now.. This will allow more continuous allocations for bluefs space claim. and hence shorter log write. But given high main disk fragmentation this might be not enough. 'stupid' allocator has some issues (e.g. high RAM utilization over time in some cases) as well but they're rather irrelevant for OSD startup. 2) Increase 'bluefs_max_log_runway' parameter to 8-12 MB (with the default value at 4MB). Suggest to start with 1) and then additionally proceed with 2) if the first one doesn't help. Once OSD is up and cluster is healthy please consider adding more DB space and/or OSDs to your cluster to fight dangerous factors I started with. BTW wondering what payload is the primarily one for you cluster - RGW or something else? Hope this helps. Thanks, Igor On 5/29/2020 5:38 PM, Simon Leinen wrote: > Dear Igor, > > thanks a lot for your assistance. We're still trying to bring OSDs back > up... the cluster is not in a great shape right now. > >> so the log from the ticket I can see a huge ((400+ MB) bluefs log >> kept over many small non-adjustent extents. >> Presumably it was caused by either setting small bluefs_alloc_size or >> high disk space fragmentation or both. Now I'd like more details on >> your OSDs. >> Could you please collect OSD startup log with debug_bluefs set to 20? > Yes, I now have such a log from an OSD that crashed with the assertion > in the subject after about 30 seconds. The log file is about 850'000 > lines / 100 MB in size. How can I make it available to you? > >> Also please run the following commands for broken OSD (need results >> only, no need to collect the log unless they're failing): >> ceph-bluestore-tool --path <path-to-osd> --command bluefs-bdev-sizes > ---------------------------------------------------------------------- > inferring bluefs devices from bluestore path > slot 2 /var/lib/ceph/osd/ceph-46/block -> /dev/dm-7 > slot 1 /var/lib/ceph/osd/ceph-46/block.db -> /dev/dm-17 > 1 : device size 0xa74c00000 : own 0x[2000~6b4bfe000] = 0x6b4bfe000 : using 0x6b4bfe000(27 GiB) > 2 : device size 0x74702000000 : own 0x[37e3e600000~4a85400000] = 0x4a85400000 : using 0x4a85400000(298 GiB) > ---------------------------------------------------------------------- > >> ceph-bluestore-tool --path <path-to-osd> --command free-score > ---------------------------------------------------------------------- > block: > { > "fragmentation_rating": 0.84012572151981013 > } > > bluefs-db: > { > "fragmentation_rating": -nan > } > > failure querying 'bluefs-wal' > 2020-05-29 16:31:54.882 7fec3c89cd80 -1 asok(0x55c4ec574000) AdminSocket: request '{"prefix": "bluestore allocator score bluefs-wal"}' not defined > ---------------------------------------------------------------------- > > See anything interesting?

Simon Leinen

10:19 a.m.

Dear Igor, thanks a lot for the analysis and recommendations.

...

Here is a brief analysis:

...

1) Your DB is pretty large - 27GB at DB device (making it full) and 279GB at main spinning one. I.e. RocksDB is experiencing huge spillover to slow main device - expect performance drop. And generally DB is highly under-provisioned.

Yes, we have known about this issue for a long time. This cluster and in particular its SSD devices were dimensioned in the pre-Bluestore days. We haven't yet found a viable migration path towards something more sensible (with ~1500 OSDs on two separate clusters and quite a bit of user data on them).

...

2) Main device space is highly fragmented - 0.84012572151981013 where 1.0 is the maximum. Can't say for sure but I presume it's pretty full as well.

Not too full: $ ceph osd df | sort -n ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS [...] 3 hdd 7.27699 1.00000 7.3 TiB 4.6 TiB 4.3 TiB 49 MiB 319 GiB 2.7 TiB 63.46 1.10 0 down

...

The above are indirect factors resulting in the current failure though, primarily I just want to make you aware since this might cause other issues later on.

Thanks.

...

The major reason preventing OSD from proper starting is BlueFS attempt to claim additional space (~52GB), see in the log:

[...]

...

I can suggest the following workarounds to start the OSD for now:

...

1) switch allocator to stupid by setting 'bluestore allocator' parameter to 'stupid'. Presume you have default setting of 'bitmap' now.. This will allow more continuous allocations for bluefs space claim. and hence shorter log write. But given high main disk fragmentation this might be not enough. 'stupid' allocator has some issues (e.g. high RAM utilization over time in some cases) as well but they're rather irrelevant for OSD startup.

Thanks, we'll try that & report.

...

2) Increase 'bluefs_max_log_runway' parameter to 8-12 MB (with the default value at 4MB).

...

Suggest to start with 1) and then additionally proceed with 2) if the first one doesn't help.

...

Once OSD is up and cluster is healthy please consider adding more DB space and/or OSDs to your cluster to fight dangerous factors I started with.

...

BTW wondering what payload is the primarily one for you cluster - RGW or something else?

The payload has changed over the lifetime of the cluster (which has been in operation for more than four years, growing and being upgraded). Initially it was almost exclusively RBD (for OpenStack VMs), then we added RadosGW (still all with 3-way replication). As RadosGW/S3 became more popular, we added an EC 8+3 pool. (We also added an NVMe-only pool, which is used for RadosGW indexes.) Lately this EC 8+3 pool has become very popular, and users have been storing hundreds of Terabyte on it. Unfortunately they tend to use a small object size (~1MB per object). That's why we have close to a billion objects in the EC pool now, and things start to fail. As I said, it's a problem of finding a viable migration path to a better configuration. Unfortunately we cannot just throw away the current installation and start from scratch... Cheers, -- Simon.

Simon Leinen

2 Jun 2 Jun

12:55 p.m.

Simon Leinen writes:

...

> I can suggest the following workarounds to start the OSD for now:

...

> 1) switch allocator to stupid by setting 'bluestore allocator' > parameter to 'stupid'. Presume you have default setting of 'bitmap' > now.. This will allow more continuous allocations for bluefs space > claim. and hence shorter log write. But given high main disk > fragmentation this might be not enough. 'stupid' allocator has some > issues (e.g. high RAM utilization over time in some cases) as well but > they're rather irrelevant for OSD startup.

...

Thanks, we'll try that & report.

Using the "stupid" allocator, we never had any crashes with this assert. But the OSDs run more slowly this way. So what we ended up doing was: When an OSD crashed with this assert, we did an offline compaction of the DB, and then started it again with the bitmap allocator. So far the resulting OSDs seem to run fine.

...

> 2) Increase 'bluefs_max_log_runway' parameter to 8-12 MB (with the > default value at 4MB).

That looks helpful too, thanks!

...

> Suggest to start with 1) and then additionally proceed with 2) if the > first one doesn't help.

...

> Once OSD is up and cluster is healthy please consider adding more DB > space and/or OSDs to your cluster to fight dangerous factors I started > with.

Today we added some disks to existing servers (replacing old disks that had failed over the years - we don't replace them right away), and will create additional OSDs to take some load off the existing ones. We'll also try to get rid of some of the EC buckets with very high numbers of objects, again to reduce the load of the OSD DBs. Thanks again for your support, -- Simon.

Simon Leinen

1:01 p.m.

Sorry for following up on myself (again), but I had left out an important detail: Simon Leinen writes:

...

Using the "stupid" allocator, we never had any crashes with this assert. But the OSDs run more slowly this way.

...

So what we ended up doing was: When an OSD crashed with this assert, we did an offline compaction of the DB, and then started it again with the bitmap allocator. So far the resulting OSDs seem to run fine.

For the offline compaction, we used the "stupid" allocator, i.e. sudo env CEPH_ARGS="--bluestore-allocator stupid" ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$OSD compact With the default "bitmap" allocator, the compaction job would fail with the same ceph_assert(). -- Simon.

Simon Leinen

12:59 p.m.

Igor Fedotov writes:

...

2) Main device space is highly fragmented - 0.84012572151981013 where 1.0 is the maximum. Can't say for sure but I presume it's pretty full as well.

As I said, these disks aren't that full as far as bytes are concerned. But they do have a lot of objects on them! As I said we have hundreds of millions of ~1MB objects in an EC 8+3 pool, which means that there are billions of 125KB fragments on the OSDs. About fragmentation: Is there a way to get rid of it? Does scrubbing/deep scrubbing help? -- Simon.

Igor Fedotov

5 Jun 5 Jun

3:07 p.m.

Hi Simon, On 6/2/2020 10:59 PM, Simon Leinen wrote:

...

Igor Fedotov writes:

2) Main device space is highly fragmented - 0.84012572151981013 where 1.0 is the maximum. Can't say for sure but I presume it's pretty full as well.

Can name anything but OSD redeployment. In upcoming pacific release this issue will hopefully be less probably due to switching to hybrid allocator. Which cares more about continuous allocations.

...

Does scrubbing/deep scrubbing help?

No, this definitely wouldn't help. These are just data consistency verification, i.e. reading + csum verification.

1435

days inactive

1442

days old

ceph-users@ceph.io

Manage subscription

10 comments

3 participants

tags (0)

participants (3)

Harald Staub
Igor Fedotov
Simon Leinen