Hi!
We've recently upgraded all our clusters from Mimic to Octopus (15.2.4). Since
then, our largest cluster is experiencing random crashes on OSDs attached to the
mon hosts.
This is the crash we are seeing (cut for brevity, see links in post scriptum):
{
"ceph_version": "15.2.4",
"utsname_release": "4.15.0-72-generic",
"assert_condition": "r == 0",
"assert_func": "void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)",
"assert_file": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>",
"assert_line": 11430,
"assert_thread_name": "bstore_kv_sync",
"assert_msg": "/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7fc56311a700 time 2020-08-26T08:52:24.917083+0200\n/build/ceph-15.2.4/src/os/bluestore/BlueStore.cc <http://bluestore.cc/>: 11430: FAILED ceph_assert(r == 0)\n",
"backtrace": [
"(()+0x12890) [0x7fc576875890]",
"(gsignal()+0xc7) [0x7fc575527e97]",
"(abort()+0x141) [0x7fc575529801]",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a5) [0x559ef9ae97b5]",
"(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x559ef9ae993f]",
"(BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x3a0) [0x559efa0245b0]",
"(BlueStore::_kv_sync_thread()+0xbdd) [0x559efa07745d]",
"(BlueStore::KVSyncThread::entry()+0xd) [0x559efa09cd3d]",
"(()+0x76db) [0x7fc57686a6db]",
"(clone()+0x3f) [0x7fc57560a88f]"
]
}
Right before the crash occurs, we see the following message in the crash log:
-3> 2020-08-26T08:52:24.787+0200 7fc569b2d700 2 rocksdb: [db/db_impl_compaction_flush.cc:2212 <http://db_impl_compaction_flush.cc:2212/>] Waiting after background compaction error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102 in db/815839.sst offset 67107066 size 3808, Accumulated background error counts: 1
-2> 2020-08-26T08:52:24.852+0200 7fc56311a700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2548200440, got 2324967102 in db/815839.sst offset 67107066 size 3808 code = 2 Rocksdb transaction:
In short, we see a Rocksdb corruption error after background compaction, when this happens.
When an OSD crashes, which happens about 10-15 times a day, it restarts and
resumes work without any further problems.
We are pretty confident that this is not a hardware issue, due to the following facts:
* The crashes occur on 5 different hosts over 3 different racks.
* There is no smartctl/dmesg output that could explain it.
* It usually happens to a different OSD that did not crash before.
Still we checked the following on a few OSDs/hosts:
* We can do a manual compaction, both offline and online.
* We successfully ran "ceph-bluestore-tool fsck --deep yes" on one of the OSDs.
* We manually compacted a number of OSDs, one of which crashed hours later.
The only thing we have noticed so far: It only happens to OSDs that are attached
to a mon host. *None* of the non-mon host OSDs have had a crash!
Does anyone have a hint what could be causing this? We currently have no good
theory that could explain this, much less have a fix or workaround.
Any help would be greatly appreciated.
Denis
Crash: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/meta.txt>
Log: https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt <https://public-resources.objects.lpg.cloudscale.ch/osd-crash/log.txt>
Hi everyone, a bucket was overquota, (default quota of 300k objects per bucket), I enabled the object quota for this bucket and set a quota of 600k objects.
We are on Luminous (12.2.12) and dynamic resharding is disabled, I manually do the resharding from 3 to 6 shards.
Since then, radosgw-admin bucket stats report a `rgw.none` in the usage section for this bucket.
I search the mailing-lists, bugzilla, github, it's look like I can ignore the rgw.none stats. (0 byte object, entry left in the index marked as cancelled...)
but, the num_object in rgw.none is part of the quota usage.
I bump the quota to 800k object to workaround the problem. (without resharding)
Is there a way I can garbage collect the rgw.none?
Is this problem fixed in Mimic/Nautilus/Octopus?
"usage": {
"rgw.none": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 417827
},
"rgw.main": {
"size": 1390778138502,
"size_actual": 1391581007872,
"size_utilized": 1390778138502,
"size_kb": 1358181776,
"size_kb_actual": 1358965828,
"size_kb_utilized": 1358181776,
"num_objects": 305637
}
},
Thanks!
Hi all,
I tried to set bucket quota using admin API as shown below:
admin/user?quota&uid=bse&bucket=test"a-type=bucket
with payload in json format:
{
"enabled": true,
"max_size": 1099511627776,
"max_size_kb": 1073741824,
"max_objects": -1
}
it returned success but the quota change did not happen, as confirmed by
'radosgw-admin bucket stats --bucket=test' command.
Am I missing something obvious? Please kindly advise/suggest.
By the way, I am using ceph mimic (v13.2.4). Setting quota by radosgw-admin
quota set --bucket=${BUCK} --max-size=1T --quota-scope=bucket works, but I
want to do it programmatically.
Thanks in advance,
-Youzhong
My 3-node Ceph cluster (14.2.4) has been running fine for months. However,
my data pool became close to full a couple of weeks ago, so I added 12 new
OSDs, roughly doubling the capacity of the cluster. However, the pool size
has not changed, and the health of the cluster has changed for the worse.
The dashboard shows the following cluster status:
- PG_DEGRADED_FULL: Degraded data redundancy (low space): 2 pgs
backfill_toofull
- POOL_NEARFULL: 6 pool(s) nearfull
- OSD_NEARFULL: 1 nearfull osd(s)
Output from ceph -s:
cluster:
id: e5a47160-a302-462a-8fa4-1e533e1edd4e
health: HEALTH_ERR
1 nearfull osd(s)
6 pool(s) nearfull
Degraded data redundancy (low space): 2 pgs backfill_toofull
services:
mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 5w)
mgr: ceph01(active, since 4w), standbys: ceph03, ceph02
mds: cephfs:1 {0=ceph01=up:active} 2 up:standby
osd: 33 osds: 33 up (since 43h), 33 in (since 43h); 1094 remapped pgs
rgw: 3 daemons active (ceph01, ceph02, ceph03)
data:
pools: 6 pools, 1632 pgs
objects: 134.50M objects, 7.8 TiB
usage: 42 TiB used, 81 TiB / 123 TiB avail
pgs: 213786007/403501920 objects misplaced (52.983%)
1088 active+remapped+backfill_wait
538 active+clean
4 active+remapped+backfilling
2 active+remapped+backfill_wait+backfill_toofull
io:
recovery: 477 KiB/s, 330 keys/s, 29 objects/s
Can someone steer me in the right direction for how to get my cluster
healthy again?
Thanks in advance!
-Dallas
Hi list,
We had some stuck ops on our MDS. In order to figure out why, we looked
up the documention. The first thing it mentions is the following:
ceph daemon mds.<name> dump cache /tmp/dump.txt
Our MDS had 170 GB in cache at that moment.
Turns out that is a sure way to get your active MDS replaced by a standby.
Is this supposed to work on MDS with large cache size? If not, than a
big warning sign to prohibit running this on MDSes with large caches
would be appropriate.
Gr. Stefan
P.s. I think our only option was to get the active restarted at that
point, but still.
Hi,
A few weeks ago several of our rdb images became unresponsive after a few of our OSDs reached a near full state.
Another member of the team rebooted the server that the rbd images are mounted on in an attempt to resolve the issue.
In the meantime I added several more nodes to the cluster in order to get additional space.
Here are some cluster details:
root@rbd1:/var/log# ceph -v
ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)
root@rbd1:/var/log# uname -va
Linux rbd1 4.15.0-48-generic #51~16.04.1-Ubuntu SMP Fri Apr 5 12:01:12 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
root@rbd1:/var/log# ceph -s
cluster:
id: 504b5794-34bd-44e7-a8c3-0494cf800c23
health: HEALTH_ERR
crush map has legacy tunables (require argonaut, min is firefly)
full ratio(s) out of order
2091546/274437905 objects misplaced (0.762%)
Reduced data availability: 114 pgs inactive
1 slow requests are blocked > 32 sec. Implicated osds 152
4 stuck requests are blocked > 4096 sec. Implicated osds 6,99
services:
mon: 3 daemons, quorum hqceph1,hqceph2,hqceph3
mgr: hqceph2(active), standbys: hqceph3
osd: 291 osds: 283 up, 265 in; 116 remapped pgs
rgw: 1 daemon active
data:
pools: 17 pools, 8199 pgs
objects: 91.48M objects, 292TiB
usage: 880TiB used, 758TiB / 1.60PiB avail
pgs: 1.390% pgs not active
2091546/274437905 objects misplaced (0.762%)
8040 active+clean
114 activating+remapped
40 active+clean+scrubbing+deep
3 active+clean+scrubbing
2 active+remapped+backfilling
io:
recovery: 41.4MiB/s, 12objects/s
This morning I got on the server in order to map and mount the 5 or so RBD images that are shared out via samba on this server.
After waiting about 10 minutes it was clear something was not 100% correct…here is what I found in the kern.log file:
Aug 31 11:43:16 rbd1 kernel: [2158818.570948] libceph: mon0 10.35.1.201:6789 session established
Aug 31 11:43:16 rbd1 kernel: [2158818.576306] libceph: client182797617 fsid 504b5794-34bd-44e7-a8c3-0494cf800c23
Aug 31 11:43:16 rbd1 kernel: [2158818.710199] rbd: rbd0: capacity 54975581388800 features 0x0
Aug 31 11:43:22 rbd1 kernel: [2158824.685353] XFS (rbd0): Mounting V4 Filesystem
Aug 31 11:44:19 rbd1 kernel: [2158881.536794] XFS (rbd0): Starting recovery (logdev: internal)
Aug 31 11:47:06 rbd1 kernel: [2159048.202835] INFO: task mount:33177 blocked for more than 120 seconds.
Aug 31 11:47:06 rbd1 kernel: [2159048.203053] Not tainted 4.15.0-48-generic #51~16.04.1-Ubuntu
Aug 31 11:47:06 rbd1 kernel: [2159048.203260] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 31 11:47:06 rbd1 kernel: [2159048.203523] mount D 0 33177 33011 0x00000000
Aug 31 11:47:06 rbd1 kernel: [2159048.203527] Call Trace:
Aug 31 11:47:06 rbd1 kernel: [2159048.203538] __schedule+0x3d6/0x8b0
Aug 31 11:47:06 rbd1 kernel: [2159048.203542] ? __switch_to_asm+0x34/0x70
Aug 31 11:47:06 rbd1 kernel: [2159048.203546] schedule+0x36/0x80
Aug 31 11:47:06 rbd1 kernel: [2159048.203551] schedule_timeout+0x1db/0x370
Aug 31 11:47:06 rbd1 kernel: [2159048.203629] ? xfs_trans_read_buf_map+0xf8/0x330 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203634] wait_for_completion+0xb4/0x140
Aug 31 11:47:06 rbd1 kernel: [2159048.203637] ? wake_up_q+0x70/0x70
Aug 31 11:47:06 rbd1 kernel: [2159048.203691] ? xfs_trans_read_buf_map+0xf8/0x330 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203740] ? _xfs_buf_read+0x23/0x30 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203787] xfs_buf_submit_wait+0x7f/0x220 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203839] ? xfs_trans_read_buf_map+0xf8/0x330 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203887] _xfs_buf_read+0x23/0x30 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203933] xfs_buf_read_map+0x10a/0x190 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.203985] xfs_trans_read_buf_map+0xf8/0x330 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204022] xfs_read_agf+0x90/0x120 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204058] xfs_alloc_read_agf+0x49/0x1d0 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204094] xfs_alloc_pagf_init+0x29/0x60 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204141] xfs_initialize_perag_data+0x99/0x110 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204193] xfs_mountfs+0x79b/0x950 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204243] ? xfs_mru_cache_create+0x12b/0x170 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204294] xfs_fs_fill_super+0x428/0x5e0 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204300] mount_bdev+0x246/0x290
Aug 31 11:47:06 rbd1 kernel: [2159048.204349] ? xfs_test_remount_options.isra.16+0x60/0x60 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204398] xfs_fs_mount+0x15/0x20 [xfs]
Aug 31 11:47:06 rbd1 kernel: [2159048.204402] mount_fs+0x3d/0x150
Aug 31 11:47:06 rbd1 kernel: [2159048.204406] vfs_kern_mount+0x67/0x110
Aug 31 11:47:06 rbd1 kernel: [2159048.204410] do_mount+0x201/0xd00
Aug 31 11:47:06 rbd1 kernel: [2159048.204414] ? __check_object_size+0xfc/0x1a0
Aug 31 11:47:06 rbd1 kernel: [2159048.204418] SyS_mount+0x95/0xe0
Aug 31 11:47:06 rbd1 kernel: [2159048.204423] do_syscall_64+0x73/0x130
Aug 31 11:47:06 rbd1 kernel: [2159048.204426] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Aug 31 11:47:06 rbd1 kernel: [2159048.204429] RIP: 0033:0x7f77a9a4fc5a
Aug 31 11:47:06 rbd1 kernel: [2159048.204431] RSP: 002b:00007ffda3e83818 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
Aug 31 11:47:06 rbd1 kernel: [2159048.204435] RAX: ffffffffffffffda RBX: 0000000002424030 RCX: 00007f77a9a4fc5a
Aug 31 11:47:06 rbd1 kernel: [2159048.204436] RDX: 0000000002424210 RSI: 0000000002424310 RDI: 0000000002424230
Aug 31 11:47:06 rbd1 kernel: [2159048.204438] RBP: 0000000000000000 R08: 0000000002424280 R09: 0000000000000012
Aug 31 11:47:06 rbd1 kernel: [2159048.204440] R10: 00000000c0ed0c00 R11: 0000000000000206 R12: 0000000002424230
Aug 31 11:47:06 rbd1 kernel: [2159048.204441] R13: 0000000002424210 R14: 0000000000000000 R15: 0000000000000003
Any suggestions on what I can/should do next?
Thanks in advance,
Shain
Shain Miley | Director of Platform and Infrastructure | Digital Media | smiley(a)npr.org
Hi,
today I recognized bad performance in our cluster. Running "watch ceph
osd perf |sort -hk 2 -r" I found that all bluestore OSDs are slow on
commit and that the commit timings are equal to their apply timings:
For example
Every 2.0s: ceph osd perf |sort -hk 2
-r
440 82 82
430 58 58
435 56 56
449 53 53
442 40 40
441 30 30
439 27 27
99 0 1
98 0 0
97 0 2
96 0 6
95 0 2
94 0 6
93 0 13
The once with zero commit timings are filestore and the others are
bluestore osds.
I did not see this after installing the new bluestore osds (maybe this
occured later).
Both types of osds have nvmes as journal/db. Servers have equal
cpus/ram etc.
The only tuning regarding bluestore is:
bluestore_block_db_size = 69793218560
bluestore_prefer_deferred_size_hdd = 524288
In order to make a filestore like behavior, but that does not seem to
work.
Any tips?
Regards Dennis
Hi friends,
Since deployment of our Ceph cluster we've been plagued by slow
metadata error.
Namely, cluster goes into HEALTH_WARN with a message similar to this
one:
2 MDSs report slow metadata IOs
1 MDSs report slow requests
1 slow ops, oldest one blocked for 32 sec, daemons [osd.22,osd.4] have
slow ops.
Here is a brief overview of our setup:
- 7 OSD nodes with 6 OSD drives each
- three of those are also monitors, managers and MDS
- there is a single Ceph client (at the moment)
- there is only CephFS being used (at the moment)
- metadata for CephFS is on HDD (was on HDD, but we moved it as
suggested - no improvement)
Our expectation is that this is not a RAM issue as we have 64GiB of
memory and is never fully utilized.
It might be a CPU problem, as issue happens mostly during high loads
(load of ~12 on a 8-core Intel Xeon Bronze 3106).
However, the load is present on all OSD nodes, not just MDS ones.
Cluster is used for (mostly nightly) backups and has no critical
performance requirement.
Interestingly, significant load across all nodes appears when running
cleanup of outdated backups.
This boils down to mostly truncating files and some removal, but it is
usually small number of large files.
Bellow you can find an example of "dump_ops_in_flight" output during
the problem (which you may find useful - I couldn't make sense out of
it).
Should we invest into more powerfull CPU hardware (or should we move
MDS roles to more powerful nodes)?
Please let me know if I can share any more information to help resolve
this thing.
Thanks in advance!
Kind regards,
Momo.
===
{
"ops": [
{
"description": "client_request(client.22661659:706483006
create #0x10000002742/a-random-file 2020-08-23T23:09:33.919740+0200
caller_uid=117, caller_gid=121{})",
"initiated_at": "2020-08-23T23:09:33.926509+0200",
"age": 30.193027896,
"duration": 30.193083934000001,
"type_data": {
"flag_point": "failed to authpin, subtree is being
exported",
"reqid": "client.22661659:706483006",
"op_type": "client_request",
"client_info": {
"client": "client.22661659",
"tid": 706483006
},
"events": [
{
"time": "2020-08-23T23:09:33.926509+0200",
"event": "initiated"
},
{
"time": "2020-08-23T23:09:33.926510+0200",
"event": "throttled"
},
{
"time": "2020-08-23T23:09:33.926509+0200",
"event": "header_read"
},
{
"time": "2020-08-23T23:09:33.926516+0200",
"event": "all_read"
},
{
"time": "2020-08-23T23:09:33.926540+0200",
"event": "dispatched"
},
{
"time": "2020-08-23T23:09:33.926595+0200",
"event": "failed to authpin, subtree is being
exported"
}
]
}
}
],
"num_ops": 1
}
Hello,
On a Nautilus 14.2.8 cluster I am seeing large RocksDB database with
many slow DB bytes in use.
To investigate this further I marked one OSD as out and waited for the
all the backfilling to complete.
Once the backfilling was completed I exported BlueFS and investigated
the RocksDB using 'ceph-kvstore-tool'. This resulted in 22GB of data.
Listing all the keys in the RocksDB shows me there are 747.000 keys in
the DB. A small portion are osdmaps, but the biggest amount are keys
prefixed with 'b'.
I dumped the stats of the RocksDB and this shows me:
L1: 1/0: 439.32 KB
L2: 1/0: 2.65 MB
L3: 5/0: 14.36 MB
L4: 127/0: 7.22 GB
L5: 217/0: 13.73 GB
Sum: 351/0: 20.98 GB
So there is almost 21GB of data in this RocksDB database. Why? Where is
this coming from?
Throughout this cluster OSDs are suffering from many slow bytes used and
I can't figure out why.
Has anybody seen this or has a clue on what is going on?
I have an external copy of this RocksDB database to do investigations on.
Thank you,
Wido
Hi all,
can anyone help me with this? In mimic, for any of these commands:
ceph osd [deep-]scrub ID
ceph pg [deep-]scrub ID
ceph pg repair ID
an operation is scheduled asynchronously. How can I check the following states:
1) Operation is pending (scheduled, not started).
2) Operation is running.
3) Operation has completed.
4) Exit code and error messages if applicable.
Many thanks!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14