On multiple clusters we are seeing the mgr hang frequently when the balancer is enabled. It seems that the balancer is getting caught in some kind of infinite loop which chews up all the CPU for the mgr which causes problems with other modules like prometheus (we don't have the devicehealth module enabled yet).
I've been able to reproduce the issue doing an offline balance as well using the osdmaptool:
osdmaptool --debug-osd 10 osd.map --upmap balance-upmaps.sh --upmap-pool default.rgw.buckets.data --upmap-max 100
It seems to loop over the same group of PGs of ~7,000 PGs over and over again like this without finding any new upmaps that can be added:
2019-11-19 16:39:11.131518 7f85a156f300 10 trying 24.d91
2019-11-19 16:39:11.138035 7f85a156f300 10 trying 24.2e3c
2019-11-19 16:39:11.144162 7f85a156f300 10 trying 24.176b
2019-11-19 16:39:11.149671 7f85a156f300 10 trying 24.ac6
2019-11-19 16:39:11.155115 7f85a156f300 10 trying 24.2cb2
2019-11-19 16:39:11.160508 7f85a156f300 10 trying 24.129c
2019-11-19 16:39:11.166287 7f85a156f300 10 trying 24.181f
2019-11-19 16:39:11.171737 7f85a156f300 10 trying 24.3cb1
2019-11-19 16:39:11.177260 7f85a156f300 10 24.2177 already has pg_upmap_items [368,271]
2019-11-19 16:39:11.177268 7f85a156f300 10 trying 24.2177
2019-11-19 16:39:11.182590 7f85a156f300 10 trying 24.a4
2019-11-19 16:39:11.188053 7f85a156f300 10 trying 24.2583
2019-11-19 16:39:11.193545 7f85a156f300 10 24.93e already has pg_upmap_items [80,27]
2019-11-19 16:39:11.193553 7f85a156f300 10 trying 24.93e
2019-11-19 16:39:11.198858 7f85a156f300 10 trying 24.e67
2019-11-19 16:39:11.204224 7f85a156f300 10 trying 24.16d9
2019-11-19 16:39:11.209844 7f85a156f300 10 trying 24.11dc
2019-11-19 16:39:11.215303 7f85a156f300 10 trying 24.1f3d
2019-11-19 16:39:11.221074 7f85a156f300 10 trying 24.2a57
While this cluster is running Luminous (12.2.12), I've reproduced the loop using the same osdmap on Nautilus (14.2.4). Is there somewhere I can privately upload the osdmap for someone to troubleshoot the problem?
Thanks,
Bryan
Hi everyone,
I'm looking for some advice on diagnosing an OSD issue.
We have a Mimic cluster, not very full, with Bluestore OSDs.
We recently had to bring the cluster down to allow power testing in the host datacentre, and when we brought things up again, 1 OSD daemon would not start.
The log shows (cut to useful context):
-314> 2019-11-21 15:55:15.561 7efdc049dd80 4 rocksdb: Options.ttl: 0
-314> 2019-11-21 15:55:15.563 7efdc049dd80 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3362] Recovered from manifest file:db/MANIFEST-000127 succeeded,manifest_file_number is 127, next_file_number is 264, last_sequence is 21956004, log_number is 0,prev_log_number is 0,max_column_family is 0,deleted_log_number is 123
-314> 2019-11-21 15:55:15.563 7efdc049dd80 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3370] Column family [default] (ID 0), log number is 255
-314> 2019-11-21 15:55:15.563 7efdc049dd80 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1574351715564768, "job": 1, "event": "recovery_started", "log_files": [252, 255]}
-314> 2019-11-21 15:55:15.563 7efdc049dd80 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering log #252 mode 0
-314> 2019-11-21 15:55:16.722 7efdc049dd80 4 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering log #255 mode 0
-314> 2019-11-21 15:55:17.885 7efdc049dd80 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/os/bluestore/KernelDevice.cc: In function 'virtual int KernelDevice::read(uint64_t, uint64_t, ceph::bufferlist*, IOContext*, bool)' thread 7efdc049dd80 time 2019-11-21 15:55:17.870632
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm/el7/BUILD/ceph-13.2.6/src/os/bluestore/KernelDevice.cc: 825: FAILED assert((uint64_t)r == len)
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7efdb788036b]
2: (()+0x26e4f7) [0x7efdb78804f7]
3: (KernelDevice::read(unsigned long, unsigned long, ceph::buffer::list*, IOContext*, bool)+0x4b4) [0x5619ab313144]
4: (BlueFS::_read(BlueFS::FileReader*, BlueFS::FileReaderBuffer*, unsigned long, unsigned long, ceph::buffer::list*, char*)+0x3c2) [0x5619ab2d59a2]
5: (BlueRocksSequentialFile::Read(unsigned long, rocksdb::Slice*, char*)+0x34) [0x5619ab2f88f4]
6: (rocksdb::SequentialFileReader::Read(unsigned long, rocksdb::Slice*, char*)+0x6b) [0x5619ab4e541b]
7: (rocksdb::log::Reader::ReadMore(unsigned long*, int*)+0xd8) [0x5619ab3f3148]
8: (rocksdb::log::Reader::ReadPhysicalRecord(rocksdb::Slice*, unsigned long*)+0x70) [0x5619ab3f3240]
9: (rocksdb::log::Reader::ReadRecord(rocksdb::Slice*, std::string*, rocksdb::WALRecoveryMode)+0x12b) [0x5619ab3f351b]
10: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool)+0xea2) [0x5619ab3a3bf2]
11: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0xa59) [0x5619ab3a54e9]
12: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::string const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool)+0x689) [0x5619ab3a6299]
13: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::string const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x22) [0x5619ab3a7ac2]
14: (RocksDBStore::do_open(std::ostream&, bool, std::vector<KeyValueDB::ColumnFamily, std::allocator<KeyValueDB::ColumnFamily> > const*)+0x164e) [0x5619ab27a43e]
15: (BlueStore::_open_db(bool, bool)+0xd6a) [0x5619ab205f9a]
16: (BlueStore::_mount(bool, bool)+0x4d1) [0x5619ab237071]
17: (OSD::init()+0x28f) [0x5619aaddeedf]
18: (main()+0x23a3) [0x5619aacbd7a3]
19: (__libc_start_main()+0xf5) [0x7efdb33f2505]
The disk behind this OSD is very new, and hasn't been stressed very much, so I am not convinced it's a disk failure issue.
Is this a known bug in Mimic (it's hard to find a similar bug in the bug tracker)... how should I diagnose this?
Sam
Hi everyone,
We're pleased to announce that the next Cephalocon will be March 3-5 in
Seoul, South Korea!
https://ceph.com/cephalocon/seoul-2020/
The CFP for the conference is now open:
https://linuxfoundation.smapply.io/prog/cephalocon_2020
Main conference: March 4-5
Developer summit: March 3
Mark your calendars, and get your talk proposals in! The CFP will close
in early December in order to get a final schedule published in early
January.
In addition to the two day conference, we will also have a developer
summit on March 3 to take advantage of having so many developers in the
same place at the same time. The developer sessions will include video
conferencing so that remote developers will also be able to participate.
A sponsorship prospectus will be available Real Soon Now.
We hope you can join us!
Hello - Recently we have upgraded to Luminous 12.2.11. After that we can
see the scrub errors on the object storage pool only on daily basis. After
repair, it will be cleared. But again it will come tomorrow after scrub
performed the PG.
Any known issue - on scrub errs with 12.2.11 version?
Thanks
Swami
I've upgraded 7 of our clusters to Nautilus (14.2.4) and noticed that on some of the clusters (3 out of 7) the OSDs aren't using msgr2 at all. Here's the output for osd.0 on 2 clusters of each type:
### Cluster 1 (v1 only):
# ceph osd find 0 | jq -r '.addrs'
{
"addrvec": [
{
"type": "v1",
"addr": "10.26.0.33:6809",
"nonce": 4185021
}
]
}
### Cluster 2 (v1 only):
# ceph osd find 0 | jq -r '.addrs'
{
"addrvec": [
{
"type": "v1",
"addr": "10.197.0.243:6801",
"nonce": 3802140
}
]
}
### Cluster 3 (v1 & v2):
# ceph osd find 0 | jq -r '.addrs'
{
"addrvec": [
{
"type": "v2",
"addr": "10.32.0.36:6802",
"nonce": 3167
},
{
"type": "v1",
"addr": "10.32.0.36:6804",
"nonce": 3167
}
]
}
### Cluster 4 (v1 & v2):
# ceph osd find 0 | jq -r '.addrs'
{
"addrvec": [
{
"type": "v2",
"addr": "10.36.0.12:6820",
"nonce": 3150
},
{
"type": "v1",
"addr": "10.36.0.12:6827",
"nonce": 3150
}
]
}
All of the mon nodes have the same msgr settings of:
# ceph daemon mon.$(hostname -s) config show | grep msgr
"mon_warn_on_msgr2_not_enabled": "true",
"ms_bind_msgr1": "true",
"ms_bind_msgr2": "true",
"ms_msgr2_encrypt_messages": "false",
"ms_msgr2_sign_messages": "false",
What could be causing this? All of the clusters are listening on port 3300 for v2 and 6789 for v1. I can even connect to port 3300 on the mon nodes from the OSD nodes.
Thanks,
Bryan
Three OSDs, holding the 3 replicas of a PG here are only half-starting, and
hence that single PG gets stuck as "stale+active+clean".
All died of suicide timeout while walking over a huge omap (pool 7
'default.rgw.buckets.index') and would not get the PG 7.b back online
again.
From the logs, they try to start normally, get into a bit of leveldb
things, play the journal and then say nothing more.
2019-11-19 15:15:46.967543 7fe644fad840 0 set uid:gid to 167:167
(ceph:ceph)
2019-11-19 15:15:46.967600 7fe644fad840 0 ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 5149
2019-11-19 15:15:47.026065 7fe644fad840 0 pidfile_write: ignore empty
--pid-file
2019-11-19 15:15:47.078291 7fe644fad840 0
filestore(/var/lib/ceph/osd/ceph-22) backend xfs (magic 0x58465342)
2019-11-19 15:15:47.079317 7fe644fad840 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2019-11-19 15:15:47.079331 7fe644fad840 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2019-11-19 15:15:47.079352 7fe644fad840 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features: splice
is supported
2019-11-19 15:15:47.080287 7fe644fad840 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2019-11-19 15:15:47.080529 7fe644fad840 0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-22) detect_feature: extsize is
disabled by conf
2019-11-19 15:15:47.095819 7fe644fad840 1 leveldb: Recovering log #2731809
2019-11-19 15:15:47.119792 7fe644fad840 1 leveldb: Level-0 table #2731812:
started
2019-11-19 15:15:47.132107 7fe644fad840 1 leveldb: Level-0 table #2731812:
140642 bytes OK
2019-11-19 15:15:47.143782 7fe644fad840 1 leveldb: Delete type=0 #2731809
2019-11-19 15:15:47.147198 7fe644fad840 1 leveldb: Delete type=3 #2731792
2019-11-19 15:15:47.159339 7fe644fad840 0
filestore(/var/lib/ceph/osd/ceph-22) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2019-11-19 15:15:47.243262 7fe644fad840 1 journal _open
/var/lib/ceph/osd/ceph-22/journal fd 18: 21472739328 bytes, block size 4096
bytes, directio = 1, aio = 1
At this point they consume a ton of cpu, systemd thinks all is fine, and
this has been going on for some 5 hours.
ceph -s think they are down, I can't talk to the OSDs remotely from a mon,
but ceph daemon on the OSD hosts works normally, except I can't do anything
from there except get conf or perf numbers.
Strace shows they all keep looping over the same sequence:
machine1:
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7fffd7c98080) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head", {st_mode=S_IFDIR|0755,
st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B",
{st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7fffd7c98080) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-270/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
machine2:
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffe0b664240) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head", {st_mode=S_IFDIR|0755,
st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B",
{st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffe0b664240) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-243/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
machine3:
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffc63518650) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head", {st_mode=S_IFDIR|0755,
st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B",
{st_mode=S_IFDIR|0755, st_size=8192, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4",
{st_mode=S_IFDIR|0755, st_size=24576, ...}) = 0
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/DIR_D",
0x7ffc63518650) = -1 ENOENT (No such file or directory)
stat("/var/lib/ceph/osd/ceph-22/current/7.b_head/DIR_B/DIR_4/\\.dir.31716e6b-28c9-42e6-81ed-d27e3b714a9c.47687923.1711__head_6D57DD4B__7",
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
Help wanted.
--
May the most significant bit of your life be positive.
Hi,
I have a small but impacting error in my crush rules.
For unknown reasons the rules are not using host but osd to place the data and thus we have some nodes with all three copies instead of three different nodes.
We noticed this when rebooting a node and a pg became stale.
My crush rule:
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "osd"
},
{
"op": "emit"
}
]
},
Type should be host of course. And I want to alter this and move pg's such that all is as should.
How can I best proceed in correcting this issue? I do like to throttle the remapping of the data so ceph itself won't be unavailable while the data is redistributed.
We are running on Mimic (13.2.6), and this environment has been installed freshly as Mimic while using ceph-ansible.
Current ceph -s output:
cluster:
id: <<fsid>
health: HEALTH_OK
services:
mon: 3 daemons, quorum mon01,mon02,mon03
mgr: mon01(active), standbys: mon02, mon03
mds: cephfs-2/2/2 up {0=mon03=up:active,1=mon01=up:active}, 1 up:standby
osd: 502 osds: 502 up, 502 in
data:
pools: 18 pools, 8192 pgs
objects: 28.74 M objects, 100 TiB
usage: 331 TiB used, 2.3 PiB / 2.6 PiB avail
pgs: 8192 active+clean
Cheers,
Maarten van Ingen
| Systems Expert | Distributed Data Processing | SURFsara | Science Park 140 | 1098 XG Amsterdam |
| T +31 (0) 20 800 1300 | maarten.vaningen(a)surfsara.nl | https://surfsara.nl |
We are ISO 27001 certified and meet the high requirements for information security.
Best Video Production Company in Bangalore : We at VHTnow create visual masterpieces that engage, inspire and impact people's lives. Our services also include ad film and corporate film production in bangalore
visit:https://vhtnow.com/