Hi,
Like I said in an earlier mail to this list, we re-balanced ~ 60% of the
CephFS metadata pool to NVMe backed devices. Roughly 422 M objects (1.2
Billion replicated). We have 512 PGs allocated to them. While
rebalancing we suffered from quite a few SLOW_OPS. Memory, CPU and
device IOPS capacity were not a limiting factor as far as we can see (plenty of
them available ... nowhere near max capacity). We saw quite a few
slow ops with the following events:
"time": "2019-12-19 09:41:02.712010",
"event": "reached_pg"
},
{
"time": "2019-12-19 09:41:02.712014",
"event": "waiting for rw locks"
},
{
"time": "2019-12-19 09:41:02.881939",
"event": "reached_pg"
... and this repeated 100's of times taking ~ 30 seconds to complete
Does this indicate PG lock contention?
If so ... would we need to provide more PGs to the metadata pool to avoid this?
The metadata pool is only ~ 166 MiB big ... but with loads of OMAPs ...
Most advice on PG planning is concerned with the _amount_ of data ... but the
metadata pool (and this might also be true for RGW index pools) seem to be a
special case.
Thanks for your insights,
Gr. Stefan
--
| BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl
Hi,
After the upgrade to 13.2.8 deep-scrub has a big impact on client IO:
loads of SLOW_OPS and high latency. We hardly ever had SLOW_OPS, but
since the upgrade the impact is so big that we even have OSDs marking
each other out (OSD op thread timeout) multiple times during the scrub
window. Plenty of CPU / RAM / IOPS left, hardly any load on these OSD
servers. Has there anything changed in this release that can explain
this behaviour?
Besides this the impact of rebalance is very severe as well. With only
the balancer remapping a couple of PGs at a time there are loads of
(MDS_)SLOW_OPS. This morning the cephfs metadata pool got rebalanced ...
and that triggered a lot of SLOW_OPS. One particular OSD was pegged at
1000% CPU for more than half an hour (not doing that much IO): that's 10
cores going full throttle! After a restart this issue was gone.
Thanks,
Stefan
--
| BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'). Attaching to the thread with strace shows a lot of mmap and munmap calls. Here's the distribution after watching it for a few minutes:
48.73% - mmap
49.48% - munmap
1.75% - futex
0.05% - madvise
I've upgraded 3 other clusters so far (120 OSDs, 30 OSDs, 200 OSDs), but this is the only one which has seen the problem (355 OSDs). Perhaps it has something to do with its size?
I was suspecting it might have to do with one of the modules misbehaving, so I disabled all of them:
# ceph mgr module ls | jq -r '.enabled_modules'
[]
But that didn't help (I restarted the mgrs after disabling the modules too).
I also tried setting debug_mgr and debug_mgrc to 20, but nothing popped out at me as being the cause of the problem.
It only seems to affect the active mgr. If I stop the active mgr the problem moves to one of the other mgrs.
Any guesses or tips on what next steps I should take to figure out what's going on?
Thanks,
Bryan
Hi,
I am trying to copy the contents of our storage server into a CephFS,
but am experiencing stability issues with my MDSs. The CephFS sits on
top of an erasure-coded pool with 5 MONs, 5 MDSs and a max_mds setting
of two. My Ceph cluster version is Nautilus, the client is Mimic and
uses the kernel module to mount the FS.
The index of filenames to copy is about 23GB and I am using 16 parallel
rsync processes over a 10G link to copy the files over to Ceph. This
works perfectly for a while, but then the MDSs start reporting oversized
caches (between 20 and 50GB, sometimes more) and an inode count between
1 and 4 million. Particularly the Inode count seems quite high to me.
Each rsync job has 25k files to work with, so if all 16 processes open
all their files at the same time, I should not exceed 400k. Even if I
double this number to account for the client's page cache, I should get
nowhere near that number of inodes (a sync flush takes about 1 second).
Then after a few hours, my MDSs start failing with messages like this:
-21> 2019-07-22 14:00:05.877 7f67eacec700 1 heartbeat_map
is_healthy 'MDSRank' had timed out after 15
-20> 2019-07-22 14:00:05.877 7f67eacec700 0 mds.beacon.XXX Skipping
beacon heartbeat to monitors (last acked 24.0042s ago); MDS internal
heartbeat is not healthy!
The standby nodes try to take over, but take forever to become active
and will fail as well eventually.
During my research, I found this related topic:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-January/015959.html,
but I tried everything in there from increasing to lowering my cache
size, the number of segments etc. I also played around with the number
of active MDSs and two appears to work the best, whereas one cannot keep
up with the load and three seems to be the worst of all choices.
Do you have any ideas how I can improve the stability of my MDS damons
to handle the load properly? single 10G link is a toy and we could query
the cluster with a lot more requests per second, but it's already
yielding to 16 rsync processes.
Thanks
Please file a tracker with the symptom and examples. Please attach your
OSDMap (ceph osd getmap > osdmap.bin).
Note that https://github.com/ceph/ceph/pull/31956 has the Nautilus
version of improved upmap code. It also changes osdmaptool to match the
mgr behavior, so that one can observe the behavior of the upmap balancer
offline.
Thanks
David
On 12/8/19 11:04 AM, Philippe D'Anjou wrote:
> It's only getting worse after raising PGs now.
>
> Anything between:
> 96 hdd 9.09470 1.00000 9.1 TiB 4.9 TiB 4.9 TiB 97 KiB 13 GiB 4.2
> TiB 53.62 0.76 54 up
>
> and
>
> 89 hdd 9.09470 1.00000 9.1 TiB 8.1 TiB 8.1 TiB 88 KiB 21 GiB 1001
> GiB 89.25 1.27 87 up
>
> How is that possible? I dont know how much more proof I need to
> present that there's a bug.
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users(a)lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Hi,
in this <https://ceph.io/community/the-first-telemetry-results-are-in/>
blog post I find this statement:
"So, in our ideal world so far (assuming equal size OSDs), every OSD now
has the same number of PGs assigned."
My issue is that accross all pools the number of PGs per OSD is not equal.
And I conclude that this is causing very unbalanced data placement.
As a matter of fact the data stored on my 1.6TB HDD in specific pool
"hdb_backup" is in a range starting with
osd.228 size: 1.6 usage: 52.61 reweight: 1.00000
and ending with
osd.145 size: 1.6 usage: 81.11 reweight: 1.00000
This impacts the amount of data that can be stored in the cluster heavily.
Ceph balancer is enabled, but this is not solving this issue.
root@ld3955:~# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}
Therefore I would ask you for suggestions how to work on this unbalanced
data distribution.
I have attached pastebin for
- ceph osd df sorted by usage <https://pastebin.com/QLQHjA9g>
- ceph osd df tree <https://pastebin.com/SvhP2hp5>
My cluster has multiple crush roots respresenting different disks.
In addition I have defined multiple pools, one pool for each disk type:
hdd, ssd, nvme.
THX
Hi,
We had a strange problem with some buckets. After a s3cmd sync, some objects got ETAGs with the suffix "#x0e". This rendered the XML output of "GET /" e.g. (s3cmd du) invalid. Unfortunately, this behaviour was not reproducable but could be fixed by "GET /{object}" + "PUT /{object}" (s3cmd get + s3cmd put).
I am not sure, how this appeared and how to avoid that. Just now, we have nautilus mons and osds with jewel radosgws. At the time of first appearence, also a nautilus gateway had been online, but the requests had been handled by both types.
Any ideas?
best regards,
Ingo
--
Ingo Reimann
Teamleiter Technik
[ https://www.dunkel.de/ ]
Dunkel GmbH
Philipp-Reis-Straße 2
65795 Hattersheim
Fon: +49 6190 889-100
Fax: +49 6190 889-399
eMail: support(a)dunkel.de
http://www.Dunkel.de/ Amtsgericht Frankfurt/Main
HRB: 37971
Geschäftsführer: Axel Dunkel
Ust-ID: DE 811622001
Hi,
We have seen several issues (mailed about that earlier to this list)
after the upgrade to Mimic 13.2.8. We decided to downgrade the OSD
servewrs to 13.2.6 to check if issues disappear. However we ran into
issues with that ...
We use bluestore allocator since Luminous 12.2.12 to combat latency
issues on the OSDs. We also used that succesfully on Mimic 13.2.6.
bluestore_allocator = bitmap
bluefs_allocator = bitmap
When downgrading to 13.2.6 we hit the following assert:
2019-12-27 14:14:16.409 7f2ed2dcce00 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block size 3.5 TiB
2019-12-27 14:14:16.409 7f2ed2dcce00 1 bluefs mount
2019-12-27 14:14:16.413 7f2ed2dcce00 -1 /build/ceph-13.2.6/src/os/bluestore/fastbmap_allocator_impl.h: In function 'void AllocatorLevel02<T>::_mark_allocated(uint64_t, uint64_t) [with L1 = AllocatorLevel01Loose; uint64_t = long unsigned int]' thread 7f2ed2dcce00 time 2019-12-27 14:14:16.414793
/build/ceph-13.2.6/src/os/bluestore/fastbmap_allocator_impl.h: 749: FAILED assert(available >= allocated)
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f2eca0a497e]
2: (()+0x2fab07) [0x7f2eca0a4b07]
3: (BitmapAllocator::init_rm_free(unsigned long, unsigned long)+0x44d) [0xc91dbd]
4: (BlueFS::mount()+0x260) [0xc6e6c0]
5: (BlueStore::_open_db(bool, bool)+0x17cd) [0xb8f50d]
6: (BlueStore::_mount(bool, bool)+0x4b7) [0xbbfb77]
7: (OSD::init()+0x295) [0x761fc5]
8: (main()+0x367b) [0x64f23b]
9: (__libc_start_main()+0xf0) [0x7f2ec7c3f830]
10: (_start()+0x29) [0x718929]
We upgraded the node back to 13.2.8 again which started without issues.
We did do a "downgrade test" on a test cluster ... that cluster did not
suffer from this issue. It turned out that the cluster was not using
the bitmap allocator ... after enabling the bitmap allocator there on a
13.2.6 node (that has been previviously downgraded but had never run
with the bitmap allocator) and restarting the node this came online just
fine. However, an upgrade to 13.2.8 with bitmap allocator enabled, and a
downgrade again to 13.2.6 would trigger the same assert.
Switching back to default (stupid allocator) again would work
(initially) for 2 out of 3 OSDs. One would fail right away with rocksdb corruption:
2019-12-27 15:10:50.945 7fc77fbcbe00 20 osd.6 1952 register_pg 2.16 0x990c800
2019-12-27 15:10:50.945 7fc77fbcbe00 10 osd.6:2._attach_pg 2.16 0x990c800
2019-12-27 15:10:50.945 7fc77fbcbe00 10 osd.6 1952 pgid 2.0 coll 2.0_head
2019-12-27 15:10:50.945 7fc77fbcbe00 10 osd.6 1952 _make_pg 2.0
2019-12-27 15:10:50.945 7fc77fbcbe00 5 osd.6 pg_epoch: 1952 pg[2.0(unlocked)] enter Initial
2019-12-27 15:10:50.945 7fc77fbcbe00 20 osd.6 pg_epoch: 1952 pg[2.0(unlocked)] enter NotTrimming
2019-12-27 15:10:50.945 7fc77fbcbe00 -1 abort: Corruption: block checksum mismatch: expected 1122551773, got 2333355710 in db/000397.sst offset 57741 size 4044
2019-12-27 15:10:50.949 7fc77fbcbe00 -1 *** Caught signal (Aborted) **
in thread 7fc77fbcbe00 thread_name:ceph-osd
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (()+0x11390) [0x7fc775520390]
2: (gsignal()+0x38) [0x7fc774a53428]
3: (abort()+0x16a) [0x7fc774a5502a]
4: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x4a8) [0xbff498]
5: (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0x201) [0xb852e1]
6: (PG::read_info(ObjectStore*, spg_t, coll_t const&, pg_info_t&, PastIntervals&, unsigned char&)+0x16b) [0x7ecc8b]
7: (PG::read_state(ObjectStore*)+0x56) [0x81aff6]
8: (OSD::load_pgs()+0x566) [0x759516]
9: (OSD::init()+0xcd3) [0x762a03]
10: (main()+0x367b) [0x64f23b]
11: (__libc_start_main()+0xf0) [0x7fc774a3e830]
12: (_start()+0x29) [0x718929]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
And after a restart we got rocksdb messages like this:
...
2019-12-27 15:11:32.322 7fd1c0fbbe00 4 rocksdb: [/build/ceph-13.2.6/src/rocksdb/db/version_set.cc:3088] Recovering from manifest file: MANIFEST-000402
...
...
-352> 2019-12-27 15:11:11.598 7fa0f0558e00 -1 abort: Corruption: Bad table magic number: expected 9863518390377041911, found 11124 in db/000397.sst
...
After we set osd.6 out ... osd.7 crashed after a while (while backfilling) and
would fail to restart again with the following message:
2019-12-27 15:27:10.833 7f1bb4701e00 4 rocksdb: [/build/ceph-13.2.6/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work
2019-12-27 15:27:10.833 7f1bb4701e00 4 rocksdb: [/build/ceph-13.2.6/src/rocksdb/db/db_impl.cc:397] Shutdown complete
2019-12-27 15:27:10.833 7f1bb4701e00 -1 rocksdb: Corruption: CURRENT file does not end with newline
2019-12-27 15:27:10.833 7f1bb4701e00 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db:
2019-12-27 15:27:10.833 7f1bb4701e00 1 bluefs umount
2019-12-27 15:27:10.833 7f1bb4701e00 1 stupidalloc 0x0x325aee0 shutdown
2019-12-27 15:27:10.833 7f1bb4701e00 1 bdev(0x380a380 /var/lib/ceph/osd/ceph-7/block) close
2019-12-27 15:27:11.093 7f1bb4701e00 1 bdev(0x380a000 /var/lib/ceph/osd/ceph-7/block) close
2019-12-27 15:27:11.345 7f1bb4701e00 -1 osd.7 0 OSD:init: unable to mount object store
2019-12-27 15:27:11.345 7f1bb4701e00 -1 ESC[0;31m ** ERROR: osd init failed: (5) Input/output errorESC[0m
A restart / reboot of the node would not help.
For those of you still running 13.2.6 ... I would not recommend upgrading to
13.2.8 (at least not for storage nodes ... mon / mds still seem to work fine).
Does bitmap allocator modify the OSD on disk data in some way? Are you supposed
to be able to switch between different allocators?
Thanks,
Stefan
--
| BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351
| GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl
I am seeing the following errors on an RGW multisite slave:
1. ERROR: failed to fetch mdlog info
2. failed to fetch local sync status: (5) Input/output error
Data seems to be replicating but not metadata. Does anyone have any
ideas on what may be wrong?
-----
# radosgw-admin sync status
realm 8f7fd3fd-f72d-411d-b06b-7b4b579f5f2f (prod)
zonegroup 60a2cb75-6978-46a3-b830-061c8be9dc75 (prod)
zone ffce148e-3b24-462d-98bf-8c212de31de5 (us-east-1)
2019-12-27 12:29:13.329597 7f71c4ec9dc0 0 meta sync: ERROR: failed to
fetch mdlog info
metadata sync syncing
full sync: 0/64 shards
failed to fetch local sync status: (5) Input/output error
data sync source: 7fe96e52-d6f7-4ad6-b66e-ecbbbffbc18e (us-east-2)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 12 shards
behind shards: [29,31,33,45,46,48,54,76,87,113,120,127]
oldest incremental change not applied:
2019-12-27 12:28:58.0.107159s
23 shards are recovering
recovering shards:
[1,2,24,26,29,33,35,37,40,41,42,46,48,51,54,66,76,95,100,101,122,123,127]
-----