December 2019 - ceph-users

PG-upmap offline optimization is not working as expected

by Thomas Schneider

Hi, I have tested the PG-upmap offline optimization with 1 of my pools: ssd This pool is unbalanced; here's the ouput of ceph osd df tree before the optimization: root@ld3955:~# ceph osd df tree class ssd ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 1532.88501 - 27 TiB 27 GiB 4.4 GiB 816 KiB 23 GiB 27 TiB 0.10 1.00 - root default -46 353.82300 - 1.1 TiB 3.7 GiB 702 MiB 144 KiB 3.0 GiB 1.1 TiB 0.33 3.35 - host ld4257 20 ssd 0.37099 1.00000 371 GiB 1.2 GiB 234 MiB 44 KiB 1024 MiB 370 GiB 0.33 3.35 25 up osd.20 21 ssd 0.37099 1.00000 371 GiB 1.2 GiB 234 MiB 20 KiB 1024 MiB 370 GiB 0.33 3.35 27 up osd.21 22 ssd 0.37099 1.00000 371 GiB 1.2 GiB 234 MiB 80 KiB 1024 MiB 370 GiB 0.33 3.35 26 up osd.22 -34 356.22299 - 4.6 TiB 4.9 GiB 936 MiB 228 KiB 4.0 GiB 4.6 TiB 0.10 1.06 - host ld4464 23 ssd 0.37099 1.00000 371 GiB 1.2 GiB 234 MiB 44 KiB 1024 MiB 370 GiB 0.33 3.35 29 up osd.23 24 ssd 0.37099 1.00000 371 GiB 1.2 GiB 234 MiB 40 KiB 1024 MiB 370 GiB 0.33 3.35 31 up osd.24 25 ssd 0.37099 1.00000 371 GiB 1.2 GiB 234 MiB 52 KiB 1024 MiB 370 GiB 0.33 3.35 31 up osd.25 26 ssd 3.48999 1.00000 3.5 TiB 1.2 GiB 234 MiB 92 KiB 1024 MiB 3.5 TiB 0.03 0.35 265 up osd.26 -37 356.22299 - 4.6 TiB 4.9 GiB 936 MiB 152 KiB 4.0 GiB 4.6 TiB 0.10 1.06 - host ld4465 27 ssd 0.37099 1.00000 371 GiB 1.2 GiB 234 MiB 20 KiB 1024 MiB 370 GiB 0.33 3.35 24 up osd.27 28 ssd 0.37099 1.00000 371 GiB 1.2 GiB 234 MiB 20 KiB 1024 MiB 370 GiB 0.33 3.35 28 up osd.28 29 ssd 0.37099 1.00000 371 GiB 1.2 GiB 234 MiB 20 KiB 1024 MiB 370 GiB 0.33 3.35 22 up osd.29 30 ssd 3.48999 1.00000 3.5 TiB 1.2 GiB 234 MiB 92 KiB 1024 MiB 3.5 TiB 0.03 0.35 258 up osd.30 -3 116.65399 - 4.2 TiB 3.5 GiB 491 MiB 76 KiB 3.0 GiB 4.2 TiB 0.08 0.82 - host ld5505 8 ssd 3.48999 1.00000 3.5 TiB 1.2 GiB 164 MiB 20 KiB 1024 MiB 3.5 TiB 0.03 0.33 288 up osd.8 9 ssd 0.37199 1.00000 372 GiB 1.2 GiB 164 MiB 24 KiB 1024 MiB 371 GiB 0.31 3.16 28 up osd.9 10 ssd 0.37199 1.00000 372 GiB 1.2 GiB 164 MiB 32 KiB 1024 MiB 371 GiB 0.31 3.16 31 up osd.10 -7 116.65399 - 4.2 TiB 3.5 GiB 491 MiB 72 KiB 3.0 GiB 4.2 TiB 0.08 0.82 - host ld5506 11 ssd 0.37199 1.00000 372 GiB 1.2 GiB 164 MiB 24 KiB 1024 MiB 371 GiB 0.31 3.16 36 up osd.11 12 ssd 3.48999 1.00000 3.5 TiB 1.2 GiB 164 MiB 32 KiB 1024 MiB 3.5 TiB 0.03 0.33 260 up osd.12 13 ssd 0.37199 1.00000 372 GiB 1.2 GiB 164 MiB 16 KiB 1024 MiB 371 GiB 0.31 3.16 28 up osd.13 -10 116.65399 - 4.2 TiB 3.5 GiB 491 MiB 80 KiB 3.0 GiB 4.2 TiB 0.08 0.82 - host ld5507 14 ssd 0.37199 1.00000 372 GiB 1.2 GiB 164 MiB 24 KiB 1024 MiB 371 GiB 0.31 3.16 24 up osd.14 15 ssd 0.37199 1.00000 372 GiB 1.2 GiB 164 MiB 32 KiB 1024 MiB 371 GiB 0.31 3.16 26 up osd.15 16 ssd 3.48999 1.00000 3.5 TiB 1.2 GiB 164 MiB 24 KiB 1024 MiB 3.5 TiB 0.03 0.33 259 up osd.16 -13 116.65399 - 4.2 TiB 3.5 GiB 490 MiB 64 KiB 3.0 GiB 4.2 TiB 0.08 0.82 - host ld5508 17 ssd 0.37199 1.00000 372 GiB 1.2 GiB 164 MiB 28 KiB 1024 MiB 371 GiB 0.31 3.16 19 up osd.17 18 ssd 0.37199 1.00000 372 GiB 1.2 GiB 163 MiB 8 KiB 1024 MiB 371 GiB 0.31 3.16 24 up osd.18 19 ssd 3.48999 1.00000 3.5 TiB 1.2 GiB 164 MiB 28 KiB 1024 MiB 3.5 TiB 0.03 0.33 259 up osd.19 TOTAL 27 TiB 27 GiB 4.4 GiB 823 KiB 23 GiB 27 TiB 0.10 MIN/MAX VAR: 0.33/3.35 STDDEV: 0.20 The output of osdmaptool implies many modifications affecting osd.11 and osd.12, means the optimizer wants to shift PGs from 12 to 11. root@ld3955:~# source out_ssd.txt set 66.41 pg_upmap_items mapping to [12->13] set 66.4e pg_upmap_items mapping to [22->20] set 66.7c pg_upmap_items mapping to [28->29] set 66.9f pg_upmap_items mapping to [12->11] set 66.147 pg_upmap_items mapping to [12->11] set 66.1b1 pg_upmap_items mapping to [12->11] set 66.203 pg_upmap_items mapping to [12->11] set 66.257 pg_upmap_items mapping to [28->30] set 66.27d pg_upmap_items mapping to [28->30] set 66.300 pg_upmap_items mapping to [12->11] set 66.354 pg_upmap_items mapping to [28->29] set 66.35b pg_upmap_items mapping to [28->30] set 66.38a pg_upmap_items mapping to [12->11] set 66.3d0 pg_upmap_items mapping to [28->30] However this makes no sense as osd.11 has already more PGs than other OSDs. In fact there's one OSD with least PGs: osd.17 Why is the optimizer not shifting PGs to osd.17? Here's the output of ceph osd df tree after the optimization: root@ld3955:~# ceph osd df tree class ssd ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 1532.88501 - 27 TiB 27 GiB 3.2 GiB 816 KiB 23 GiB 27 TiB 0.10 1.00 - root default -46 353.82300 - 1.1 TiB 3.6 GiB 502 MiB 144 KiB 3.0 GiB 1.1 TiB 0.33 3.40 - host ld4257 20 ssd 0.37099 1.00000 371 GiB 1.2 GiB 167 MiB 44 KiB 1024 MiB 370 GiB 0.33 3.49 25 up osd.20 21 ssd 0.37099 1.00000 371 GiB 1.2 GiB 167 MiB 20 KiB 1024 MiB 370 GiB 0.33 3.44 27 up osd.21 22 ssd 0.37099 1.00000 371 GiB 1.2 GiB 167 MiB 80 KiB 1024 MiB 370 GiB 0.31 3.28 26 up osd.22 -34 356.22299 - 4.6 TiB 4.8 GiB 668 MiB 228 KiB 4.0 GiB 4.6 TiB 0.10 1.07 - host ld4464 23 ssd 0.37099 1.00000 371 GiB 1.2 GiB 167 MiB 44 KiB 1024 MiB 370 GiB 0.33 3.44 29 up osd.23 24 ssd 0.37099 1.00000 371 GiB 1.2 GiB 167 MiB 40 KiB 1024 MiB 370 GiB 0.31 3.28 31 up osd.24 25 ssd 0.37099 1.00000 371 GiB 1.2 GiB 167 MiB 52 KiB 1024 MiB 370 GiB 0.33 3.49 31 up osd.25 26 ssd 3.48999 1.00000 3.5 TiB 1.2 GiB 167 MiB 92 KiB 1024 MiB 3.5 TiB 0.03 0.34 265 up osd.26 -37 356.22299 - 4.6 TiB 4.8 GiB 669 MiB 152 KiB 4.0 GiB 4.6 TiB 0.10 1.07 - host ld4465 27 ssd 0.37099 1.00000 371 GiB 1.2 GiB 167 MiB 20 KiB 1024 MiB 370 GiB 0.31 3.28 24 up osd.27 28 ssd 0.37099 1.00000 371 GiB 1.2 GiB 167 MiB 20 KiB 1024 MiB 370 GiB 0.33 3.49 28 up osd.28 29 ssd 0.37099 1.00000 371 GiB 1.2 GiB 167 MiB 20 KiB 1024 MiB 370 GiB 0.31 3.28 23 up osd.29 30 ssd 3.48999 1.00000 3.5 TiB 1.2 GiB 167 MiB 92 KiB 1024 MiB 3.5 TiB 0.03 0.36 257 up osd.30 -3 116.65399 - 4.2 TiB 3.3 GiB 350 MiB 76 KiB 3.0 GiB 4.2 TiB 0.08 0.81 - host ld5505 8 ssd 3.48999 1.00000 3.5 TiB 1.1 GiB 117 MiB 20 KiB 1024 MiB 3.5 TiB 0.03 0.33 288 up osd.8 9 ssd 0.37199 1.00000 372 GiB 1.1 GiB 117 MiB 24 KiB 1024 MiB 371 GiB 0.30 3.13 28 up osd.9 10 ssd 0.37199 1.00000 372 GiB 1.1 GiB 117 MiB 32 KiB 1024 MiB 371 GiB 0.30 3.13 31 up osd.10 -7 116.65399 - 4.2 TiB 3.3 GiB 350 MiB 72 KiB 3.0 GiB 4.2 TiB 0.08 0.81 - host ld5506 11 ssd 0.37199 1.00000 372 GiB 1.1 GiB 117 MiB 24 KiB 1024 MiB 371 GiB 0.30 3.13 41 up osd.11 12 ssd 3.48999 1.00000 3.5 TiB 1.1 GiB 117 MiB 32 KiB 1024 MiB 3.5 TiB 0.03 0.33 254 up osd.12 13 ssd 0.37199 1.00000 372 GiB 1.1 GiB 117 MiB 16 KiB 1024 MiB 371 GiB 0.30 3.13 29 up osd.13 -10 116.65399 - 4.2 TiB 3.3 GiB 350 MiB 80 KiB 3.0 GiB 4.2 TiB 0.08 0.81 - host ld5507 14 ssd 0.37199 1.00000 372 GiB 1.1 GiB 117 MiB 24 KiB 1024 MiB 371 GiB 0.30 3.13 24 up osd.14 15 ssd 0.37199 1.00000 372 GiB 1.1 GiB 117 MiB 32 KiB 1024 MiB 371 GiB 0.30 3.13 26 up osd.15 16 ssd 3.48999 1.00000 3.5 TiB 1.1 GiB 117 MiB 24 KiB 1024 MiB 3.5 TiB 0.03 0.33 259 up osd.16 -13 116.65399 - 4.2 TiB 3.3 GiB 350 MiB 64 KiB 3.0 GiB 4.2 TiB 0.08 0.81 - host ld5508 17 ssd 0.37199 1.00000 372 GiB 1.1 GiB 117 MiB 28 KiB 1024 MiB 371 GiB 0.30 3.13 19 up osd.17 18 ssd 0.37199 1.00000 372 GiB 1.1 GiB 117 MiB 8 KiB 1024 MiB 371 GiB 0.30 3.13 24 up osd.18 19 ssd 3.48999 1.00000 3.5 TiB 1.1 GiB 117 MiB 28 KiB 1024 MiB 3.5 TiB 0.03 0.33 259 up osd.19 TOTAL 27 TiB 27 GiB 3.2 GiB 823 KiB 23 GiB 27 TiB 0.10 MIN/MAX VAR: 0.33/3.49 STDDEV: 0.19 Regards Thomas

4 years, 3 months

1
0
0 0

ceph radosgw failed to initialize

by dayong tian

The radosgw can't start normally, the error in log file: ------------------------------- 2019-12-20 14:37:04.058 7fd5b088f700 -1 Initialization timeout, failed to initialize 2019-12-20 14:37:04.304 7fe7148c0780 0 deferred set uid:gid to 167:167 (ceph:ceph) 2019-12-20 14:37:04.304 7fe7148c0780 0 ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable), process radosgw, pid 3474 2019-12-20 14:37:04.333 7fe6fe47e700 20 reqs_thread_entry: start 2019-12-20 14:37:04.338 7fe7148c0780 1 librados: starting msgr at 2019-12-20 14:37:04.338 7fe7148c0780 1 librados: starting objecter 2019-12-20 14:37:04.338 7fe7148c0780 1 librados: setting wanted keys 2019-12-20 14:37:04.338 7fe7148c0780 1 librados: calling monclient init 2019-12-20 14:37:04.340 7fe7148c0780 1 librados: init done 2019-12-20 14:37:04.340 7fe7148c0780 20 rados->read ofs=0 len=0 2019-12-20 14:37:04.340 7fe7148c0780 10 librados: wait_for_osdmap waiting 2019-12-20 14:37:04.341 7fe7148c0780 10 librados: wait_for_osdmap done waiting 2019-12-20 14:37:04.341 7fe7148c0780 10 librados: read oid=default.realm nspace= 2019-12-20 14:42:04.304 7fe700f0d700 -1 Initialization timeout, failed to initialize ------------------------------- The config I used: ------------------------------- [client.rgw.ceph-test-f01] rgw_host = ceph-test-f01 keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-test-f01/keyring rgw_frontends = civetweb port=8099 ------------------------------- The env I set as follows: ------------------------------- # cat /etc/redhat-release CentOS Linux release 7.7.1908 (Core) # ceph -v ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable) # ceph -s cluster: id: ac018df0-2e71-4c0f-a8a1-0ea29b8a7eb1 health: HEALTH_WARN Reduced data availability: 8 pgs inactive Degraded data redundancy: 8 pgs undersized services: mon: 1 daemons, quorum ceph-test-f01 (age 2h) mgr: ceph-test-f01(active, since 61m) osd: 1 osds: 1 up (since 2h), 1 in (since 2h) data: pools: 2 pools, 40 pgs objects: 1 objects, 1.6 KiB usage: 1.0 GiB used, 98 GiB / 99 GiB avail pgs: 20.000% pgs not active 32 active+clean 8 undersized+peered # ceph osd pool ls .rgw.root panama ------------------------------- I also tried to follow the ceph document to create the default pools such as .default.rgw.control .default.rgw.gc .default.rgw.buckets .default.rgw.buckets.index .default.rgw.buckets.extra .default.log .default.intent-log .default.usage .default.users .default.users.email .default.users.swift .default.users.default.uid The radosgw service still can't start normally, any advices will be appreciated. And all the radosgw-admin utilities hang like: # radosgw-admin zone get 2019-12-20 15:29:03.322 7f18c65876c0 1 librados: starting msgr at 2019-12-20 15:29:03.322 7f18c65876c0 1 librados: starting objecter 2019-12-20 15:29:03.323 7f18c65876c0 1 librados: setting wanted keys 2019-12-20 15:29:03.323 7f18c65876c0 1 librados: calling monclient init 2019-12-20 15:29:03.325 7f18c65876c0 1 librados: init done 2019-12-20 15:29:03.331 7f18c65876c0 1 librados: starting msgr at 2019-12-20 15:29:03.331 7f18c65876c0 1 librados: starting objecter 2019-12-20 15:29:03.331 7f18c65876c0 1 librados: setting wanted keys 2019-12-20 15:29:03.331 7f18c65876c0 1 librados: calling monclient init 2019-12-20 15:29:03.333 7f18c65876c0 1 librados: init done 2019-12-20 15:29:03.334 7f188ffff700 2 RGWDataChangesLog::ChangesRenewThread: start 2019-12-20 15:29:03.334 7f18c65876c0 20 rados->read ofs=0 len=0 2019-12-20 15:29:03.334 7f188f7fe700 20 reqs_thread_entry: start 2019-12-20 15:29:03.335 7f18c65876c0 10 librados: read oid=default.realm nspace= 2019-12-20 15:29:25.335 7f188ffff700 2 RGWDataChangesLog::ChangesRenewThread: start

4 years, 3 months

1
0
0 0

RGW bucket stats extremely slow to respond

by David Monschein

Hi all! Reaching out again about this issue since I haven't had much luck. We've been seeing some strange behavior with our object storage cluster. While bucket stats (radosgw-admin bucket stats) normally return in a matter of seconds, we frequently observe it taking almost ten minutes, which is not convenient since we use those bucket stats for billing/accounting. Restarting the radosgw process on the RGWs fixes this issue until it crops up again in maybe a few days. Someone mentioned that they think this might have to do with bucket deletions, or more specifically, lifecycle policies to abort incomplete multipart uploads. He mentioned there was an item in the bug tracker for this, but I have not been able to find said bug in the tracker. I have no clue if this is the case or not, but I figured I'd throw it out there to see if anyone else has run into this problem. I have seen many of these messages in my RGW logs: 2019-12-02 13:12:52.882 7faa7018f700 0 abort_bucket_multiparts WARNING : aborted 8553000 incomplete multipart uploads So maybe there is some truth to the aborted multipart uploads causing problems? My cluster has over 200 OSDs, 10 RGWs, about 2200 buckets. Running Nautilus 14.2.5. If anyone has run into this or has any information I'd appreciate it. Merry Christmas & Happy Holidays, - Dave

4 years, 4 months

1
0
0 0

Re: Pool Max Avail and Ceph Dashboard Pool Useage on Nautilus giving different percentages

by Stephan Mueller

Hi, if "MAX AVAIL" displays the wrong data, the bug is just made more visible through the dashboard, as the calculation is correct. To get the right percentage you have to divide the used space through the total, and the total can only consist of two states used and not used space, so both states will be added together to get the total. Or in short: used / (avail + used) Just looked into the C++ code - Max avail will be calculated the following way: avail_res = avail / raw_used_rate ( https://github.com/ceph/ceph/blob/nautilus/src/mon/PGMap.cc#L905) raw_used_rate *= (sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies (https://github.com/ceph/ceph/blob/nautilus/src/mon/PGMap.cc#L892) Am Dienstag, den 17.12.2019, 07:07 +0100 schrieb ceph(a)elchaka.de: > I have observed this in the ceph nautilus dashboard too - and Think > it is a Display Bug... but sometimes it Shows tue right values > > > Which nautilus u use? > > > Am 10. Dezember 2019 14:31:05 MEZ schrieb "David Majchrzak, ODERLAND > Webbhotell AB" <david(a)oderland.se>: > > Hi! > > > > While browsing /#/pool in nautilus ceph dashboard I noticed it said > > 93% > > used on the single pool we have (3x replica). > > > > ceph df detal however shows 81% used on the pool and 67% raw > > useage. > > > > # ceph df detail > > RAW STORAGE: > > CLASS SIZE AVAIL USED RAW USED %RAW > > USED > > ssd 478 TiB 153 TiB 324 TiB 325 > > TiB 67.96 > > TOTAL 478 TiB 153 TiB 324 TiB 325 > > TiB 67.96 > > > > POOLS: > > POOL ID STORED OBJECTS USED %USED > > > > MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED > > COMPR UNDER COMPR > > echo 3 108 TiB 29.49M 324 > > TiB 81.61 24 > > TiB N/A N/A 29.49M 0 > > B 0 B I manually calculated the used percentage to get "avail" in your case it seems to be 73 TiB. That means the the total space available for your pool would be 397 TiB. I'm not sure why that is, but it's what the math behind those calculations say. (Found a thread regarding that on the new mailing list (ceph- users(a)ceph.io) -> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NH2LMMX5KVR… ) 0.8161 = used (324) / total => total = 397 Than I looked at the remaining calculations: raw_used_rate *= (sum.num_object_copies - sum.num_objects_degraded) / sum.num_object_copies and avail_res = avail / raw_used_rate First I looked up the init value for "raw_used_rate" for replicated pools. It's their size so we can put in 3 here and for "avail_res" is 24. So I first calculated the final "raw_used_rate" which is 3.042. That means that you have around 4.2% degraded pg's in your pool. > > > > > > I know we're looking at the most full OSD (210PGs, 79% used, 1.17 > > VAR) > > and count max avail from that. But where's the 93% full from in > > dashboard? As said above the calculation is right but the data is wrong... As it uses the real data that can be put in the selected pool, but it uses everywhere else sizes that consider all pool replicas. I created an issue to fix this https://tracker.ceph.com/issues/43384 > > > > My guess is that is comes from calculating: > > > > 1 - Max Avail / (Used + Max avail) = 0.93 > > > > > > Kind Regards, > > > > David Majchrzak > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users(a)lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > ceph-users(a)lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hope I could clarify some things and thanks for your feedback :) BTW this problem currently still exists as there wasn't any change to these mentioned lines after the nautilus release. Stephan

4 years, 4 months

1
0
0 0

rbd images inaccessible for a longer period of time

by yveskretzschmar＠web.de

Hi, yesterday I had to power off some vm (proxmox) backed by rbd images for maintenance. After the VMs were off, I tried to create a Snapshot which didn't finish even after half an hour. Because it was maintenance I rebooted all VM nodes an all ceph nodes - nothing changed. Powering on the VM was impossible, kvm exited with timeout. This happened to two of about 15 VM. Two of three Images of one VM still had locks, which I did remove but still unable to power on. I tried to access the Image by mapping it with rbd-nbd, which was unsuccessful and logged this: [ 8601.746971] block nbd0: Connection timed out [ 8601.747648] block nbd0: shutting down sockets [ 8601.747653] block nbd0: Connection timed out [...] [ 8601.750419] block nbd0: Connection timed out [ 8601.750831] print_req_error: 121 callbacks suppressed [ 8601.750832] blk_update_request: I/O error, dev nbd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [ 8601.751261] buffer_io_error: 182 callbacks suppressed [ 8601.751262] Buffer I/O error on dev nbd0, logical block 0, async page read [ 8601.751678] blk_update_request: I/O error, dev nbd0, sector 1 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [...] [ 8601.760283] ldm_validate_partition_table(): Disk read failed. [ 8601.760344] Dev nbd0: unable to read RDB block 0 [ 8601.760985] nbd0: unable to read partition table [ 8601.761282] nbd0: detected capacity change from 0 to 375809638400 [ 8601.761382] ldm_validate_partition_table(): Disk read failed. [ 8601.761461] Dev nbd0: unable to read RDB block 0 [ 8601.762145] nbd0: unable to read partition table The rbd-nbd process kept existing and had to be killed Same thing with qemu-nbd. Exporting the Image via rbd export worked fine, also a rbd copy. Any other operation on the Image (feature dis / enable) took forever so I had to abort it. It seems that every operation leaves a lock on the image. Because it was in the middle of the night, I stopped working on it. Today Morning one of the images was accessible again, the others not. Anybody any hint? Some system information below. Regards, Yves ceph version 14.2.5 (3ce7517553bdd5195b68a6ffaf0bd7f3acad1647) nautilus (stable) Primary Cluster with a backup cluster (rbd mirror) [global] auth client required = none auth cluster required = none auth service required = none auth supported = none cephx_sign_messages = false cephx require signatures = False cluster_network = 172.16.230.0/24 debug asok = 0/0 debug auth = 0/0 debug bdev = 0/0 debug bluefs = 0/0 debug bluestore = 0/0 debug buffer = 0/0 debug civetweb = 0/0 debug client = 0/0 debug compressor = 0/0 debug context = 0/0 debug crush = 0/0 debug crypto = 0/0 debug dpdk = 0/0 debug eventtrace = 0/0 debug filer = 0/0 debug filestore = 0/0 debug finisher = 0/0 debug fuse = 0/0 debug heartbeatmap = 0/0 debug javaclient = 0/0 debug journal = 0/0 debug journaler = 0/0 debug kinetic = 0/0 debug kstore = 0/0 debug leveldb = 0/0 debug lockdep = 0/0 debug mds = 0/0 debug mds balancer = 0/0 debug mds locker = 0/0 debug mds log = 0/0 debug mds log expire = 0/0 debug mds migrator = 0/0 debug memdb = 0/0 debug mgr = 0/0 debug mgrc = 0/0 debug mon = 0/0 debug monc = 0/00 debug ms = 0/0 debug none = 0/0 debug objclass = 0/0 debug objectcacher = 0/0 debug objecter = 0/0 debug optracker = 0/0 debug osd = 0/0 debug paxos = 0/0 debug perfcounter = 0/0 debug rados = 0/0 debug rbd = 0/0 debug rbd mirror = 0/0 debug rbd replay = 0/0 debug refs = 0/0 debug reserver = 0/0 debug rgw = 0/0 debug rocksdb = 0/0 debug striper = 0/0 debug throttle = 0/0 debug timer = 0/0 debug tp = 0/0 debug xio = 0/0 fsid = 27fdf1bb-22a1-4d5e-9729-780cbdcd33fe mon_allow_pool_delete = true mon_host = 172.16.230.142 172.16.230.144 172.16.230.146 mon_osd_down_out_subtree_limit = host osd_backfill_scan_max = 16 osd_backfill_scan_min = 4 osd_deep_scrub_interval = 1209600 osd_journal_size = 5120 osd_max_backfills = 1 osd_max_trimming_pgs = 1 osd_pg_max_concurrent_snap_trims = 1 osd_pool_default_min_size = 2 osd_pool_default_size = 3 osd_recovery_max_active = 1 osd_recovery_max_single_start = 1 osd_recovery_op_priority = 1 osd_recovery_threads = 1 osd_scrub_begin_hour = 19 osd_scrub_chunk_max = 1 osd_scrub_chunk_min = 1 osd_scrub_during_recovery = false osd_scrub_end_hour = 6 osd_scrub_priority = 1 osd_scrub_sleep = 0.5 osd_snap_trim_priority = 1 osd_snap_trim_sleep = 0.005 osd_srub_max_interval = 1209600 public_network = 172.16.230.0/24 max open files = 131072 osd objectstore = bluestore osd op threads = 2 osd crush update on start = true Currently inaccessible image: rbd image 'vm-29009-disk-2': size 200 GiB in 51200 objects order 22 (4 MiB objects) snapshot_count: 2 id: 1abd04da8b9a4d block_name_prefix: rbd_data.1abd04da8b9a4d format: 2 features: layering, exclusive-lock, object-map, fast-diff, deep-flatten, journaling op_features: flags: create_timestamp: Tue Jul 9 13:07:36 2019 access_timestamp: Thu Dec 19 01:35:34 2019 modify_timestamp: Thu Dec 19 00:19:32 2019 journal: 1abd04da8b9a4d mirroring state: enabled mirroring global id: c71ec81f-18be-4d0b-93ed-0cebe3e619bb mirroring primary: true

4 years, 4 months

2
2
0 0

Strange behavior for crush buckets of erasure-profile

by tdados＠hotmail.com

So I wanted to report a crush rule/ec profile strange behaviour regarding radosgw items which i am not sure if it's a bug or it's supposed to work that way. I am trying to implement the below scenario in my home lab: By default there is a "default" erasure-code-profile with the below settings: crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=2 m=1 plugin=jerasure technique=reed_sol_van w=8 From the above we see that it uses the root bucket. Now ofcourse you would want to create your own ec-profile with custom algorithm/crush buckets etc Let's say for example we create two new ec profiles. One with specific crush-root = ssd-performance2 and one with the crush-root=default (there are no disks there according to ceph osd tree-> end of page) ceph osd erasure-code-profile set test-ec crush-device-class= crush-failure-domain=host crush-root=ssd-performance2 jerasure-per-chunk-alignment=false k=2 m=1 plugin=jerasure technique=reed_sol_van w=8 ceph osd erasure-code-profile set test-ec2 crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=2 m=1 plugin=jerasure technique=reed_sol_van w=8 Now let's create the associated crush rules to use these profiles: ceph osd crush rule create-erasure erasure-test-rule1 test-ec ceph osd crush rule create-erasure erasure-test-rule2 test-ec2 Now let's say you have a radosgw server that has started and by default it creates the 5 default radosgwpools(supposed you have uploaded some data as well): default.rgw.buckets.data default.rgw.buckets.index default.rgw.control default.rgw.log default.rgw.meta Now if you grep these pools with ceph osd dump you will see that all of them are using replicated rules but we want to use erasure for the radosgw data pool. So let's migrate the default.rgw.buckets.data pool to a erasure-coded one. 1) We shutdown the radosgw-server so that we don't allow any requests coming in. 2) ceph osd pool rename default.rgw.buckets.data default.rgw.buckets.data-old 3) ceph osd pool create default.rgw.buckets.data 8 8 erasure test-ec erasure-test-rule - > We use the newly created erasure crush rule with the profile we created and use the ssd-performance2 root bucket 4) rados cppool default.rgw.buckets.data-old default.rgw.buckets.data 5) Start radosgw server again At this point i can see the old objects and i can upload new objects in radosgw and everything is working fine. Now i see this strange behavior after i do the below: We set the default.rgw.buckets.data to use the other erasure crush rule (This is using the root bucket=default which doesn't have any disks): ceph osd pool set default.rgw.buckets.data crush_rule erasure-test-rule2 Bug1? You could still browse the data but any attempt to upload/download hangs there with the below log messages: 2019-12-18 17:07:07.037 7f05a1ece700 0 ERROR: client_io->complete_request() returned Input/output error 2019-12-18 17:07:07.037 7f05a1ece700 2 req 712 0.004s s3:list_buckets op status= Monitor nodes don't display anything and seems that new items cannot be saved (which is correct as it doesn't know where to save them) but at least Monitor nodes should display something as a warning or there must be crush check before to see if the rule can be applied? Reverting back the rule to erasure-test-rule works fine again ================================= Bug 2? If you modify the erasure-test-rule profile to use a null crush bucket (like erasure-test-rule2) then this is not being parsed and identified by the crush rule. Seems crush rules skips that part Example: ceph osd erasure-code-profile set test-ec crush-root=default --force At this point nothing happens and radosgw is working fine. Which it shouldn't as it should see that the data cannot be saved anywhere. Unless it keeps the crush root bucket from the crush rules and not from the erasure coded profiles...even if you force apply/change it to the erasure profile like above. ================================= Bug 3? You don't know which rule is using which erasure-code-profile from ceph osd dump. You only see that this pool is using crush rule number 1 but if you dump this crush rule it doesn't mention which erasure-code profile is using, other than which item_name eg = root bucket Even with the telemetry on with latest release and if you do "ceph telemetry show basic" with below you see there is no crush-root being mentioned. So is the crush rule > erasure_code_profile regarding parsing of the crush_root buckets? { "min_size": 2, "erasure_code_profile": { "crush-failure-domain": "host", "k": "2", "technique": "reed_sol_van", "m": "1", "plugin": "jerasure" }, "pg_autoscale_mode": "warn", "pool": 860, "size": 3, "cache_mode": "none", "target_max_objects": 0, "pg_num": 8, "pgp_num": 8, "target_max_bytes": 0, "type": "erasure" } root@ceph-mon01:~# ceph osd crush rule dump erasure-test-rule { "rule_id": 2, "rule_name": "erasure-test-rule", "ruleset": 2, "type": 3, "min_size": 3, "max_size": 3, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -2, "item_name": "ssd-performance2" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } root@ceph-mon01:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -37 0.18398 root really-low -40 0.09799 host ceph-osd01-really-low 11 hdd 0.09799 osd.11 up 1.00000 1.00000 -41 0.04799 host ceph-osd02-really-low 1 hdd 0.01900 osd.1 up 1.00000 1.00000 9 hdd 0.02899 osd.9 up 1.00000 1.00000 -42 0.03799 host ceph-osd03-really-low 6 hdd 0.01900 osd.6 up 1.00000 1.00000 7 hdd 0.01900 osd.7 up 1.00000 1.00000 -23 10.67598 root spinning-rust -20 2.04900 rack rack1 -3 2.04900 host ceph-osd01 3 hdd 0.04900 osd.3 up 0.95001 1.00000 22 hdd 1.00000 osd.22 up 0.90002 1.00000 17 ssd 1.00000 osd.17 up 1.00000 1.00000 -25 3.07799 rack rack2 -5 3.07799 host ceph-osd02 4 hdd 0.04900 osd.4 up 1.00000 1.00000 8 hdd 0.02899 osd.8 up 1.00000 1.00000 23 hdd 1.00000 osd.23 up 1.00000 1.00000 25 hdd 1.00000 osd.25 up 1.00000 1.00000 12 ssd 1.00000 osd.12 up 1.00000 1.00000 -28 3.54900 rack rack3 -7 3.54900 host ceph-osd03 0 hdd 1.00000 osd.0 up 0.90002 1.00000 5 hdd 0.04900 osd.5 up 1.00000 1.00000 30 hdd 0.50000 osd.30 up 1.00000 1.00000 21 ssd 1.00000 osd.21 up 0.95001 1.00000 24 ssd 1.00000 osd.24 up 1.00000 1.00000 -55 2.00000 rack rack4 -49 2.00000 host ceph-osd04 26 hdd 1.00000 osd.26 up 1.00000 1.00000 27 hdd 1.00000 osd.27 up 1.00000 1.00000 -2 9.10799 root ssd-performance2 -32 2.09799 host ceph-osd01-ssd 2 ssd 0.09799 osd.2 up 1.00000 1.00000 13 ssd 1.00000 osd.13 up 1.00000 1.00000 16 ssd 1.00000 osd.16 up 1.00000 1.00000 -31 3.00000 host ceph-osd02-ssd 14 ssd 1.00000 osd.14 up 1.00000 1.00000 18 ssd 1.00000 osd.18 up 1.00000 1.00000 19 ssd 1.00000 osd.19 up 1.00000 1.00000 -9 2.00999 host ceph-osd03-ssd 10 ssd 0.00999 osd.10 up 0.90002 1.00000 15 ssd 1.00000 osd.15 up 1.00000 1.00000 20 ssd 1.00000 osd.20 up 1.00000 1.00000 -52 2.00000 host ceph-osd04-ssd 28 ssd 1.00000 osd.28 up 1.00000 1.00000 29 ssd 1.00000 osd.29 up 1.00000 1.00000 -1 0 root default root@ceph-mon01:~# Thanks, Anastasios

4 years, 4 months

1
0
0 0

re-balancing resulting in unexpected availability issues

by steve.nolen＠rstudio.com

Hi! We've found ourselves in state with our ceph cluster that we haven't seen before, and are looking for a bit of expertise to chime in. We're running a (potentially unusually laid out) moderately large luminous-based ceph cluster in a public cloud, with 234*8TB OSDs, with a single osd per cloud instance. Here's a snippet of our ceph status: services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon4 mgr: ceph-mon2(active), standbys: ceph-mon1, ceph-mon4 osd: 234 osds: 234 up, 231 in data: pools: 5 pools, 7968 pgs objects: 136.35M objects, 382TiB usage: 1.09PiB used, 713TiB / 1.79PiB avail pgs: 7924 active+clean 44 active+clean+scrubbing+deep Our layout is spread across 6 availability zones (with the majority of osds in three, us-east-1a,us-east-1b and us-east-1e). We've recently decided that a spread across six azs is unnecessary and potentially a detriment to performance, so we are working towards shifting our workload out of 3 of the 6 azs, so that we are evenly placed in 3 azs. Relevant Recent events: 1. We expanded by 24 osds to handle additional capacity needs as well as to provide the capacity necessary to remove all osds in us-east-1f. 2. After the re-balance related to #1 finished we expanded by an additional 12 osds as a follow-on for the #1 change. 3. On the same day as #2, we also issued `ceph crush move` commands to move the location of oldest 20 osds which had previously not been configured with a "datacenter" bucket denoting their availability zone. The re-balancing related to #3 caused quite a change in our cluster, resulting in hours with degraded pgs, and waves of "slow requests" from many of the osds as data shifted. Three of the osds which had their location moved are also wildly more full than the other osds (being in a nearfull, >85% utilized state where the other 231 osds are in the range of 58-63% utilized). A `ceph balancer optimize` plan shows no changes to be made by the balancer to rectify this. Because of their nearfull status, we have marked those three osds as "out". During the resulting re-balancing we experienced more "slow requests" piling up, with some pgs dipping into a peering or activating state (which obviously causes some user-visible trouble). Where we are headed: * We have yet to set the crush map to use a replicated_datacenter rule now that our osds conform to these buckets. * We need to take action to remove/replace the three osds which were over-utilized. * We need to remove the us-east-1f osds that provoked the event in #1 (of which there are 16). What I'm hoping for a bit of direction on: * A recommendation on the order of these operations that would result in the least user-visible time. * Some hints or thoughts on why we are seeing flapping pg availability since the change in #3. * General thoughts on our layout. It has been rather useful for unbounded rapid growth but it seems very likely the sheer count of osds/instances is optimal at this point. Obviously I'm happy to provide any additional information that might be helpful that I've overlooked. Thanks so much for looking! Steve

4 years, 4 months

1
0
0 0

Use Wireshark to analysis ceph network package

by Xu Chen

Hi guys, I want to use tcpdump and Wireshark to capture and analysis packages between clients and ceph cluster. But the protocol only shows tcp, no ceph, so I can not read the data between client and cluster. The wireshark version is 3.07. Hope your help. Thank you.

4 years, 4 months

1
0
0 0

list CephFS snapshots

by Lars Täuber

Hi! Is there a mean to list all snapshots existing in a (subdir of) Cephfs? I can't use the find dommand to look for the ".snap" dirs. I'd like to remove certain (or all) snapshots within a CephFS. But how do I find them? Thanks, Lars

4 years, 4 months

4
9
0 0

what's meaning of "cache_hit_rate": 0.000000 in "ceph daemon mds.<x> dump loads" output?

by Sang, Oliver

hi all, seek your help for this. we are using luminous 12.2.12 and we enabled 3 active mds. when I running "ceph daemon mds.<x> dump loads" on any active mds, I always saw such like below "mds_load": { "mds.0": { "request_rate": 526.045993, "cache_hit_rate": 0.000000, ... "mds.1": { "request_rate": 169.845956, "cache_hit_rate": 0.000000, ... "mds.2": { "request_rate": 300.511478, "cache_hit_rate": 0.000000, don't understand what's this 'cache_hit_rate' and why it seems always 0? we actually set a bigger mds_cache_memory_limit than default such like 16G in /etc/ceph/ceph.conf. are they related? I checked https://docs.ceph.com/docs/luminous/cephfs/mds-config-ref/ , but didn't get further clue from the description of "mds cache *" settings.

4 years, 4 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users December 2019