April 2020 - ceph-users - lists.ceph.io

by Andrei Mikhailovsky

Hello everyone, A few weeks ago I have enabled the ceph balancer on my cluster as per the instructions here: [ https://docs.ceph.com/docs/mimic/mgr/balancer/ | https://docs.ceph.com/docs/mimic/mgr/balancer/ ] I am running ceph version: ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable) The cluster has 48 osds (40 osds in hdd pools and 8 osds in ssd pool) Currently, the balancer status is showing as Active. # ceph balancer status { "active": true, "plans": [], "mode": "upmap" } The health status of the cluster is: health: HEALTH_OK Previously, I've used the old REWEIGHT to change the placement of data as I've seen very uneven usage (ranging from about 60% usage on some OSDs to over 90% on others). So, I have a number of osds with reweight of 1 an some going down to 0.75. At the moment the osd usage ranges between about 65% to to just under 90%, so still a huge variation. After switching on the balancer, I've not actually seen any activity or data migration, so I am not sure if the balancer is working at all. Could someone tell me how do I check if balancing is doing its job? The second question is as the balancer is now switched on, do I suppose to set the reweight values back to their default value of 1? Many thanks

4 years

2
1
0 0

why the export v2 image is so small compare with without it?

by linghucongsong

Hi all! I export the rbd image use the export v2 format and without it. The difference is the export v2 format image is more smaller than the image which do not use export v2 format. Can any body tell me why they are size is difference so huge? Thanks very much! root@controller:/mnt# rbd du images/35d69ca5-b4f7-499e-9719-331eee498bc4 NAME PROVISIONED USED 35d69ca5-b4f7-499e-9719-331eee498bc4@snap 40GiB 40GiB 35d69ca5-b4f7-499e-9719-331eee498bc4 40GiB 0B <TOTAL> 40GiB 40GiB root@controller:/mnt# rbd export --export-format 2 images/35d69ca5-b4f7-499e-9719-331eee498bc4 ./v2image Exporting image: 100% complete...done. root@controller:/mnt# du -sh ./v2image 9.8G ./v2image root@controller:/mnt# rbd export images/35d69ca5-b4f7-499e-9719-331eee498bc4 ./image Exporting image: 100% complete...done. root@controller:/mnt# du -sh ./image 40G ./image

4 years

1
0
0 0

HBase/HDFS on Ceph/CephFS

by jesper＠krogh.cc

Hi We have an 3 year old Hadoop cluster - up for refresh - so it is time to evaluate options. The "only" usecase is running an HBase installation which is important for us and migrating out of HBase would be a hazzle. Our Ceph usage has expanded and in general - we really like what we see. Thus - Can this be "sanely" consolidated somehow? I have seen this: https://docs.ceph.com/docs/jewel/cephfs/hadoop/ But it seem really-really bogus to me. It recommends that you set: pool 3 'hadoop1' rep size 1 min_size 1 Which would - if I understand correct - be disastrous. The Hadoop end would replicated in 3 across - but within Ceph the replication would be 1. The 1 replication in ceph means pulling the OSD node would "gaurantee" the pg's to go inactive - which could be ok - but there is nothing gauranteeing that the other Hadoop replicas are not served out of the same OSD-node/pg? In which case - rebooting an OSD node would bring the hadoop cluster unavailable. Is anyone serving HBase out of Ceph - how does the stadck and configuration look? If I went for 3 x replication in both Ceph and HDFS then it would definately work, but 9x copies of the dataset is a bit more than what looks feasible at the moment. Thanks for your reflections/input. Jesper

4 years

4
3
0 0

Ceph reporting out-of-charts metrics (Nautilus 14.2.8)

by David Bartoš

Hi, thought i might ask here since i was unable to find anything similar to my issue.Perhaps someone might have an idea. In our org we are running currently few nautilus clusters (14.2.2 , 14.2.4, 14.2.8) But strangely enough the clusters with 14.2.8 are reporting weird metrics of pool R/W. To be more specific, from time to time some random huge spikes in orders of TB/s appear in our graphs , whereas the real usage is GB/s (Clusters are monitored by Prometheus). First to blame was our exporters, so i tried the one built in ceph-dashboard , but the values of those metrics were same (+- few bytes). I went as far as writing a small program to talk directly to the ceph API, but after i processed the output, data was also the same. I can provide more details if someone has a clue what might be the cause(Graphs, logs etc...). As far as the rest of the clusters goes, their graphs does not contain any spikes as the ones of the 14.2.8 clusters. Thanks for any responses/help in advance.

4 years

1
0
0 0

Existing Cluster to cephadm - mds start failing

by Ashley Merrick

Completed the migration of an existing Ceph cluster on Octopus to cephadm. All OSD/MON/MGR moved fine, however upon running the command to setup some new MDS for cephfs they both failed to start. After looking into the cephadm log's I found the following error: Apr 13 06:26:15 sn-s01 systemd[1]: Started Ceph mds.cephfs.sn-s01.snkhfd for b1db6b36-0c4c-4bce-9cda-18834be0632d. Apr 13 06:26:16 sn-s01 bash[3520809]: debug 2020-04-13T04:26:16.445+0000 7f380b908700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [1] but i only support [2] Apr 13 06:26:16 sn-s01 bash[3520809]: debug 2020-04-13T04:26:16.445+0000 7f380b107700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [1] but i only support [2] Apr 13 06:26:16 sn-s01 bash[3520809]: debug 2020-04-13T04:26:16.445+0000 7f380a906700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [1] but i only support [2] Apr 13 06:26:16 sn-s01 bash[3520809]: failed to fetch mon config (--no-mon-config to skip This cluster is running with cephx disabled, I imported the ceph.conf into cephadm fine and this has worked fine when the other services have started, but from the above error looks like maybe the mds is not checking if cephx is enabled or disabled before trying to communicate to the mon's?

4 years

1
2
0 0

osd crashing and rocksdb corruption

by Francois Legrand

Hi all, *** Short version *** Is there a way to repair a rocksdb from errors "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ? *** Long version *** We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers + 4 mons/mgr + 3 mds). We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a rebalancing of some data. Two days later (Wednesday 22) we had a very short power outage. Only one of the osd servers went down (and unfortunately died). This triggered a reconstruction of the losts osds. Operations went fine until Saturday 25 where some osds in the 5 remaining servers started to crash apparently with no reasons. We tryed to restart them, but they crashed again. We ended with 18 osd down (+ 16 in the dead server so 34 osd downs out of 100). Looking at the logs we found for all the crashed osd : -237> 2020-04-25 16:32:51.835 7f1f45527a80 3 rocksdb: [table/block_based_table_reader.cc:1117] Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch: expected 0, got 2729370997 in db/181355.sst offset 18446744073709551615 size 18446744073709551615 and 2020-04-25 16:05:47.251 7fcbd1e46a80 -1 bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db: We also noticed that the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" was present few days before the crash. We also have some osd with this error but still up. We tryed to repair with : ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair But no success (it ends with _open_db erroring opening db). Thus does somebody have an idea to fix this or at least know if it's possible to repair and correct the "Encountered error while reading data from compression dictionary block Corruption: block checksum mismatch" and "_open_db erroring opening db" ! Thanks for your help (we are desperate because we will loose datas and are fighting to save something) !!! F.

4 years

1
0
0 0

Re: HBase/HDFS on Ceph/CephFS

by Xiaoxi Chen

RBD is never a workable solution unless you want to pay the cost of double-replication in both HDFS and Ceph. I think the right approach is thinking about other implementation of the FileSystem interface, like s3a and localfs. s3a is straight forward, ceph rgw provide s3 interface and s3a is stable and well tested in Hadoop ecosystem, just run it. There are a few other in-house solution offered by some vendor that integrating librgw into the s3a driver so it saves one extra hop and the management/LB cost of maintaining an RGW cluster. local filesystem is a bit tricky, we just tried a POC that mounting CephFS into every hadoop , configure Hadoop using LocalFS with Replica = 1. Which end up with each data only write once into cephfs and cephfs take care of the data durability. There was a libcephfs-jni but it is significantly out of date and seems be abandoned, which is a pity. For both solutions for sure you lost the locality , but trading for better scalability and compute/storage separation. -Xiaoxi Marc Roos <M.Roos(a)f1-outsourcing.eu> 于2020年4月24日周五下午4:00写道： > > I think the idea behind pool size of 1, is that hadoop already writes > copies to 2 other pools(?). > > However that leaves the possibility that pg's of these 3 pools can maybe > share an osd, and if that osd fails, you loose data in these pools. I > have no idea what the chances are that the same data of different pools > can end up on the same osd. > > > -----Original Message----- > To: ceph-users(a)ceph.io > Subject: [ceph-users] HBase/HDFS on Ceph/CephFS > > Hi > > We have an 3 year old Hadoop cluster - up for refresh - so it is time to > evaluate options. The "only" usecase is running an HBase installation > which is important for us and migrating out of HBase would be a hazzle. > > Our Ceph usage has expanded and in general - we really like what we see. > > Thus - Can this be "sanely" consolidated somehow? I have seen this: > https://docs.ceph.com/docs/jewel/cephfs/hadoop/ > But it seem really-really bogus to me. > > It recommends that you set: > pool 3 'hadoop1' rep size 1 min_size 1 > > Which would - if I understand correct - be disastrous. The Hadoop end > would replicated in 3 across - but within Ceph the replication would be > 1. > The 1 replication in ceph means pulling the OSD node would "gaurantee" > the pg's to go inactive - which could be ok - but there is nothing > gauranteeing that the other Hadoop replicas are not served out of the > same OSD-node/pg? In which case - rebooting an OSD node would bring the > hadoop cluster unavailable. > > Is anyone serving HBase out of Ceph - how does the stadck and > configuration look? If I went for 3 x replication in both Ceph and HDFS > then it would definately work, but 9x copies of the dataset is a bit > more than what looks feasible at the moment. > > Thanks for your reflections/input. > > Jesper > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > email to ceph-users-leave(a)ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

4 years

2
1
0 0

active+remapped+backfilling keeps going .. and going

by Kyriazis, George

Hello, I have a Proxmox ceph cluster with 5 nodes and 3 OSDs each (total 15 OSDs), on a 10G network. The cluster started small, and I’ve progressively added OSDs over time. Problem is…. The cluster never rebalances completely. There is always progress on backfilling, but PGs that used to be in active+clean state jump back into the active+remapped+backfilling (or active+remapped+backfill_wait) state, to be moved to different OSDs. Initially I had a 1G network (recently upgraded to 10G), and I was holding on the backfill settings (osd_max_backfills and osd_recovery_sleep_hdd). I just recently (last few weeks) upgraded to 10G, with osd_max_backfills = 50 and osd_recovery_sleep_hdd = 0 (only HDDs, no SSDs). Cluster has been backfilling for months now with no end in sight. Is this normal behavior? Is there any setting that I can look at that till give me an idea as to why PGs are jumping back into remapped from clean? Below is output of “ceph osd tree” and “ceph osd df”: # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 203.72472 root default -9 40.01666 host vis-hsw-01 3 hdd 10.91309 osd.3 up 1.00000 1.00000 6 hdd 14.55179 osd.6 up 1.00000 1.00000 10 hdd 14.55179 osd.10 up 1.00000 1.00000 -13 40.01666 host vis-hsw-02 0 hdd 10.91309 osd.0 up 1.00000 1.00000 7 hdd 14.55179 osd.7 up 1.00000 1.00000 11 hdd 14.55179 osd.11 up 1.00000 1.00000 -11 40.01666 host vis-hsw-03 4 hdd 10.91309 osd.4 up 1.00000 1.00000 8 hdd 14.55179 osd.8 up 1.00000 1.00000 12 hdd 14.55179 osd.12 up 1.00000 1.00000 -3 40.01666 host vis-hsw-04 5 hdd 10.91309 osd.5 up 1.00000 1.00000 9 hdd 14.55179 osd.9 up 1.00000 1.00000 13 hdd 14.55179 osd.13 up 1.00000 1.00000 -15 43.65807 host vis-hsw-05 1 hdd 14.55269 osd.1 up 1.00000 1.00000 2 hdd 14.55269 osd.2 up 1.00000 1.00000 14 hdd 14.55269 osd.14 up 1.00000 1.00000 -5 0 host vis-ivb-07 -7 0 host vis-ivb-10 # # ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 3 hdd 10.91309 1.00000 11 TiB 8.2 TiB 8.2 TiB 552 MiB 25 GiB 2.7 TiB 75.08 1.19 131 up 6 hdd 14.55179 1.00000 15 TiB 9.1 TiB 9.1 TiB 1.2 GiB 30 GiB 5.5 TiB 62.47 0.99 148 up 10 hdd 14.55179 1.00000 15 TiB 8.1 TiB 8.1 TiB 1.5 GiB 20 GiB 6.4 TiB 55.98 0.89 142 up 0 hdd 10.91309 1.00000 11 TiB 7.5 TiB 7.4 TiB 504 MiB 24 GiB 3.5 TiB 68.34 1.09 120 up 7 hdd 14.55179 1.00000 15 TiB 8.7 TiB 8.7 TiB 1.0 GiB 31 GiB 5.8 TiB 60.07 0.95 144 up 11 hdd 14.55179 1.00000 15 TiB 9.4 TiB 9.3 TiB 819 MiB 20 GiB 5.2 TiB 64.31 1.02 147 up 4 hdd 10.91309 1.00000 11 TiB 7.0 TiB 7.0 TiB 284 MiB 25 GiB 3.9 TiB 64.35 1.02 112 up 8 hdd 14.55179 1.00000 15 TiB 9.3 TiB 9.2 TiB 1.8 GiB 29 GiB 5.3 TiB 63.65 1.01 157 up 12 hdd 14.55179 1.00000 15 TiB 8.6 TiB 8.6 TiB 623 MiB 19 GiB 5.9 TiB 59.14 0.94 136 up 5 hdd 10.91309 1.00000 11 TiB 8.6 TiB 8.6 TiB 542 MiB 29 GiB 2.3 TiB 79.01 1.26 134 up 9 hdd 14.55179 1.00000 15 TiB 8.2 TiB 8.2 TiB 707 MiB 27 GiB 6.3 TiB 56.56 0.90 138 up 13 hdd 14.55179 1.00000 15 TiB 8.7 TiB 8.7 TiB 741 MiB 18 GiB 5.8 TiB 59.85 0.95 134 up 1 hdd 14.55269 1.00000 15 TiB 9.8 TiB 9.8 TiB 1.3 GiB 20 GiB 4.8 TiB 67.18 1.07 158 up 2 hdd 14.55269 1.00000 15 TiB 8.7 TiB 8.7 TiB 936 MiB 18 GiB 5.8 TiB 60.04 0.95 148 up 14 hdd 14.55269 1.00000 15 TiB 8.3 TiB 8.3 TiB 673 MiB 18 GiB 6.3 TiB 56.97 0.90 131 up TOTAL 204 TiB 128 TiB 128 TiB 13 GiB 350 GiB 75 TiB 62.95 MIN/MAX VAR: 0.89/1.26 STDDEV: 6.44 # Thank you! George

4 years

5
9
0 0

Re: rbd perf image iostat - rbdmap write latency

by Jason Dillaman

On Mon, Apr 27, 2020 at 7:38 AM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > I guess this is not good for ssd (samsung sm863)? Or do I need to devide > 14.8 by 40? > The 14.8 ms number is the average latency coming from the OSDs, so no need to divide the number by anything. What is the size of your writes? At 40 writes/sec against an SSD-backed cluster, I can only hope they are large IOs. > > rbd perf image iostat > > NAME WR RD WR_BYTES RD_BYTES WR_LAT > RD_LAT > rbd.ssd/vps-test 40/s 0/s 5.0 MiB/s 0 B/s 14.84 ms > 0.00 ns > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > -- Jason

4 years

1
0
0 0

rbd perf image iostat - rbdmap write latency

by Marc Roos

I guess this is not good for ssd (samsung sm863)? Or do I need to devide 14.8 by 40? rbd perf image iostat NAME WR RD WR_BYTES RD_BYTES WR_LAT RD_LAT rbd.ssd/vps-test 40/s 0/s 5.0 MiB/s 0 B/s 14.84 ms 0.00 ns

4 years

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users April 2020