Hello everyone,
A few weeks ago I have enabled the ceph balancer on my cluster as per the instructions here: [ https://docs.ceph.com/docs/mimic/mgr/balancer/ | https://docs.ceph.com/docs/mimic/mgr/balancer/ ]
I am running ceph version:
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
The cluster has 48 osds (40 osds in hdd pools and 8 osds in ssd pool)
Currently, the balancer status is showing as Active.
# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}
The health status of the cluster is:
health: HEALTH_OK
Previously, I've used the old REWEIGHT to change the placement of data as I've seen very uneven usage (ranging from about 60% usage on some OSDs to over 90% on others). So, I have a number of osds with reweight of 1 an some going down to 0.75.
At the moment the osd usage ranges between about 65% to to just under 90%, so still a huge variation. After switching on the balancer, I've not actually seen any activity or data migration, so I am not sure if the balancer is working at all. Could someone tell me how do I check if balancing is doing its job?
The second question is as the balancer is now switched on, do I suppose to set the reweight values back to their default value of 1?
Many thanks
Hi all!
I export the rbd image use the export v2 format and without it.
The difference is the export v2 format image is more smaller than the image which do not use export v2 format.
Can any body tell me why they are size is difference so huge?
Thanks very much!
root@controller:/mnt# rbd du images/35d69ca5-b4f7-499e-9719-331eee498bc4
NAME PROVISIONED USED
35d69ca5-b4f7-499e-9719-331eee498bc4@snap 40GiB 40GiB
35d69ca5-b4f7-499e-9719-331eee498bc4 40GiB 0B
<TOTAL> 40GiB 40GiB
root@controller:/mnt# rbd export --export-format 2 images/35d69ca5-b4f7-499e-9719-331eee498bc4 ./v2image
Exporting image: 100% complete...done.
root@controller:/mnt# du -sh ./v2image
9.8G ./v2image
root@controller:/mnt# rbd export images/35d69ca5-b4f7-499e-9719-331eee498bc4 ./image
Exporting image: 100% complete...done.
root@controller:/mnt# du -sh ./image
40G ./image
Hi
We have an 3 year old Hadoop cluster - up for refresh - so it is time
to evaluate options. The "only" usecase is running an HBase installation
which is important for us and migrating out of HBase would be a hazzle.
Our Ceph usage has expanded and in general - we really like what we see.
Thus - Can this be "sanely" consolidated somehow? I have seen this:
https://docs.ceph.com/docs/jewel/cephfs/hadoop/
But it seem really-really bogus to me.
It recommends that you set:
pool 3 'hadoop1' rep size 1 min_size 1
Which would - if I understand correct - be disastrous. The Hadoop end would
replicated in 3 across - but within Ceph the replication would be 1.
The 1 replication in ceph means pulling the OSD node would "gaurantee" the
pg's to go inactive - which could be ok - but there is nothing
gauranteeing that the other Hadoop replicas are not served out of the same
OSD-node/pg? In which case - rebooting an OSD node would bring the hadoop
cluster unavailable.
Is anyone serving HBase out of Ceph - how does the stadck and
configuration look? If I went for 3 x replication in both Ceph and HDFS
then it would definately work, but 9x copies of the dataset is a bit more
than what looks feasible at the moment.
Thanks for your reflections/input.
Jesper
Hi, thought i might ask here since i was unable to find anything similar
to my issue.Perhaps someone might have an idea.
In our org we are running currently few nautilus clusters (14.2.2 ,
14.2.4, 14.2.8) But strangely enough the clusters with 14.2.8 are
reporting weird metrics of pool R/W. To be more specific, from time to
time some random huge spikes in orders of TB/s appear in our graphs ,
whereas the real usage is GB/s (Clusters are monitored by Prometheus).
First to blame was our exporters, so i tried the one built in
ceph-dashboard , but the values of those metrics were same (+- few
bytes). I went as far as writing a small program to talk directly to the
ceph API, but after i processed the output, data was also the same. I
can provide more details if someone has a clue what might be the
cause(Graphs, logs etc...).
As far as the rest of the clusters goes, their graphs does not contain
any spikes as the ones of the 14.2.8 clusters.
Thanks for any responses/help in advance.
Completed the migration of an existing Ceph cluster on Octopus to cephadm.
All OSD/MON/MGR moved fine, however upon running the command to setup some new MDS for cephfs they both failed to start.
After looking into the cephadm log's I found the following error:
Apr 13 06:26:15 sn-s01 systemd[1]: Started Ceph mds.cephfs.sn-s01.snkhfd for b1db6b36-0c4c-4bce-9cda-18834be0632d.
Apr 13 06:26:16 sn-s01 bash[3520809]: debug 2020-04-13T04:26:16.445+0000 7f380b908700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [1] but i only support [2]
Apr 13 06:26:16 sn-s01 bash[3520809]: debug 2020-04-13T04:26:16.445+0000 7f380b107700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [1] but i only support [2]
Apr 13 06:26:16 sn-s01 bash[3520809]: debug 2020-04-13T04:26:16.445+0000 7f380a906700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [1] but i only support [2]
Apr 13 06:26:16 sn-s01 bash[3520809]: failed to fetch mon config (--no-mon-config to skip
This cluster is running with cephx disabled, I imported the ceph.conf into cephadm fine and this has worked fine when the other services have started, but from the above error looks like maybe the mds is not checking if cephx is enabled or disabled before trying to communicate to the mon's?
Hi all,
*** Short version ***
Is there a way to repair a rocksdb from errors "Encountered error while
reading data from compression dictionary block Corruption: block
checksum mismatch" and "_open_db erroring opening db" ?
*** Long version ***
We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers +
4 mons/mgr + 3 mds).
We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a
rebalancing of some data.
Two days later (Wednesday 22) we had a very short power outage. Only one
of the osd servers went down (and unfortunately died).
This triggered a reconstruction of the losts osds. Operations went fine
until Saturday 25 where some osds in the 5 remaining servers started to
crash apparently with no reasons.
We tryed to restart them, but they crashed again. We ended with 18 osd
down (+ 16 in the dead server so 34 osd downs out of 100).
Looking at the logs we found for all the crashed osd :
-237> 2020-04-25 16:32:51.835 7f1f45527a80 3 rocksdb:
[table/block_based_table_reader.cc:1117] Encountered error while reading
data from compression dictionary block Corruption: block checksum
mismatch: expected 0, got 2729370997 in db/181355.sst offset
18446744073709551615 size 18446744073709551615
and
2020-04-25 16:05:47.251 7fcbd1e46a80 -1
bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db:
We also noticed that the "Encountered error while reading data from
compression dictionary block Corruption: block checksum mismatch" was
present few days before the crash.
We also have some osd with this error but still up.
We tryed to repair with :
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
But no success (it ends with _open_db erroring opening db).
Thus does somebody have an idea to fix this or at least know if it's
possible to repair and correct the "Encountered error while reading data
from compression dictionary block Corruption: block checksum mismatch"
and "_open_db erroring opening db" !
Thanks for your help (we are desperate because we will loose datas and
are fighting to save something) !!!
F.
RBD is never a workable solution unless you want to pay the cost of
double-replication in both HDFS and Ceph.
I think the right approach is thinking about other implementation of the
FileSystem interface, like s3a and localfs.
s3a is straight forward, ceph rgw provide s3 interface and s3a is stable
and well tested in Hadoop ecosystem, just run it. There are a few other
in-house solution offered by some vendor that integrating librgw into the
s3a driver so it saves one extra hop and the management/LB cost of
maintaining an RGW cluster.
local filesystem is a bit tricky, we just tried a POC that mounting CephFS
into every hadoop , configure Hadoop using LocalFS with Replica = 1. Which
end up with each data only write once into cephfs and cephfs take care of
the data durability.
There was a libcephfs-jni but it is significantly out of date and seems
be abandoned, which is a pity.
For both solutions for sure you lost the locality , but trading for better
scalability and compute/storage separation.
-Xiaoxi
Marc Roos <M.Roos(a)f1-outsourcing.eu> 于2020年4月24日周五 下午4:00写道:
>
> I think the idea behind pool size of 1, is that hadoop already writes
> copies to 2 other pools(?).
>
> However that leaves the possibility that pg's of these 3 pools can maybe
> share an osd, and if that osd fails, you loose data in these pools. I
> have no idea what the chances are that the same data of different pools
> can end up on the same osd.
>
>
> -----Original Message-----
> To: ceph-users(a)ceph.io
> Subject: [ceph-users] HBase/HDFS on Ceph/CephFS
>
> Hi
>
> We have an 3 year old Hadoop cluster - up for refresh - so it is time to
> evaluate options. The "only" usecase is running an HBase installation
> which is important for us and migrating out of HBase would be a hazzle.
>
> Our Ceph usage has expanded and in general - we really like what we see.
>
> Thus - Can this be "sanely" consolidated somehow? I have seen this:
> https://docs.ceph.com/docs/jewel/cephfs/hadoop/
> But it seem really-really bogus to me.
>
> It recommends that you set:
> pool 3 'hadoop1' rep size 1 min_size 1
>
> Which would - if I understand correct - be disastrous. The Hadoop end
> would replicated in 3 across - but within Ceph the replication would be
> 1.
> The 1 replication in ceph means pulling the OSD node would "gaurantee"
> the pg's to go inactive - which could be ok - but there is nothing
> gauranteeing that the other Hadoop replicas are not served out of the
> same OSD-node/pg? In which case - rebooting an OSD node would bring the
> hadoop cluster unavailable.
>
> Is anyone serving HBase out of Ceph - how does the stadck and
> configuration look? If I went for 3 x replication in both Ceph and HDFS
> then it would definately work, but 9x copies of the dataset is a bit
> more than what looks feasible at the moment.
>
> Thanks for your reflections/input.
>
> Jesper
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
On Mon, Apr 27, 2020 at 7:38 AM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
> I guess this is not good for ssd (samsung sm863)? Or do I need to devide
> 14.8 by 40?
>
The 14.8 ms number is the average latency coming from the OSDs, so no need
to divide the number by anything. What is the size of your writes? At 40
writes/sec against an SSD-backed cluster, I can only hope they are large
IOs.
>
> rbd perf image iostat
>
> NAME WR RD WR_BYTES RD_BYTES WR_LAT
> RD_LAT
> rbd.ssd/vps-test 40/s 0/s 5.0 MiB/s 0 B/s 14.84 ms
> 0.00 ns
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
--
Jason
I guess this is not good for ssd (samsung sm863)? Or do I need to devide
14.8 by 40?
rbd perf image iostat
NAME WR RD WR_BYTES RD_BYTES WR_LAT
RD_LAT
rbd.ssd/vps-test 40/s 0/s 5.0 MiB/s 0 B/s 14.84 ms
0.00 ns