Hello,
I have created a small 16pg EC pool with k=4, m=2.
Then I applied following crush rule to it:
rule test_ec { id 99 type erasure min_size 5 max_size 6 step
set_chooseleaf_tries 5 step set_choose_tries 100 step take default
step choose indep 3 type host step chooseleaf indep 2 type osd
step emit }
The OSD tree looks as following:
-1 43.38448 root default
-9 43.38448 region lab1
-7 43.38448 room dc1.lab1
-5 43.38448 rack r1.dc1.lab1
-3 14.44896 host host1.r1.dc1.lab1
6 hdd 3.63689 osd.6
up 1.00000 1.00000
8 hdd 3.63689 osd.8
up 1.00000 1.00000
7 hdd 3.63689 osd.7
up 1.00000 1.00000
11 hdd 3.53830 osd.11
up 1.00000 1.00000
-11 14.44896 host host2.r1.dc1.lab1
4 hdd 3.63689 osd.4
up 1.00000 1.00000
9 hdd 3.63689 osd.9
up 1.00000 1.00000
5 hdd 3.63689 osd.5
up 1.00000 1.00000
10 hdd 3.53830 osd.10
up 1.00000 1.00000
-13 14.48656 host host3.r1.dc1.lab1
0 hdd 3.57590 osd.0
up 1.00000 1.00000
1 hdd 3.63689 osd.1
up 1.00000 1.00000
2 hdd 3.63689 osd.2
up 1.00000 1.00000
3 hdd 3.63689 osd.3
up 1.00000 1.00000
My expectation was that each host will contain 2 shards of any PG of the pool.
When I dumped PGs, it was true, but one group is placed on OSDs 0,2,3
which will cause downtime in case of host3 failure.
root@host1:~/mkw # ceph pg dump|grep "^66\."|awk '{print $17}'
dumped all
[4,5,7,6,1,2]
[8,11,9,3,0,2] <<< - this one is problematic
[6,7,10,9,2,0]
[2,3,7,6,5,9]
[7,8,10,5,3,1]
[4,5,8,6,0,2]
[7,11,9,4,1,2]
[5,9,0,2,7,11]
[9,5,3,1,7,8]
[8,11,2,0,5,9]
[2,0,8,6,10,9]
[3,2,5,9,7,11]
[6,7,9,5,1,2]
[10,5,1,3,11,8]
[4,5,7,8,2,0]
[7,8,3,2,9,10]
Is there a way to ensure that host failure is not disruptive to the cluster?
During the experiment I used info from this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-October/030227.html
Kind regards,
Maks Kowalik
I have a small cluster with a single crash map. I use 3 pools one (Openebula VMs on rbd), cephfs_data and cephfs_metadata for cephfs. Here is my ceph df
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 94 TiB 78 TiB 17 TiB 17 TiB 17.75
TOTAL 94 TiB 78 TiB 17 TiB 17 TiB 17.75
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
cephfs_data 1 3.3 TiB 6.62M 10 TiB 12.36 24 TiB
cephfs_metadata 2 2.1 GiB 447.63k 2.5 GiB 0 24 TiB
one 5 2.2 TiB 598.12k 6.6 TiB 8.42 24 TiB
What confuses me is an even distribution of MAX_AVAIL storage between those pools. When I mount cephfs on a client host, df -h shows me pool utilization.
28T 3.4T 24T 13%
I also have an old hammer cluster where I see a similar picture for ceph df for a one crash map (covering rbd, cephfs-data, cephfs-metadata)
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
87053G 31306G 55747G 64.04
POOLS:
NAME ID USED %USED MAX AVAIL OBJECTS
rbd 0 12907G 69.37 5700G 3312474
cephfs-data 12 2873G 33.52 5700G 5859947
cephfs-meta 13 90035k 0 5700G 443961
cloud12g 14 2857G 43.41 3726G 623737
However, df -h on clients show total cluster utilization
86T 55T 31T 65%
It seems that the hammer dynamically changes allocation data between the pools on the same crush map as needed. Does nautilus do the same? In this case, does 24 TB means actually avaialble space divided 3 (all my pools are set with 3/2 replication)?
Thank you and sorry for the confusion
Since the spreading of the corona virus is taking such drastic
proportions that flights between Europe and the US are being halted. I
would suggest we show some support and we temporary use in the mailing
list only :) and not :D
I have default centos7 setup with nautilus. I have been asked to install
5.5 to check a 'bug'. Where should I get this from? I read that the
elrepo kernel is not compiled like rhel.
Ok, I think that answers my question then, thanks! Too risky to be playing with patterns that will get increasingly difficult to support over time.
> On Mar 12, 2020, at 12:48 PM, Anthony D'Atri <anthony.datri(a)gmail.com> wrote:
>
> They won’t be AFAIK. Few people ever did this.
>
>> On Mar 12, 2020, at 11:08 AM, Brian Topping <brian.topping(a)gmail.com> wrote:
>>
>> If the ceph roadmap is getting rid of named clusters, how will multiple clusters be supported? How (for instance) would `/var/lib/ceph/mon/{name}` directories be resolved?
If the ceph roadmap is getting rid of named clusters, how will multiple clusters be supported? How (for instance) would `/var/lib/ceph/mon/{name}` directories be resolved?
> On Mar 11, 2020, at 8:29 PM, Brian Topping <brian.topping(a)gmail.com> wrote:
>
>> On Mar 11, 2020, at 7:59 PM, Anthony D'Atri <anthony.datri(a)gmail.com> wrote:
>>
>>> This is all possible with a single cluster, but this limited node also needs storage.
>>
>> Are you saying that the limited node needs to access Ceph-based storage? Is this some sort of converged architecture?
>
> It is a converged architecture in that all three boxes are running Kubernetes. There is one k8s cluster on each side of the link, let’s call them “primary” and “secondary”:
> * The primary k8s cluster will only access storage from the primary Ceph cluster, secondary k8s only accesses storage from secondary Ceph.
> * Primary Ceph gets monitors on both sides of the link. Secondary Ceph only has monitors on the secondary side.
>
> In a netsplit situation, the primary Ceph will maintain quorum with both nodes on the primary side. The secondary Ceph cluster only exists separately for this netsplit situation and the secondary k8s cluster can continue unaffected.
>
> With this in place, the primary side can continue operating with either primary node downed for maintenance via a suboptimal quorum over the WAN link. I cannot do that today.
>
> I am sacrificing the case where there is a netsplit at the same time I am doing maintenance.
>
> Thanks for your input!
> Brian
>
>
Hi,
Currently running Mimic 13.2.5.
We had reports this morning of timeouts and failures with PUT and GET
requests to our Ceph RGW cluster. I found these messages in the RGW
log:
RGWReshardLock::lock failed to acquire lock on
bucket_name:bucket_instance ret=-16
NOTICE: resharding operation on bucket index detected, blocking
block_while_resharding ERROR: bucket is still resharding, please retry
Which were preceded by many of these, which I think are normal/expected.
check_bucket_shards: resharding needed: stats.num_objects=6415879
shard max_objects=6400000
Our RGW cluster sits behind haproxy which notified me approx 90
seconds after the first 'resharding needed' message that no backends
were available. It appears this dynamic reshard process caused the
RGWs to lock up for a period of time. Roughly 2 minutes later the
reshard error messages stop and operation returns to normal.
Looking back through previous RGW logs, I see a similar event from
about a week ago, on the same bucket. We have several buckets with
shard counts exceeding 1k (this one only has 128), and much larger
object counts, so clearly this isn't the first time dynamic sharding
has been invoked on this cluster.
Has anyone seen this? I expect it will come up again, and can turn up
debugging if that'll help. Thanks for any assistance!
Josh
Hi,
I'm (still) testing upgrading from Luminous to Nautilus and ran into the
following situation:
The lab-setup I'm testing in has three OSD-Hosts.
If one of those hosts dies the store.db in /var/lib/ceph/mon/ on all my
Mon-Nodes starts to rapidly grow in size until either the OSD-host comes
back up or disks are full.
On another cluster that's still on Luminous I don't see any growth at all.
Is that a difference in behaviour between Luminous and Nautilus or is that
caused by the lab-setup only having three hosts and one lost host causing
all PGs to be degraded at the same time?
--
Cheers,
Hardy
Hi, I’m getting conflicting reads from the documentation. It seems that by using the “cluster name”[1], multiple clusters can be run in parallel on the same hardware.
In trying to set this up with `ceph-deploy`, I see the man page[2] says "if it finds the distro.init to be sysvinit (Fedora, CentOS/RHEL etc), it doesn't allow installation with custom cluster name and uses the default name ceph for the cluster”.
Is it possible to run multiple clusters on the same hardware with CentOS 7 as the base OS?
Thanks, Brian
[1] https://docs.ceph.com/docs/nautilus/install/manual-deployment/#monitor-boot…
[2] https://docs.ceph.com/docs/nautilus/man/8/ceph-deploy/?highlight=ceph-deplo…
Hi,
I'm trying to create a namespace in rados, create a user that has
access to this created namespace and with rados command line utility
read and write objects in this created namespace using the created
user.
I can't find an example on how to do it.
Can someone point me to such example or show me how to do it?
Regards,
Rodrigo Severo