Hello everyone
I've got a fresh ceph octopus installation and I'm trying to set up a cephfs with erasure code configuration.
The metadata pool was set up as default.
The erasure code pool was set up with this command:
-> ceph osd pool create ec-data_fs 128 erasure default
Enabled overwrites:
-> ceph osd pool set ec-data_fs allow_ec_overwrites true
And create fs:
-> ceph fs new ec-data_fs meta_fs ec-data_fs --force
Then I tried deploying the mds, but this fails:
-> ceph orch daemon add mds ec-data_fs magma01
returns:
-> Deployed mds.ec-data_fs.magma01.ujpcly on host 'magma01'
The mds daemon is not there.
Aparently the container dies without any information, as seen in the journal:
May 25 16:11:56 magma01 podman[9348]: 2020-05-25 16:11:56.670510456 +0200 CEST m=+0.186462913 container create 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori)
May 25 16:11:56 magma01 systemd[1]: Started libpod-conmon-0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90.scope.
May 25 16:11:56 magma01 systemd[1]: Started libcontainer container 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90.
May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.112182262 +0200 CEST m=+0.628134873 container init 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori)
May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.137011897 +0200 CEST m=+0.652964354 container start 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori)
May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.137110412 +0200 CEST m=+0.653062853 container attach 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori)
May 25 16:11:57 magma01 systemd[1]: libpod-0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90.scope: Consumed 327ms CPU time
May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.182968802 +0200 CEST m=+0.698921275 container died 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori)
May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.413743787 +0200 CEST m=+0.929696266 container remove 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori)
Can someone help me debugging this?
Cheers
Simon
Hi,
Following on from various woes, we see an odd and unhelpful behaviour with some OSDs on our cluster currently.
A minority of OSDs seem to have runaway memory usage, rising to 10s of GB, whilst other OSDs on the same host behave sensibly. This started when we moved from Mimic -> Nautilus, as far as we can tell.
In best case, this causes some nodes to start swapping [and reduces their performance]. In worst case, it triggers the OOMkiller.
I have dumped the mempool for these OSDs, which shows that almost all the memory is in the buffer_anon pool.
The perf dump shows that the OSD is targetting the 4GB limit that's set for it, but for some reason is unable to reach this due to stuff in the priorty_cache (which seems to be mostly what is filling buffer_anon)
Can anyone advise on what we should do next?
(mempool dump and excerpt of perf dump at end of email).
Thanks for any help,
Sam Skipsey
MEMPOOL DUMP
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 5629372,
"bytes": 45034976
},
"bluestore_cache_data": {
"items": 127,
"bytes": 65675264
},
"bluestore_cache_onode": {
"items": 8275,
"bytes": 4634000
},
"bluestore_cache_other": {
"items": 2967913,
"bytes": 62469216
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 145,
"bytes": 100920
},
"bluestore_writing_deferred": {
"items": 335,
"bytes": 13160884
},
"bluestore_writing": {
"items": 1406,
"bytes": 5379120
},
"bluefs": {
"items": 1105,
"bytes": 24376
},
"buffer_anon": {
"items": 13705143,
"bytes": 40719040439
},
"buffer_meta": {
"items": 6820143,
"bytes": 600172584
},
"osd": {
"items": 96,
"bytes": 1138176
},
"osd_mapbl": {
"items": 59,
"bytes": 7022524
},
"osd_pglog": {
"items": 491049,
"bytes": 156701043
},
"osdmap": {
"items": 107885,
"bytes": 1723616
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 29733053,
"bytes": 41682277138
}
}
}
PERF DUMP excerpt:
"prioritycache": {
"target_bytes": 4294967296,
"mapped_bytes": 38466584576,
"unmapped_bytes": 425984,
"heap_bytes": 38467010560,
"cache_bytes": 134217728
},
Folks,
I am running into a very strange issue with a brand new Ceph cluster during initial testing. Cluster
consists of 12 nodes, 4 of them have SSDs only, the other eight have a mixture of SSDs and HDDs.
The latter nods are configured so that three or four HDDs use one SSDs for their blockdb.
Ceph version is Nautilus.
When writing to the cluster, clients will, in regular intervals, run into I/O stall (i.e. writes will take up
to 25 minutes to complete). Deleting RBD Images will often take forever as well. After several weeks
of debugging, what I can say from looking at the log files, is that what appears to take a lot of time
is writing stuff to OSDs:
"time": "2020-05-20 10:52:23.211006",
"event": "reached_pg"
},
{
"time": "2020-05-20 10:52:23.211047",
"event": "waiting for ondisk"
},
{
"time": "2020-05-20 10:53:35.369081",
"event": "done"
}
But these machines are I/O idling. there is almost no I/O happening at all according to sysstat.
I am slowly growing a bit desperate over this, and hence I wonder whether anybody has ever
seen a similar issue? Or are there possibly any tips on where to carry on with debugging?
Servers are from Dell with PERC controllers in HBA mode.
The primary purpose of this Ceph cluster is to serve as backing storage for OpenStack, and to
this point, I was not able to reproduce the issue with the SSD-only nodes.
Best regards
Martin
Hi,
I am trying to setup multisite cluster with 2 sites. I created master zonegroup and zone by following the instructions given in the documentation. On the secondary zone cluster I could pull the master zone. I created secondary zone. When I tried to commit the period I am getting the following error.
2020-05-25 16:16:46.054 7f4ad25596c0 1 Cannot find zone id=2f272093-3712-45a7-8a63-b17f12ccd07c (name=testsite2), switching to local zonegroup configuration
Sending period to new master zone 6d8d5ffa-2034-4717-978e-3ab4ba4349c5
request failed: (5) Input/output error
failed to commit period: (5) Input/output error
Can someone please help me to solve this issue.
Regards,
Sailaja
Hi,
The numbers of object counts from "rados df" and "rados ls" are different
in my testing environment. I think it maybe some zero bytes or unclean
objects since I removed all rbd images on top of it few days ago.
How can I make it right / found out where are those ghost objects? Or i
should ignore it since the numbers was not that high.
$ rados -p rbd df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED
RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
rbd 18 MiB 430107 0 1290321 0 0 0
141243877 6.9 TiB 42395431 11 TiB 0 B 0 B
$ rados -p rbd ls | wc -l
4
$ rados -p rbd ls
gateway.conf
rbd_directory
rbd_info
rbd_trash
Regs,
Icy
Hi, I am new on rgw and try deploying a mutisite cluster in order to sync
data from one cluster to another.
My source zone is the default zone in the default zonegroup, structure as
belows:
realm: big-realm
|
zonegroup: default
/ \
master zone: default secondary zone: backup
*STEP*:
on source cluster:
1. radosgw-admin realm create --rgw-realm=big-realm --default
2. radosgw-admin zonegroup modify --rgw-realm big-realm --rgw-zonegroup
default --master --endpoints "http://172.24.29.26:7480"
3. radosgw-admin zone modify --rgw-zonegroup default --rgw-zone default
--master --endpoints "http://172.24.29.26:7480"
4. radosgw-admin user create --uid=sync-user
--display-name="Synchronization User" --access-key=redhat --secret=redhat
--system
5. radosgw-admin zone modify --rgw-zone=default --access-key=redhat
--secret=redhat
6. radosgw-admin period update --commit
on destination cluster:
1. radosgw-admin realm pull --url="http://172.24.29.26:7480
<http://172.24.29.26/>" --access-key=redhat --secret=redhat
--rgw-realm=big-realm
2. radosgw-admin realm default --rgw-realm=big-realm
3. radosgw-admin period pull --url="http://172.24.29.26:7480"
--access-key=redhat --secret=redhat
4. radosgw-admin zonegroup default --rgw-zonegroup=default
5. radosgw-admin zone create --rgw-zonegroup=default --rgw-zone=backup
--endpoints="http://172.24.29.29:7480" --access-key=redhat --secret=redhat
--default
6. radosgw-admin period update --commit
commit period on secondary zone get error:
2020-04-02 14:36:04.707 7fd8ee9376c0 1 Cannot find zone
id=8c75360a-c0cf-4772-b85e-ff74726396c2 (name=backup), switching to local
zonegroup configuration
Sending period to new master zone 5fba7cae-47f1-4c8e-9a34-1b499c9c27f8
request failed: (2202) Unknown error 2202
failed to commit period: (2202) Unknown error 2202
radosgw-admin sync status:
2020-04-02 14:37:18.330 7f27c60676c0 1 Cannot find zone
id=8c75360a-c0cf-4772-b85e-ff74726396c2 (name=backup), switching to local
zonegroup configuration
realm fec73799-36be-4418-abb2-9804cc83d83d (big-realm)
zonegroup fc61ac2f-dc1d-421b-90af-ffe9113b9935 (default)
zone 8c75360a-c0cf-4772-b85e-ff74726396c2 (backup)
metadata sync failed to read sync status: (2) No such file or directory
data sync source: 5fba7cae-47f1-4c8e-9a34-1b499c9c27f8 (default)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
My source cluster's version is 13.2.8, and destination cluster's is 14.2.8.
I tried sync data from cluster of both 13.2.8 version and got the same
error.
Is there any step I was wrong or the default zone cannot be synced?
Thanks
Hi,
I am using ceph Nautilus cluster with below configuration.
3 node's (Ubuntu 18.04) each has 12 OSD's, and mds, mon and mgr are running
in shared mode.
The client mounted through ceph kernel client.
I was trying to emulate a node failure when a write and read were going on
(replica2) pool.
I was expecting read and write continue after a small pause due to a Node
failure but it halts and never resumes until the failed node is up.
I remember I tested the same scenario before in ceph mimic where it
continued IO after a small pause.
regards
Amudhan P
Hi,
I have the following Ceph Mimic setup :
- a bunch of old servers with 3-4 SATA drives each (74 OSDs in total)
- index/leveldb is stored on each OSD (so no SSD drives, just SATA)
- the current usage is :
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
542 TiB 105 TiB 437 TiB 80.67
POOLS:
NAME ID USED %USED MAX
AVAIL OBJECTS
.rgw.root 1 1.1 KiB 0 26
TiB 4
default.rgw.control 2 0 B 0 26
TiB 8
default.rgw.meta 3 20 MiB 0 26
TiB 75357
default.rgw.log 4 0 B 0 26
TiB 4271
default.rgw.buckets.data 5 290 TiB 85.05 51 TiB
78067284
default.rgw.buckets.non-ec 6 0 B 0 26
TiB 0
default.rgw.buckets.index 7 0 B 0 26
TiB 603008
- rgw_override_bucket_index_max_shards = 16. Clients are accessing RGW
via Swift, not S3.
- the replication schema is EC 4+2.
We are using this Ceph cluster as a secondary storage for another
storage infrastructure (which is more expensive) and we are offloading
cold data (big files with a low number of downloads/reads from our
customer). This way we can lower the TCO . So most of the files are big
( a few GB at least).
So far Ceph is doing well considering that I don't have big
expectations from current hardware. I'm a bit worried however that we
have 78 M objects with max_shards=16 and we will probably reach 100M in
the next few months. Do I need a increase the max shards to ensure the
stability of the cluster ? I read that storing more than 1 M of objects
in a single bucket can lead to OSD's flapping or having io timeouts
during deep-scrub or even to have ODS's failures due to the leveldb
compacting all the time if we have a large number of DELETEs.
Any advice would be appreciated.
Thank you,
Adrian Nicolae
Hi
We have some clusters which are rbd only. Each time someone uses
radosgw-admin by mistake on those clusters, rgw pools are auto created.
Is there a way to disable that? I mean the part:
"When radosgw first tries to operate on a zone pool that does not exist, it
will create that pool with the default values from osd pool default pg num
and osd pool default pgp num"
Thanks,
Kate