May 2020 - ceph-users - lists.ceph.io

by Simon Sutter

Hello everyone I've got a fresh ceph octopus installation and I'm trying to set up a cephfs with erasure code configuration. The metadata pool was set up as default. The erasure code pool was set up with this command: -> ceph osd pool create ec-data_fs 128 erasure default Enabled overwrites: -> ceph osd pool set ec-data_fs allow_ec_overwrites true And create fs: -> ceph fs new ec-data_fs meta_fs ec-data_fs --force Then I tried deploying the mds, but this fails: -> ceph orch daemon add mds ec-data_fs magma01 returns: -> Deployed mds.ec-data_fs.magma01.ujpcly on host 'magma01' The mds daemon is not there. Aparently the container dies without any information, as seen in the journal: May 25 16:11:56 magma01 podman[9348]: 2020-05-25 16:11:56.670510456 +0200 CEST m=+0.186462913 container create 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori) May 25 16:11:56 magma01 systemd[1]: Started libpod-conmon-0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90.scope. May 25 16:11:56 magma01 systemd[1]: Started libcontainer container 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90. May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.112182262 +0200 CEST m=+0.628134873 container init 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori) May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.137011897 +0200 CEST m=+0.652964354 container start 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori) May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.137110412 +0200 CEST m=+0.653062853 container attach 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori) May 25 16:11:57 magma01 systemd[1]: libpod-0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90.scope: Consumed 327ms CPU time May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.182968802 +0200 CEST m=+0.698921275 container died 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori) May 25 16:11:57 magma01 podman[9348]: 2020-05-25 16:11:57.413743787 +0200 CEST m=+0.929696266 container remove 0fdf8c508b330adac713ffb04c72b5df770277ad191d844888f7387f28e3cc90 (image=docker.io/ceph/ceph:v15, name=competent_cori) Can someone help me debugging this? Cheers Simon

3 years, 11 months

1
1
0 0

Nautilus: (Minority of) OSDs with huge buffer_anon usage - triggering OOMkiller in worst cases.

by aoanla＠gmail.com

Hi, Following on from various woes, we see an odd and unhelpful behaviour with some OSDs on our cluster currently. A minority of OSDs seem to have runaway memory usage, rising to 10s of GB, whilst other OSDs on the same host behave sensibly. This started when we moved from Mimic -> Nautilus, as far as we can tell. In best case, this causes some nodes to start swapping [and reduces their performance]. In worst case, it triggers the OOMkiller. I have dumped the mempool for these OSDs, which shows that almost all the memory is in the buffer_anon pool. The perf dump shows that the OSD is targetting the 4GB limit that's set for it, but for some reason is unable to reach this due to stuff in the priorty_cache (which seems to be mostly what is filling buffer_anon) Can anyone advise on what we should do next? (mempool dump and excerpt of perf dump at end of email). Thanks for any help, Sam Skipsey MEMPOOL DUMP { "mempool": { "by_pool": { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 5629372, "bytes": 45034976 }, "bluestore_cache_data": { "items": 127, "bytes": 65675264 }, "bluestore_cache_onode": { "items": 8275, "bytes": 4634000 }, "bluestore_cache_other": { "items": 2967913, "bytes": 62469216 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 145, "bytes": 100920 }, "bluestore_writing_deferred": { "items": 335, "bytes": 13160884 }, "bluestore_writing": { "items": 1406, "bytes": 5379120 }, "bluefs": { "items": 1105, "bytes": 24376 }, "buffer_anon": { "items": 13705143, "bytes": 40719040439 }, "buffer_meta": { "items": 6820143, "bytes": 600172584 }, "osd": { "items": 96, "bytes": 1138176 }, "osd_mapbl": { "items": 59, "bytes": 7022524 }, "osd_pglog": { "items": 491049, "bytes": 156701043 }, "osdmap": { "items": 107885, "bytes": 1723616 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 } }, "total": { "items": 29733053, "bytes": 41682277138 } } } PERF DUMP excerpt: "prioritycache": { "target_bytes": 4294967296, "mapped_bytes": 38466584576, "unmapped_bytes": 425984, "heap_bytes": 38467010560, "cache_bytes": 134217728 },

3 years, 11 months

2
4
0 0

Performance issues in newly deployed Ceph cluster

by Loschwitz,Martin Gerhard

Folks, I am running into a very strange issue with a brand new Ceph cluster during initial testing. Cluster consists of 12 nodes, 4 of them have SSDs only, the other eight have a mixture of SSDs and HDDs. The latter nods are configured so that three or four HDDs use one SSDs for their blockdb. Ceph version is Nautilus. When writing to the cluster, clients will, in regular intervals, run into I/O stall (i.e. writes will take up to 25 minutes to complete). Deleting RBD Images will often take forever as well. After several weeks of debugging, what I can say from looking at the log files, is that what appears to take a lot of time is writing stuff to OSDs: "time": "2020-05-20 10:52:23.211006", "event": "reached_pg" }, { "time": "2020-05-20 10:52:23.211047", "event": "waiting for ondisk" }, { "time": "2020-05-20 10:53:35.369081", "event": "done" } But these machines are I/O idling. there is almost no I/O happening at all according to sysstat. I am slowly growing a bit desperate over this, and hence I wonder whether anybody has ever seen a similar issue? Or are there possibly any tips on where to carry on with debugging? Servers are from Dell with PERC controllers in HBA mode. The primary purpose of this Ceph cluster is to serve as backing storage for OpenStack, and to this point, I was not able to reproduce the issue with the SSD-only nodes. Best regards Martin

3 years, 11 months

1
0
0 0

RGW Multisite metadata sync

by Sailaja Yedugundla

Hi, I am trying to setup multisite cluster with 2 sites. I created master zonegroup and zone by following the instructions given in the documentation. On the secondary zone cluster I could pull the master zone. I created secondary zone. When I tried to commit the period I am getting the following error. 2020-05-25 16:16:46.054 7f4ad25596c0 1 Cannot find zone id=2f272093-3712-45a7-8a63-b17f12ccd07c (name=testsite2), switching to local zonegroup configuration Sending period to new master zone 6d8d5ffa-2034-4717-978e-3ab4ba4349c5 request failed: (5) Input/output error failed to commit period: (5) Input/output error Can someone please help me to solve this issue. Regards, Sailaja

3 years, 11 months

2
3
0 0

Mismatched object counts between "rados df" and "rados ls" after rbd images removal

by icy chan

Hi, The numbers of object counts from "rados df" and "rados ls" are different in my testing environment. I think it maybe some zero bytes or unclean objects since I removed all rbd images on top of it few days ago. How can I make it right / found out where are those ghost objects? Or i should ignore it since the numbers was not that high. $ rados -p rbd df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR rbd 18 MiB 430107 0 1290321 0 0 0 141243877 6.9 TiB 42395431 11 TiB 0 B 0 B $ rados -p rbd ls | wc -l 4 $ rados -p rbd ls gateway.conf rbd_directory rbd_info rbd_trash Regs, Icy

3 years, 11 months

2
5
0 0

RGW Multi-site Issue

by Zhenshi Zhou

Hi, I am new on rgw and try deploying a mutisite cluster in order to sync data from one cluster to another. My source zone is the default zone in the default zonegroup, structure as belows: realm: big-realm | zonegroup: default / \ master zone: default secondary zone: backup *STEP*: on source cluster: 1. radosgw-admin realm create --rgw-realm=big-realm --default 2. radosgw-admin zonegroup modify --rgw-realm big-realm --rgw-zonegroup default --master --endpoints "http://172.24.29.26:7480" 3. radosgw-admin zone modify --rgw-zonegroup default --rgw-zone default --master --endpoints "http://172.24.29.26:7480" 4. radosgw-admin user create --uid=sync-user --display-name="Synchronization User" --access-key=redhat --secret=redhat --system 5. radosgw-admin zone modify --rgw-zone=default --access-key=redhat --secret=redhat 6. radosgw-admin period update --commit on destination cluster: 1. radosgw-admin realm pull --url="http://172.24.29.26:7480 <http://172.24.29.26/>" --access-key=redhat --secret=redhat --rgw-realm=big-realm 2. radosgw-admin realm default --rgw-realm=big-realm 3. radosgw-admin period pull --url="http://172.24.29.26:7480" --access-key=redhat --secret=redhat 4. radosgw-admin zonegroup default --rgw-zonegroup=default 5. radosgw-admin zone create --rgw-zonegroup=default --rgw-zone=backup --endpoints="http://172.24.29.29:7480" --access-key=redhat --secret=redhat --default 6. radosgw-admin period update --commit commit period on secondary zone get error: 2020-04-02 14:36:04.707 7fd8ee9376c0 1 Cannot find zone id=8c75360a-c0cf-4772-b85e-ff74726396c2 (name=backup), switching to local zonegroup configuration Sending period to new master zone 5fba7cae-47f1-4c8e-9a34-1b499c9c27f8 request failed: (2202) Unknown error 2202 failed to commit period: (2202) Unknown error 2202 radosgw-admin sync status: 2020-04-02 14:37:18.330 7f27c60676c0 1 Cannot find zone id=8c75360a-c0cf-4772-b85e-ff74726396c2 (name=backup), switching to local zonegroup configuration realm fec73799-36be-4418-abb2-9804cc83d83d (big-realm) zonegroup fc61ac2f-dc1d-421b-90af-ffe9113b9935 (default) zone 8c75360a-c0cf-4772-b85e-ff74726396c2 (backup) metadata sync failed to read sync status: (2) No such file or directory data sync source: 5fba7cae-47f1-4c8e-9a34-1b499c9c27f8 (default) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source My source cluster's version is 13.2.8, and destination cluster's is 14.2.8. I tried sync data from cluster of both 13.2.8 version and got the same error. Is there any step I was wrong or the default zone cannot be synced? Thanks

3 years, 11 months

3
4
0 0

Cephfs IO halt on Node failure

by Amudhan P

Hi, I am using ceph Nautilus cluster with below configuration. 3 node's (Ubuntu 18.04) each has 12 OSD's, and mds, mon and mgr are running in shared mode. The client mounted through ceph kernel client. I was trying to emulate a node failure when a write and read were going on (replica2) pool. I was expecting read and write continue after a small pause due to a Node failure but it halts and never resumes until the failed node is up. I remember I tested the same scenario before in ceph mimic where it continued IO after a small pause. regards Amudhan P

3 years, 11 months

3
8
0 0

May Ceph Science User Group Virtual Meeting

by Kevin Hrpcek

Hello, We will be having a Ceph science/research/big cluster call on Wednesday May 27th. If anyone wants to discuss something specific they can add it to the pad linked below. If you have questions or comments you can contact me. This is an informal open call of community members mostly from hpc/htc/research environments where we discuss whatever is on our minds regarding ceph. Updates, outages, features, maintenance, etc...there is no set presenter but I do attempt to keep the conversation lively. https://pad.ceph.com/p/Ceph_Science_User_Group_20200527 We try to keep it to an hour or less. Ceph calendar event details: May 27, 2020 14:00 UTC 4pm Central European 9am Central US (I think I got timezones right this time) Description:Main pad for discussions: https://pad.ceph.com/p/Ceph_Science_User_Group_Index <https://www.google.com/url?q=https://pad.ceph.com/p/Ceph_Science_User_Group…> Meetings will be recorded and posted to the Ceph Youtube channel. To join the meeting on a computer or mobile phone: https://bluejeans.com/908675367?src=calendarLink <https://www.google.com/url?q=https://bluejeans.com/908675367?src%3Dcalendar…> To join from a Red Hat Deskphone or Softphone, dial: 84336. Connecting directly from a room system? 1.) Dial: 199.48.152.152 or bjn.vc <https://www.google.com/url?q=http://bjn.vc&sa=D&ust=1579363980705000&usg=AO…> 2.) Enter Meeting ID: 908675367 Just want to dial in on your phone? 1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: https://www.redhat.com/en/conference-numbers <https://www.google.com/url?q=https://www.redhat.com/en/conference-numbers&s…> 2.) Enter Meeting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111 <https://www.google.com/url?q=https://bluejeans.com/111&sa=D&ust=15793639807…> Kevin -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS Space Science & Engineering Center University of Wisconsin-Madison

3 years, 11 months

1
0
0 0

RGW resharding

by Adrian Nicolae

Hi, I have the following Ceph Mimic setup : - a bunch of old servers with 3-4 SATA drives each (74 OSDs in total) - index/leveldb is stored on each OSD (so no SSD drives, just SATA) - the current usage is : GLOBAL: SIZE AVAIL RAW USED %RAW USED 542 TiB 105 TiB 437 TiB 80.67 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS .rgw.root 1 1.1 KiB 0 26 TiB 4 default.rgw.control 2 0 B 0 26 TiB 8 default.rgw.meta 3 20 MiB 0 26 TiB 75357 default.rgw.log 4 0 B 0 26 TiB 4271 default.rgw.buckets.data 5 290 TiB 85.05 51 TiB 78067284 default.rgw.buckets.non-ec 6 0 B 0 26 TiB 0 default.rgw.buckets.index 7 0 B 0 26 TiB 603008 - rgw_override_bucket_index_max_shards = 16. Clients are accessing RGW via Swift, not S3. - the replication schema is EC 4+2. We are using this Ceph cluster as a secondary storage for another storage infrastructure (which is more expensive) and we are offloading cold data (big files with a low number of downloads/reads from our customer). This way we can lower the TCO . So most of the files are big ( a few GB at least). So far Ceph is doing well considering that I don't have big expectations from current hardware. I'm a bit worried however that we have 78 M objects with max_shards=16 and we will probably reach 100M in the next few months. Do I need a increase the max shards to ensure the stability of the cluster ? I read that storing more than 1 M of objects in a single bucket can lead to OSD's flapping or having io timeouts during deep-scrub or even to have ODS's failures due to the leveldb compacting all the time if we have a large number of DELETEs. Any advice would be appreciated. Thank you, Adrian Nicolae

3 years, 11 months

2
4
0 0

Disable auto-creation of RGW pools

by Katarzyna Myrek

Hi We have some clusters which are rbd only. Each time someone uses radosgw-admin by mistake on those clusters, rgw pools are auto created. Is there a way to disable that? I mean the part: "When radosgw first tries to operate on a zone pool that does not exist, it will create that pool with the default values from osd pool default pg num and osd pool default pgp num" Thanks, Kate

3 years, 11 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users May 2020