April 2020 - ceph-users - lists.ceph.io

Dear Abby: Why Is Architecting CEPH So Hard?

by Linus VanWeil

Hello, Looks like the original chain got deleted, but thank you to everyone who responded. Just to keep any new-comers in the loop, I have pasted the original positing below. To all the original contributors to this chain, I feel much more confident in my design theory for the storage nodes. However, I wanted to narrow the focus and see if I can get any elaborated comments on the two below topics. Does anyone have any real-world data on metrics I can use to size MONs? When are they active? When do they utilize CPU, RAM, Storage (ie. larger storage pools require more resources, resources are used during recovery, etc.)? For anyone that commented or has opinions on Storage node sizing: How does choosing EC vs 3X replication affect your sizing of CPU / RAM? IS the some kind of over-head generalization I can use if assuming EC (ie. add an extra core per OSD)? I understand that recoveries are where this is most important, so I am looking for sizing metrics based on living through worst case scenarios. --------------------------- ORIGINAL POSTING: Hey Folks, This is my first ever post here in the CEPH user group and I will preface with the fact that I know this is a lot of what many people ask frequently. Unlike what I assume to be a large majority of CEPH “users” in this forum, I am more of a CEPH “distributor.” My interests lie in how to build a CEPH environment to best fill an organization’s needs.I am here for the real-world experience and expertise so that I can learn to build CEPH “right.” I have spent the last couple years collecting data on general “best practices” through forum posts, CEPH documentation, CEPHLACON, etc. I wanted to post my findings to the forum to see where I can harden my stance. Below are two example designs that I might use when architecting a solution currently. I have specific questions around design elements in each that I would like you to approve for holding water or not. I want to focus on the hardware, so I am asking for generalizations where possible. Let’s assume in all scenarios that we are using Luminous and that the data type is mixed use. I am not expecting anyone to run through every question, so please feel free to comment on any piece you can. Tell me what is overkill and what is lacking! Example 1: 8x 60-Bay (8TB) Storage nodes (480x 8TB SAS Drives) Storage Node Spec: 2x 32C 2.9GHz AMD EPYC - Documentation mentions .5 cores per OSD for throughput optimized. Are they talking about .5 Physical cores or .5 Logical cores? - Is it better to pick my processors based on a total GHz measurement like 2GHz per OSD? - Would a theoretical 8C at 2GHz serve the same number of OSDs as a 16C at 1GHz? Would Threads be included in this calculation? 512GB Memory - I know this is the hot topic because of its role in recoveries. Basically, I am looking for the most generalized practice I can use as a safe number and a metric I can use as a nice to have. - Is it 1GB of RAM per TB of RAW OSD? 2x 3.2TB NVMe WAHLDB / Log Drive - Another hot topic that I am sure will bring many “it depends.” All I am looking for is experience on this. I know people have mentioned having at least 70GB of Flash for WAHLDB / Logs. - Can I use 70GB as a flat calculation per OSD or is it depend on the Size of the OSD? - I know more is better, but what is a number I can use to get started with minimal issues? 2x 56Gbit Links - I think this should be enough given the rule of thumb of 10Gbit for every 12 OSDs. 3x MON Node MON Node Spec: 1x 8C 3.2GHz AMD EPYC - I can’t really find good practices around when to increase your core count. Any suggestions? 128GB Memory - What do I need memory for in a MON? - When do I need to expand? 2x 480GB Boot SSDs - Any reason to look more closely into the sizing of these drives? 2x 25Gbit Uplinks - Should these match the output of the storage nodes for any reason? Example 2: 8x 12-Bay NVMe Storage nodes (96x 1.6TB NVMe Drives) Storage Node Spec: 2x 32C 2.9GHz AMD EPYC - I have read that each NMVe OSD should have 10 cores. I am not splitting Physical drives into multiple OSDs so let’s assume I have 12 OSD per Node. - Would threads count toward my 10 core quota or just physical cores? - Can I do a similar calculation as I mentioned before and just use 20GHz per OSD instead of focusing on cores specifically? 512GB Memory - I assume there is some reason I can’t use the same methodology of 1GB per TB of OSD since this is NVMe storage 2x 100Gbit Links - This is assuming about 1Gigabyte per second of real-world speed per disk 3x MON Node – What differences should MONs serving NVMe have compared to large NLSAS pools? MON Node Spec: 1x 8C 3.2GHz AMD Epyc 128GB Memory 2x 480GB Boot SSDs 2x 25Gbit Uplinks

4 years

1
0
0 0

Healthy objects trapped in incomplete pgs

by Jesper Lykkegaard Karlsen

Dear Cephers, A few days ago disaster struck the Ceph cluster (erasure-coded) I am administrating, as the UPS power was pull from the cluster causing a power outage. After rebooting the system, 6 osds were lost (spread over 5 osd nodes) as they could not mount anymore, several others had damages. This was more than the host-faliure domain was setup to handle and auto-recovery failed and osds started downing in a cascading maner. When the dust settled, there were 8 pgs (of 2048) inactive and a bunch of osds down. I managed to recover 5 pgs, mainly by ceph-objectstore-tool export/import/repair commands, but now I am left with 3 pgs that are inactive and incomplete. One of the pgs seems un-salvageable, as I cannot get to become active at all (repair/import/export/lowering min_size), but the two others I can get active if I export/import one of the pg shards and restart osd. Rebuilding then starts but after a while one of the osds holding the pgs goes down, with a "FAILED ceph_assert(clone_size.count(clone))" message in the log. If I set osds to noout nodown, then I can that it is only rather few objects e.g. 161 of a pg of >100000, that are failing to be remapped. Since most of the object in the two pgs seem intact, it would be sad to delete the whole pg (force-create-pg) and loose all that data. Is there a way to show and delete the failing objects? I have thought of a recovery plan and want to share that with you, so you can comment on this if it sounds doable or not? * Stop osds from recovering: ceph osd set norecover * bring back pgs active: ceph-objectstore-tool export/import and restart osd * find files in pgs: cephfs-data-scan pg_files <path> <pg id> * pull out as many as possible of those files to other location. * recreate pgs: ceph osd force-create-pg <pgid> * restart recovery: ceph osd unset norecover * copy back in the recovered files Would that work or do you have a better suggestion? Cheers, Jesper -------------------------- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Gustav Wieds Vej 10 8000 Aarhus C E-mail: jelka(a)mbg.au.dk Tlf: +45 50906203

4 years

1
0
0 0

missing amqp-exchange on bucket-notification with AMQP endpoint

by Andreas Unterkircher

Hello List, I'm trying to create a (S3-)bucket-notification into RabbitMQ via AMQP - on Ceph v15.2.1 octopus, using the official .deb packages on Debian Buster. I've created the following topic (directly via S3, not via pubsub REST API): <ListTopicsResponse xmlns="https://sns.amazonaws.com/doc/2010-03-31/"> <ListTopicsResult> <Topics> <member> <User>testuser</User> <Name>testtopic</Name> <EndPoint> <EndpointAddress>amqp://rabbitmquser:rabbitmqpass@rabbitmq.example.com:5672</EndpointAddress> <EndpointArgs>Attributes.entry.1.key=amqp-exchange&Attributes.entry.1.value=amqp.direct&push-endpoint=amqp://rabbitmquser:rabbitmqpass@rabbitmq.example.com:5672</EndpointArgs> <EndpointTopic>testtopic</EndpointTopic> </EndPoint> <TopicArn>arn:aws:sns:de::testtopic</TopicArn> <OpaqueData></OpaqueData> </member> </Topics> </ListTopicsResult> ... </ListTopicsResponse> Then I've created the following bucket-notification <NotificationConfiguration> <TopicConfiguration> <Id>notify-psapp</Id> <Topic>arn:aws:sns:de::testtopic</Topic> <Event>s3:ObjectCreated:*</Event> <Event>s3:ObjectRemoved:*</Event> </TopicConfiguration> </NotificationConfiguration> When I upload a file into the bucket, the event itself seems to get fired, but radosgw keeps tell me that cmqp-exchange is not set 2020-04-20T12:24:29.935+0200 7ff01c5d3700 1 ====== starting new request req=0x7ff01c5cad50 ===== 2020-04-20T12:24:30.019+0200 7ff01c5d3700 1 ERROR: failed to create push endpoint: amqp://rabbitmquser:rabbitmqpass@rabbitmq.example.com:5672 due to: pubsub endpoint configuration error: AMQP: missing amqp-exchange But it's there in the EndpointArgs, right? Or do I miss it somewhere else? Best Regards, Andreas

4 years

2
4
0 0

How to remove a deamon from orch

by Ml Ml

Hello list, i somehow have this "mgr.cph02 ceph02 stopped " line here. root@ceph01:~# ceph orch ps NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID mgr.ceph02 ceph02 running (2w) 2w ago - 15.2.0 docker.io/ceph/ceph:v15 204a01f9b0b6 4e349a382c6b mgr.ceph03 ceph03 running (2w) 2w ago - 15.2.0 docker.io/ceph/ceph:v15 204a01f9b0b6 2a9a037e5e2d mgr.cph02 ceph02 stopped 2w ago - <unknown> <unknown> <unknown> <unknown> mon.ceph02 ceph02 running (2w) 2w ago - 15.2.0 docker.io/ceph/ceph:v15 204a01f9b0b6 cf66ca51c0dd mon.ceph03 ceph03 running (2w) 2w ago - 15.2.0 docker.io/ceph/ceph:v15 204a01f9b0b6 fceaaa03b41f I actually cant remember how i did that. How can i remove that wrong "mgr.cph02" entry? Thanks, Michael

4 years

1
0
0 0

Re: docs.ceph.com certificate expired?

by Jos Collin

Fixed On 22/04/20 6:57 pm, Bobby wrote: > > Thanks ! When will it be back? > > On Wed, Apr 22, 2020 at 3:03 PM <ulrich.weigand(a)de.ibm.com > <mailto:ulrich.weigand@de.ibm.com>> wrote: > > Hello, > > trying to access the documentation on docs.ceph.com > <http://docs.ceph.com> now results in an error: The certificate > expired on April 22, 2020, 8:46 AM. > > Bye, > Ulrich > _______________________________________________ > Dev mailing list -- dev(a)ceph.io <mailto:dev@ceph.io> > To unsubscribe send an email to dev-leave(a)ceph.io > <mailto:dev-leave@ceph.io> > > > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io

4 years

1
0
0 0

block.db symlink missing after each reboot

by Stefan Priebe - Profihost AG

Hi there, i've a bunch of hosts where i migrated HDD only OSDs to hybird ones using: sudo -E -u ceph -- bash -c 'ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-${OSD} bluefs-bdev-new-db --dev-target /dev/bluefs_db1/db-osd${OSD}' while this worked fine and each OSD was running fine. It looses it's block.db symlink after reboot. If i manually recreate the block.db symlink inside: /var/lib/ceph/osd/ceph-* all osds start fine. Can anybody help who creates those symlinks and why they're not created automatically in case of migrated db? Greets, Stefan

4 years

3
5
0 0

MDS : replace a standby-replay daemon by an active one

by Herve Ballans

Hello to all confined people (and the others too) ! On one of my Ceph cluster (Nautilus 14.2.3), I previously set up 3 MDS daemons in active/standy-replay/standby configuration. For design reasons, I would like to replace this configuration by an active/active/standby one. It means replace the standby-replay daemon by an active one. I didn't find any clear procedure regarding this operation, and my question is about if I can add an active rank directly or if I have to unset the standby-replay status first ? I was thinking of the second option, that is: $ sudo ceph fs set /my_fs/ allow_standby_replay false $ sudo ceph fs set /my_fs/ max_mds 2 Is it the correct way ? Thanks in advance, Hervé

4 years

2
2
0 0

: nautilus : progress section in ceph status is stuck

by Vasishta Shastry

Hi, I upgraded a luminous cluster to nautilus and migrated Filestore OSD to bluestore using ceph-ansible playbook. I migrated 6 OSDs as of now, I see that since the last 3 days progress section of ceph status is stuck as below. Can anyone please help me to check what is going wrong ? progress: > Rebalancing after osd.6 marked out > [========================......] > Regards, Vasishta Shastry

4 years

3
2
0 0

certificate docs.ceph.com

by Nic De Muyer

Hi, It appears the certificate expired today for docs.ceph.com. Just thought I'd mention it here. kr, Nic De Muyer

4 years

1
0
0 0

PG deep-scrub does not finish

by Andras Pataki

On a cluster running Nautilus (14.2.8), we are getting a complaint about a PG not being deep-scrubbed on time. Looking at the primary OSD's logs, it looks like it tries to deep-scrub the PG every hour or so, emits some complaints that I don't understand, but the deep scrub does not finish (either with or without a scrub error). Here is the PG from pg dump: 1.43f 31794 0 0 0 0 66930087214 0 0 3004 3004 active+clean+scrubbing+deep 2020-04-20 04:48:13.055481 46286'483734 46286:563439 [354,694,851] 354 [354,694,851] 354 39594'348643 2020-04-10 12:39:16.261088 38482'314644 2020-04-04 10:37:03.121638 0 Here is a section of the primary OSD logs (osd.354): 2020-04-18 08:21:08.322 7fffd2e2b700 0 log_channel(cluster) log [DBG] : 1.43f deep-scrub starts 2020-04-18 08:37:53.362 7fffd2e2b700 1 osd.354 pg_epoch: 45910 pg[1.43f( v 45909'449615 (45801'446525,45909'449615] local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f 45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910 pi=[45908,45910)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0 active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] -> [354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 08:37:53.362 7fffd2e2b700 1 osd.354 pg_epoch: 45910 pg[1.43f( v 45909'449615 (45801'446525,45909'449615] local-lis/les=45908/45909 n=30862 ec=874/871 lis/c 45908/45908 les/c/f 45909/45909/0 45910/45910/42988) [354,851] r=0 lpr=45910 pi=[45908,45910)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 08:38:01.002 7fffd2e2b700 1 osd.354 pg_epoch: 45912 pg[1.43f( v 45909'449615 (45801'446525,45909'449615] local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f 45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912 pi=[45908,45912)/1 luod=0'0 crt=45909'449615 lcod 45909'449614 mlcod 0'0 active mbc={}] start_peering_interval up [354,851] -> [354,694,851], acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 08:38:01.002 7fffd2e2b700 1 osd.354 pg_epoch: 45912 pg[1.43f( v 45909'449615 (45801'446525,45909'449615] local-lis/les=45910/45911 n=30862 ec=874/871 lis/c 45910/45908 les/c/f 45911/45909/0 45912/45912/42988) [354,694,851] r=0 lpr=45912 pi=[45908,45912)/1 crt=45909'449615 lcod 45909'449614 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 08:40:04.219 7fffd2e2b700 0 log_channel(cluster) log [DBG] : 1.43f deep-scrub starts 2020-04-18 08:56:49.095 7fffd2e2b700 1 osd.354 pg_epoch: 45914 pg[1.43f( v 45913'449735 (45812'446725,45913'449735] local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f 45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914 pi=[45912,45914)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0 active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] -> [354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 08:56:49.095 7fffd2e2b700 1 osd.354 pg_epoch: 45914 pg[1.43f( v 45913'449735 (45812'446725,45913'449735] local-lis/les=45912/45913 n=30868 ec=874/871 lis/c 45912/45912 les/c/f 45913/45913/0 45914/45914/42988) [354,851] r=0 lpr=45914 pi=[45912,45914)/1 crt=45913'449735 lcod 45913'449734 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 08:56:55.627 7fffd2e2b700 1 osd.354 pg_epoch: 45916 pg[1.43f( v 45913'449735 (45812'446725,45913'449735] local-lis/les=45914/45915 n=30868 ec=874/871 lis/c 45914/45912 les/c/f 45915/45913/0 45916/45916/42988) [354,694,851] r=0 lpr=45916 pi=[45912,45916)/1 luod=0'0 crt=45913'449735 lcod 45913'449734 mlcod 0'0 active mbc={}] start_peering_interval up [354,851] -> [354,694,851], acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 08:56:55.627 7fffd2e2b700 1 osd.354 pg_epoch: 45916 pg[1.43f( v 45913'449735 (45812'446725,45913'449735] local-lis/les=45914/45915 n=30868 ec=874/871 lis/c 45914/45912 les/c/f 45915/45913/0 45916/45916/42988) [354,694,851] r=0 lpr=45916 pi=[45912,45916)/1 crt=45913'449735 lcod 45913'449734 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 08:56:56.867 7fffd2e2b700 0 log_channel(cluster) log [DBG] : 1.43f deep-scrub starts 2020-04-18 09:13:37.419 7fffd2e2b700 1 osd.354 pg_epoch: 45918 pg[1.43f( v 45917'449808 (45812'446725,45917'449808] local-lis/les=45916/45917 n=30876 ec=874/871 lis/c 45916/45916 les/c/f 45917/45917/0 45918/45918/42988) [354,851] r=0 lpr=45918 pi=[45916,45918)/1 luod=0'0 crt=45917'449808 lcod 45917'449807 mlcod 0'0 active+scrubbing+deep mbc={}] start_peering_interval up [354,694,851] -> [354,851], acting [354,694,851] -> [354,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 09:13:37.419 7fffd2e2b700 1 osd.354 pg_epoch: 45918 pg[1.43f( v 45917'449808 (45812'446725,45917'449808] local-lis/les=45916/45917 n=30876 ec=874/871 lis/c 45916/45916 les/c/f 45917/45917/0 45918/45918/42988) [354,851] r=0 lpr=45918 pi=[45916,45918)/1 crt=45917'449808 lcod 45917'449807 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary 2020-04-18 09:13:44.959 7fffd2e2b700 1 osd.354 pg_epoch: 45920 pg[1.43f( v 45917'449808 (45812'446725,45917'449808] local-lis/les=45918/45919 n=30876 ec=874/871 lis/c 45918/45916 les/c/f 45919/45917/0 45920/45920/42988) [354,694,851] r=0 lpr=45920 pi=[45916,45920)/1 luod=0'0 crt=45917'449808 lcod 45917'449807 mlcod 0'0 active mbc={}] start_peering_interval up [354,851] -> [354,694,851], acting [354,851] -> [354,694,851], acting_primary 354 -> 354, up_primary 354 -> 354, role 0 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-04-18 09:13:44.959 7fffd2e2b700 1 osd.354 pg_epoch: 45920 pg[1.43f( v 45917'449808 (45812'446725,45917'449808] local-lis/les=45918/45919 n=30876 ec=874/871 lis/c 45918/45916 les/c/f 45919/45917/0 45920/45920/42988) [354,694,851] r=0 lpr=45920 pi=[45916,45920)/1 crt=45917'449808 lcod 45917'449807 mlcod 0'0 unknown mbc={}] state<Start>: transitioning to Primary I've tried restarting all 3 OSDs in question, it didn't help. I'm running the upmap balancer on this cluster (if that matters). Any ideas? Andras

4 years

2
3
0 0

2024

2023

2022

2021

2020

2019

ceph-users April 2020