Hello,
Looks like the original chain got deleted, but thank you to everyone who responded. Just to keep any new-comers in the loop, I have pasted the original positing below. To all the original contributors to this chain, I feel much more confident in my design theory for the storage nodes. However, I wanted to narrow the focus and see if I can get any elaborated comments on the two below topics.
Does anyone have any real-world data on metrics I can use to size MONs?
When are they active?
When do they utilize CPU, RAM, Storage (ie. larger storage pools require more resources, resources are used during recovery, etc.)?
For anyone that commented or has opinions on Storage node sizing:
How does choosing EC vs 3X replication affect your sizing of CPU / RAM?
IS the some kind of over-head generalization I can use if assuming EC (ie. add an extra core per OSD)? I understand that recoveries are where this is most important, so I am looking for sizing metrics based on living through worst case scenarios.
---------------------------
ORIGINAL POSTING:
Hey Folks,
This is my first ever post here in the CEPH user group and I will preface with the fact
that I know this is a lot of what many people ask frequently. Unlike what I assume to be a
large majority of CEPH “users” in this forum, I am more of a CEPH “distributor.” My
interests lie in how to build a CEPH environment to best fill an organization’s needs.I am
here for the real-world experience and expertise so that I can learn to build CEPH
“right.” I have spent the last couple years collecting data on general “best practices”
through forum posts, CEPH documentation, CEPHLACON, etc. I wanted to post my findings to
the forum to see where I can harden my stance.
Below are two example designs that I might use when architecting a solution currently. I
have specific questions around design elements in each that I would like you to approve
for holding water or not. I want to focus on the hardware, so I am asking for
generalizations where possible. Let’s assume in all scenarios that we are using Luminous
and that the data type is mixed use.
I am not expecting anyone to run through every question, so please feel free to comment on
any piece you can. Tell me what is overkill and what is lacking!
Example 1:
8x 60-Bay (8TB) Storage nodes (480x 8TB SAS Drives)
Storage Node Spec:
2x 32C 2.9GHz AMD EPYC
- Documentation mentions .5 cores per OSD for throughput optimized. Are they talking
about .5 Physical cores or .5 Logical cores?
- Is it better to pick my processors based on a total GHz measurement like 2GHz per
OSD?
- Would a theoretical 8C at 2GHz serve the same number of OSDs as a 16C at 1GHz? Would
Threads be included in this calculation?
512GB Memory
- I know this is the hot topic because of its role in recoveries. Basically, I am
looking for the most generalized practice I can use as a safe number and a metric I can
use as a nice to have.
- Is it 1GB of RAM per TB of RAW OSD?
2x 3.2TB NVMe WAHLDB / Log Drive
- Another hot topic that I am sure will bring many “it depends.” All I am looking for
is experience on this. I know people have mentioned having at least 70GB of Flash for
WAHLDB / Logs.
- Can I use 70GB as a flat calculation per OSD or is it depend on the Size of the OSD?
- I know more is better, but what is a number I can use to get started with minimal
issues?
2x 56Gbit Links
- I think this should be enough given the rule of thumb of 10Gbit for every 12 OSDs.
3x MON Node
MON Node Spec:
1x 8C 3.2GHz AMD EPYC
- I can’t really find good practices around when to increase your core count. Any
suggestions?
128GB Memory
- What do I need memory for in a MON?
- When do I need to expand?
2x 480GB Boot SSDs
- Any reason to look more closely into the sizing of these drives?
2x 25Gbit Uplinks
- Should these match the output of the storage nodes for any reason?
Example 2:
8x 12-Bay NVMe Storage nodes (96x 1.6TB NVMe Drives)
Storage Node Spec:
2x 32C 2.9GHz AMD EPYC
- I have read that each NMVe OSD should have 10 cores. I am not splitting Physical
drives into multiple OSDs so let’s assume I have 12 OSD per Node.
- Would threads count toward my 10 core quota or just physical cores?
- Can I do a similar calculation as I mentioned before and just use 20GHz per OSD
instead of focusing on cores specifically?
512GB Memory
- I assume there is some reason I can’t use the same methodology of 1GB per TB of OSD
since this is NVMe storage
2x 100Gbit Links
- This is assuming about 1Gigabyte per second of real-world speed per disk
3x MON Node – What differences should MONs serving NVMe have compared to large NLSAS
pools?
MON Node Spec:
1x 8C 3.2GHz AMD Epyc
128GB Memory
2x 480GB Boot SSDs
2x 25Gbit Uplinks
Dear Cephers,
A few days ago disaster struck the Ceph cluster (erasure-coded) I am administrating, as the UPS power was pull from the cluster causing a power outage.
After rebooting the system, 6 osds were lost (spread over 5 osd nodes) as they could not mount anymore, several others had damages. This was more than the host-faliure domain was setup to handle and auto-recovery failed and osds started downing in a cascading maner.
When the dust settled, there were 8 pgs (of 2048) inactive and a bunch of osds down. I managed to recover 5 pgs, mainly by ceph-objectstore-tool export/import/repair commands, but now I am left with 3 pgs that are inactive and incomplete.
One of the pgs seems un-salvageable, as I cannot get to become active at all (repair/import/export/lowering min_size), but the two others I can get active if I export/import one of the pg shards and restart osd.
Rebuilding then starts but after a while one of the osds holding the pgs goes down, with a "FAILED ceph_assert(clone_size.count(clone))" message in the log.
If I set osds to noout nodown, then I can that it is only rather few objects e.g. 161 of a pg of >100000, that are failing to be remapped.
Since most of the object in the two pgs seem intact, it would be sad to delete the whole pg (force-create-pg) and loose all that data.
Is there a way to show and delete the failing objects?
I have thought of a recovery plan and want to share that with you, so you can comment on this if it sounds doable or not?
* Stop osds from recovering: ceph osd set norecover
* bring back pgs active: ceph-objectstore-tool export/import and restart osd
* find files in pgs: cephfs-data-scan pg_files <path> <pg id>
* pull out as many as possible of those files to other location.
* recreate pgs: ceph osd force-create-pg <pgid>
* restart recovery: ceph osd unset norecover
* copy back in the recovered files
Would that work or do you have a better suggestion?
Cheers,
Jesper
--------------------------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C
E-mail: jelka(a)mbg.au.dk
Tlf: +45 50906203
Hello List,
I'm trying to create a (S3-)bucket-notification into RabbitMQ via
AMQP - on Ceph v15.2.1 octopus, using the official .deb packages on
Debian Buster.
I've created the following topic (directly via S3, not via pubsub REST API):
<ListTopicsResponse xmlns="https://sns.amazonaws.com/doc/2010-03-31/">
<ListTopicsResult>
<Topics>
<member>
<User>testuser</User>
<Name>testtopic</Name>
<EndPoint>
<EndpointAddress>amqp://rabbitmquser:rabbitmqpass@rabbitmq.example.com:5672</EndpointAddress>
<EndpointArgs>Attributes.entry.1.key=amqp-exchange&Attributes.entry.1.value=amqp.direct&push-endpoint=amqp://rabbitmquser:rabbitmqpass@rabbitmq.example.com:5672</EndpointArgs>
<EndpointTopic>testtopic</EndpointTopic>
</EndPoint>
<TopicArn>arn:aws:sns:de::testtopic</TopicArn>
<OpaqueData></OpaqueData>
</member>
</Topics>
</ListTopicsResult>
...
</ListTopicsResponse>
Then I've created the following bucket-notification
<NotificationConfiguration>
<TopicConfiguration>
<Id>notify-psapp</Id>
<Topic>arn:aws:sns:de::testtopic</Topic>
<Event>s3:ObjectCreated:*</Event>
<Event>s3:ObjectRemoved:*</Event>
</TopicConfiguration>
</NotificationConfiguration>
When I upload a file into the bucket, the event itself seems to get
fired, but radosgw keeps tell me that cmqp-exchange is not set
2020-04-20T12:24:29.935+0200 7ff01c5d3700 1 ====== starting new
request req=0x7ff01c5cad50 =====
2020-04-20T12:24:30.019+0200 7ff01c5d3700 1 ERROR: failed to
create push endpoint:
amqp://rabbitmquser:rabbitmqpass@rabbitmq.example.com:5672 due to:
pubsub endpoint configuration error: AMQP: missing amqp-exchange
But it's there in the EndpointArgs, right?
Or do I miss it somewhere else?
Best Regards,
Andreas
Hello list,
i somehow have this "mgr.cph02 ceph02 stopped " line here.
root@ceph01:~# ceph orch ps
NAME HOST STATUS REFRESHED AGE VERSION IMAGE
NAME IMAGE ID CONTAINER ID
mgr.ceph02 ceph02 running (2w) 2w ago - 15.2.0
docker.io/ceph/ceph:v15 204a01f9b0b6 4e349a382c6b
mgr.ceph03 ceph03 running (2w) 2w ago - 15.2.0
docker.io/ceph/ceph:v15 204a01f9b0b6 2a9a037e5e2d
mgr.cph02 ceph02 stopped 2w ago - <unknown> <unknown>
<unknown> <unknown>
mon.ceph02 ceph02 running (2w) 2w ago - 15.2.0
docker.io/ceph/ceph:v15 204a01f9b0b6 cf66ca51c0dd
mon.ceph03 ceph03 running (2w) 2w ago - 15.2.0
docker.io/ceph/ceph:v15 204a01f9b0b6 fceaaa03b41f
I actually cant remember how i did that. How can i remove that wrong
"mgr.cph02" entry?
Thanks,
Michael
Fixed
On 22/04/20 6:57 pm, Bobby wrote:
>
> Thanks ! When will it be back?
>
> On Wed, Apr 22, 2020 at 3:03 PM <ulrich.weigand(a)de.ibm.com
> <mailto:ulrich.weigand@de.ibm.com>> wrote:
>
> Hello,
>
> trying to access the documentation on docs.ceph.com
> <http://docs.ceph.com> now results in an error: The certificate
> expired on April 22, 2020, 8:46 AM.
>
> Bye,
> Ulrich
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io <mailto:dev@ceph.io>
> To unsubscribe send an email to dev-leave(a)ceph.io
> <mailto:dev-leave@ceph.io>
>
>
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io
Hi there,
i've a bunch of hosts where i migrated HDD only OSDs to hybird ones using:
sudo -E -u ceph -- bash -c 'ceph-bluestore-tool --path
/var/lib/ceph/osd/ceph-${OSD} bluefs-bdev-new-db --dev-target
/dev/bluefs_db1/db-osd${OSD}'
while this worked fine and each OSD was running fine.
It looses it's block.db symlink after reboot.
If i manually recreate the block.db symlink inside:
/var/lib/ceph/osd/ceph-*
all osds start fine. Can anybody help who creates those symlinks and why
they're not created automatically in case of migrated db?
Greets,
Stefan
Hello to all confined people (and the others too) !
On one of my Ceph cluster (Nautilus 14.2.3), I previously set up 3 MDS
daemons in active/standy-replay/standby configuration.
For design reasons, I would like to replace this configuration by an
active/active/standby one.
It means replace the standby-replay daemon by an active one.
I didn't find any clear procedure regarding this operation, and my
question is about if I can add an active rank directly or if I have to
unset the standby-replay status first ?
I was thinking of the second option, that is:
$ sudo ceph fs set /my_fs/ allow_standby_replay false
$ sudo ceph fs set /my_fs/ max_mds 2
Is it the correct way ?
Thanks in advance,
Hervé
Hi,
I upgraded a luminous cluster to nautilus and migrated Filestore OSD to
bluestore using ceph-ansible playbook.
I migrated 6 OSDs as of now, I see that since the last 3 days progress
section of ceph status is stuck as below.
Can anyone please help me to check what is going wrong ?
progress:
> Rebalancing after osd.6 marked out
> [========================......]
>
Regards,
Vasishta Shastry