Hi.
We have a need for "bulk" storage - but with decent write latencies.
Normally we would do this with a DAS with a Raid5 with 2GB Battery
backed write cache in front - As cheap as possible but still getting the
features of scalability of ceph.
In our "first" ceph cluster we did the same - just stuffed in BBWC
in the OSD nodes and we're fine - but now we're onto the next one and
systems like:
https://www.supermicro.com/en/products/system/1U/6119/SSG-6119P-ACR12N4L.cfm
Does not support a Raid controller like that - but is branded as for "Ceph
Storage Solutions".
It do however support 4 NVMe slots in the front - So - some level of
"tiering" using the NVMe drives should be what is "suggested" - but what
do people do? What is recommeneded. I see multiple options:
Ceph tiering at the "pool - layer":
https://docs.ceph.com/docs/master/rados/operations/cache-tiering/
And rumors that it is "deprectated:
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2.0/html…
Pro: Abstract layer
Con: Deprecated? - Lots of warnings?
Offloading the block.db on NVMe / SSD:
https://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/
Pro: Easy to deal with - seem heavily supported.
Con: As far as I can tell - this will only benefit the metadata of the
osd- not actual data. Thus a data-commit to the osd til still be dominated
by the writelatency of the underlying - very slow HDD.
Bcache:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-June/027713.html
Pro: Closest to the BBWC mentioned above - but with way-way larger cache
sizes.
Con: It is hard to see if I end up being the only one on the planet using
this
solution.
Eat it - Writes will be as slow as hitting dead-rust - anything that
cannot live
with that need to be entirely on SSD/NVMe.
Other?
Thanks for your input.
Jesper
Hi again, after all, this appears to be an MTU issue:
Baseline:
1) Two of the nodes have a straight ethernet with 1500MTU, the third (problem) node is on a WAN tunnel with a restricted MTU. It appears that the MTUs were not set up correctly, so no surprise some software has problems.
2) I decided I knew Ceph well enough that I could handle recovery from disaster cases in Rook and it has advantages I can use. So please keep that in mind as I discuss this issue. (For those who aren’t familiar, Rook just orchestrates containers that are built by the Ceph team.)
3) In Rook, monitors run as pods under a CNI. The CNI adds additional overhead for transit, in my case a VxLAN overlay network. This overhead is apparently not enough to cause problems when running between nodes on a full 1500MTU local net. So the first two monitors come up cleanly.
After spending a lot of time looking at the logs, I could see the mon map of all three nodes properly distributed, but when it came to an election, all nodes knew the election epoch but the third was not joining. Comparing the logs of the second node as peon with the troubled third node on the other side of the restricted MTU, the difference appeared to be that the third node was not providing a feature proposal when in fact it probably was and it was being dropped. So the election would end without the third node being a part of the quorum. The third node stopped asking for a new election and that’s how things ended.
What I did this morning was figure out the MTU of the WAN tunnel and then change the entire CNI to that number. My expectation was that everything would start working and the necessary fragmentation would be generated by the client end of any connection.
Instead, the second node that was previously able to join as peon was no longer able to do so. It seems to follow that the smaller MTU (1340 to be exact) set on the overall CNI causes the elections to fail.
There are a number of things that I can do to improve the behavior of the cluster (such as PMTUD), but if Ceph is not going to work with a small MTU, all bets are off.
I tried looking for issues in tracker.ceph.com, but apparently I haven’t logged in there for a while and my account was deleted. I applied for a new one.
Any ideas what I can do here?
Thanks! Brian
Hi, everyone:
what's the purpose for LogEvent with empty metablob?
For example in link/unlink operation cross two active mds, the
procedure may look like below:
1. master send PREPARE to slave
2. slave receives PREPARE, and journal an ESlaveUpdate::OP_PREPARE,
then responds ACK to master
3. master receives ACK and journal an EUpdate, then send OP_FINISH to slave
4. slave receives OP_FINISH and journal ESlaveUpdate::OP_COMMIT, then
send OP_COMMIT to master
5. master receives OP_COMMIT and journal an ECommitted
Why logevents in step 4 and 5 are necessary? Both of them have empty
metablob. I guess these two logevents are originally used for a scene
that crashes happen, but actually it seems not necessary. For example,
if cash happens, after failed mds are brought up again, in resolve
stage, master will resend OP_FINISH to slave, then things will
continue as expected.
Any tips about this question are appreciated. Thanks!
I am trying to install a fresh Ceph cluster on CentOS 8.
Using the latest Ceph repo for el8, it still is not possible because of certain dependencies:
libleveldb.so.1 needed by ceph-osd.
Even after manually downloading and installing the leveldb-1.20-1.el8.x86_64.rpm package, there are still dependencies:
Problem: package ceph-mgr-2:15.2.1-0.el8.x86_64 requires ceph-mgr-modules-core = 2:15.2.1-0.el8, but none of the providers can be installed
- conflicting requests
- nothing provides python3-cherrypy needed by ceph-mgr-modules-core-2:15.2.1-0.el8.noarch
- nothing provides python3-pecan needed by ceph-mgr-modules-core-2:15.2.1-0.el8.noarch
Is there a way to perform a fresh Ceph install on CentOS 8?
Thanking in advance for your answer.
Hopefully someone can sanity check me here, but I'm getting the feeling that the MAX AVAIL in ceph df isn't reporting the correct value in 14.2.8 (mon/mgr/mds are .8, most OSDs are .7)
> RAW STORAGE:
> CLASS SIZE AVAIL USED RAW USED %RAW USED
> hdd 530 TiB 163 TiB 366 TiB 367 TiB 69.19
> ssd 107 TiB 37 TiB 70 TiB 70 TiB 65.33
> TOTAL 637 TiB 201 TiB 436 TiB 437 TiB 68.54
>
> POOLS:
> POOL ID STORED OBJECTS USED %USED MAX AVAIL
> fs-metadata 16 44 GiB 4.16M 44 GiB 0.25 5.6 TiB
> cephfs-hdd-3x 17 46 TiB 109.54M 144 TiB 61.81 30 TiB
> objects-hybrid 20 46 TiB 537.08M 187 TiB 91.71 5.6 TiB
> objects-hdd 24 224 GiB 50.81k 676 GiB 0.74 30 TiB
> rbd-hybrid 29 3.8 TiB 1.19M 11 TiB 40.38 5.6 TiB
> device_health_metrics 33 270 MiB 327 270 MiB 0 30 TiB
> rbd-ssd 34 4.2 TiB 1.19M 12 TiB 41.55 5.6 TiB
> cephfs-hdd-ec73 37 42 TiB 30.35M 72 TiB 44.78 62 TiB
I have a few pools that don't look like they are calculating the available storage for that pool correctly.
Specifically, any of my hybrid pools (20,29) or all-SSD pools (16,34).
For my hybrid pools, I have a crush rule of take 1 of host in the ssd root, take -1 chassis in the hdd root.
For my ssd pools, I have a crush rule of take 0 of host in the ssd root.
Now I have 60 ssd osds 1.92T each, and sadly distribution is imperfect (leaving those issues out of this), and I have plenty of underfull and overfull osds, which I am trying to manually reweighs to get my most full's down to free up space:
> $ ceph osd df class ssd | sort -k17
> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
> MIN/MAX VAR: 0.80/1.21 STDDEV: 5.56
> TOTAL 107 TiB 70 TiB 68 TiB 163 GiB 431 GiB 37 TiB 65.33
> 28 ssd 1.77879 1.00000 1.8 TiB 951 GiB 916 GiB 7.1 GiB 5.8 GiB 871 GiB 52.20 0.80 68 up
> 33 ssd 1.77879 1.00000 1.8 TiB 1.0 TiB 1010 GiB 6.0 MiB 5.9 GiB 777 GiB 57.33 0.88 74 up
> 47 ssd 1.77879 1.00000 1.8 TiB 1.0 TiB 1011 GiB 6.7 GiB 6.4 GiB 776 GiB 57.38 0.88 75 up
> [SNIP]
> 57 ssd 1.77879 0.98000 1.8 TiB 1.4 TiB 1.3 TiB 6.2 GiB 8.6 GiB 417 GiB 77.08 1.18 102 up
> 107 ssd 1.80429 1.00000 1.8 TiB 1.4 TiB 1.3 TiB 7.0 GiB 8.7 GiB 422 GiB 77.15 1.18 102 up
> 50 ssd 1.77879 1.00000 1.8 TiB 1.4 TiB 1.4 TiB 5.5 MiB 8.6 GiB 381 GiB 79.10 1.21 105 up
> 60 ssd 1.77879 0.92000 1.8 TiB 1.4 TiB 1.4 TiB 6.2 MiB 9.0 GiB 379 GiB 79.17 1.21 105 up
That said, as a straw man argument, ~380GiB free, times 60 OSDs, should be ~22.8TiB free, if all OSD's grew evenly, which they won't, which is still far short of 37TiB raw free, as expected.
However, what doesn't track is the 5.6TiB available at the pools level, even for a 3x replicated pool (5.6*3=16.8TiB, which is about 34% less than my napkin math, which would be 22.8/3=7.6TiB.
But what tracks even less is the hybrid pools, which use 1/3 of what the 3x-replicated data consumes.
Meaning if my napkin math is right, should show ~22.8TiB free.
Am I grossly mis-understanding how this is calculated?
Maybe this is fixed in Octopus?
Just trying to get a grasp on what I'm seeing not matching expectations.
Thanks,
Reed
Hi,
is it possible to remove an S3 bucket and its S3 objects with the rados
CLI tool (removing the low level Ceph objects)?
The situation is a nearfull cluster with OSDs filled more than 80% and
the default.rgw.buckets.data being reported as 100% full. Read
operations are still possible but no write. s3cmd rb … says operation
halted and radosgw-admin bucket rm also waits for a healthy cluster.
The other option would be to tune the nearfull_ratio and full_ratio
temporarily to allow the cluster to use more space, correct?
Regards
--
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin
http://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
Hi experts, question about monitors and latency.
I am setting up a new cluster and I’d like to have more than one monitor. Unfortunately, the primary site only has two chassis, so to get the third mon, I’ve been trying to bring it up remotely. So far, it’s not working and I wonder if someone knows how I can fit the proverbial square peg in this round hole.
This gist[1] has the logs that I have from the third (remote) monitor and a diagram of the network. The logs keep repeating the same errors about slow ops from the first log entry.
For comparison, I’ve also included a performance test of etcd that is running on the same three nodes. The secondary site node is slower by an order of magnitude.
As for the use case, I mostly expect the primary hardware to be stable and the third mon provides uptime when I need to upgrade either primary node. There will never be any OSD traffic over the site-to-site link. The storage profiles are primarily RBD with static sizes and limited changes.
My understanding is that the monitors are only keeping track of PG locations for pools. If that’s the case and the storage is primarily fixed size RBDs, the amount of inter-monitor communication should also be limited since changes are few. Am I mistaken here?
Thanks for any input that can be provided!
Brian
[1] https://gist.github.com/briantopping/3638d665bc60c19e6f3f2005eb555a9d
Hi all,
I am a novice Ceph user and I am going through Ceph CRUSH algorithm. I
would like to do some profiling of CRUSH computations on a CPU.
Are there any CRUSH benchmarks?
Any profiling tool for CRUSH?
I would like to examine what are the compute intensive parts of CRUSH on
the CPU.
May be there is some research paper on it already? The goal is hardware and
software partioning.
I am really looking forward for some helping tips.
Bobby !