Ceph will be present at DevConf.CZ, February 18-20 in a joint booth
with the Rook Community!
If you're interested in more information about being present at the
booth to provide expertise/content/presentations to our audience,
please let me know privately.
we are in the process of growing our Nautilus ceph cluster. Currently,
we have 6 nodes, 3 nodes with 2×5.5TB, 6x11TB disks and 8x186GB SSD and
3 nodes with 6×5.5TB and 6×7.5TB disks. All with dual link 10GE NICs.
The SSDs are used for the CephFS metadata pool, the hard drives are
used for the CephFS data pool. All OSD journals are kept on the drives
themselves. Replication level is 3 for both data and metadata pools.
The new servers have 12x12TB disks and 1 1.5TB NVMe drive. We expect to
get another 3 similar nodes in the near future.
My question is what is the most sensible thing to do with the NVMe
drives. I would like to increase the replication level of the metadata
pool. So my idea was to split the NVMes into say 4 partitions and add
them to the metadata pool.
Given the size of the drives and the metadata pool usage (~35GB) that
seems overkill. Would it make sense to partition the drives further and
stick the OSD journals on the NVMEs?
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
For the past several months I had been building a sizable Ceph cluster that
will be up to 10PB with between 20 and 40 OSD servers this year.
A few weeks ago I was informed that SUSE is shutting down SES and will no
longer be selling it. We haven't licensed our proof of concept cluster
that is currently at 14 OSD nodes, but it looks like SUSE is not going to
be the answer here.
I'm seeking recommendations for consulting help on this project since SUSE
has let me down.
I have Ceph installed and operating, however, I've been struggling with
getting the pool configured properly for CephFS and getting very poor
performance. The OSD servers have TLC NVMe for DB, and Optane NVMe for
WAL, so I should be seeing decent performance with the current cluster.
I'm not opposed to completely switching OS distributions. Ceph on SUSE was
our first SUSE installation. Almost everything else we run is on CentOS,
but that may change thanks to IBM cannibalizing CentOS.
Please reach out to me if you can recommend someone to sell us consulting
hours and/or a support contract.
Washington University School of Medicine
Looking at the Octopus upgrade instructions, I see "the first time each
OSD starts, it will do a format conversion to improve the accounting for
“omap” data. This may take a few minutes to as much as a few hours (for
an HDD with lots of omap data)." and that I can disable this by setting
bluestore_fsck_quick_fix_on_mount to false.
A couple of questions about this:
i) what are the consequences of turning off this "quick fix"? Is it
possible to have it run in the background or similar?
ii) is there any way to narrow down the time estimate? Our production
cluster has 3060 OSDs on hdd (with block.db on NVME), and obviously 3000
lots of "a few hours" is an awful lot of time...
I'll be doing some testing on our test cluster (by putting 10M objects
into an S3 bucket before trying the upgrade), but it'd be useful to have
some idea of how this is likely to work at scale...
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
We just installed a Ceph cluster version luminous (12.2.11) on servers
working with Debian buster (10.8)
using ceph-deploy and we are trying to upgrade it to mimic but can't
find a way to do it.
We tried ceph-deploy install --release mimic mon1 mon2 mon3 (after
having modified /etc/apt/sources.list.d/ceph.list)
but this does nothing because the packets are said to be up to date.
Could someone help us, please ?
(sorry if this gets posted twice. I forgot a subject in the first mail)
We expereinced an outage this morning on a jewel cluster with 1559 osds.
It appeared that a switch uplink in a rack misbehaved and once shutting that
interface ceph health restored quickly. I have some questions though on
osd behaviour that I hope someone can answer
1 - In a lot of osd logs I saw that neighbours reported the osd down
(while the process was still running and obviously logging). Then after a
while the logs shows
* Got signal Interrupt
* prepare_to_stop starting shutdown
and the osd process stops
Why does the osd proces stop? Is it instructed to do so by the monitor
because neighbours reported it down and ceph wants to avoid flapping?
2 - The osds reported a lot of
* heartbeat_check: no reply from #ip:#port
When I telnet to the ip and port I get a connection just fine. Is there a
way to run a heartbeat_check from the commandline so that we can try
capture the traffic to determine why it fails
Hello I am using rados bench tool. Currently I am using this tool on the
development cluster after running vstart.sh script. It is working fine and
I am interested in benchmarking the cluster. However I am struggling to
achieve a good bandwidth i.e. bandwidth (MB/sec). My target throughput is
at least 50 MB/sec and more. But mostly I am achieving is around 15-20
MB/sec. So, very poor.
I am quite sure I am missing something. Either I have to change my cluster
through vstart.sh script or I am not fully utilizing the rados bench tool.
Or may be both. i.e. not the right cluster and also not using the rados
bench tool correctly.
Some of the shell examples I have been using to build the cluster are
MDS=0 RGW=1 ../src/vstart.sh -d -l -n --bluestore
MDS=0 RGW=1 MON=1 OSD=4../src/vstart.sh -d -l -n --bluestore
While using rados bench tool I have been trying with different block sizes
4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K. And I have also been changing the
-t parameter in the shell to increase concurrent IOs.
Looking forward to help.
A questions that probalby has been asked by many other users before. I want to do a POC. For the POC I can use old decomissioned hardware. Currently I have 3 x IBM X3550 M5 with:
1 Dualport 10G NIC
Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz
the other two have a slower CPU but more RAM:
Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Of course I can re-arrange the RAM.
The switches are not LACP capable, so I'm planning to use bonding in active-active. For the disks I'm planning on buying 12 x Samsung PM883 1.9TB and use them in an EC pool.
My questions are:
1. Which bonding mode should I choose? balance-alb?
2. Are the disks ok for a POC? Or should I rather go with more smaller disks (960GB) e.g. 24 in total?
3. Are there any drawbacks when using EC pools?
Workload will be mostly VMs (Vsphere / Openstack), but also CephFS with Samba Gateway.