Just in case anybody is interested: Using dm-cache works and boosts
performance -- at least for my use case.
The "challenge" was to get 100 (identical) Linux-VMs started on a three
node hyperconverged cluster. The hardware is nothing special, each node
has a Supermicro server board with a single CPU with 24 cores and 4 x 4
TB hard disks. And there's that extra 1 TB NVMe...
I know that the general recommendation is to use the NVMe for WAL and
metadata, but this didn't seem appropriate for my use case and I'm still
not quite sure about failure scenarios with this configuration. So
instead I made each drive a logical volume (managed by an OSD) and added
85 GiB NVMe to each LV as read-only cache.
Each VM uses as system disk an RBD based on a snapshot from the master
image. The idea was that with this configuration, all VMs should share
most (actually almost all) of the data on their system disk and this
data should be available from the cache.
Well, it works. When booting the 100 VMs, almost all read operations are
satisfied from the cache. So I get close to NVMe speed but have payed
for conventional hard drives only (well, SSDs aren't that much more
expensive nowadays, but the hardware is 4 years old).
So, nothing sophisticated, but as I couldn't find anything about this
kind of setup, it might be of interest nevertheless.
- Michael
We're happy to announce the 15th, and expected to be the last,
backport release in the Pacific series.
https://ceph.io/en/news/blog/2024/v16-2-15-pacific-released/
Notable Changes
---------------
* `ceph config dump --format <json|xml>` output will display the localized
option names instead of their normalized version. For example,
"mgr/prometheus/x/server_port" will be displayed instead of
"mgr/prometheus/server_port". This matches the output of the non pretty-print
formatted version of the command.
* CephFS: MDS evicts clients who are not advancing their request tids,
which causes
a large buildup of session metadata, resulting in the MDS going
read-only due to
the RADOS operation exceeding the size threshold. The
`mds_session_metadata_threshold`
config controls the maximum size that an (encoded) session metadata can grow.
* RADOS: The `get_pool_is_selfmanaged_snaps_mode` C++ API has been deprecated
due to its susceptibility to false negative results. Its safer replacement is
`pool_is_in_selfmanaged_snaps_mode`.
* RBD: When diffing against the beginning of time (`fromsnapname == NULL`) in
fast-diff mode (`whole_object == true` with `fast-diff` image feature enabled
and valid), diff-iterate is now guaranteed to execute locally if exclusive
lock is available. This brings a dramatic performance improvement for QEMU
live disk synchronization and backup use cases.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-16.2.15.tar.gz
* Containers at https://quay.io/repository/ceph/ceph
* For packages, see https://docs.ceph.com/en/latest/install/get-packages/
* Release git sha1: 618f440892089921c3e944a991122ddc44e60516
Hi,
ceph dashboard fails to listen on all IPs.
log_channel(cluster) log [ERR] : Unhandled exception from module 'dashboard'
while running on mgr.controllera: OSError("No socket could be created --
(('0.0.0.0', 8443): [Errno -2] Name or service not known) -- (('::', 8443,
0, 0):
ceph version 17.2.7 quincy (stable)
Regards.
Hey ceph-users,
I just noticed issues with ceph-crash using the Debian /Ubuntu packages
(package: ceph-base):
While the /var/lib/ceph/crash/posted folder is created by the package
install,
it's not properly chowned to ceph:ceph by the postinst script.
This might also affect RPM based installs somehow, but I did not look
into that.
I opened a bug report with all the details and two ideas to fix this:
https://tracker.ceph.com/issues/64548
The wrong ownership causes ceph-crash to NOT work at all. I myself
missed quite a few crash reports. All of them were just sitting around
on the machines, but were reported right after I did
chown ceph:ceph /var/lib/ceph/crash/posted
systemctl restart ceph-crash.service
You might want to check if you might be affected as well.
Failing to post crashes to the local cluster results in them not being
reported back via telemetry.
Regards
Christian
Hi,
I have finished the conversion from ceph-ansible to cephadm yesterday.
Everything seemed to be working until this morning, I wanted to redeploy
rgw service to specify the network to be used.
So I deleted the rgw services with ceph orch rm, then I prepared a yml file
with the new conf. I applied the file and the new rgw service was started
but it was launched with an external image, so I wanted to redeploy using
my local image so I did a redeploy ... and then nothing happened, I get the
rescheduled message but nothing happened, then I restarted one of the
controllers, the orchestrator doesn't seem to be aware that some service
have restarted???
PS : I don't fully master the cephadm command line and use.
Regards.
Hi;
I tried to create an NFS cluster using this command :
[root@controllera ceph]# ceph nfs cluster create mynfs "3 controllera
controllerb controllerc" --ingress --virtual_ip 20.1.0.201 --ingress-mode
haproxy-protocol
Invalid command: haproxy-protocol not in default|keepalive-only
And I got this error : Invalid command haproxy-protocol
I am using Quincy : ceph version 17.2.7 (...) quincy (stable)
Is it not supported yet?
Regards.
Hi Y'all,
We have a new ceph cluster online that looks like this:
md-01 : monitor, manager, mds
md-02 : monitor, manager, mds
md-03 : monitor, manager
store-01 : twenty 30TB NVMe OSDs
store-02 : twenty 30TB NVMe OSDs
The cephfs storage is using erasure coding at 4:2. The crush domain is
set to "osd".
(I know that's not optimal but let me get to that in a minute)
We have a current regular single NFS server (nfs-01) with the same
storage as the OSD servers above (twenty 30TB NVME disks). We want to
wipe the NFS server and integrate it into the above ceph cluster as
"store-03". When we do that, we would then have three OSD servers. We
would then switch the crush domain to "host".
My question is this: Given that we have 4:2 erasure coding, would the
data rebalance evenly across the three OSD servers after we add store-03
such that if a single OSD server went down, the other two would be
enough to keep the system online? Like, with 4:2 erasure coding, would
2 shards go on store-01, then 2 shards on store-02, and then 2 shards on
store-03? Is that how I understand it?
Thanks for any insight!
-erich
Is there any update on this? Did someone test the option and has
performance values before and after?
Is there any good documentation regarding this option?
Please don't drop the list from your response.
The first question coming to mind is, why do you have a cache-tier if
all your pools are on nvme decices anyway? I don't see any benefit here.
Did you try the suggested workaround and disable the cache-tier?
Zitat von Cedric <yipikai7(a)gmail.com>:
> Thanks Eugen, see attached infos.
>
> Some more details:
>
> - commands that actually hangs: ceph balancer status ; rbd -p vms ls ;
> rados -p vms_cache cache-flush-evict-all
> - all scrub running on vms_caches pgs are stall / start in a loop
> without actually doing anything
> - all io are 0 both from ceph status or iostat on nodes
>
> On Tue, Feb 20, 2024 at 10:00 AM Eugen Block <eblock(a)nde.ag> wrote:
>>
>> Hi,
>>
>> some more details would be helpful, for example what's the pool size
>> of the cache pool? Did you issue a PG split before or during the
>> upgrade? This thread [1] deals with the same problem, the described
>> workaround was to set hit_set_count to 0 and disable the cache layer
>> until that is resolved. Afterwards you could enable the cache layer
>> again. But keep in mind that the code for cache tier is entirely
>> removed in Reef (IIRC).
>>
>> Regards,
>> Eugen
>>
>> [1]
>> https://ceph-users.ceph.narkive.com/zChyOq5D/ceph-strange-issue-after-addin…
>>
>> Zitat von Cedric <yipikai7(a)gmail.com>:
>>
>> > Hello,
>> >
>> > Following an upgrade from Nautilus (14.2.22) to Pacific (16.2.13), we
>> > encounter an issue with a cache pool becoming completely stuck,
>> > relevant messages below:
>> >
>> > pg xx.x has invalid (post-split) stats; must scrub before tier agent
>> > can activate
>> >
>> > In OSD logs, scrubs are starting in a loop without succeeding for all
>> > pg of this pool.
>> >
>> > What we already tried without luck so far:
>> >
>> > - shutdown / restart OSD
>> > - rebalance pg between OSD
>> > - raise the memory on OSD
>> > - repeer PG
>> >
>> > Any idea what is causing this? any help will be greatly appreciated
>> >
>> > Thanks
>> >
>> > Cédric
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users(a)ceph.io
>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io