Hey,
Just to confirm my understanding: If I set up a 3-osd cluster really
fast with an EC42 pool, and I set the crush map to osd failover
domain, the data will be distributed among the osd's, and of course
there won't be protection against host failure. And yes, I know that's
a bad idea, but I need the extra storage really fast, and it's a
backup of other data. So availability is important, but now critical.
If I then add 5 more hosts a week later, I can just edit the crush map
and change the failover domain from osd to host, put the crush map
back in, and ceph should automatically distribute all the pg's over
the osd's again to be fully host-fault tolerant, right?
Am I understanding this correctly?
Angelo.
Hello Users,
We've a big cluster (Quincy) with almost 1.7 billion RGW objects, and we've
enabled SSE on as per
https://docs.ceph.com/en/quincy/radosgw/encryption/#automatic-encryption-fo…
(yes, we've chosen this insecure method to store the key)
We're now in the process of implementing RGW multisite, but stuck due to
https://tracker.ceph.com/issues/46062 and list at
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/PQW66JJ5DCR…
Was wondering if there is a way to decrypt the objects in-place with the
applied symmetric key. I tried to remove
the rgw_crypt_default_encryption_key from the mon configuration database
(on a test cluster), but as expected, RGW daemons throw 500 server errors
as it can not work on encrypted objects.
There is a PR being worked on about introducing the command option at
https://github.com/ceph/ceph/pull/51842 but it appears it takes some time
to be merged.
Cheers,
Jayanth Reddy
The documentation very briefly explains a few core commands for restarting
things;
https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-d…
but I feel I'm lacking quite some details of what is safe to do.
I have a system in production, clusters connected via CephFS and some
shared block devices.
We would like to restart some things due to some new network
configurations. Going daemon by daemon would take forever, so I'm curious
as to what happens if one tries the command;
ceph orch restart osd
Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere
simultaneously.
I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next
host, but I'm still curious as to what the "ceph orch restart xxx" command
would do (but not enough to try it out in production)
Best regards, Mikael
Chalmers University of Technology
Hi,
we are running a cluster that has been alive for a long time and we tread carefully regarding updates. We are still a bit lagging and our cluster (that started around Firefly) is currently at Nautilus. We’re updating and we know we’re still behind, but we do keep running into challenges along the way that typically are still unfixed on main and - as I started with - have to tread carefully.
Nevertheless, mistakes happen, and we found ourselves in this situation: we converted our RGW data pool from replicated (n=3) to erasure coded (k=10, m=3, with 17 hosts) but when doing the EC profile selection we missed that our hosts are not evenly balanced (this is a growing cluster and some machines have around 20TiB capacity for the RGW data pool, wheres newer machines have around 160TiB and we rather should have gone with k=4, m=3. In any case, having 13 chunks causes too many hosts to participate in each object. Going for k+m=7 will allow distribution to be more effective as we have 7 hosts that have the 160TiB sizing.
Our original migration used the “cache tiering” approach, but that only works once when moving from replicated to EC and can not be used for further migrations.
The amount of data is at 215TiB somewhat significant, so using an approach that scales when copying data[1] to avoid ending up with months of migration.
I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a rados/pool level) and I guess we can only fix this on an application level using multi-zone replication.
I have the setup nailed in general, but I’m running into issues with buckets in our staging and production environment that have `explicit_placement` pools attached, AFAICT is this an outdated mechanisms but there are no migration tools around. I’ve seen some people talk about patched versions of the `radosgw-admin metadata put` variant that (still) prohibits removing explicit placements.
AFAICT those explicit placements will be synced to the secondary zone and the effect that I’m seeing underpins that theory: the sync runs for a while and only a few hundred objects show up in the new zone, as the buckets/objects are already found in the old pool that the new zone uses due to the explicit placement rule.
I’m currently running out of ideas, but open for any other options.
Looking at https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXL… I’m wondering whether the relevant patch is available somewhere, or whether I’ll have to try building that patch again on my own.
Going through the docs and the code I’m actually wondering whether `explicit_placement` is actually a really crufty residual piece that won’t get used in newer clusters but older clusters don’t really have an option to get away from?
In my specific case, the placement rules are identical to the explicit placements that are stored on (apparently older) buckets and the only thing I need to do is to remove them. I can accept a bit of downtime to avoid any race conditions if needed, so maybe having a small tool to just remove those entries while all RGWs are down would be fine. A call to `radosgw-admin bucket stat` takes about 18s for all buckets in production and I guess that would be a good comparison for what timing to expect when running an update on the metadata.
I’ll also be in touch with colleagues from Heinlein and 42on but I’m open to other suggestions.
Hugs,
Christian
[1] We currently have 215TiB data in 230M objects. Using the “official” “cache-flush-evict-all” approach was unfeasible here as it only yielded around 50MiB/s. Using cache limits and targetting the cache sizes to 0 caused proper parallelization and was able to flush/evict at almost constant 1GiB/s in the cluster.
--
Christian Theune · ct(a)flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · https://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
Hello,
is it possible to recover an OSD if it was removed?
The systemd service was removed but the block device is still listed under
lsblk
and the config files are still available under
/var/lib/ceph/uuid/removed
It is a containerized cluster. So I think we need to add the cephx
entries, use ceph-volume, crush, and so on.
Best regards,
Malte
Hi ceph gurus,
I am experimenting with rgw multisite sync feature using Quincy release (17.2.5). I am using the zone-level sync, not bucket-level sync policy. During my experiment, somehow my setup got into a situation that it doesn't seem to get out of. One zone is perpetually behind the other, although there is no ongoing client request.
Here is the output of my "sync status":
root@mon1-z1:~# radosgw-admin sync status
realm f90e4356-3aa7-46eb-a6b7-117dfa4607c4 (test-realm)
zonegroup a5f23c9c-0640-41f2-956f-a8523eccecb3 (zg)
zone bbe3e2a1-bdba-4977-affb-80596a6fe2b9 (z1)
metadata sync no sync (zone is master)
data sync source: 9645a68b-012e-4889-bf24-096e7478f786 (z2)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is behind on 14 shards
behind shards: [56,61,63,107,108,109,110,111,112,113,114,115,116,117]
It stays behind forever while rgw is almost completely idle (1% of CPU).
Any suggestion on how to drill deeper to see what happened?
Thanks,
Yixin