Hi,
I have a fresh Nautilus Ceph cluster with radosgw as a front end. I've been
testing with a slightly modified version of https://github.com/wasabi-tech/s3-benchmark/
I have 5 storage nodes with 4 osds each, for a total of 20 osds. I am
testing locally on a single rgw node. First, I uploaded a bunch of 1GB
objects. Now I'm attempting to download them in random order and measure
the time it takes to fetch them.
My problem is that during the download phase rgw will hang and the process
will suck up 100% CPU on the civitwed-worker thread (according to top).
The logs show that it downloads segments of the object but then stops part
way though and never continues.
I tried using beast instead of civitweb as a front-end, but it still hangs
in the same way, leading me to believe that this is a back-end issue.
This is the end of the logs, as you can see the first three lines show a
successful read, and the last line show that it starts a read attempt but
never completes:
2019-10-08 13:35:42.673 7fc6cec40700 20 rados->get_obj_iterate_cb oid=2217f6c8-5a9f-4cfc-a1a7-1ced740afb81.127425.2__shadow_.SCoV2VuKnMkiOqi2n3FcWgveOJYu4Io_18 obj-ofs=75497472 read_ofs=0 len=4194304
2019-10-08 13:35:42.673 7fc6cec40700 20 RGWObjManifest::operator++(): rule->part_size=0 rules.size()=1
2019-10-08 13:35:42.673 7fc6cec40700 20 RGWObjManifest::operator++(): result: ofs=79691776 stripe_ofs=79691776 part_ofs=0 rule->part_size=0
2019-10-08 13:35:42.673 7fc6cec40700 20 rados->get_obj_iterate_cb oid=2217f6c8-5a9f-4cfc-a1a7-1ced740afb81.127425.2__shadow_.SCoV2VuKnMkiOqi2n3FcWgveOJYu4Io_19 obj-ofs=79691776 read_ofs=0 len=4194304
Can someone advise me if I've misconfigured something, or happened to
find a bug?
Thanks,
Mike
Hi Everyone,
So it recently came to my attention that on one of our clusters, running
the command "radosgw-admin usage show" returns a blank response. What is
going on behind the scenes with this command, and why might it not be
seeing any of the buckets properly? The data is still accessible over S3
via the rgw service, it's just not showing us either the Index or Metadata
of the buckets.
Greatly appreciate everyone's help in advance.
Thanks,
Mac
Hi all,
I'm evaluation cephfs to serve our business as a file share that span
across our 3 datacenters. One concern that I have is that when using cephfs
and OpenStack Manila is that all guest vms needs access to the public
storage net. This to me feels like a security concern. I've seen one
suggestion is to put NFS gateways in between to prevent this, I would
prefer not having to use NFS. Is there another way to solve this or is this
a no concern to others, both the network and NFS? We are a small cloud
provider and having different customers exposed to each other on the same
storage net seems risky to me.
Regards
Jaan
>If the journal is no longer readable: the safe variant is to
>completely re-create the OSDs after replacing the journal disk. (The
>unsafe way to go is to just skip the --flush-journal part, not
>recommended)
Hello paul,
Thank for your reply.we has replaced the journal disk.
Last week we were on vacation,so this email is delayed.
My confusion is that why pg was stuck?
PG should to be repaired automatically,when osd is down isn't it?
Hi,
I'm following the discussion for a tracker issue [1] about spillover
warnings that affects our upgraded Nautilus cluster.
Just to clarify, would a resize of the rocksDB volume (and expanding
with 'ceph-bluestore-tool bluefs-bdev-expand...') resolve that or do
we have to recreate every OSD?
Regards,
Eugen
[1] https://tracker.ceph.com/issues/38745
For performance stuff you’d better setup a running environment to benchmark by yourself.
Regards,
Wesley Peng
> Am Oct 7, 2019 - 1:17 AM schrieb frankaritchie(a)gmail.com:
>
>
> Would RBD performance be hurt by having thousands of cephx users defined?
>
It’s not that the limit is *ignored*; sometimes the failure of the subtree isn’t *detected*. Eg., I’ve seen this happen when a node experienced kernel weirdness or OOM conditions such that the OSDs didn’t all get marked down at the same time, so the PGs all started recovering. Admitedly it’s been a while since I’ve seen this, my sense is that with Luminous the detection became a *lot* better.
> On Oct 3, 2019, at 9:55 AM, Darrell Enns <darrelle(a)knowledge.ca> wrote:
>
> Thanks for the reply Anthony.
>
> Those are all considerations I am very much aware of. I'm very curious about this though:
>
>> mon_osd_down_out_subtree_limit. There are cases where it doesn’t kick in and a whole node will attempt to rebalance
>
> In what cases is the limit ignored? Do these exceptions also apply to mon_osd_min_in_ratio? Is this in the docs somewhere?
>
[ good Cephers trim their quoted text ]
This is in part a question of *how many* of those dense OSD nodes you have. If you have a hundred of them, then most likely they’re spread across a decent number of racks and the loss of one or two is a tolerable *fraction* of the whole cluster.
If you have a cluster of just, say, 3-4 of these dense nodes, component failure, network glitches, and even maintenance become problematic.
You can *mostly* forestall whole-node rebalancing by careful alignment of fault domains with the value of mon_osd_down_out_subtree_limit. There are cases where it doesn’t kick in and a whole node will attempt to rebalance, which — assuming the CRUSH rules and topology are fault-tolerant — may cause surviving OSDs to reach full or backfillfull states, potentially resulting in an outage.
If the limit does kick in, you’ll have reduced or no redundancy until you either bring the host/OSDs back up, or manually cause the recovery to proceed.
As was already mentioned as well, having a small number of fault domains also limits the EC strategies you can safely use.
> Thanks Paul. I was speaking more about total OSDs and RAM, rather than a single node. However, I am considering building a cluster with a large OSD/node count. This would be for archival use, with reduced performance and availability requirements. What issues would you anticipate with a large OSD/node count? Is the concern just the large rebalance if a node fails and takes out a large portion of the OSDs at once?