I'm running a cepf fs with an 8+2 EC data pool. Disks are on 10 hosts and failure domain is host. Version is mimic 13.2.2. Today I added a few OSDs to one of the hosts and observed that a lot of PGs became inactive even though 9 out of 10 hosts were up all the time. After getting the 10th host and all disks up, I still ended up with a large amount of undersized PGs and degraded objects, which I don't understand as no OSD was removed.
Here some details about the steps taken on the host with new disks, main questions at the end:
- shut down OSDs (systemctl stop docker)
- reboot host (this is necessary due to OS deployment via warewulf)
Devices got renamed and not all disks came back up (4 OSDs remained down). This is expected, I need to re-deploy the containers to adjust for device name changes. Around this point PGs started peering and some failed waiting for 1 of the down OSDs. I don't understand why they didn't just remain active with 9 out of 10 disks. Until this moment of some OSDs coming up, all PGs were active. With min_size=9 I would expect all PGs to remain active with no changes to 9 out of the 10 hosts.
- redeploy docker containers
- all disks/OSDs come up, including the 4 OSDs from above
- inactive PGs complete peering and become active
- now I have a los of degraded Objects and undersized PGs even though not a single OSD was removed
I don't understand why I have degraded objects. I should just have misplaced objects:
HEALTH_ERR
22995992/145698909 objects misplaced (15.783%)
Degraded data redundancy: 5213734/145698909 objects degraded (3.578%), 208 pgs degraded, 208
pgs undersized
Degraded data redundancy (low space): 169 pgs backfill_toofull
Note: The backfill_toofull with low utilization (usage: 38 TiB used, 1.5 PiB / 1.5 PiB avail) is a known issue in ceph (https://tracker.ceph.com/issues/39555)
Also, I should be able to do whatever with 1 out of 10 hosts without loosing data access. What could be the problem here?
Questions summary:
Why does peering not succeed to keep all PGs active with 9 out of 10 OSDs up and in?
Why do undersized PGs arise even though all OSDs are up?
Why do degraded objects arise even though no OSD was removed?
Thanks!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
I have one or two more stability issues I'm trying to solve in a
cluster that I inherited that I just can't seem to figure out. One
issue may be the cause for the other.
This is a Jewel 10.2.11 cluster with ~760 - 10TB HDDs and 5GB journals on SSD.
When a large number of files are deleted from CephFS (and possibly
when leveldb compacts), the OSD will stop responding to heartbeats and
get marked down, then come back and start recovery and then other OSDs
will have the same issue until client load on the cluster eases up
then it settles down.
Is there a way to have leveldb compact more frequently or cause it to
come up for air more frequently and respond to heartbeats and process
some IO? I thought splitting PGs would help, but we are still seeing
the problem (previously ~20 PGs per OSD to now ~150). I still have
some space on the SSDs that I can double, almost triple the journal,
but not sure if that will help in this situation.
The other issue I'm seeing is that some IO just gets stuck when the
OSDs are getting marked down and coming back through the cluster.
Thanks,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
There is a lock for object exists. If the file was not writing close, other ones can read only.
Regards
> Am Oct 2, 2019 - 12:09 AM schrieb khaled.atteya(a)gmail.com:
>
>
> Hi,
>
> Is it possible to do this scenario :
> If one open a file first , he will get read/write permissions and other will get read-only permission if they open the file after the first one.
>
> Thanks
>
The problem with lots of OSDs per node is that this usually means you
have too few nodes. It's perfectly fine to run 60 OSDs per node if you
got a total of 1000 OSDs or so.
But I've seen too many setups with 3-5 nodes where each node runs 60
OSDs which makes no sense (and usually isn't even cheaper than more
nodes, especially once you consider the lost opportunity for running
erasure coding).
The usual backup cluster we are seeing is in the single-digit petabyte
range with about 12 to 24 disks per server running ~8+3 erasure
coding.
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Wed, Oct 2, 2019 at 12:53 AM Darrell Enns <darrelle(a)knowledge.ca> wrote:
>
> Thanks Paul. I was speaking more about total OSDs and RAM, rather than a single node. However, I am considering building a cluster with a large OSD/node count. This would be for archival use, with reduced performance and availability requirements. What issues would you anticipate with a large OSD/node count? Is the concern just the large rebalance if a node fails and takes out a large portion of the OSDs at once?
>
> -----Original Message-----
> From: Paul Emmerich <paul.emmerich(a)croit.io>
> Sent: Tuesday, October 01, 2019 3:00 PM
> To: Darrell Enns <darrelle(a)knowledge.ca>
> Cc: ceph-users(a)ceph.io
> Subject: Re: [ceph-users] RAM recommendation with large OSDs?
>
> On Tue, Oct 1, 2019 at 6:12 PM Darrell Enns <darrelle(a)knowledge.ca> wrote:
> >
> > The standard advice is “1GB RAM per 1TB of OSD”. Does this actually still hold with large OSDs on bluestore?
>
> No
>
> > Can it be reasonably reduced with tuning?
>
> Yes
>
>
> > From the docs, it looks like bluestore should target the “osd_memory_target” value by default. This is a fixed value (4GB by default), which does not depend on OSD size. So shouldn’t the advice really by “4GB per OSD”, rather than “1GB per TB”? Would it also be reasonable to reduce osd_memory_target for further RAM savings?
>
> Yes
>
> > For example, suppose we have 90 12TB OSD drives:
>
> Please don't put 90 drives in one node, that's not a good idea in 99.9% of the use cases.
>
> >
> > “1GB per TB” rule: 1080GB RAM
> > “4GB per OSD” rule: 360GB RAM
> > “2GB per OSD” (osd_memory_target reduced to 2GB): 180GB RAM
> >
> >
> >
> > Those are some massively different RAM values. Perhaps the old advice was for filestore? Or there is something to consider beyond the bluestore memory target? What about when using very dense nodes (for example, 60 12TB OSDs on a single node)?
>
> Keep in mind that it's only a target value, it will use more during recovery if you set a low value.
> We usually set a target of 3 GB per OSD and recommend 4 GB of RAM per OSD.
>
> RAM saving trick: use fewer PGs than recommended.
>
>
> Paul
>
>
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> > email to ceph-users-leave(a)ceph.io
On Tue, Oct 1, 2019 at 6:12 PM Darrell Enns <darrelle(a)knowledge.ca> wrote:
>
> The standard advice is “1GB RAM per 1TB of OSD”. Does this actually still hold with large OSDs on bluestore?
No
> Can it be reasonably reduced with tuning?
Yes
> From the docs, it looks like bluestore should target the “osd_memory_target” value by default. This is a fixed value (4GB by default), which does not depend on OSD size. So shouldn’t the advice really by “4GB per OSD”, rather than “1GB per TB”? Would it also be reasonable to reduce osd_memory_target for further RAM savings?
Yes
> For example, suppose we have 90 12TB OSD drives:
Please don't put 90 drives in one node, that's not a good idea in
99.9% of the use cases.
>
> “1GB per TB” rule: 1080GB RAM
> “4GB per OSD” rule: 360GB RAM
> “2GB per OSD” (osd_memory_target reduced to 2GB): 180GB RAM
>
>
>
> Those are some massively different RAM values. Perhaps the old advice was for filestore? Or there is something to consider beyond the bluestore memory target? What about when using very dense nodes (for example, 60 12TB OSDs on a single node)?
Keep in mind that it's only a target value, it will use more during
recovery if you set a low value.
We usually set a target of 3 GB per OSD and recommend 4 GB of RAM per OSD.
RAM saving trick: use fewer PGs than recommended.
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi,
Is it possible to do this scenario :
If one open a file first , he will get read/write permissions and other
will get read-only permission if they open the file after the first one.
Thanks
Hi,
I'm testing Ceph with Vmware, using Ceph-iscsi gateway. I reading
documentation* and have doubts some points:
- If I understanded, in general terms, for each VMFS datastore in VMware
will match the an RBD image. (consequently in an RBD image I will possible
have many VMWare disks). Its correct?
- In documentation is this: "gwcli requires a pool with the name rbd, so it
can store metadata like the iSCSI configuration". In part 4 of
"Configuration", have: "Add a RBD image with the name disk_1 in the pool
rbd". In this part, the use of "rbd" pool is a example and I could use any
pool for storage of image, or the pool should be "rbd"?
Resuming: gwcli require "rbd" pool for metadata and I could use any pool
for image, or i will use just "rbd pool" for storage image and metadata?
- How much memory ceph-iscsi use? Which is a good number of RAM?
Regards
Gesiel
* https://docs.ceph.com/docs/master/rbd/iscsi-target-cli/
Hi. I am new to ceph but have set it up on my homelab and started using it. It seemed very good intil I desided to try pg autoscale.
After enabling autoscale to 3 of my pools, autoscale tried(?) to reduce the number of PGs and the pools are now unaccessible.
I have tried to turn it off again, but no luck! Please help.
ceph status:
https://pastebin.com/88qNivJi (do not know why it lists 4 pools, I have 3. Maybe one of the pools I created after and deleted are in limbo?)
ceph osd pool ls detail:
https://pastebin.com/HZLz6yHL
ceph health detail:
https://pastebin.com/Kqd2YMtm
I need to move a 6+2 EC pool from HDDs to SSDs while storage must remain accessible. All SSDs and HDDs are within the same failure domains. The crush rule in question is
rule sr-rbd-data-one {
id 5
type erasure
min_size 3
max_size 8
step set_chooseleaf_tries 50
step set_choose_tries 1000
step take ServerRoom class hdd
step chooseleaf indep 0 type host
step emit
}
and I would be inclined just to change the entry "step take ServerRoom class hdd" to "step take ServerRoom class ssd" and wait for the dust to settle.
However, this will almost certainly lead to all PGs being undersized and inaccessible as all objects are in the wrong place. I noticed that this is not an issue with PGs created by replicated rules as they can contain more OSDs than the replication factor while objects are moved. The same does not apply to EC rules. I suspect this is due to the setting "max_size 8", which does not allow for more than 6+2=8 OSDs being a member of a PG.
What is the correct way to do what I need to do? Can I just set "max_size 16" and go? Will this work with EC rules? If not, what are my options?
Thanks!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14