Hi,
I've 4 node cluster with 13x15TB 7.2k OSDs each and around 300TB data inside. I'm having issues with deep scrub/scrub not being done in time, any tips to handle these operations with large disks like this?
osd pool default size = 2
osd deep scrub interval = 2592000
osd scrub begin hour = 23
osd scrub end hour = 5
osd scrub sleep = 0.1
Cheers,
Kamil
Den mån 25 maj 2020 kl 10:03 skrev Marc Roos <M.Roos(a)f1-outsourcing.eu>:
>
> I am interested. I am always setting mtu to 9000. To be honest I cannot
> imagine there is no optimization since you have less interrupt requests,
> and you are able x times as much data. Every time there something
> written about optimizing the first thing mention is changing to the mtu
> 9000. Because it is quick and easy win.
>
>
This sort of assumes you are not using interrupt coalescing network cards,
because if you do, you can get something like hundreds of packets in one
single IRQ*, already checksummed and stripped and in recent cards
(10-25-40GE) even delivered into the cpu L3 cache by the time you get the
int, so if they were 1500 or 9000 on the wire doesn't matter much by then.
Even in the bad old days of software handling of all parts packet-related,
many things (like mbuf allocations) were optimized for 1500, so 9k packets
became just a multiple of a number of 1500 bytes chunks taken from a pool
of network buffers anyhow.
I'm not trying to shoot down the 9k-vs-1500 idea, but doing a benchmark
will give you lots more facts than airing things that are easy to imagine
but really doesn't have a huge impact because hw manufacturers worked
around things like this a long time ago. If your tests say you win x%, then
use it by all means. I'm just not thinking that 10/25/40G networks are so
filled that the frame overheads really matter as a matter of % of the
packet sizes and the cards offload most of the work to strip the overhead
out, so the computer won't notice it was ever there.
*) SysKonnect cards had this around 2003, just to get a feeling for what
"modern ethernet cards" means in this context.
--
May the most significant bit of your life be positive.
Hi all,
I have a Nautilus cluster mostly used for RBD (openstack) and CephFS.
I have been using rbd perf command from time to time but it doesn't
work anymore. I have tried several images in different pools but
there's no output at all except for
client:~ $ rbd perf image iostat --format json
volumes-ssd/volume-358cd6c5-6fb0-424f-93d9-990ea1963472
rbd: waiting for initial image stats
It never updates, no matter how long I wait. It stopped working while
we were using version 14.2.3, last Friday we updated to 14.2.9 but it
still doesn't work.
The only relevant mgr log output I'm seeing in debug mode (debug_mgr
5/5) is this:
---snip---
2020-05-25 10:53:07.072 7fedd5f59700 4 mgr.server _handle_command decoded 4
2020-05-25 10:53:07.072 7fedd5f59700 4 mgr.server _handle_command
prefix=rbd perf image stats
2020-05-25 10:53:07.072 7fedd5f59700 0 log_channel(audit) log [DBG] :
from='client.710971242 v1:192.168.103.13:0/693257394'
entity='client.admin' cmd=[
2020-05-25 10:53:07.072 7fedd5f59700 0 log_channel(audit) log [DBG] : {
2020-05-25 10:53:07.072 7fedd5f59700 0 log_channel(audit) log [DBG] :
"prefix": "rbd perf image stats",
2020-05-25 10:53:07.072 7fedd5f59700 0 log_channel(audit) log [DBG] :
"pool_spec":
"volumes-ssd/volume-358cd6c5-6fb0-424f-93d9-990ea1963472",
2020-05-25 10:53:07.072 7fedd5f59700 0 log_channel(audit) log [DBG] :
"sort_by": "write_ops",
2020-05-25 10:53:07.072 7fedd5f59700 0 log_channel(audit) log [DBG] :
"format": "json"
2020-05-25 10:53:07.072 7fedd5f59700 0 log_channel(audit) log [DBG] :
}"]: dispatch
2020-05-25 10:53:07.072 7fedd675a700 4 mgr.server reply reply success
2020-05-25 10:53:07.104 7fedd5f59700 4 mgr.server handle_report from
0x555e60f31200 osd,33
2020-05-25 10:53:07.172 7fedd5f59700 4 mgr.server handle_report from
0x555e5c6c6d80 osd,15
2020-05-25 10:53:07.224 7fedf10c1700 4 mgr send_beacon active
---snip---
What I'm also wondering about is that the "format": "json" doesn't
change even if I choose to run --format plain or xml.
Does anyone experience the same? The missing output also applies to
rbd perf image iotop.
Any hints are appreciated.
Regards,
Eugen
Hi Manuel,
rgw_gc_obj_min_wait -- yes, this is how you control how long rgw waits
before removing the stripes of deleted objects
the following are more gc performance and proportion of available iops:
rgw_gc_processor_max_time -- controls how long gc runs once scheduled;
a large value might be 3600
rgw_gc_processor_period -- sets the gc cycle; smaller is more frequent
If you want to make gc more aggressive when it is running, set the
following (can be increased), which more than doubles the :
rgw_gc_max_concurrent_io = 20
rgw_gc_max_trim_chunk = 32
If you want to increase gc fraction of total rgw i/o, increase these
(mostly, concurrent_io).
regards,
Matt
On Sun, May 24, 2020 at 4:02 PM EDH - Manuel Rios
<mriosfer(a)easydatahost.com> wrote:
>
> Hi,
>
> Im looking for any experience optimizing garbage collector with the next configs:
>
> global advanced rgw_gc_obj_min_wait
> global advanced rgw_gc_processor_max_time
> global advanced rgw_gc_processor_period
>
> By default gc expire objects within 2 hours, we're looking to define expire in 10 minutes as our S3 cluster got heavy uploads and deletes.
>
> Are those params usable? For us doesn't have sense store delete objects 2 hours in a gc.
>
> Regards
> Manuel
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
--
Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103
http://www.redhat.com/en/technologies/storage
tel. 734-821-5101
fax. 734-769-8938
cel. 734-216-5309
Hi,
Im looking for any experience optimizing garbage collector with the next configs:
global advanced rgw_gc_obj_min_wait
global advanced rgw_gc_processor_max_time
global advanced rgw_gc_processor_period
By default gc expire objects within 2 hours, we're looking to define expire in 10 minutes as our S3 cluster got heavy uploads and deletes.
Are those params usable? For us doesn't have sense store delete objects 2 hours in a gc.
Regards
Manuel
Hi,
Since my upgrade from 15.2.1 to 15.2.2 i've got this error message at the
"Object Gateway" section of the dashboard.
RGW REST API failed request with status code 403
(b'{"Code":"InvalidAccessKeyId","RequestId":"tx000000000000000000017-005ecac06c'
b'-e349-eu-west-1","HostId":"e349-eu-west-1-default"}')
I did try to change my secret-key and access-key without success.
I made a tcpdump, i didn't see any special thing like json escape character
etc ..
Somebody had the same issue ??
Regards.
yep, my fault I meant replication = 3 ....
> > but aren't PGs checksummed so from the remaining PG (given the
> > checksum would be right) two new copies could be created?
>
> Assuming again 3R on 5 nodes, failure domain of host, if 2 nodes go down, there will be 1/3 copies available. Normally a 3R pool has min_size set to 2.
>
> You can set min_size to 1 temporarily, then those PGs will become active and copies will be created to restore redundancy, but if that remaining OSD is damaged, if there’s a DIMM flake, a cosmic ray, if the wrong OSD crashes or restarts at the wrong time, you can find yourself without the most recent copy of data and be unable to recover. It’s Russian Roulette.
I see, but wouldn't ceph try to recreate redundancy by it's own
(unless I'm explicitly tell it not to do so)?
And if the I/O and load on the cluster isn't too high disk speed good
net connectivity good it would recover fairly quickly into healthy
redundancy state?
Anyhow, I'm not planing on crashing two nodes ;-) I just wanted to get
a feeling of how much more secure/robust
a setup with five nodes compared to four nodes is.
Hello,
I came across a section of the documentation that I don't quite
understand. In the section about inconsistent PGs it says if one of the
shards listed in `rados list-inconsistent-obj` has a read_error the disk is
probably bad.
Quote from documentation:
https://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/…
`If read_error is listed in the errors attribute of a shard, the
inconsistency is likely due to disk errors. You might want to check your
disk used by that OSD.`
I determined that the disk is bad by looking at the output of smartctl. I
would think that replacing the disk by removing the OSD from the cluster
and allowing the cluster to recover would fix this inconsistency error
without having to run `ceph pg repair`.
Can I just replace the OSD and the inconsistency will be resolved by the
recovery? Or would it be better to run `ceph pg repair` and then replace
the OSD associated with that bad disk?
Thanks!