Hello,
if I understand correctly:
if we upgrade from an running nautilus cluster to octopus we have a
downtime on an update of MDS.
Is this correct?
Mit freundlichen Grüßen / Kind regards
Andreas Schiefer
Leiter Systemadministration / Head of systemadministration
---
HOME OF LOYALTY
CRM- & Customer Loyalty Solution
by UW Service
Gesellschaft für Direktwerbung und Marketingberatung mbH
Alter Deutzer Postweg 221
51107 Koeln (Rath/Heumar)
Deutschland
Telefon : +49 221 98696 0
Telefax : +49 221 98696 5222
info(a)uw-service.de
www.hooloy.de
Amtsgericht Koeln HRB 24 768
UST-ID: DE 164 191 706
Geschäftsführer: Ralf Heim
---
FYI. Hope to see some awesome CephFS submissions for our virtual IO500 BoF!
Thanks,
John
---------- Forwarded message ---------
From: committee--- via IO-500 <io-500(a)vi4io.org>
Date: Fri, May 22, 2020 at 1:53 PM
Subject: [IO-500] IO500 ISC20 Call for Submission
To: <io-500(a)vi4io.org>
*Deadline*: 08 June 2020 AoE
The IO500 <http://io500.org/> is now accepting and encouraging submissions
for the upcoming 6th IO500 list. Once again, we are also accepting
submissions to the 10 Node Challenge to encourage the submission of small
scale results. The new ranked lists will be announced via live-stream at a
virtual session. We hope to see many new results.
The benchmark suite is designed to be easy to run and the community has
multiple active support channels to help with any questions. Please note
that submissions of all sizes are welcome; the site has customizable
sorting so it is possible to submit on a small system and still get a very
good per-client score for example. Additionally, the list is about much
more than just the raw rank; all submissions help the community by
collecting and publishing a wider corpus of data. More details below.
Following the success of the Top500 in collecting and analyzing historical
trends in supercomputer technology and evolution, the IO500
<http://io500.org/> was created in 2017, published its first list at SC17,
and has grown exponentially since then. The need for such an initiative has
long been known within High-Performance Computing; however, defining
appropriate benchmarks had long been challenging. Despite this challenge,
the community, after long and spirited discussion, finally reached
consensus on a suite of benchmarks and a metric for resolving the scores
into a single ranking.
The multi-fold goals of the benchmark suite are as follows:
1. Maximizing simplicity in running the benchmark suite
2. Encouraging optimization and documentation of tuning parameters for
performance
3. Allowing submitters to highlight their “hero run” performance numbers
4. Forcing submitters to simultaneously report performance for
challenging IO patterns.
Specifically, the benchmark suite includes a hero-run of both IOR and mdtest
configured however possible to maximize performance and establish an
upper-bound for performance. It also includes an IOR and mdtest run with
highly constrained parameters forcing a difficult usage pattern in an
attempt to determine a lower-bound. Finally, it includes a namespace search
as this has been determined to be a highly sought-after feature in HPC
storage systems that has historically not been well-measured. Submitters
are encouraged to share their tuning insights for publication.
The goals of the community are also multi-fold:
1. Gather historical data for the sake of analysis and to aid
predictions of storage futures
2. Collect tuning data to share valuable performance optimizations
across the community
3. Encourage vendors and designers to optimize for workloads beyond
“hero runs”
4. Establish bounded expectations for users, procurers, and
administrators
*10 Node I/O Challenge*
The 10 Node Challenge is conducted using the regular IO500 benchmark,
however, with the rule that exactly *10 client nodes* must be used to run
the benchmark. You may use any shared storage with, e.g., any number of
servers. When submitting for the IO500 list, you can opt-in for
“Participate in the 10 compute node challenge only”, then we will not
include the results into the ranked list. Other 10-node node submissions
will be included in the full list and in the ranked list. We will announce
the result in a separate derived list and in the full list but not on the
ranked IO500 list at https://io500.org/.
This information and rules for ISC20 submissions are available here:
https://www.vi4io.org/io500/rules/submission
Thanks,
The IO500 Committee
_______________________________________________
IO-500 mailing list
IO-500(a)vi4io.org
https://www.vi4io.org/mailman/listinfo/io-500
Hello,
we are currently experiencing problems with ceph pg repair not working
on Ceph Nautilus 14.2.8.
ceph health detail is showing us an inconsistent pg:
[aaaaax-yyyy ~]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 18.19a is active+clean+inconsistent+snaptrim_wait, acting
[21,15,39,18,0,9]
when we try to repair it, nothing happens.
[aaaaax-yyyy ~]# ceph pg repair 18.19a
instructing pg 18.19as0 on osd.21 to repair
There are no new entries in OSD 21's log file.
We have no trouble repairing pgs in our other clusters so I assume it
might have to be something related to this cluster using Erasure
Codings. But this is just a wild guess.
I found a similar problem in this mailing list -
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-April/026304.html
Unfortunately the solution of waiting more than a week until it fixes
itself isn't quite satisfying.
Is there anyone who has had similar issues and knows how to repair these
inconsistent pgs or what is causing the delay?
--
Mit freundlichen Grüßen
Daniel Aberger
Ihr Profihost Team
-------------------------------
Profihost AG
Expo Plaza 1
30539 Hannover
Deutschland
Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: info(a)profihost.com
Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe
Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)
Hallo all,
hope you can help me with very strange problems which arose
suddenly today. Tried to search, also in this mailing list, but could
not find anything relevant.
At some point today, without any action from my side, I noticed some
OSDs in my production cluster would go down and never come up.
I am on Luminous 12.2.13, CentOS7, kernel 3.10: my setup is non-standard
as OSD disks are served off a SAN (which is for sure OK now, although I
cannot exclude some glitch).
Tried to reboot OSD servers a few times, ran "activate --all", added
bluestore_ignore_data_csum=true in the [osd] section in ceph.conf...
the number of "down" OSDs changed for a while but now seems rather stable.
There are actually two classes of problems (bit more details right below):
- ERROR: osd init failed: (5) Input/output error
- failed to load OSD map for epoch 141282, got 0 bytes
*First problem*
This affects 50 OSDs (all disks of this kind, on all but one server):
these OSDs are reserved for object storage but I am not yet using them
so I may in principle recreate them. But would be interested in
understanding what the problem is, and learn how to solve it for future
reference.
Here is what I see in logs:
.....
2020-05-21 21:17:48.661348 7fa2e9a95ec0 1 bluefs add_block_device bdev
1 path /var/lib/ceph/osd/cephpa1-72/block size 14.5TiB
2020-05-21 21:17:48.661428 7fa2e9a95ec0 1 bluefs mount
2020-05-21 21:17:48.662040 7fa2e9a95ec0 1 bluefs _init_alloc id 1
alloc_size 0x10000 size 0xe83a3400000
2020-05-21 21:52:43.858464 7fa2e9a95ec0 -1 bluefs mount failed to replay
log: (5) Input/output error
2020-05-21 21:52:43.858589 7fa2e9a95ec0 1 fbmap_alloc 0x55c6bba92e00
shutdown
2020-05-21 21:52:43.858728 7fa2e9a95ec0 -1
bluestore(/var/lib/ceph/osd/cephpa1-72) _open_db failed bluefs mount:
(5) Input/output error
2020-05-21 21:52:43.858790 7fa2e9a95ec0 1 bdev(0x55c6bbdb6600
/var/lib/ceph/osd/cephpa1-72/block) close
2020-05-21 21:52:44.103536 7fa2e9a95ec0 1 bdev(0x55c6bbdb8600
/var/lib/ceph/osd/cephpa1-72/block) close
2020-05-21 21:52:44.352899 7fa2e9a95ec0 -1 osd.72 0 OSD:init: unable to
mount object store
2020-05-21 21:52:44.352956 7fa2e9a95ec0 -1 ESC[0;31m ** ERROR: osd init
failed: (5) Input/output errorESC[0m
*Second problem*
This affects 11 OSDs, which I use *in production* for Cinder block
storage: looks like all PGs for this pool are currently OK.
Here is the excerpt from the logs.
.....
-5> 2020-05-21 20:52:06.756469 7fd2ccc19ec0 0 _get_class not
permitted to load kvs
-4> 2020-05-21 20:52:06.759686 7fd2ccc19ec0 1 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/rgw/cls_rgw.cc:3869:
Loaded rgw class!
-3> 2020-05-21 20:52:06.760021 7fd2ccc19ec0 1 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/log/cls_log.cc:299:
Loaded log class!
-2> 2020-05-21 20:52:06.760730 7fd2ccc19ec0 1 <cls>
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/replica_log/cls_replica_log.cc:135:
Loaded replica log class!
-1> 2020-05-21 20:52:06.760873 7fd2ccc19ec0 -1 osd.63 0 failed to
load OSD map for epoch 141282, got 0 bytes
0> 2020-05-21 20:52:06.763277 7fd2ccc19ec0 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h:
In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fd2ccc19ec0
time 2020-05-21 20:52:06.760916
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h:
994: FAILED assert(ret)
Has anyone any idea how I could fix these problems, or what I could
do to try and shed some light? And also, what caused them, and whether
there is some magic configuration flag I could use to protect my cluster?
Thanks a lot for your help!
Fulvio
Hi,
We experienced random and relative high latency spikes (around 0.5-10 sec)
in our ceph cluster which consists 6 osd nodes, all osd nodes have 6 osd-s.
One osd built with one spinning disk and two nvme device.
We use a bcache device for osd back end (mixed with hdd and an nvme
partition as caching device) and one nvme partition for journal.
This synthetic command can be use for check io and latency:
rados bench -p rbd 10 write -b 4000 -t 64
With this parameters we often got about 1.5 sec or higher for maximum
latency.
We cannot decide if our cluster is misconfigured or just this is a natural
ceph behavior.
Any help, suggestion would be appreciated.
Regards,
Bence
--
--Szabo Bence
--<szabo.bence(a)gmail.com>
hi there,
we are seeing osd occasionally getting kicked out of our cluster, after
having been marked down by other osds. most of the time, the affected
osd rejoins the cluster after about ~5 minutes, but sometimes this takes
much longer. during that time, the osd seems to run just fine.
this happens more often that we'd like it to … is "OSD::osd_op_tp thread
… had timed out" a real error condition or just a warning about certain
operations on the osd taking a long time? i already set
osd_op_thread_timeout to 120 (was 60 before, default should be 15
according to the docs), but apparently that doesn't make any difference.
are there any other settings that prevent this kind of behaviour?
mon_osd_report_timeout maybe, as in frank schilder's case?
the cluster runs nautilus 14.2.7, osds are backed by spinning platters
with their rocksdb and wals on nvmes. in general, there seems to be the
following pattern:
- it happens under moderate to heavy load, eg. while creating pools with
a lot of pgs
- the affected osd logs a lot of:
"heartbeat_map is_healthy 'OSD::osd_op_tp thread ${thread-id}' had timed
out after 60"
… and finally something along the lines of:
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211
7fb25cc80700 0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn
slow operation observed for _collection_list, latency = 96.337s, lat =
96s cid =2.0s2_head start GHMAX end GHMAX max 30
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.219
7fb25cc80700 1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread
0x7fb25cc80700' had timed out after 60
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: osd.293 osd.293 2 :
Monitor daemon marked osd.293 down, but it is still running
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
7fb267c96700 0 log_channel(cluster) log [WRN] : Monitor daemon marked
osd.293 down, but it is still running
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
7fb267c96700 0 log_channel(cluster) do_log log to syslog
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
7fb267c96700 0 log_channel(cluster) log [DBG] : map e646639 wrongly
marked me down at e646638
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
7fb267c96700 0 log_channel(cluster) do_log log to syslog
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371
7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify
public interface 'br-bond0' numa node: (2) No such file or directory
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371
7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify
public interface 'br-bond0' numa node: (2) No such file or directory
- meanwhile on the mon:
2020-05-18 21:12:16.440 7f08f7933700 0 mon.ceph-mon-01@0(leader) e4
handle_command mon_command({"prefix": "status"} v 0) v1
entity='client.admin' cmd=[{"prefix": "status"}]: dispatch
2020-05-18 21:12:18.436 7f08f7933700 0 log_channel(cluster) log [DBG] :
osd.293 reported failed by osd.101
2020-05-18 21:12:18.848 7f08f7933700 0 log_channel(cluster) log [DBG] :
osd.293 reported failed by osd.533
[… lots of these from various osds]
2020-05-18 21:12:24.992 7f08f7933700 0 log_channel(cluster) log [DBG] :
osd.293 reported failed by osd.421
2020-05-18 21:12:26.124 7f08f7933700 0 log_channel(cluster) log [DBG] :
osd.293 reported failed by osd.504
2020-05-18 21:12:26.132 7f08f7933700 0 log_channel(cluster) log [INF] :
osd.293 failed (root=tuberlin,datacenter=barz,host=ceph-osd-05) (16
reporters from different host after 27.137527 >= grace 26.361774)
2020-05-18 21:12:26.236 7f08fa138700 0 log_channel(cluster) log [WRN] :
Health check failed: 1 osds down (OSD_DOWN)
2020-05-18 21:12:26.280 7f08f6130700 0 log_channel(cluster) log [DBG] :
osdmap e646638: 604 total, 603 up, 604 in
2020-05-18 21:12:27.336 7f08f6130700 0 log_channel(cluster) log [DBG] :
osdmap e646639: 604 total, 603 up, 604 in
2020-05-18 21:12:28.248 7f08fa138700 0 log_channel(cluster) log [WRN] :
Health check failed: Reduced data availability: 17 pgs peering
(PG_AVAILABILITY)
2020-05-18 21:12:29.392 7f08fa138700 0 log_channel(cluster) log [WRN] :
Health check failed: Degraded data redundancy: 80091/181232010 objects
degraded (0.044%), 18 pgs degraded (PG_DEGRADED)
2020-05-18 21:12:33.927 7f08fa138700 0 log_channel(cluster) log [INF] :
Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1
pg inactive, 22 pgs peering)
2020-05-18 21:12:35.095 7f08fa138700 0 log_channel(cluster) log [INF] :
Health check cleared: OSD_DOWN (was: 1 osds down)
2020-05-18 21:12:35.119 7f08f6130700 0 log_channel(cluster) log [INF] :
osd.293 [v2:172.28.9.26:6936/2356578,v1:172.28.9.26:6937/2356578] boot
2020-05-18 21:12:35.119 7f08f6130700 0 log_channel(cluster) log [DBG] :
osdmap e646640: 604 total, 604 up, 604 in
2020-05-18 21:12:36.175 7f08f6130700 0 log_channel(cluster) log [DBG] :
osdmap e646641: 604 total, 604 up, 604 in
i can happily provide more detailed logs, if that helps.
thank you very much & with kind regards,
thoralf.
Don't optimize stuff without benchmarking *before and after*, don't apply
random tuning tipps from the Internet without benchmarking them.
My experience with Jumbo frames: 3% performance. On a NVMe-only setup with
100 Gbit/s network.
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Tue, May 26, 2020 at 7:02 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
>
> Look what I have found!!! :)
> https://ceph.com/geen-categorie/ceph-loves-jumbo-frames/
>
>
>
> -----Original Message-----
> From: Anthony D'Atri [mailto:anthony.datri@gmail.com]
> Sent: maandag 25 mei 2020 22:12
> To: Marc Roos
> Cc: kdhall; martin.verges; sstkadu; amudhan83; ceph-users; doustar
> Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not
> working after setting MTU 9000
>
> Quick and easy depends on your network infrastructure. Sometimes it is
> difficult or impossible to retrofit a live cluster without disruption.
>
>
> > On May 25, 2020, at 1:03 AM, Marc Roos <M.Roos(a)f1-outsourcing.eu>
> wrote:
> >
> >
> > I am interested. I am always setting mtu to 9000. To be honest I
> > cannot imagine there is no optimization since you have less interrupt
> > requests, and you are able x times as much data. Every time there
> > something written about optimizing the first thing mention is changing
>
> > to the mtu 9000. Because it is quick and easy win.
> >
> >
> >
> >
> > -----Original Message-----
> > From: Dave Hall [mailto:kdhall@binghamton.edu]
> > Sent: maandag 25 mei 2020 5:11
> > To: Martin Verges; Suresh Rama
> > Cc: Amudhan P; Khodayar Doustar; ceph-users
> > Subject: [ceph-users] Re: [External Email] Re: Ceph Nautius not
> > working after setting MTU 9000
> >
> > All,
> >
> > Regarding Martin's observations about Jumbo Frames....
> >
> > I have recently been gathering some notes from various internet
> > sources regarding Linux network performance, and Linux performance in
> > general, to be applied to a Ceph cluster I manage but also to the rest
>
> > of the Linux server farm I'm responsible for.
> >
> > In short, enabling Jumbo Frames without also tuning a number of other
> > kernel and NIC attributes will not provide the performance increases
> > we'd like to see. I have not yet had a chance to go through the rest
> > of the testing I'd like to do, but I can confirm (via iperf3) that
> > only enabling Jumbo Frames didn't make a significant difference.
> >
> > Some of the other attributes I'm referring to are incoming and
> > outgoing buffer sizes at the NIC, IP, and TCP levels, interrupt
> > coalescing, NIC offload functions that should or shouldn't be turned
> > on, packet queuing disciplines (tc), the best choice of TCP slow-start
>
> > algorithms, and other TCP features and attributes.
> >
> > The most off-beat item I saw was something about adding IPTABLES rules
>
> > to bypass CONNTRACK table lookups.
> >
> > In order to do anything meaningful to assess the effect of all of
> > these settings I'd like to figure out how to set them all via Ansible
> > - so more to learn before I can give opinions.
> >
> > --> If anybody has added this type of configuration to Ceph Ansible,
> > I'd be glad for some pointers.
> >
> > I have started to compile a document containing my notes. It's rough,
>
> > but I'd be glad to share if anybody is interested.
> >
> > -Dave
> >
> > Dave Hall
> > Binghamton University
> >
> >> On 5/24/2020 12:29 PM, Martin Verges wrote:
> >>
> >> Just save yourself the trouble. You won't have any real benefit from
> > MTU
> >> 9000. It has some smallish, but it is not worth the effort, problems,
> > and
> >> loss of reliability for most environments.
> >> Try it yourself and do some benchmarks, especially with your regular
> >> workload on the cluster (not the maximum peak performance), then drop
> > the
> >> MTU to default ;).
> >>
> >> Please if anyone has other real world benchmarks showing huge
> > differences
> >> in regular Ceph clusters, please feel free to post it here.
> >>
> >> --
> >> Martin Verges
> >> Managing director
> >>
> >> Mobile: +49 174 9335695
> >> E-Mail: martin.verges(a)croit.io
> >> Chat: https://t.me/MartinVerges
> >>
> >> croit GmbH, Freseniusstr. 31h, 81247 Munich
> >> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht
> >> Munich HRB 231263
> >>
> >> Web: https://croit.io
> >> YouTube: https://goo.gl/PGE1Bx
> >>
> >>
> >>> Am So., 24. Mai 2020 um 15:54 Uhr schrieb Suresh Rama
> >> <sstkadu(a)gmail.com>:
> >>
> >>> Ping with 9000 MTU won't get response as I said and it should be
> > 8972. Glad
> >>> it is working but you should know what happened to avoid this issue
> > later.
> >>>
> >>>> On Sun, May 24, 2020, 3:04 AM Amudhan P <amudhan83(a)gmail.com>
> wrote:
> >>>
> >>>> No, ping with MTU size 9000 didn't work.
> >>>>
> >>>> On Sun, May 24, 2020 at 12:26 PM Khodayar Doustar
> > <doustar(a)rayanexon.ir>
> >>>> wrote:
> >>>>
> >>>>> Does your ping work or not?
> >>>>>
> >>>>>
> >>>>> On Sun, May 24, 2020 at 6:53 AM Amudhan P <amudhan83(a)gmail.com>
> > wrote:
> >>>>>
> >>>>>> Yes, I have set setting on the switch side also.
> >>>>>>
> >>>>>> On Sat 23 May, 2020, 6:47 PM Khodayar Doustar,
> > <doustar(a)rayanexon.ir>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Problem should be with network. When you change MTU it should be
> >>>> changed
> >>>>>>> all over the network, any single hup on your network should
> >>>>>>> speak
> > and
> >>>>>>> accept 9000 MTU packets. you can check it on your hosts with
> >>> "ifconfig"
> >>>>>>> command and there is also equivalent commands for other
> >>>> network/security
> >>>>>>> devices.
> >>>>>>>
> >>>>>>> If you have just one node which it not correctly configured for
> > MTU
> >>>> 9000
> >>>>>>> it wouldn't work.
> >>>>>>>
> >>>>>>> On Sat, May 23, 2020 at 2:30 PM sinan(a)turka.nl <sinan(a)turka.nl>
> >>> wrote:
> >>>>>>>> Can the servers/nodes ping eachother using large packet sizes?
> >>>>>>>> I
> >>> guess
> >>>>>>>> not.
> >>>>>>>>
> >>>>>>>> Sinan Polat
> >>>>>>>>
> >>>>>>>>> Op 23 mei 2020 om 14:21 heeft Amudhan P <amudhan83(a)gmail.com>
> > het
> >>>>>>>> volgende geschreven:
> >>>>>>>>> In OSD logs "heartbeat_check: no reply from OSD"
> >>>>>>>>>
> >>>>>>>>>> On Sat, May 23, 2020 at 5:44 PM Amudhan P
> > <amudhan83(a)gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I have set Network switch with MTU size 9000 and also in my
> >>> netplan
> >>>>>>>>>> configuration.
> >>>>>>>>>>
> >>>>>>>>>> What else needs to be checked?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Sat, May 23, 2020 at 3:39 PM Wido den Hollander <
> >>> wido(a)42on.com
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On 5/23/20 12:02 PM, Amudhan P wrote:
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I am using ceph Nautilus in Ubuntu 18.04 working fine wit
> > MTU
> >>>> size
> >>>>>>>> 1500
> >>>>>>>>>>>> (default) recently i tried to update MTU size to 9000.
> >>>>>>>>>>>> After setting Jumbo frame running ceph -s is timing out.
> >>>>>>>>>>> Ceph can run just fine with an MTU of 9000. But there is
> >>> probably
> >>>>>>>>>>> something else wrong on the network which is causing this.
> >>>>>>>>>>>
> >>>>>>>>>>> Check the Jumbo Frames settings on all the switches as well
> > to
> >>>> make
> >>>>>>>> sure
> >>>>>>>>>>> they forward all the packets.
> >>>>>>>>>>>
> >>>>>>>>>>> This is definitely not a Ceph issue.
> >>>>>>>>>>>
> >>>>>>>>>>> Wido
> >>>>>>>>>>>
> >>>>>>>>>>>> regards
> >>>>>>>>>>>> Amudhan P
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io To
> >>>>>>>>>>>> unsubscribe send an email to ceph-users-leave(a)ceph.io
> >>>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe
>
> >>>>>>>>>>> send an email to ceph-users-leave(a)ceph.io
> >>>>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe
> >>>>>>>>> send an email to ceph-users-leave(a)ceph.io
> >>>>>>>> _______________________________________________
> >>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe
> >>>>>>>> send an email to ceph-users-leave(a)ceph.io
> >>>>>>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send
> >>>> an email to ceph-users-leave(a)ceph.io
> >>>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>
> >>> email to ceph-users-leave(a)ceph.io
> >>>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> >> email to ceph-users-leave(a)ceph.io
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> > email to ceph-users-leave(a)ceph.io
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> > email to ceph-users-leave(a)ceph.io
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
Hello,
I didn’t find any information about the replication factor in the zone group. Assume I have three ceph clusters with Rados Gateway in one zone group each with replica size 3. How many replicas of an object I’ll get in total?
Is it possible to define several regions, each with several datacenters, and define maximum replication factor at region scope?
Александр Высочин
Старший разработчик
Технические инновации и инфраструктура - Разработка и развертывание платформ бизнес-сервисов
Технические инновации и инфраструктура ПАО «МегаФон»
<tel:>
[cid:image001.png@01D632C0.86CD4F00]
________________________________
Информация в этом сообщении предназначена исключительно для конкретных лиц, которым она адресована. В сообщении может содержаться конфиденциальная информация, которая не может быть раскрыта или использована кем-либо кроме адресатов. Если вы не адресат этого сообщения, то использование, переадресация, копирование или распространение содержания сообщения или его части - незаконно и запрещено. Если Вы получили это сообщение ошибочно, пожалуйста, незамедлительно сообщите отправителю об этом и удалите со всем содержимым само сообщение и любые возможные его копии и приложения. Настоящее сообщение и вложения в него носят исключительно информационный характер и не влекут для нас каких-либо обязательств, не означают признания или подтверждения нами каких-либо обстоятельств. Такие последствия наступают только после подписания уполномоченными лицами оригинальных версий соглашений, актов, уведомлений или иных документов. В случае неприемлемости для нас каких-либо условий мы сохраняем за собой право в любое время прекратить переговоры в отношении любых вопросов. Вступая с нами в переписку, Вы считаетесь проинформированными обо всем, что указано выше.
The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. The contents may not be disclosed or used by anyone other than the addressee. If you are not the intended recipient(s), any use, disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it is prohibited and may be unlawful. If you have received this communication in error please notify us immediately by responding to this email and then delete the e-mail and all attachments and any copies thereof. This communication and attachments hereto are for the informational purposes only and do not create or modify any our obligations and shall not be deemed as our admission or confirmation of any circumstances. Such consequences may occur only after duly authorized persons have signed the originals of the agreements, acts or other documents. If any conditions are unacceptable for us we reserve a right to terminate negotiations in respect of any issues at any time. Entering into any correspondence with us you are considered to be informed on all that is stated above.
-----
Hello Everyone,
I have installed both Prometheus and Grafana on one of my manager nodes (Ubuntu 18.04), and have configured both according to the documentation. I have visible Grafana dashboards when visiting http://mon1:3000, but no data exists on the dashboard. Python errors are shown for the job_name: ceph in Prometheus.
Below is my prometheus.yaml configuration
global:
scrape_interval: 5s
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
- job_name: 'ceph-exporter'
static_configs:
- targets: ['localhost:9100']
labels:
alias: ceph-exporter
- job_name: 'ceph'
static_configs:
- targets: ['localhost:9283']
labels:
alias: ceph
And, these are the Python errors shown when I view the details of the targets in Prometheus (http://mon1:9090/targets)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, in respond
response.body = self.handler()
File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 220, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 60, in __call__
return self.callable(*self.args, **self.kwargs)
File "/usr/share/ceph/mgr/prometheus/module.py", line 1060, in metrics
return self._metrics(instance)
File "/usr/share/ceph/mgr/prometheus/module.py", line 1074, in _metrics
instance.collect_cache = instance.collect()
File "/usr/share/ceph/mgr/prometheus/module.py", line 975, in collect
self.get_rbd_stats()
File "/usr/share/ceph/mgr/prometheus/module.py", line 734, in get_rbd_stats
'rbd_stats_pools_refresh_interval', 300)
TypeError: unsupported operand type(s) for +: 'int' and 'str'
If anyone has experienced this issue, and might have a solution, I would appreciate any assistance.
Thank you,
Todd