- ceph-users - lists.ceph.io

Nautilus to Octopus Upgrade mds without downtime

by Andreas Schiefer

Hello, if I understand correctly: if we upgrade from an running nautilus cluster to octopus we have a downtime on an update of MDS. Is this correct? Mit freundlichen Grüßen / Kind regards Andreas Schiefer Leiter Systemadministration / Head of systemadministration --- HOME OF LOYALTY CRM- & Customer Loyalty Solution by UW Service Gesellschaft für Direktwerbung und Marketingberatung mbH Alter Deutzer Postweg 221 51107 Koeln (Rath/Heumar) Deutschland Telefon : +49 221 98696 0 Telefax : +49 221 98696 5222 info(a)uw-service.de www.hooloy.de Amtsgericht Koeln HRB 24 768 UST-ID: DE 164 191 706 Geschäftsführer: Ralf Heim ---

3 years, 11 months

2
1
0 0

Fwd: [IO-500] IO500 ISC20 Call for Submission

by John Bent

FYI. Hope to see some awesome CephFS submissions for our virtual IO500 BoF! Thanks, John ---------- Forwarded message --------- From: committee--- via IO-500 <io-500(a)vi4io.org> Date: Fri, May 22, 2020 at 1:53 PM Subject: [IO-500] IO500 ISC20 Call for Submission To: <io-500(a)vi4io.org> *Deadline*: 08 June 2020 AoE The IO500 <http://io500.org/> is now accepting and encouraging submissions for the upcoming 6th IO500 list. Once again, we are also accepting submissions to the 10 Node Challenge to encourage the submission of small scale results. The new ranked lists will be announced via live-stream at a virtual session. We hope to see many new results. The benchmark suite is designed to be easy to run and the community has multiple active support channels to help with any questions. Please note that submissions of all sizes are welcome; the site has customizable sorting so it is possible to submit on a small system and still get a very good per-client score for example. Additionally, the list is about much more than just the raw rank; all submissions help the community by collecting and publishing a wider corpus of data. More details below. Following the success of the Top500 in collecting and analyzing historical trends in supercomputer technology and evolution, the IO500 <http://io500.org/> was created in 2017, published its first list at SC17, and has grown exponentially since then. The need for such an initiative has long been known within High-Performance Computing; however, defining appropriate benchmarks had long been challenging. Despite this challenge, the community, after long and spirited discussion, finally reached consensus on a suite of benchmarks and a metric for resolving the scores into a single ranking. The multi-fold goals of the benchmark suite are as follows: 1. Maximizing simplicity in running the benchmark suite 2. Encouraging optimization and documentation of tuning parameters for performance 3. Allowing submitters to highlight their “hero run” performance numbers 4. Forcing submitters to simultaneously report performance for challenging IO patterns. Specifically, the benchmark suite includes a hero-run of both IOR and mdtest configured however possible to maximize performance and establish an upper-bound for performance. It also includes an IOR and mdtest run with highly constrained parameters forcing a difficult usage pattern in an attempt to determine a lower-bound. Finally, it includes a namespace search as this has been determined to be a highly sought-after feature in HPC storage systems that has historically not been well-measured. Submitters are encouraged to share their tuning insights for publication. The goals of the community are also multi-fold: 1. Gather historical data for the sake of analysis and to aid predictions of storage futures 2. Collect tuning data to share valuable performance optimizations across the community 3. Encourage vendors and designers to optimize for workloads beyond “hero runs” 4. Establish bounded expectations for users, procurers, and administrators *10 Node I/O Challenge* The 10 Node Challenge is conducted using the regular IO500 benchmark, however, with the rule that exactly *10 client nodes* must be used to run the benchmark. You may use any shared storage with, e.g., any number of servers. When submitting for the IO500 list, you can opt-in for “Participate in the 10 compute node challenge only”, then we will not include the results into the ranked list. Other 10-node node submissions will be included in the full list and in the ranked list. We will announce the result in a separate derived list and in the full list but not on the ranked IO500 list at https://io500.org/. This information and rules for ISC20 submissions are available here: https://www.vi4io.org/io500/rules/submission Thanks, The IO500 Committee _______________________________________________ IO-500 mailing list IO-500(a)vi4io.org https://www.vi4io.org/mailman/listinfo/io-500

3 years, 11 months

1
0
0 0

Cannot repair inconsistent PG

by Daniel Aberger - Profihost AG

Hello, we are currently experiencing problems with ceph pg repair not working on Ceph Nautilus 14.2.8. ceph health detail is showing us an inconsistent pg: [aaaaax-yyyy ~]# ceph health detail HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 18.19a is active+clean+inconsistent+snaptrim_wait, acting [21,15,39,18,0,9] when we try to repair it, nothing happens. [aaaaax-yyyy ~]# ceph pg repair 18.19a instructing pg 18.19as0 on osd.21 to repair There are no new entries in OSD 21's log file. We have no trouble repairing pgs in our other clusters so I assume it might have to be something related to this cluster using Erasure Codings. But this is just a wild guess. I found a similar problem in this mailing list - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-April/026304.html Unfortunately the solution of waiting more than a week until it fixes itself isn't quite satisfying. Is there anyone who has had similar issues and knows how to repair these inconsistent pgs or what is causing the delay? -- Mit freundlichen Grüßen Daniel Aberger Ihr Profihost Team ------------------------------- Profihost AG Expo Plaza 1 30539 Hannover Deutschland Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282 URL: http://www.profihost.com | E-Mail: info(a)profihost.com Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827 Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350 Vorstand: Cristoph Bluhm, Sebastian Bluhm, Stefan Priebe Aufsichtsrat: Prof. Dr. iur. Winfried Huck (Vorsitzender)

3 years, 11 months

3
4
0 0

Luminous, OSDs down: "osd init failed" and "failed to load OSD map for epoch ... got 0 bytes"

by Fulvio Galeazzi

Hallo all, hope you can help me with very strange problems which arose suddenly today. Tried to search, also in this mailing list, but could not find anything relevant. At some point today, without any action from my side, I noticed some OSDs in my production cluster would go down and never come up. I am on Luminous 12.2.13, CentOS7, kernel 3.10: my setup is non-standard as OSD disks are served off a SAN (which is for sure OK now, although I cannot exclude some glitch). Tried to reboot OSD servers a few times, ran "activate --all", added bluestore_ignore_data_csum=true in the [osd] section in ceph.conf... the number of "down" OSDs changed for a while but now seems rather stable. There are actually two classes of problems (bit more details right below): - ERROR: osd init failed: (5) Input/output error - failed to load OSD map for epoch 141282, got 0 bytes *First problem* This affects 50 OSDs (all disks of this kind, on all but one server): these OSDs are reserved for object storage but I am not yet using them so I may in principle recreate them. But would be interested in understanding what the problem is, and learn how to solve it for future reference. Here is what I see in logs: ..... 2020-05-21 21:17:48.661348 7fa2e9a95ec0 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/cephpa1-72/block size 14.5TiB 2020-05-21 21:17:48.661428 7fa2e9a95ec0 1 bluefs mount 2020-05-21 21:17:48.662040 7fa2e9a95ec0 1 bluefs _init_alloc id 1 alloc_size 0x10000 size 0xe83a3400000 2020-05-21 21:52:43.858464 7fa2e9a95ec0 -1 bluefs mount failed to replay log: (5) Input/output error 2020-05-21 21:52:43.858589 7fa2e9a95ec0 1 fbmap_alloc 0x55c6bba92e00 shutdown 2020-05-21 21:52:43.858728 7fa2e9a95ec0 -1 bluestore(/var/lib/ceph/osd/cephpa1-72) _open_db failed bluefs mount: (5) Input/output error 2020-05-21 21:52:43.858790 7fa2e9a95ec0 1 bdev(0x55c6bbdb6600 /var/lib/ceph/osd/cephpa1-72/block) close 2020-05-21 21:52:44.103536 7fa2e9a95ec0 1 bdev(0x55c6bbdb8600 /var/lib/ceph/osd/cephpa1-72/block) close 2020-05-21 21:52:44.352899 7fa2e9a95ec0 -1 osd.72 0 OSD:init: unable to mount object store 2020-05-21 21:52:44.352956 7fa2e9a95ec0 -1 ESC[0;31m ** ERROR: osd init failed: (5) Input/output errorESC[0m *Second problem* This affects 11 OSDs, which I use *in production* for Cinder block storage: looks like all PGs for this pool are currently OK. Here is the excerpt from the logs. ..... -5> 2020-05-21 20:52:06.756469 7fd2ccc19ec0 0 _get_class not permitted to load kvs -4> 2020-05-21 20:52:06.759686 7fd2ccc19ec0 1 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/rgw/cls_rgw.cc:3869: Loaded rgw class! -3> 2020-05-21 20:52:06.760021 7fd2ccc19ec0 1 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/log/cls_log.cc:299: Loaded log class! -2> 2020-05-21 20:52:06.760730 7fd2ccc19ec0 1 <cls> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/replica_log/cls_replica_log.cc:135: Loaded replica log class! -1> 2020-05-21 20:52:06.760873 7fd2ccc19ec0 -1 osd.63 0 failed to load OSD map for epoch 141282, got 0 bytes 0> 2020-05-21 20:52:06.763277 7fd2ccc19ec0 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fd2ccc19ec0 time 2020-05-21 20:52:06.760916 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h: 994: FAILED assert(ret) Has anyone any idea how I could fix these problems, or what I could do to try and shed some light? And also, what caused them, and whether there is some magic configuration flag I could use to protect my cluster? Thanks a lot for your help! Fulvio

3 years, 11 months

2
5
0 0

High latency spikes under jewel

by Bence Szabo

Hi, We experienced random and relative high latency spikes (around 0.5-10 sec) in our ceph cluster which consists 6 osd nodes, all osd nodes have 6 osd-s. One osd built with one spinning disk and two nvme device. We use a bcache device for osd back end (mixed with hdd and an nvme partition as caching device) and one nvme partition for journal. This synthetic command can be use for check io and latency: rados bench -p rbd 10 write -b 4000 -t 64 With this parameters we often got about 1.5 sec or higher for maximum latency. We cannot decide if our cluster is misconfigured or just this is a natural ceph behavior. Any help, suggestion would be appreciated. Regards, Bence -- --Szabo Bence --<szabo.bence(a)gmail.com>

3 years, 11 months

2
1
0 0

osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

by thoralf schulze

hi there, we are seeing osd occasionally getting kicked out of our cluster, after having been marked down by other osds. most of the time, the affected osd rejoins the cluster after about ~5 minutes, but sometimes this takes much longer. during that time, the osd seems to run just fine. this happens more often that we'd like it to … is "OSD::osd_op_tp thread … had timed out" a real error condition or just a warning about certain operations on the osd taking a long time? i already set osd_op_thread_timeout to 120 (was 60 before, default should be 15 according to the docs), but apparently that doesn't make any difference. are there any other settings that prevent this kind of behaviour? mon_osd_report_timeout maybe, as in frank schilder's case? the cluster runs nautilus 14.2.7, osds are backed by spinning platters with their rocksdb and wals on nvmes. in general, there seems to be the following pattern: - it happens under moderate to heavy load, eg. while creating pools with a lot of pgs - the affected osd logs a lot of: "heartbeat_map is_healthy 'OSD::osd_op_tp thread ${thread-id}' had timed out after 60" … and finally something along the lines of: May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211 7fb25cc80700 0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn slow operation observed for _collection_list, latency = 96.337s, lat = 96s cid =2.0s2_head start GHMAX end GHMAX max 30 May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.219 7fb25cc80700 1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7fb25cc80700' had timed out after 60 May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: osd.293 osd.293 2 : Monitor daemon marked osd.293 down, but it is still running May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315 7fb267c96700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.293 down, but it is still running May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315 7fb267c96700 0 log_channel(cluster) do_log log to syslog May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315 7fb267c96700 0 log_channel(cluster) log [DBG] : map e646639 wrongly marked me down at e646638 May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315 7fb267c96700 0 log_channel(cluster) do_log log to syslog May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371 7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify public interface 'br-bond0' numa node: (2) No such file or directory May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371 7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify public interface 'br-bond0' numa node: (2) No such file or directory - meanwhile on the mon: 2020-05-18 21:12:16.440 7f08f7933700 0 mon.ceph-mon-01@0(leader) e4 handle_command mon_command({"prefix": "status"} v 0) v1 entity='client.admin' cmd=[{"prefix": "status"}]: dispatch 2020-05-18 21:12:18.436 7f08f7933700 0 log_channel(cluster) log [DBG] : osd.293 reported failed by osd.101 2020-05-18 21:12:18.848 7f08f7933700 0 log_channel(cluster) log [DBG] : osd.293 reported failed by osd.533 [… lots of these from various osds] 2020-05-18 21:12:24.992 7f08f7933700 0 log_channel(cluster) log [DBG] : osd.293 reported failed by osd.421 2020-05-18 21:12:26.124 7f08f7933700 0 log_channel(cluster) log [DBG] : osd.293 reported failed by osd.504 2020-05-18 21:12:26.132 7f08f7933700 0 log_channel(cluster) log [INF] : osd.293 failed (root=tuberlin,datacenter=barz,host=ceph-osd-05) (16 reporters from different host after 27.137527 >= grace 26.361774) 2020-05-18 21:12:26.236 7f08fa138700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2020-05-18 21:12:26.280 7f08f6130700 0 log_channel(cluster) log [DBG] : osdmap e646638: 604 total, 603 up, 604 in 2020-05-18 21:12:27.336 7f08f6130700 0 log_channel(cluster) log [DBG] : osdmap e646639: 604 total, 603 up, 604 in 2020-05-18 21:12:28.248 7f08fa138700 0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 17 pgs peering (PG_AVAILABILITY) 2020-05-18 21:12:29.392 7f08fa138700 0 log_channel(cluster) log [WRN] : Health check failed: Degraded data redundancy: 80091/181232010 objects degraded (0.044%), 18 pgs degraded (PG_DEGRADED) 2020-05-18 21:12:33.927 7f08fa138700 0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 22 pgs peering) 2020-05-18 21:12:35.095 7f08fa138700 0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down) 2020-05-18 21:12:35.119 7f08f6130700 0 log_channel(cluster) log [INF] : osd.293 [v2:172.28.9.26:6936/2356578,v1:172.28.9.26:6937/2356578] boot 2020-05-18 21:12:35.119 7f08f6130700 0 log_channel(cluster) log [DBG] : osdmap e646640: 604 total, 604 up, 604 in 2020-05-18 21:12:36.175 7f08f6130700 0 log_channel(cluster) log [DBG] : osdmap e646641: 604 total, 604 up, 604 in i can happily provide more detailed logs, if that helps. thank you very much & with kind regards, thoralf.

3 years, 11 months

3
7
0 0

Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

by Paul Emmerich

Don't optimize stuff without benchmarking *before and after*, don't apply random tuning tipps from the Internet without benchmarking them. My experience with Jumbo frames: 3% performance. On a NVMe-only setup with 100 Gbit/s network. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Tue, May 26, 2020 at 7:02 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > > Look what I have found!!! :) > https://ceph.com/geen-categorie/ceph-loves-jumbo-frames/ > > > > -----Original Message----- > From: Anthony D'Atri [mailto:anthony.datri@gmail.com] > Sent: maandag 25 mei 2020 22:12 > To: Marc Roos > Cc: kdhall; martin.verges; sstkadu; amudhan83; ceph-users; doustar > Subject: Re: [ceph-users] Re: [External Email] Re: Ceph Nautius not > working after setting MTU 9000 > > Quick and easy depends on your network infrastructure. Sometimes it is > difficult or impossible to retrofit a live cluster without disruption. > > > > On May 25, 2020, at 1:03 AM, Marc Roos <M.Roos(a)f1-outsourcing.eu> > wrote: > > > > > > I am interested. I am always setting mtu to 9000. To be honest I > > cannot imagine there is no optimization since you have less interrupt > > requests, and you are able x times as much data. Every time there > > something written about optimizing the first thing mention is changing > > > to the mtu 9000. Because it is quick and easy win. > > > > > > > > > > -----Original Message----- > > From: Dave Hall [mailto:kdhall@binghamton.edu] > > Sent: maandag 25 mei 2020 5:11 > > To: Martin Verges; Suresh Rama > > Cc: Amudhan P; Khodayar Doustar; ceph-users > > Subject: [ceph-users] Re: [External Email] Re: Ceph Nautius not > > working after setting MTU 9000 > > > > All, > > > > Regarding Martin's observations about Jumbo Frames.... > > > > I have recently been gathering some notes from various internet > > sources regarding Linux network performance, and Linux performance in > > general, to be applied to a Ceph cluster I manage but also to the rest > > > of the Linux server farm I'm responsible for. > > > > In short, enabling Jumbo Frames without also tuning a number of other > > kernel and NIC attributes will not provide the performance increases > > we'd like to see. I have not yet had a chance to go through the rest > > of the testing I'd like to do, but I can confirm (via iperf3) that > > only enabling Jumbo Frames didn't make a significant difference. > > > > Some of the other attributes I'm referring to are incoming and > > outgoing buffer sizes at the NIC, IP, and TCP levels, interrupt > > coalescing, NIC offload functions that should or shouldn't be turned > > on, packet queuing disciplines (tc), the best choice of TCP slow-start > > > algorithms, and other TCP features and attributes. > > > > The most off-beat item I saw was something about adding IPTABLES rules > > > to bypass CONNTRACK table lookups. > > > > In order to do anything meaningful to assess the effect of all of > > these settings I'd like to figure out how to set them all via Ansible > > - so more to learn before I can give opinions. > > > > --> If anybody has added this type of configuration to Ceph Ansible, > > I'd be glad for some pointers. > > > > I have started to compile a document containing my notes. It's rough, > > > but I'd be glad to share if anybody is interested. > > > > -Dave > > > > Dave Hall > > Binghamton University > > > >> On 5/24/2020 12:29 PM, Martin Verges wrote: > >> > >> Just save yourself the trouble. You won't have any real benefit from > > MTU > >> 9000. It has some smallish, but it is not worth the effort, problems, > > and > >> loss of reliability for most environments. > >> Try it yourself and do some benchmarks, especially with your regular > >> workload on the cluster (not the maximum peak performance), then drop > > the > >> MTU to default ;). > >> > >> Please if anyone has other real world benchmarks showing huge > > differences > >> in regular Ceph clusters, please feel free to post it here. > >> > >> -- > >> Martin Verges > >> Managing director > >> > >> Mobile: +49 174 9335695 > >> E-Mail: martin.verges(a)croit.io > >> Chat: https://t.me/MartinVerges > >> > >> croit GmbH, Freseniusstr. 31h, 81247 Munich > >> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht > >> Munich HRB 231263 > >> > >> Web: https://croit.io > >> YouTube: https://goo.gl/PGE1Bx > >> > >> > >>> Am So., 24. Mai 2020 um 15:54 Uhr schrieb Suresh Rama > >> <sstkadu(a)gmail.com>: > >> > >>> Ping with 9000 MTU won't get response as I said and it should be > > 8972. Glad > >>> it is working but you should know what happened to avoid this issue > > later. > >>> > >>>> On Sun, May 24, 2020, 3:04 AM Amudhan P <amudhan83(a)gmail.com> > wrote: > >>> > >>>> No, ping with MTU size 9000 didn't work. > >>>> > >>>> On Sun, May 24, 2020 at 12:26 PM Khodayar Doustar > > <doustar(a)rayanexon.ir> > >>>> wrote: > >>>> > >>>>> Does your ping work or not? > >>>>> > >>>>> > >>>>> On Sun, May 24, 2020 at 6:53 AM Amudhan P <amudhan83(a)gmail.com> > > wrote: > >>>>> > >>>>>> Yes, I have set setting on the switch side also. > >>>>>> > >>>>>> On Sat 23 May, 2020, 6:47 PM Khodayar Doustar, > > <doustar(a)rayanexon.ir> > >>>>>> wrote: > >>>>>> > >>>>>>> Problem should be with network. When you change MTU it should be > >>>> changed > >>>>>>> all over the network, any single hup on your network should > >>>>>>> speak > > and > >>>>>>> accept 9000 MTU packets. you can check it on your hosts with > >>> "ifconfig" > >>>>>>> command and there is also equivalent commands for other > >>>> network/security > >>>>>>> devices. > >>>>>>> > >>>>>>> If you have just one node which it not correctly configured for > > MTU > >>>> 9000 > >>>>>>> it wouldn't work. > >>>>>>> > >>>>>>> On Sat, May 23, 2020 at 2:30 PM sinan(a)turka.nl <sinan(a)turka.nl> > >>> wrote: > >>>>>>>> Can the servers/nodes ping eachother using large packet sizes? > >>>>>>>> I > >>> guess > >>>>>>>> not. > >>>>>>>> > >>>>>>>> Sinan Polat > >>>>>>>> > >>>>>>>>> Op 23 mei 2020 om 14:21 heeft Amudhan P <amudhan83(a)gmail.com> > > het > >>>>>>>> volgende geschreven: > >>>>>>>>> In OSD logs "heartbeat_check: no reply from OSD" > >>>>>>>>> > >>>>>>>>>> On Sat, May 23, 2020 at 5:44 PM Amudhan P > > <amudhan83(a)gmail.com> > >>>>>>>> wrote: > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> I have set Network switch with MTU size 9000 and also in my > >>> netplan > >>>>>>>>>> configuration. > >>>>>>>>>> > >>>>>>>>>> What else needs to be checked? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> On Sat, May 23, 2020 at 3:39 PM Wido den Hollander < > >>> wido(a)42on.com > >>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> On 5/23/20 12:02 PM, Amudhan P wrote: > >>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>> I am using ceph Nautilus in Ubuntu 18.04 working fine wit > > MTU > >>>> size > >>>>>>>> 1500 > >>>>>>>>>>>> (default) recently i tried to update MTU size to 9000. > >>>>>>>>>>>> After setting Jumbo frame running ceph -s is timing out. > >>>>>>>>>>> Ceph can run just fine with an MTU of 9000. But there is > >>> probably > >>>>>>>>>>> something else wrong on the network which is causing this. > >>>>>>>>>>> > >>>>>>>>>>> Check the Jumbo Frames settings on all the switches as well > > to > >>>> make > >>>>>>>> sure > >>>>>>>>>>> they forward all the packets. > >>>>>>>>>>> > >>>>>>>>>>> This is definitely not a Ceph issue. > >>>>>>>>>>> > >>>>>>>>>>> Wido > >>>>>>>>>>> > >>>>>>>>>>>> regards > >>>>>>>>>>>> Amudhan P > >>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io To > >>>>>>>>>>>> unsubscribe send an email to ceph-users-leave(a)ceph.io > >>>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe > > >>>>>>>>>>> send an email to ceph-users-leave(a)ceph.io > >>>>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe > >>>>>>>>> send an email to ceph-users-leave(a)ceph.io > >>>>>>>> _______________________________________________ > >>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe > >>>>>>>> send an email to ceph-users-leave(a)ceph.io > >>>>>>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send > >>>> an email to ceph-users-leave(a)ceph.io > >>>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > > >>> email to ceph-users-leave(a)ceph.io > >>> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > >> email to ceph-users-leave(a)ceph.io > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > > email to ceph-users-leave(a)ceph.io > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > > email to ceph-users-leave(a)ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

3 years, 11 months

4
4
0 0

looking for telegram group in English or Chinese

by Zhenshi Zhou

Hi all, Is there any telegram group for communicating with ceph users?

3 years, 11 months

3
3
0 0

Multisite RADOS Gateway replication factor in zonegroup

by alexander.vysochin＠Megafon.ru

Hello, I didn’t find any information about the replication factor in the zone group. Assume I have three ceph clusters with Rados Gateway in one zone group each with replica size 3. How many replicas of an object I’ll get in total? Is it possible to define several regions, each with several datacenters, and define maximum replication factor at region scope? Александр Высочин Старший разработчик Технические инновации и инфраструктура - Разработка и развертывание платформ бизнес-сервисов Технические инновации и инфраструктура ПАО «МегаФон» <tel:> [cid:image001.png@01D632C0.86CD4F00] ________________________________ Информация в этом сообщении предназначена исключительно для конкретных лиц, которым она адресована. В сообщении может содержаться конфиденциальная информация, которая не может быть раскрыта или использована кем-либо кроме адресатов. Если вы не адресат этого сообщения, то использование, переадресация, копирование или распространение содержания сообщения или его части - незаконно и запрещено. Если Вы получили это сообщение ошибочно, пожалуйста, незамедлительно сообщите отправителю об этом и удалите со всем содержимым само сообщение и любые возможные его копии и приложения. Настоящее сообщение и вложения в него носят исключительно информационный характер и не влекут для нас каких-либо обязательств, не означают признания или подтверждения нами каких-либо обстоятельств. Такие последствия наступают только после подписания уполномоченными лицами оригинальных версий соглашений, актов, уведомлений или иных документов. В случае неприемлемости для нас каких-либо условий мы сохраняем за собой право в любое время прекратить переговоры в отношении любых вопросов. Вступая с нами в переписку, Вы считаетесь проинформированными обо всем, что указано выше. The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. The contents may not be disclosed or used by anyone other than the addressee. If you are not the intended recipient(s), any use, disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it is prohibited and may be unlawful. If you have received this communication in error please notify us immediately by responding to this email and then delete the e-mail and all attachments and any copies thereof. This communication and attachments hereto are for the informational purposes only and do not create or modify any our obligations and shall not be deemed as our admission or confirmation of any circumstances. Such consequences may occur only after duly authorized persons have signed the originals of the agreements, acts or other documents. If any conditions are unacceptable for us we reserve a right to terminate negotiations in respect of any issues at any time. Entering into any correspondence with us you are considered to be informed on all that is stated above. -----

3 years, 11 months

2
1
0 0

Prometheus Python Errors

by support＠orbitingcode.com

Hello Everyone, I have installed both Prometheus and Grafana on one of my manager nodes (Ubuntu 18.04), and have configured both according to the documentation. I have visible Grafana dashboards when visiting http://mon1:3000, but no data exists on the dashboard. Python errors are shown for the job_name: ceph in Prometheus. Below is my prometheus.yaml configuration global: scrape_interval: 5s scrape_configs: - job_name: prometheus static_configs: - targets: ['localhost:9090'] - job_name: 'ceph-exporter' static_configs: - targets: ['localhost:9100'] labels: alias: ceph-exporter - job_name: 'ceph' static_configs: - targets: ['localhost:9283'] labels: alias: ceph And, these are the Python errors shown when I view the details of the targets in Prometheus (http://mon1:9090/targets) Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/cherrypy/_cprequest.py", line 670, in respond response.body = self.handler() File "/usr/lib/python2.7/dist-packages/cherrypy/lib/encoding.py", line 220, in __call__ self.body = self.oldhandler(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/cherrypy/_cpdispatch.py", line 60, in __call__ return self.callable(*self.args, **self.kwargs) File "/usr/share/ceph/mgr/prometheus/module.py", line 1060, in metrics return self._metrics(instance) File "/usr/share/ceph/mgr/prometheus/module.py", line 1074, in _metrics instance.collect_cache = instance.collect() File "/usr/share/ceph/mgr/prometheus/module.py", line 975, in collect self.get_rbd_stats() File "/usr/share/ceph/mgr/prometheus/module.py", line 734, in get_rbd_stats 'rbd_stats_pools_refresh_interval', 300) TypeError: unsupported operand type(s) for +: 'int' and 'str' If anyone has experienced this issue, and might have a solution, I would appreciate any assistance. Thank you, Todd

3 years, 11 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users