June 2020 - ceph-users - lists.ceph.io

Octopus: orchestrator not working correctly with nfs

by Simon Sutter

Hello I know that nfs on octopus is still a bit under development. I'm trying to deploy nfs daemons and have some issues with the orchestartor. For the other daemons, for example monitors, I can issue the command "ceph orch apply mon 3" This will tell the orchestrator to deploy or remove monitor daemons until there are three of them. The command does not work with nfs, and now the orchestrator is a bit missconfigured... And with missconfigured I mean, that I have now a nfs daemon on node 1 and the orchestrator wants to create another one on node 1 but with wrong settings (it fails). Also a "ceph orch apply nfs –unconfigured" does not work, so I can't manually manage the nfs containers. Is there a manual way to tell ceph orch, to not create or remove nfs daemons? then I would be able to set them up manually. Or a manual way of configuring the orchestrator so it does the right thing. Thanks in advance Simon

3 years, 10 months

3
5
0 0

rbd-mirror sync image continuously or only sync once

by Zhenshi Zhou

Hi all, I'm gonna deploy a rbd-mirror in order to sync image from clusterA to clusterB. The image will be used while syncing. I'm not sure if the rbd-mirror will sync image continuously or not. If not, I will inform clients not to write data in it. Thanks. Regards

3 years, 10 months

3
14
0 0

rgw multisite metadata sync error

by 黄明友

hi,all: I have a multisite config with one zonegroup and 3 zone; a zone can't fully sync metadata from master zone. I had restart the rgw to restart the sync process; but not useful. and can I run the " metadata sync init " or " metadata init run " command to resolve this problem.

3 years, 10 months

1
0
0 0

Re: Excessive write load on mons after upgrade from 12.2.13 -> 14.2.7

by Dan van der Ster

This means it has been applied: # ceph osd dump -f json | jq .require_osd_release "nautilus" -- dan On Mon, Feb 17, 2020 at 11:10 AM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > > How do you check if you issued this command in the past? > > > -----Original Message----- > To: ceph-users(a)ceph.io > Subject: [ceph-users] Re: Excessive write load on mons after upgrade > from 12.2.13 -> 14.2.7 > > Hi Peter, > > could be a totally different problem but did you run the command "ceph > osd require-osd-release nautilus" after the upgrade? > We had poor performance after upgrading to nautilus and running this > command fixed it. The same was reported by others for previous updates. > Here is my original message regarding this issue: > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/OYFRWSJXPV… > > We did not observe the master election problem though. > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 10 months

2
2
0 0

Ceph dashboard inventory page not listing osds

by Amudhan P

Hi, I am using Ceph octopus in a small cluster. I have enabled ceph dashboard and when I go to inventory page I could see OSD's running in mgr node only not listing other OSD in other 3 nodes. I don't see any issue in the log. How do I list other OSD'S Regards Amudhan P

3 years, 10 months

3
3
0 0

[RGW] Strange write performance issues

by Alexandru Cucu

Hello Ceph users, We've been doing some tests with Ceph RGW. Mostly wanted to see how Ceph will do with a large number of objects in a single bucket. For the test we had a cluster with 3 nodes, running collocated OSDs, MON, MGR, and RGW. CPU: 2x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (48 threads in total) RAM: 128 GB Network: 4 x 10Gbps in a single LACP bond. OSD drives: 2 x 800GB NVMe Write Intensive Ceph version: Nautilus (14.2.9) The data pool uses replica 3. Ceph has only default configuration values. We ran a COSBench test with 200 threads, writing 100 Million objects with a size of 4KB and noticed that performance started at ~500 ops/sec, then, multiple times, jumped up to more than 7000 ops/sec and back down to less than 500 ops/sec. https://i.imgur.com/TopM6sw.png https://i.imgur.com/y5Mu9F3.png All the collected data from the COSBench test: https://docs.google.com/spreadsheets/d/1wAwrg9nE2e_MItQB5wVrmLIO-KH7hkUBtz0… We have noticed that during the low-performance time, the cluster is doing read IO and the response time is very high: https://i.imgur.com/tgZ5WLF.png https://i.imgur.com/2PiEGZB.png Here are the IO stats for the index and data pools: https://i.imgur.com/hC3HZ1R.png https://i.imgur.com/TwsXghv.png We did the test multiple times on clean clusters, with similar results. We also ran a second test, writing 50M new objects to the same bucket we have previously filled with 100M objects and everything seems to be working perfectly. Does anyone know why this is happening? The response times are huge and would be a disaster in a production environment! Thank you, --- Alex Cucu

3 years, 10 months

2
1
0 0

Re: Reducing RAM usage on production MDS

by Patrick Donnelly

On Wed, May 27, 2020 at 10:09 PM Dylan McCulloch <dmc(a)unimelb.edu.au> wrote: > > Hi all, > > The single active MDS on one of our Ceph clusters is close to running out of RAM. > > MDS total system RAM = 528GB > MDS current free system RAM = 4GB > mds_cache_memory_limit = 451GB > current mds cache usage = 426GB This mds_cache_memory_limit is way too high for the available RAM. We normally recommend that your RAM be 150% of your cache limit but we lack data for such large cache sizes. > Presumably we need to reduce our mds_cache_memory_limit and/or mds_max_caps_per_client, but would like some guidance on whether it’s possible to do that safely on a live production cluster when the MDS is already pretty close to running out of RAM. > > Cluster is Luminous - 12.2.12 > Running single active MDS with two standby. > 890 clients > Mix of kernel client (4.19.86) and ceph-fuse. > Clients are 12.2.12 (398) and 12.2.13 (3) v12.2.12 has the changes necessary to throttle MDS cache size reduction. You should be able to reduce mds_cache_memory_limit to any lower value without destabilizing the cluster. > The kernel clients have stayed under “mds_max_caps_per_client”: “1048576". But the ceph-fuse clients appear to hold very large numbers according to the ceph-fuse asok. > e.g. > “num_caps”: 1007144398, > “num_caps”: 1150184586, > “num_caps”: 1502231153, > “num_caps”: 1714655840, > “num_caps”: 2022826512, This data from the ceph-fuse asok is actually the number of caps ever received, not the current number. I've created a ticket for this: https://tracker.ceph.com/issues/45749 Look at the data from `ceph tell mds.foo session ls` instead. > Dropping caches on the clients appears to reduce their cap usage but does not free up RAM on the MDS. The MDS won't free up RAM until the cache memory limit is reached. > What is the safest method to free cache and reduce RAM usage on the MDS in this situation (without having to evict or remount clients)? reduce mds_cache_memory_limit > I’m concerned that reducing mds_cache_memory_limit even in very small increments may trigger a large recall of caps and overwhelm the MDS. That used to be the case in older versions of Luminous but not any longer. -- Patrick Donnelly, Ph.D. He / Him / His Senior Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

3 years, 10 months

2
1
0 0

RadosGW latency on chuked uploads

by Tadas

Hello, I have strange issues with radosgw: When trying to PUT object with “transfer-encoding: chunked”, I can see high request latencies. When trying to PUT the same object as non-chunked – latency is much lower, and also request/s performance is better. Perhaps anyone had the same issue?

3 years, 10 months

2
4
0 0

mds behind on trimming - replay until memory exhausted

by Francois Legrand

Hi all, We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and 3 mds (1 active for each fs + one failover). We are transfering all the datas (~600M files) from one FS (which was in EC 3+2) to the other FS (in R3). On the old FS we first removed the snapshots (to avoid strays problems when removing files) and the ran some rsync deleting the files after the transfer. The operation should last a few weeks more to complete. But few days ago, we started to have some warning mds behind on trimming from the mds managing the old FS. Yesterday, I restarted the active mds service to force the takeover by the standby mds (basically because the standby is more powerfull and have more memory, i.e 48GB over 32). The standby mds took the rank 0 and started to replay... the mds behind on trimming came back and the number of segments rised as well as the memory usage of the server. Finally, it exhausted the memory of the mds and the service stopped and the previous mds took rank 0 and started to replay... until memory exhaustion and a new switch of mds etc... It thus seems that we are in a never ending loop ! And of course, as the mds is always in replay, the data are not accessible and the transfers are blocked. I stopped all the rsync and unmount the clients. My questions are : - Does the mds trim during the replay so we could hope that after a while it will purge everything and the mds will be able to become active at the end ? - Is there a way to accelerate the operation or to fix this situation ? Thanks for you help. F.

3 years, 10 months

2
22
0 0

Cluster outage due to client IO?

by Frank Schilder

Dear fellow cephers, we have a nasty and at the same time mysterious problem with our mimic 13.2.8 cluster. The most prominent observations are - benign operations like adding disks sometimes leads to timeouts and OSDs marked down for no apparent reason, other times everything is normal (see https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/VO7FGPDKRD…) - out of the blue, OSDs or other daemons get marked down for a period of time after which everything suddenly goes back to normal (see https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QVQTAL6ZK6S…) All this happens while everything is physically healthy, no network problems, no disk problems, nothing. No daemon actually crashes and the logs do not contain any trace of a malfunction. At first I thought that some clients overloaded the cluster, but now I'm leaning towards a very rarely hit bug in the MON daemons. My current hypothesis is that a MON daemon can enter an ill state without notice, there is nothing in the logs and the peers don't recognize anything either. An affected MON daemon keeps running and continues to communicate with everything else, yet every now and then it goes bonkers for a while taking an entire cluster out of service. A single ill MON is enough for this to happen, making this a rare high impact problem. Please find below a collection of observations and attempts to address the periodic service outages we experienced. Unfortunately, the observations are somewhat inconclusive and I hope someone here recognizes a smoking gun. The information was collected over a period of 6 weeks. I tried to organize the information below by relevance/significance, but also aim for providing a complete re-collection just in case. The structure is as follows: 1. Description of incident 2. Actions taken to get the cluster under control 3. Potential (temporary) resolution 4. Observations collected during various occurrences ========================== 1. Description of incident After setting mon_osd_report_timeout and mon_mgr_beacon_grace to 24h I managed to isolate a signal in our monitoring data that clearly marks the window of an incident; see https://imgur.com/a/cg9Yf4E . The images show the network packet reports collected with ganglia on the server- and client side. The blue curve is packets out and the green curve is packets in. The incident started at ca. 10:15 and is clearly visible as a plateau. Note that we had another incident just shortly before that one, ending around 9:30, the ending of which is also visible. During an incident, typically two MONs start sending an insane amount of packets out. The number goes up to 40000 pkts/s. For comparison, normal operation is only 200 pkts/s. During an incident, the leader stops processing OSD and MGR beacons, but happily marks everything down. I don't remember any other operation stuck, in particular, not MDS beacons. The CPU load of MONs sending packets is slightly above 100%. Memory consumption is normal. It looks like the packets are sent to a subset of ceph fs clients. I'm not sure if this is connected to actual I/O the clients are doing, I have seen incidents for both cases, heavy and absolutely no I/O load on the file system. The images show an incident with no correlated I/O going on. However, it is remarkable that all affected clients were running a specific software of a specific user, so clients seem to play a role although it looks unlikely that ordinary file operations are causing this. During an incident window, one has very little time to react. During the first 4-5 minutes the cluster still responds to admin commands like setting OSD flags. As soon as the MONs start reporting slow ops, everything gets stuck and one can only wait for it to be over. Its a complete loss of control. Typically, OSDs and MGRs get marked down en mass and service is lost without a chance to do anything. The cluster is physically healthy the entire time, I did not observe any actual crashes or spontaneous restarts. The incident is not (easily?) reproducible. I observed two software packages running while incidents happened. However, the software was also running without any issues at other times. The frequency of incident windows was 1-2 per day during the worst periods. It looks like certain client software (operations) increase the chance of this happening, but don't always trigger it. I collected a lot of information during a one-week period with several incidents, such as - MON and OSD logs - full aggregated system log - dumps of ops and historic ops during incidents - tcpdumps in both, promiscuous and non-promiscuous mode - a full tcp dump of 10000 packets (ca. 0.25s) on the leader during an incident window - more graphs with other data - results of experiments with different actions So far I couldn't find anything, probably because I don't know what to look for. I'm happy to make any information available if it helps. ========================== 2. Actions taken to get the cluster under control The first incident of long ping times (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/VO7FGPDKRD…), which I believe is related happened while I added a small number of OSDs. The problem went away by itself and I didn't observe it again when adding more OSDs later. I discovered the second incident of total cluster outage (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QVQTAL6ZK6S…) very shortly after it started. By coincidence I was looking at the dashboard in the right moment. However, I came too late to the command line. About 2/3 of all OSDs were marked down. The cluster still responded to admin commands and I managed to set nodown and noout, which helped a little. I got a very important hint in the ceph-user thread and after setting mon_osd_report_timeout to 1h the cluster became immediately healthy. I'm not entirely sure though if this was due to the setting of the time out or the incident stopping by itself and the command returning only after the incident was over - it might have been coincidence. Thereafter it went gradually worse and worse. The incident windows became longer and occurred more and more often. Instead of OSDs being marked down, it now hit the MGRs. I therefore also increased mon_mgr_beacon_grace to larger and larger values to get the cluster stable and to have a chance to debug and collect information. At some point, the incident window exceeded 1h and OSDs got affected again. We had a couple more total outages and at some point I set the timeouts to 24h, which would allow me to check just twice a day. The longest incident window I observed was something like 1h:15m. After this, I was trying to catch incidents and experiment with different actions to see what happens. I potentially discovered a way to resolve the problem at least temporarily, see next section. Results of other experiments are listed thereafter. ========================== 3. Potential (temporary) resolution Restarting the "ill" monitor resolves the issue. I don't know if this issue arises at MON startup or over time under certain conditions. Therefore, I cannot say if a clean restart resolves it permanently or if the monitor can fall ill again. I still have the 24h beacon timeouts just to be sure. In my case, the leader was probably the culprit. After restarting the leader about 3 weeks ago I have not seen a single incident again. A possible clue for whether or not a cluster is affected could be the distribution of client connections the monitors hold. I observed an incredibly uneven distribution like ceph-01(leader) 1500, ceph-02 70, ceph-03 300. After restarting ceph-02 nothing happened. After restarting ceph-01 the MONs immediately started re-balancing client connections and ended up with something like ceph-01 650, ceph-02 850, ceph-03 650. This converged over the last 3 weeks to around ceph-01 725, ceph-02 815, ceph-03 625 and varies only little. It seems that a highly imbalanced distribution of client connections is an early indicator. I watch the distribution since then. Another early sign might be that the active MGR does not like to run on a specific host any more. Our MGR was always running on ceph-01 (together with the MON leader). At some point, however, I observed that it started failing over to ceph-02 for, again, no apparent reason. There was no actual MGR restart. Even if I forced the MGR on ceph-01 to be active, after a while a different MGR would become the active again. After re-starting the "ill" MON, this also stopped. ========================== 4. Observations collected during various occurrences - setting nodown and noout in time will prevent service outage - if OSDs remain up, client I/O is unaffected (throughput wise), the incident does not add critical load - restarting a MON with slow OPS does not help, probably unless it is the right one; I never dared to restart the leader during an incident though - restarting MGRs makes things worse - disabling the dashboard does not help (I suspected a bug in the dashboard for a while) - during an incident, the general network load on the entire cluster increases; I could see a very large increase of packets between all (?, many?) servers with tcpdump in promiscuous mode; network hardware was never challenged though Hope that any of this makes sense to someone and helps to isolate the root cause. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 10 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users June 2020