October 2020 - ceph-users

by Brian Topping

Hello experts, I have accidentally created a situation where the only monitor in a cluster has been moved to a new node without it’s /var/lib/ceph contents. Not realizing what I had done, I decommissioned the original node, but still have the contents of it’s /var/lib/ceph. Can I shut down the monitor running on the new node, copy monitor data from the original node to the new node and restart the monitor? Or is there information in the monitor database that is tied to the original node? If that’s the case, I suspect I need to somehow recommission the original node. Thanks for any feedback on this situation! Brian

3 years, 6 months

2
2
0 0

How to clear Health Warning status?

by Tecnología CHARNE.NET

Hello! Today, I started the morning with a WARNING STATUS on our Ceph cluster. # ceph health detail HEALTH_WARN Too many repaired reads on 1 OSDs [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs osd.67 had 399911 reads repaired I made "ceph osd out 67" and PGs where migrated to another OSDs. I stopped the osd.67 daemon, inspected the logs, etc... Then I started the daemon and made "# ceph osd in 67". OSD started backfilling with some PGs and no other error appeared in the rest of the day, but Warning status still remains. Can I clear it? Shoud I remove the osd and start with a new one? Thanks in advance for your time! Javier.-

3 years, 6 months

2
2
0 0

Re: another osd_pglog memory usage incident

by Dan van der Ster

On Fri, Oct 9, 2020 at 3:12 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > > > > > >1. The pg log contains 3000 entries by default (on nautilus). These > >3000 entries can legitimately consume gigabytes of ram for some > >use-cases. (I haven't determined exactly which ops triggered this > >today). > > How can I check how much ram my pg_logs are using? ceph daemon osd.x dump_mempools | jq .mempool.by_pool.osd_pglog > > > > -----Original Message----- > Cc: ceph-users > Subject: [ceph-users] Re: another osd_pglog memory usage incident > > On 09.10.20 13:55, Dan van der Ster wrote: > [...] > > I also noticed a possible relationship with scrubbing -- One week ago > > we increased to osd_max_scrubs=5 to clear out a scrubbing backlog; I > > wonder if the increased read/write ratio somehow led to an exploding > > buffer_anon. Do things stabilize on your side if you temporarily > > disable scrubbing? > > During the worst periods, we had disabled scrubbing. When we re-enabled, > we had our write-job to mitigate the problems. And currently, scrub load > is low. So I cannot tell, but it is very plausible. > > Cheers > Harry > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > email to ceph-users-leave(a)ceph.io > >

3 years, 6 months

1
0
0 0

Multisite replication speed

by Nicolas Moal

Hello everybody, We have two Ceph object clusters replicating over a very long-distance WAN link. Our version of Ceph is 14.2.10. Currently, replication speed seems to be capped around 70 MiB/s even if there's a 10Gb WAN link between the two clusters. The clusters themselves don't seem to suffer from any performance issue. The replication traffic leverages HAProxy VIPs, which means there's a single endpoint (the HAProxy VIP) in the multisite replication configuration. So, my questions are: - Is it possible to improve replication speed by adding more endpoints in the multisite replication configuration? The issue we are facing is that the secondary cluster is way behind the master cluster because of the relatively slow speed. - Is there anything else I can do to optimize replication speed ? Thanks for your comments ! Nicolas

3 years, 6 months

3
4
0 0

another osd_pglog memory usage incident

by Dan van der Ster

Hi all, This morning some osds in our S3 cluster started going OOM, after restarting them I noticed that the osd_pglog is using >1.5GB per osd. (This is on an osd with osd_memory_target = 2GB, hosting 112PGs, all PGs are active+clean). After reading through this list and trying a few things, I'd like to share the following observations for your feedback: 1. The pg log contains 3000 entries by default (on nautilus). These 3000 entries can legitimately consume gigabytes of ram for some use-cases. (I haven't determined exactly which ops triggered this today). 2. The pg log length is decided by the primary osd -- setting osd_max_pg_log_entries/osd_min_pg_log_entries on one single OSD does not have a big effect (because most of the PGs are primaried somewhere else). You need to set it on all the osds for it to be applied to all PGs. 3. We eventually set osd_max_pg_log_entries = 500 everywhere. This decreased the osd_pglog mempool from more than 1.5GB on our largest osds to less that 500MB. 4. The osd_pglog mempool is not accounted for in the osd_memory_target (in nautilus). 5. I have opened a feature request to limit the pg_log length by memory size (https://tracker.ceph.com/issues/47775). This way we could allocate a fraction of memory to the pg log and it would shorten the pglog length (budget) accordingly. 6. Would it be feasible to add an osd option to 'trim pg log at boot' ? This way we could avoid the cumbersome ceph-objectstore-tool trim-pg-log in cases of disaster (osds going oom at boot). For those that had pglog memory usage incidents -- does this match your experience? Thanks! Dan

3 years, 6 months

4
7
0 0

Nautilus RGW fails to open Jewel buckets (400 Bad Request)

by Wido den Hollander

Hi, Most of it is described here: https://tracker.ceph.com/issues/22928 Buckets created under Jewel don't always have the *placement_rule* set in their bucket metadata and this causes Nautilus RGWs to not serve requests for them. Snippet from the metadata: { "key": "bucket.instance:pbx:ams02.446941181.1", "ver": { "tag": "86lc3iVtQpPiJYkh95YCTnhu", "ver": 2 }, "mtime": "2020-10-09 09:12:04.744423Z", "data": { "bucket_info": { "bucket": { "name": "pbx", "marker": "ams02.241978.4", "bucket_id": "ams02.446941181.1", "tenant": "", "explicit_placement": { "data_pool": ".rgw.buckets", "data_extra_pool": "", "index_pool": ".rgw.buckets" } }, "creation_time": "2014-02-16 12:32:15.000000Z", "owner": "vdvm", "flags": 0, "zonegroup": "eu", "placement_rule": "", Notice that *placement_rule* is empty and that this bucket has *explicit_placement* set. There is no way to update the bucket.instance metadata as far as I know, otherwise I could have set a placement rule for the bucket. Earlier on the ML this has been discussed: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXL… People there compiled a manual version of RGW, something I'd rather stay away from. Has anybody seen this and if so: Have you found a solution? The commit that breaks these buckets is this one: https://github.com/ceph/ceph/commit/2a8e8a98d8c56cc374ec671846a20e2b0484bc75 14.2.0 was the first release with that code in there. So two things I'm thinking about and I don't know which one is best: - Update RGW and modify the if-statement added by commit 2a8e8a - Enhance 'bucket check --fix' to update the placement_rule if none is set for a bucket Any hints or suggestions? Wido

3 years, 6 months

1
0
0 0

Weird performance issue with long heartbeat and slow ops warnings

by Void Star Nill

Hello, I have a ceph cluster running 14.2.11. I am running benchmark tests with FIO concurrently on ~2000 volumes of 10G each. During the time initial warm-up FIO creates a 10G file on each volume before it runs the actual read/write I/O operations. During this time, I start seeing the Ceph cluster reporting about 35GiB/s write throughput for a while, but after some time I start seeing "long heartbeat" and "slow ops" warnings and in a few mins the throughput drops to ~1GB/s and stays there until all FIO runs complete. The cluster has 5 monitor nodes and 10 data nodes - each with 10x3.2TB NVME drives. I have setup 3 OSD for each NVME, so there are a total of 300 OSDs. Each server has 200GB uplink and there's no apparent network bottleneck as the network is set up to support over 1Tbps bandwidth. I dont see any CPU or memory issues also on the servers. There is a single manager instance running on one of the mons. The pool is configured for 3 replication factor with min_size of 2. I tried to use pg_num of 8192 and 16384 and saw the issue with both settings. Could you please suggest if this is a known issue or if I can tune any parameters? Long heartbeat ping times on back interface seen, longest is 1202.120 msec Long heartbeat ping times on front interface seen, longest is 1535.191 msec 35 slow ops, oldest one blocked for 122 sec, daemons [osd.135,osd.14,osd.141,osd.143,osd.149,osd.15,osd.151,osd.153,osd.157,osd.162]... have slow ops. Regards, Shridhar

3 years, 6 months

2
2
0 0

pg active+clean but can not handle io

by 古轶特

Hi all: I have a ceph cluster, the version is 12.2.12. this is my ceph osd tree: [root@node-1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -25 2.78760 root rack-test -26 0.92920 rack rack_1 -37 0.92920 host host_1 12 ssd 0.92920 osd.12 up 1.00000 1.00000 -27 0.92920 rack rack_2 -38 0.92920 host host_2 6 ssd 0.92920 osd.6 up 1.00000 1.00000 -28 0.92920 rack rack_3 -39 0.92920 host host_3 18 ssd 0.92920 osd.18 up 1.00000 1.00000 I have a pool in cluster: pool 14 'gyt-test' replicated size 2 min_size 1 crush_rule 2 object_hash rjenkins pg_num 128 pgp_num 128 last_change 5864 lfor 0/5828 flags hashpspool stripe_width 0 removed_snaps [1~3] crush_rule 2 dump: { "rule_id": 2, "rule_name": "replicated_rule_rack", "ruleset": 2, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -25, "item_name": "rack-test" }, { "op": "chooseleaf_firstn", "num": 0, "type": "rack" }, { "op": "emit" } ] } I have a rbd images in gyt-test pool: [root@node-1 ~]# rbd ls gyt-test gyt-test now, I use fio tool to test this rbd images: [root@node-1 ~]# fio --ioengine=rbd --pool=gyt-test --rbdname=gyt-test --rw=randwrite --bs=4k --numjobs=1 --runtime=120 --iodepth=128 --clientname=admin --direct=1 --name=test --time_based=1 --eta-newline 1 test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=128 fio-3.1 Starting 1 process Jobs: 1 (f=1): [w(1)][2.5%][r=0KiB/s,w=42.2MiB/s][r=0,w=10.8k IOPS][eta 01m:57s] Jobs: 1 (f=1): [w(1)][4.2%][r=0KiB/s,w=57.7MiB/s][r=0,w=14.8k IOPS][eta 01m:55s] Jobs: 1 (f=1): [w(1)][5.8%][r=0KiB/s,w=52.4MiB/s][r=0,w=13.4k IOPS][eta 01m:53s] Jobs: 1 (f=1): [w(1)][7.5%][r=0KiB/s,w=61.1MiB/s][r=0,w=15.6k IOPS][eta 01m:51s] Jobs: 1 (f=1): [w(1)][9.2%][r=0KiB/s,w=30.0MiB/s][r=0,w=7927 IOPS][eta 01m:49s] Jobs: 1 (f=1): [w(1)][10.8%][r=0KiB/s,w=59.1MiB/s][r=0,w=15.1k IOPS][eta 01m:47s] Jobs: 1 (f=1): [w(1)][12.5%][r=0KiB/s,w=51.6MiB/s][r=0,w=13.2k IOPS][eta 01m:45s] Jobs: 1 (f=1): [w(1)][14.2%][r=0KiB/s,w=58.3MiB/s][r=0,w=14.9k IOPS][eta 01m:43s] Jobs: 1 (f=1): [w(1)][15.8%][r=0KiB/s,w=56.1MiB/s][r=0,w=14.4k IOPS][eta 01m:41s] Jobs: 1 (f=1): [w(1)][17.5%][r=0KiB/s,w=44.8MiB/s][r=0,w=11.5k IOPS][eta 01m:39s] This is normal And then, I move host_1 bucket to rack-test: [root@node-1 ~]# ceph osd crush move host_1 root=rack-test moved item id -37 name 'host_1' to location {root=gyt-test} in crush map [root@node-1 ~]# ceph osd tree -25 2.78760 root rack-test -37 0.92920 host host_1 12 ssd 0.92920 osd.12 up 1.00000 1.00000 -26 0 rack rack_1 -27 0.92920 rack rack_2 -38 0.92920 host host_2 6 ssd 0.92920 osd.6 up 1.00000 1.00000 -28 0.92920 rack rack_3 -39 0.92920 host host_3 18 ssd 0.92920 osd.18 up 1.00000 1.00000 use fio tool test gyt-test rbd again: [root@node-1 ~]# fio --ioengine=rbd --pool=gyt-test --rbdname=gyt-test --rw=randwrite --bs=4k --numjobs=1 --runtime=120 --iodepth=64 --clientname=admin --direct=1 --name=test --time_based=1 --eta-newline 1 test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=rbd, iodepth=64 fio-3.1 Starting 1 process Jobs: 1 (f=1): [w(1)][2.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:57s] Jobs: 1 (f=1): [w(1)][4.2%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:55s] Jobs: 1 (f=1): [w(1)][5.8%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:53s] Jobs: 1 (f=1): [w(1)][7.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:51s] Jobs: 1 (f=1): [w(1)][9.2%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:49s] Jobs: 1 (f=1): [w(1)][10.8%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:47s] Jobs: 1 (f=1): [w(1)][12.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:45s] Jobs: 1 (f=1): [w(1)][14.2%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:43s] Jobs: 1 (f=1): [w(1)][15.8%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:41s] Jobs: 1 (f=1): [w(1)][17.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:39s] Jobs: 1 (f=1): [w(1)][19.2%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:37s] Jobs: 1 (f=1): [w(1)][20.8%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:35s] Jobs: 1 (f=1): [w(1)][22.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:33s] Jobs: 1 (f=1): [w(1)][24.2%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:31s] Jobs: 1 (f=1): [w(1)][25.8%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:29s] Jobs: 1 (f=1): [w(1)][27.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:27s] Jobs: 1 (f=1): [w(1)][29.2%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:25s] Jobs: 1 (f=1): [w(1)][30.8%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:23s] Jobs: 1 (f=1): [w(1)][32.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:21s] Jobs: 1 (f=1): [w(1)][34.2%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:19s] Jobs: 1 (f=1): [w(1)][35.8%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:17s] Jobs: 1 (f=1): [w(1)][37.5%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:15s] Jobs: 1 (f=1): [w(1)][39.2%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:13s] Jobs: 1 (f=1): [w(1)][40.8%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 01m:11s] rbd gyt-test can not handle io ceph status at this time: [root@node-1 ~]# ceph -s cluster: id: 896eeeed-6f14-4fb0-8d0d-00adda67a36c health: HEALTH_WARN 8782/299948 objects misplaced (2.928%) services: mon: 3 daemons, quorum node-1,node-2,node-3 mgr: node-1(active) osd: 18 osds: 18 up, 18 in; 75 remapped pgs flags nodeep-scrub rbd-mirror: 1 daemon active rgw: 1 daemon active data: pools: 14 pools, 1024 pgs objects: 104.25k objects, 371GiB usage: 641GiB used, 11.5TiB / 12.1TiB avail pgs: 8782/299948 objects misplaced (2.928%) 949 active+clean 75 active+clean+remapped io: client: 106KiB/s rd, 567KiB/s wr, 127op/s rd, 52op/s wr my all pgs is active+clean, remapped pg is already stop recovery, this is no normal. [root@node-1 ~]# ceph pg dump | grep ^14 | grep "\[12" dumped all 14.37 106 0 0 0 0 441110528 1733 1733 active+clean 2020-09-29 00:52:21.808536 6226'33265 6278:34441 [12,18] 12 [12,18] 12 0'0 2020-09-28 11:35:49.973497 0'0 2020-09-28 11:31:12.912584 0 14.1c 93 0 0 0 0 383131648 1512 1512 active+clean 2020-09-29 00:45:18.475265 6226'28692 6278:29883 [12,6] 12 [12,6] 12 0'0 2020-09-28 11:34:55.543085 0'0 2020-09-28 11:31:12.912584 0 14.10 82 0 0 0 0 340963328 1505 1505 active+clean 2020-09-29 00:45:18.474759 6226'26031 6278:27257 [12,6] 12 [12,6] 12 0'0 2020-09-28 11:34:37.400091 0'0 2020-09-28 11:31:12.912584 0 14.2c 85 0 0 0 0 353435648 1677 1677 active+clean 2020-09-29 00:51:24.528368 6226'25262 6278:26344 [12,18] 12 [12,18] 12 0'0 2020-09-28 11:35:28.806418 0'0 2020-09-28 11:31:12.912584 0 14.26 106 0 0 0 0 440807424 1534 1534 active+clean 2020-09-29 00:45:18.474631 6226'33003 6278:34306 [12,6] 12 [12,6] 12 0'0 2020-09-28 11:35:01.590900 0'0 2020-09-28 11:31:12.912584 0 14.6a 78 0 0 0 0 323850240 1671 1671 active+clean 2020-09-29 00:50:57.336812 6226'23954 6278:25009 [12,18] 12 [12,18] 12 0'0 2020-09-28 11:36:06.101612 0'0 2020-09-28 11:31:12.912584 0 these pg primary osd is 12, I feel it unable to understand that osd 12 already move out gyt-test pool crush rule. it seems a BUG?

3 years, 6 months

1
0
0 0

Re: el6 / centos6 rpm's for luminous?

by Dan van der Ster

We had built some rpms locally for ceph-fuse, but AFAIR luminous needs systemd so the server rpms would be difficult. -- dan On Thu, Oct 8, 2020 at 11:12 AM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > > Nobody ever used luminous on el6? > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 6 months

2
1
0 0

Wipe an Octopus install

by Samuel Taylor Liston

Wondering if anyone knows or has put together a way to wipe an Octopus install? I’ve looked for documentation on the process, but if it exists, I haven’t found it yet. I’m going through some test installs - working through the ins and outs of cephadm and containers and would love an easy way to tear things down and start over. In previous releases managed through ceph-deploy there were three very convenient commands that nuked the world. I am looking for something as complete for Octopus. Thanks, Sam Liston (sam.liston(a)utah.edu) ========================================== Center for High Performance Computing - Univ. of Utah 155 S. 1452 E. Rm 405 Salt Lake City, Utah 84112 (801)232-6932 ==========================================

3 years, 6 months

3
6
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2020