October 2020 - ceph-users

by Kristof Coucke

Thanks to @Anthony: Diving further I see that I probably was blinded by the CPU load... I see that some disks are very slow (so my first observations were incorrect), and the latency seen using iostat seems more or less the same as what we see in the dump_historic_ops. (+ 3s for r_await) So, it looks like a few OSDs are causing a bottleneck in the whole system. I'm now wondering what my options are to improve the performance... The main goal is to use the system again, and make sure write operations are not affected. - Putting weight on 0 for the slow OSDs (temporary)? This way they recovery can go on but new files are not written to that disk? - .... Still investigating...

3 years, 6 months

2
1
0 0

CephFS user mapping

by René Bartsch

Hi, is there any documentation about mapping usernames, user-ids, groupnames and group-ids between hosts sharing the same CephFS storage? Thanx for any hint, Renne

3 years, 6 months

4
3
0 0

Re: Massive Mon DB Size with noout on 14.2.11

by Dan van der Ster

The important metric is the difference between these two values: # ceph report | grep osdmap | grep committed report 3324953770 "osdmap_first_committed": 3441952, "osdmap_last_committed": 3442452, The mon stores osdmaps on disk, and trims the older versions whenever the PGs are clean. Trimming brings the osdmap_first_committed to be closer to osdmap_last_committed. In a cluster with no PGs backfilling or recovering, the mon should trim that difference to be within 500-750 epochs. If there are any PGs backfilling or recovering, then the mon will not trim beyond the osdmap epoch when the pools were clean. So if you are accumulating gigabytes of data in the mon dir, it suggests that you have unclean PGs/Pools. Cheers, dan On Fri, Oct 2, 2020 at 4:14 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > > Does this also count if your cluster is not healthy because of errors > like '2 pool(s) have no replicas configured' > I sometimes use these pools for testing, they are empty. > > > > > -----Original Message----- > Cc: ceph-users > Subject: [ceph-users] Re: Massive Mon DB Size with noout on 14.2.11 > > As long as the cluster is no healthy, the OSD will require much more > space, depending on the cluster size and other factors. Yes this is > somewhat normal. > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 6 months

4
4
0 0

Write access delay after OSD & Mon lost

by Mathieu Dupré

Hi everybody, Our need is to do VM failover using an image disk over RBD to avoid data loss.We want to limit the downtime as much as possible. We have: - Two hypervisors with a Ceph Monitor and a Ceph OSD. - A third machine with a Ceph Monitor and a Ceph Manager. VM are running over qemu.The VM disks are on a "replicated" rbd pool formed by the two OSDs.Ceph version: NautilusDistribution: Yocto Zeus The following test is performed: we electrically turn off one hypervisor (and therefore a Ceph Monitor and a Ceph OSD), which causes its VMs to switch to the second hypervisor. My main issue is that the mount time of a partition in rw is very slow in the case of a failover (after the loss of an OSD its monitor). With failover we can write on the device after ~25s:[ 25.609074] EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: (null) In normal boot we can write on the device after ~4s:[ 3.087412] EXT4-fs (vda3): mounted filesystem with ordered data mode. Opts: (null) I wasn't able to reduce this time by tweaking Ceph settings. I am wondering if someone could help me on that. Here is our configuration. ceph.conf[global] fsid = fa7a17d1-5351-459e-bf0e-07e7edc9a625 mon initial members = hypervisor1,hypervisor2,observer mon host = 192.168.217.131,192.168.217.132,192.168.217.133 public network = 192.168.217.0/24 auth cluster required = cephx auth service required = cephx auth client required = cephx osd journal size = 1024 osd pool default size = 2 osd pool default min size = 1 osd crush chooseleaf type = 1 mon osd adjust heartbeat grace = false mon osd min down reporters = 1[mon.hypervisor1] host = hypervisor1 mon addr = 192.168.217.131:6789[mon.hypervisor2] host = hypervisor2 mon addr = 192.168.217.132:6789[mon.observer] host = observer mon addr = 192.168.217.133:6789[osd.0] host = hypervisor1 public_addr = 192.168.217.131 cluster_addr = 192.168.217.131[osd.1] host = hypervisor2 public_addr = 192.168.217.132 cluster_addr = 192.168.217.13 # ceph config dump WHO MASK LEVEL OPTION VALUE RO global advanced mon_osd_adjust_down_out_interval false global advanced mon_osd_adjust_heartbeat_grace false global advanced mon_osd_down_out_interval 5 global advanced mon_osd_report_timeout 4 global advanced osd_beacon_report_interval 1 global advanced osd_heartbeat_grace 2 global advanced osd_heartbeat_interval 1 global advanced osd_mon_ack_timeout 1.000000 global advanced osd_mon_heartbeat_interval 2 global advanced osd_mon_report_interval 3 Thanks

3 years, 6 months

3
2
0 0

Re: Slow ops on OSDs

by Kristof Coucke

Hi Anthony, Thnx for the reply Average values: User: 3.5 Idle: 78.4 Wait: 20 System: 1.2 /K. Op di 6 okt. 2020 om 10:18 schreef Anthony D'Atri <anthony.datri(a)gmail.com>: > > > > > > Diving onto the nodes we could see that the OSD daemons are consuming the > > CPU power, resulting in average CPU loads going near 10 (!). > > > FWIW, the load average doesn’t really tell you much on a multi-core > system. I’ve run 24x SSD OSD nodes with load averages routinely >30 that > hummed along just fine. > > What are the percentages for user/idle/wait/system ? > > > >

3 years, 6 months

1
0
0 0

Consul as load balancer

by Szabo, Istvan (Agoda)

Hi, Is there anybody tried consul as a load balancer? Any experience? Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 6 months

2
2
0 0

[Ceph Octopus 15.2.3 ] MDS crashed suddently and failed to replay journal after restarting

by carlimeunier＠gmail.com

Hello, MDS process crashed suddently. After trying to restart it, it failed to replay journal and started to restart continually. Just to summarize, here is what happened : 1/ The cluster is up and running with 3 nodes (mon and mds in the same nodes) and 3 OSD. 2/ After a few days, 2 (standby-replay and standby) of the 3 MDS processes crashed. No pid. Ceph status indicates that the processes are down 3/ I try restart it : - Sometimes, the restarting fails with segmentation fault error. Here is the ceph-mds.log file : -20> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: _renew_subs -19> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: _send_mon_message to mon.2 at v1:172.31.36.98:6789/0 -18> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: handle_get_version_reply finishing 0x559dcf9530c0 version 269 -17> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: handle_get_version_reply finishing 0x559dcfa87520 version 269 -16> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: handle_get_version_reply finishing 0x559dcfa875c0 version 269 -15> 2020-07-17T13:50:27.888+0000 7fc8c6c51700 10 monclient: handle_get_version_reply finishing 0x559dcfa871c0 version 269 -14> 2020-07-17T13:50:27.888+0000 7fc8c8c55700 10 monclient: get_auth_request con 0x559dcfada000 auth_method 0 -13> 2020-07-17T13:50:27.888+0000 7fc8c9456700 10 monclient: get_auth_request con 0x559dcfada800 auth_method 0 -12> 2020-07-17T13:50:27.892+0000 7fc8bfc43700 1 mds.282966.journaler.mdlog(ro) recover start -11> 2020-07-17T13:50:27.892+0000 7fc8bfc43700 1 mds.282966.journaler.mdlog(ro) read_head -10> 2020-07-17T13:50:27.892+0000 7fc8bfc43700 4 mds.0.log Waiting for journal 0x200 to recover... -9> 2020-07-17T13:50:27.893+0000 7fc8c0444700 1 mds.282966.journaler.mdlog(ro) _finish_read_head loghead(trim 4194304, expire 4231216, write 4329405, stream_format 1). probing for end of log (from 4329405)... -8> 2020-07-17T13:50:27.893+0000 7fc8c0444700 1 mds.282966.journaler.mdlog(ro) probing for end of the log -7> 2020-07-17T13:50:27.893+0000 7fc8c0444700 1 mds.282966.journaler.mdlog(ro) _finish_probe_end write_pos = 4329949 (header had 4329405). recovered. -6> 2020-07-17T13:50:27.893+0000 7fc8bfc43700 4 mds.0.log Journal 0x200 recovered. -5> 2020-07-17T13:50:27.893+0000 7fc8bfc43700 4 mds.0.log Recovered journal 0x200 in format 1 -4> 2020-07-17T13:50:27.893+0000 7fc8bfc43700 2 mds.0.0 Booting: 1: loading/discovering base inodes -3> 2020-07-17T13:50:27.893+0000 7fc8bfc43700 0 mds.0.cache creating system inode with ino:0x100 -2> 2020-07-17T13:50:27.894+0000 7fc8bfc43700 0 mds.0.cache creating system inode with ino:0x1 -1> 2020-07-17T13:50:27.894+0000 7fc8c0444700 2 mds.0.0 Booting: 2: replaying mds log 0> 2020-07-17T13:50:27.896+0000 7fc8bec41700 -1 *** Caught signal (Segmentation fault) ** in thread 7fc8bec41700 thread_name:md_log_replay - Sometimes, the restarting works but journal replay failed even after having reset it (# cephfs-journal-tool --rank=cephfs:0 journal reset) on the failed nodes. The cluster status look like this : # ceph status -w cluster: id: acd73aa2-8cdd-41a3-9941-fb397aa1d79e health: HEALTH_WARN 1 daemons have recently crashed services: mon: 3 daemons, quorum 2,0,1 (age 3w) mgr: mgr.0(active, since 11w), standbys: mgr.2, mgr.1 mds: cephfs:1 {0=node1=up:active} 1 up:standby-replay 1 up:standby osd: 3 osds: 3 up (since 33h), 3 in (since 11w) task status: scrub status: mds.node1: idle data: pools: 3 pools, 49 pgs objects: 165 objects, 157 MiB usage: 3.5 GiB used, 41 TiB / 41 TiB avail pgs: 49 active+clean io: client: 1.8 MiB/s rd, 4 op/s rd, 0 op/s wr 2020-10-05T13:32:03.798231+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob) 2020-10-05T13:32:03.851986+0000 mon.2 [INF] daemon mds.node0 restarted 2020-10-05T13:32:04.605163+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob) 2020-10-05T13:32:08.652989+0000 mon.2 [INF] daemon mds.node0 restarted 2020-10-05T13:32:08.916347+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob) 2020-10-05T13:32:12.961902+0000 mon.2 [INF] daemon mds.node0 restarted 2020-10-05T13:32:13.974410+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob) 2020-10-05T13:32:14.023126+0000 mon.2 [INF] daemon mds.node0 restarted 2020-10-05T13:32:14.610039+0000 mds.node0 [ERR] failure replaying journal (EMetaBlob) Question : - Why 2 of 3 MDS processes sometimes crash? I suspect the client ( kernel 4.20) on which there is a cephfs in-tree provisioner (not csi) for kubernetes. how can i highlight it? Thanks for your support

3 years, 6 months

1
0
0 0

S3 multipart upload in Ceph 12.2.11 Luminous

by Eugeniy Khvastunov

Hi ceph-users, Looks like we faced a broken S3 multipart upload in Ceph 12.2.11 Luminous (MCP 2019.2.0) during thanos setup. The request to initialize the multipart fails. S3cmd log: DEBUG: Sending request method_string='POST', uri='/staging-prometheus-k8s-metrics/test4/dd-50M.rnd?uploads', headers={'x-amz-meta-s3cmd-attrs': 'uid:1011/gname: yevhen.kh/uname:yevhen.kh/gid:1011/mode:33188/mtime:1600198080/atime:1600198530/md5:730ee931c6058cad5bd4d87a6c8f3178/ctime:1600198363', 'content-type': 'application/octet-stream', 'Authorization': 'AWS ZLMVD7JZZ8V09VVK66C3:YX5Oe+Ig0FPDbRpIYO2m1WHITOE=', 'x-amz-date': 'Tue, 15 Sep 2020 19:38:33 +0000', 'x-amz-storage-class': 'STANDARD'}, body=(0 bytes) DEBUG: Response: {'status': 400, 'headers': {'content-length': '253', 'accept-ranges': 'bytes', 'server': 'nginx', 'connection': 'keep-alive', 'x-amz-request-id': 'tx0000000000000001e019a-005f611839-deb80f-default', 'date': 'Tue, 15 Sep 2020 19:38:37 GMT', 'content-type': 'application/xml'}, 'reason': 'Bad Request', 'data': '<?xml version="1.0" encoding="UTF-8"?><Error><Code>InvalidArgument</Code><BucketName>staging-prometheus-k8s-metrics</BucketName><RequestId>tx0000000000000001e019a-005f611839-deb80f-default</RequestId><HostId>deb80f-default-default</HostId></Error>'} DEBUG: ConnMan.put(): connection put back to pool ( https://s3.cloud.f.in.company.com:8080#1) DEBUG: S3Error: 400 (Bad Request) DEBUG: HttpHeader: content-length: 253 DEBUG: HttpHeader: accept-ranges: bytes DEBUG: HttpHeader: server: nginx DEBUG: HttpHeader: connection: keep-alive DEBUG: HttpHeader: x-amz-request-id: tx0000000000000001e019a-005f611839-deb80f-default DEBUG: HttpHeader: date: Tue, 15 Sep 2020 19:38:37 GMT DEBUG: HttpHeader: content-type: application/xml DEBUG: ErrorXML: Code: 'InvalidArgument' DEBUG: ErrorXML: BucketName: 'staging-prometheus-k8s-metrics' DEBUG: ErrorXML: RequestId: 'tx0000000000000001e019a-005f611839-deb80f-default' DEBUG: ErrorXML: HostId: 'deb80f-default-default' ERROR: S3 error: 400 (InvalidArgument) CEPH log: > 2020-09-10 08:26:26.238814 7f7482fdd700 1 -- 10.14.2.66:0/3902939317 <== > mon.0 10.14.2.66:6789/0 19 ==== pool_op_reply(tid 1290 (22) Invalid > argument v22601) v1 ==== 43+0+0 (3815426539 0 0) 0x55faeb4e4a80 con > 0x55faeabc8800 > 2020-09-10 08:26:26.241214 7f74607de700 2 req 484:1.595459:s3:POST > /staging-prometheus-k8s-metrics/01EH8QMFREFVE4Q32Y2QZRNTBN/chunks/000001:init_multipart:completing > 2020-09-10 08:26:26.241389 7f74607de700 2 req 484:1.595632:s3:POST > /staging-prometheus-k8s-metrics/01EH8QMFREFVE4Q32Y2QZRNTBN/chunks/000001:init_multipart:op > status=-22 > 2020-09-10 08:26:26.241414 7f74607de700 2 req 484:1.595658:s3:POST > /staging-prometheus-k8s-metrics/01EH8QMFREFVE4Q32Y2QZRNTBN/chunks/000001:init_multipart:http > status=400 > 2020-09-10 08:26:26.241420 7f74607de700 1 ====== req done > req=0x7f74607d8140 op status=-22 http_status=400 ====== > 2020-09-10 08:26:26.241443 7f74607de700 20 process_request() returned -22 > 2020-09-10 08:26:26.241492 7f74607de700 1 civetweb: 0x55faeaedf000: > 10.14.5.203 - - [10/Sep/2020:08:26:24 +0000] "POST > /staging-prometheus-k8s-metrics/01EH8QMFREFVE4Q32Y2QZRNTBN/chunks/000001?uploads= > HTTP/1.1" 400 0 - MinIO (linux; amd64) minio-go/v7.0.2 > thanos-sidecar/0.15.0 (go1.14.2) > The bucket exists. The issue is repeated on the newly created bucket and on the existing one. If s3cmd runs with `--multipart-chunk-size-mb=` option and the value is larger than the file size(for example, if the file is 50 mb, set --multipart-chunk-size-mb=200 mb) then the file uploads. Example: > s3cmd -c /root/f-s3-thanos.cfg put /root/dd-50M.rnd > s3://staging-prometheus-k8s-metrics/test4/dd-50M.rnd -d > --multipart-chunk-size-mb=200 > If s3cmd runs with `--disable-multipart` that works as well. Have you faced such problems? I would be grateful for your advice. -- wbr. Eugeniy Khvastunov, System administrator. [FMGH-UANIC] http://blog.unlimite.net

3 years, 6 months

2
2
0 0

Feedback for proof of concept OSD Node

by Ignacio Ocampo

Hi All :), I would like to get your feedback about the components below to build a PoC OSD Node (I will build 3 of these). SSD for OS. NVMe for cache. HDD for storage. The Supermicro motherboard has 2 10Gb cards, and I will use ECC memories. [image: image.png] Thanks for your feedback! -- Ignacio Ocampo

3 years, 6 months

6
12
0 0

ceph iscsi latency too high for esxi?

by Golasowski Martin

Hi, does anyone here use CEPH iSCSI with VMware ESXi? It seems that we are hitting the 5 second timeout limit on software HBA in ESXi. It appears whenever there is increased load on the cluster, like deep scrub or rebalance. Is it normal behaviour in production? Or is there something special we need to tune? We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s Ethernet, erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total. ESXi Log: 2020-10-04T01:57:04.314Z cpu34:2098959)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:517: vmhba64:CH:1 T:0 CN:0: Failed to receive data: Connection closed by peer 2020-10-04T01:57:04.314Z cpu34:2098959)iscsi_vmk: iscsivmk_ConnRxNotifyFailure:1235: vmhba64:CH:1 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound 2020-10-04T01:57:04.566Z cpu19:2098979)WARNING: iscsi_vmk: iscsivmk_StopConnection:741: vmhba64:CH:1 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4) 2020-10-04T01:57:04.654Z cpu7:2097866)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:788: Probe cmd 0xa3 failed for path "vmhba64:C2:T0:L0" (0x5/0x20/0x0). Check if failover mode is still ALUA. OSD Log: [303088.450088] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1994-05.com.redhat:esxi1,i,0x00023d000002,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01 [324926.694077] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01 [407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891 [407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 5891 [411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722 [411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6722 [481459.755876] ABORT_TASK: Found referenced iSCSI task_tag: 7930 [481460.787968] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7930 Cheers, Martin

3 years, 6 months

5
8
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2020