Thanks to @Anthony:
Diving further I see that I probably was blinded by the CPU load...
I see that some disks are very slow (so my first observations were
incorrect), and the latency seen using iostat seems more or less the same
as what we see in the dump_historic_ops. (+ 3s for r_await)
So, it looks like a few OSDs are causing a bottleneck in the whole system.
I'm now wondering what my options are to improve the performance... The
main goal is to use the system again, and make sure write operations are
not affected.
- Putting weight on 0 for the slow OSDs (temporary)? This way they recovery
can go on but new files are not written to that disk?
- ....
Still investigating...
Hi,
is there any documentation about mapping usernames, user-ids,
groupnames and group-ids between hosts sharing the same CephFS storage?
Thanx for any hint,
Renne
The important metric is the difference between these two values:
# ceph report | grep osdmap | grep committed
report 3324953770
"osdmap_first_committed": 3441952,
"osdmap_last_committed": 3442452,
The mon stores osdmaps on disk, and trims the older versions whenever
the PGs are clean. Trimming brings the osdmap_first_committed to be
closer to osdmap_last_committed.
In a cluster with no PGs backfilling or recovering, the mon should
trim that difference to be within 500-750 epochs.
If there are any PGs backfilling or recovering, then the mon will not
trim beyond the osdmap epoch when the pools were clean.
So if you are accumulating gigabytes of data in the mon dir, it
suggests that you have unclean PGs/Pools.
Cheers, dan
On Fri, Oct 2, 2020 at 4:14 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
>
> Does this also count if your cluster is not healthy because of errors
> like '2 pool(s) have no replicas configured'
> I sometimes use these pools for testing, they are empty.
>
>
>
>
> -----Original Message-----
> Cc: ceph-users
> Subject: [ceph-users] Re: Massive Mon DB Size with noout on 14.2.11
>
> As long as the cluster is no healthy, the OSD will require much more
> space, depending on the cluster size and other factors. Yes this is
> somewhat normal.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi everybody,
Our need is to do VM failover using an image disk over RBD to avoid data loss.We want to limit the downtime as much as
possible.
We have: - Two hypervisors with a Ceph Monitor and a Ceph OSD. - A third machine with a Ceph Monitor and a Ceph
Manager.
VM are running over qemu.The VM disks are on a "replicated" rbd pool formed by the two OSDs.Ceph version:
NautilusDistribution: Yocto Zeus
The following test is performed: we electrically turn off one hypervisor (and therefore a Ceph Monitor and a Ceph OSD),
which causes its VMs to switch to the second hypervisor.
My main issue is that the mount time of a partition in rw is very slow in the case of a failover (after the loss of an
OSD its monitor).
With failover we can write on the device after ~25s:[ 25.609074] EXT4-fs (vda3): mounted filesystem with ordered data
mode. Opts: (null)
In normal boot we can write on the device after ~4s:[ 3.087412] EXT4-fs (vda3): mounted filesystem with ordered data
mode. Opts: (null)
I wasn't able to reduce this time by tweaking Ceph settings. I am wondering if someone could help me on that.
Here is our configuration.
ceph.conf[global] fsid = fa7a17d1-5351-459e-bf0e-07e7edc9a625 mon initial members =
hypervisor1,hypervisor2,observer mon host = 192.168.217.131,192.168.217.132,192.168.217.133 public network =
192.168.217.0/24 auth cluster required = cephx auth service required = cephx auth client required =
cephx osd journal size = 1024 osd pool default size = 2 osd pool default min size = 1 osd crush chooseleaf
type = 1 mon osd adjust heartbeat grace = false mon osd min down reporters = 1[mon.hypervisor1] host =
hypervisor1 mon addr = 192.168.217.131:6789[mon.hypervisor2] host = hypervisor2 mon addr =
192.168.217.132:6789[mon.observer] host = observer mon addr = 192.168.217.133:6789[osd.0] host =
hypervisor1 public_addr = 192.168.217.131 cluster_addr = 192.168.217.131[osd.1] host =
hypervisor2 public_addr = 192.168.217.132 cluster_addr = 192.168.217.13
# ceph config dump WHO MASK LEVEL OPTION VALUE RO global advanced
mon_osd_adjust_down_out_interval false global advanced
mon_osd_adjust_heartbeat_grace false global advanced
mon_osd_down_out_interval 5 global advanced
mon_osd_report_timeout 4 global advanced
osd_beacon_report_interval 1 global advanced
osd_heartbeat_grace 2 global advanced
osd_heartbeat_interval 1 global advanced
osd_mon_ack_timeout 1.000000 global advanced
osd_mon_heartbeat_interval 2 global advanced osd_mon_report_interval 3
Thanks
Hi Anthony,
Thnx for the reply
Average values:
User: 3.5
Idle: 78.4
Wait: 20
System: 1.2
/K.
Op di 6 okt. 2020 om 10:18 schreef Anthony D'Atri <anthony.datri(a)gmail.com>:
>
>
> >
> > Diving onto the nodes we could see that the OSD daemons are consuming the
> > CPU power, resulting in average CPU loads going near 10 (!).
>
>
> FWIW, the load average doesn’t really tell you much on a multi-core
> system. I’ve run 24x SSD OSD nodes with load averages routinely >30 that
> hummed along just fine.
>
> What are the percentages for user/idle/wait/system ?
>
>
>
>
Hi,
Is there anybody tried consul as a load balancer?
Any experience?
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi All :),
I would like to get your feedback about the components below to build a PoC
OSD Node (I will build 3 of these).
SSD for OS.
NVMe for cache.
HDD for storage.
The Supermicro motherboard has 2 10Gb cards, and I will use ECC memories.
[image: image.png]
Thanks for your feedback!
--
Ignacio Ocampo
Hi,
does anyone here use CEPH iSCSI with VMware ESXi? It seems that we are hitting the 5 second timeout limit on software HBA in ESXi. It appears whenever there is increased load on the cluster, like deep scrub or rebalance. Is it normal behaviour in production? Or is there something special we need to tune?
We are on latest Nautilus, 12 x 10 TB OSDs (4 servers), 25 Gbit/s Ethernet, erasure coded rbd pool with 128 PGs, aroun 200 PGs per OSD total.
ESXi Log:
2020-10-04T01:57:04.314Z cpu34:2098959)WARNING: iscsi_vmk: iscsivmk_ConnReceiveAtomic:517: vmhba64:CH:1 T:0 CN:0: Failed to receive data: Connection closed by peer
2020-10-04T01:57:04.314Z cpu34:2098959)iscsi_vmk: iscsivmk_ConnRxNotifyFailure:1235: vmhba64:CH:1 T:0 CN:0: Connection rx notifying failure: Failed to Receive. State=Bound
2020-10-04T01:57:04.566Z cpu19:2098979)WARNING: iscsi_vmk: iscsivmk_StopConnection:741: vmhba64:CH:1 T:0 CN:0: iSCSI connection is being marked "OFFLINE" (Event:4)
2020-10-04T01:57:04.654Z cpu7:2097866)WARNING: VMW_SATP_ALUA: satp_alua_issueCommandOnPath:788: Probe cmd 0xa3 failed for path "vmhba64:C2:T0:L0" (0x5/0x20/0x0). Check if failover mode is still ALUA.
OSD Log:
[303088.450088] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1994-05.com.redhat:esxi1,i,0x00023d000002,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
[324926.694077] Did not receive response to NOPIN on CID: 0, failing connection for I_T Nexus iqn.1994-05.com.redhat:esxi2,i,0x00023d000001,iqn.2003-01.com.redhat.iscsi-gw:iscsi-igw,t,0x01
[407067.404538] ABORT_TASK: Found referenced iSCSI task_tag: 5891
[407076.077175] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 5891
[411677.887690] ABORT_TASK: Found referenced iSCSI task_tag: 6722
[411683.297425] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6722
[481459.755876] ABORT_TASK: Found referenced iSCSI task_tag: 7930
[481460.787968] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7930
Cheers,
Martin