Hi,
All of a sudden, we are experiencing very concerning MON behaviour. We have five MONs and all of them have thousands up to tens of thousands of slow ops, the oldest one blocking basically indefinitely (at least the timer keeps creeping up). Additionally, the MON stores keep inflating heavily. Under normal circumstances we have about 450-550MB there. Right now its 27GB and growing (rapidly).
I tried restarting all MONs, I disabled auto-scaling (just in case) and checked the system load and hardware. I also restarted the MGR and MDS daemons, but to no avail.
Is there any way I can debug this properly? I can’t seem to find how I can actually view what ops are causing this and what client (if any) may be responsible for it.
Thanks
Janek
Hi,
I’ve tried to save some pg from a dead osd, I made this:
Picked on the same server an osd which is not really used and stopped that osd and import the exported one from the dead one.
root@server:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-33 --no-mon-config --pgid 44.c0s0 --op export --file ./pg44c0s0
Exporting 44.c0s0 info 44.c0s0( empty local-lis/les=0/0 n=0 ec=192123/175799 lis/c=4865474/4851556 les/c/f=4865475/4851557/0 sis=4865493)
Export successful
root@server:~# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-34 --no-mon-config --op import --file ./pg44c0s0
get_pg_num_history pg_num_history pg_num_history(e5583546 pg_nums {20={173213=256},21={219434=64},22={220991=64},24={219240=32},25={1446965=128},42={175793=32},43={197388=64},44={192123=512}} deleted_pools )
Importing pgid 44.c0s0
write_pg epoch 4865498 info 44.c0s0( empty local-lis/les=0/0 n=0 ec=192123/175799 lis/c=4865474/4851556 les/c/f=4865475/4851557/0 sis=4865493)
Import successful
Started back 34 and it says the osd is running but in the cluster map it is down :/
root@server:~# systemctl status ceph-osd@34 -l
● ceph-osd(a)34.service<mailto:ceph-osd@34.service> - Ceph object storage daemon osd.34
Loaded: loaded (/lib/systemd/system/ceph-osd@.service<mailto:/lib/systemd/system/ceph-osd@.service>; enabled-runtime; vendor preset: enabled)
Active: active (running) since Thu 2021-03-18 10:38:00 CET; 8min ago
Process: 45388 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 34 (code=exited, sta>
Main PID: 45392 (ceph-osd)
Tasks: 60
Memory: 856.2M
CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd(a)34.service
└─45392 /usr/bin/ceph-osd -f --cluster ceph --id 34 --setuser ceph --setgroup ceph
Mar 18 10:38:00 server systemd[1]: Starting Ceph object storage daemon osd.34...
Mar 18 10:38:00 server systemd[1]: Started Ceph object storage daemon osd.34.
Mar 18 10:38:21 server ceph-osd[45392]: 2021-03-18T10:38:21.817+0100 7f41738d5dc0 -1 osd.34 5583546 log_to_mon>
Mar 18 10:38:21 server ceph-osd[45392]: 2021-03-18T10:38:21.825+0100 7f41738d5dc0 -1 osd.34 5583546 mon_cmd_ma>
Any idea?
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
I have a ceph cluster with 5 nodes. 3 in one building, and 2 in the other one. I put this information in the CRUSH. So that ceph were able to put one copy of objects in the nodes of one building and the other copy to the nodes of the other building. I mean, I setup replicas=2 in order to put the same information in both locations. But I know ceph cluster needs half +1 up nodes to keep one part of cluster working. I need at least a manual procedure to recover one of the two buildings if the other get down, or even if the link between them get down. I do not need 100% up, just something to block and unblock some nodes, and starts the two nodes if the building down were the one with 3 nodes.
Hi,
What use is made of the ident data in the telemetry module? It's
disabled by default, and the docs don't seem to say what it's used for...
Thanks,
Matthew
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
Hi Guys,
So, new issue (I'm gonna get the hang of this if it kills me :-) ).
I have a working/healthy Ceph (Octopus) Cluster (with qemu-img, libvert,
etc, installed), and an erasure-coded pool called "my_pool". I now need
to create a "my_data" image within the "my_pool" pool. As this is for a
KVM host / block device (hence the qemu-img et.al.) I'm attempting to
use qemu-img, so the command I am using is:
```
qemu-img create -f rbd rbd:my_pool/my_data 1T
```
The error message I received was:
```
qemu-img: rbd:my_pool/my_data: error rbd create: Operation not supported
```
So, I tried the 'raw' rbd command:
```
rbd create -s 1T my_pool/my_data
```
and got the error:
```
_add_image_to_directory: error adding image to directory: (95) Operation
not supported
rbd: create error: (95) Operation not supported
```
So I don't believe the issue is with the 'qemu-img' command - but I may
be wrong.
After doing some research I *think* I need to specify a replicated (as
opposed to erasure-coded) pool for my_pool's metadata (eg
'my_pool_metadata'), and thus use the command:
```
rbd create -s 1T --data-pool my_pool my_pool_metadata/my_data
```
First Question: Is this correct?
Second Question: What is the qemu-img equivalent command - is it:
```
qemu-img create -f rbd rbd:--data-pool my_pool my_pool_metadata/my_data 1T
```
or something similar?
Thanks in advance
Dulux-Oz
Hi all
When setting a quota on a pool (or directory in Cephfs), is it the amount of client data written or the client data x number of replicas that counts toward the quota?
Cheers
A
Sent from my iPhone
Hello,
If anybody out there has tried this or thought about it, I'd like to know...
I've been thinking about ways to squeeze as much performance as possible
from the NICs on a Ceph OSD node. The nodes in our cluster (6 x OSD, 3
x MGR/MON/MDS/RGW) currently have 2 x 10GB ports. Currently, one port
is assigned to the front-side network, and one to the back-side
network. However, there are times when the traffic on one side or the
other is more intense and might benefit from a bit more bandwidth.
The idea I had was to bond the two ports together, and to run the
back-side network in a tagged VLAN on the combined 20GB LACP port. In
order to keep the balance and prevent starvation from either side it
would be necessary to apply some sort of a weighted fair queuing
mechanism via the 'tc' command. The idea is that if the client side
isn't using up the full 10GB/node, and there is a burst of re-balancing
activity, the bandwidth consumed by the back-side traffic could swell to
15GB or more. Or vice versa.
From what I have read and studied, these algorithms are fairly
responsive to changes in load and would thus adjust rapidly if the
demand from either side suddenly changed.
Maybe this is a crazy idea, or maybe it's really cool. Your thoughts?
Thanks.
-Dave
--
Dave Hall
Binghamton University
kdhall(a)binghamton.edu
Hi Guys,
Is the below "ceph -s" normal?
This is a brand new cluster with (at the moment) a single Monitor and 7
OSDs (each 6 GiB) that has no data in it (yet), and yet its taking
almost a day to "heal itself" from adding in the 2nd OSD.
~~~
cluster:
id: [REDACTED]
health: HEALTH_WARN
Reduced data availability: 256 pgs inactive, 256 pgs incomplete
Degraded data redundancy: 12 pgs undersized
services:
mon: 1 daemons, quorum [REDACTED] (age 22h)
mgr: [REDACTED](active, since 22h)
osd: 7 osds: 7 up (since 21h), 7 in (since 21h); 32 remapped pgs
data:
pools: 5 pools, 288 pgs
objects: 7 objects, 0 B
usage: 7.1 GiB used, 38 TiB / 38 TiB avail
pgs: 88.889% pgs not active
6/21 objects misplaced (28.571%)
256 creating+incomplete
18 active+clean
12 active+undersized+remapped
2 active+clean+remapped
progress:
Rebalancing after osd.1 marked in (22h)
[............................]
PG autoscaler decreasing pool 1 PGs from 32 to 1 (19h)
[............................]
~~~
Thanks in advance
Matthew J
--
Peregrine IT Signature
*Matthew J BLACK*
M.Inf.Tech.(Data Comms)
MBA
B.Sc.
MACS (Snr), CP, IP3P
When you want it done /right/ ‒ the first time!
Phone: +61 4 0411 0089
Email: matthew(a)peregrineit.net <mailto:matthew@peregrineit.net>
Web: www.peregrineit.net <http://www.peregrineit.net>
View Matthew J BLACK's profile on LinkedIn
<http://au.linkedin.com/in/mjblack>
This Email is intended only for the addressee. Its use is limited to
that intended by the author at the time and it is not to be distributed
without the author’s consent. You must not use or disclose the contents
of this Email, or add the sender’s Email address to any database, list
or mailing list unless you are expressly authorised to do so. Unless
otherwise stated, Peregrine I.T. Pty Ltd accepts no liability for the
contents of this Email except where subsequently confirmed in
writing. The opinions expressed in this Email are those of the author
and do not necessarily represent the views of Peregrine I.T. Pty
Ltd. This Email is confidential and may be subject to a claim of legal
privilege.
If you have received this Email in error, please notify the author and
delete this message immediately.
Hi,
I hope someone here can help me out with some contact data, email-adress or phone Number for Samsung Datacenter SSD Support ? If I contact Standard Samsung Datacenter Support they tell me they are not there to support PM1735 Drives.
We are planning a new Ceph-Cluster and we are thinking of Samsung PM1735 NVME u.2 ssds
Unfortunately the PM1735 is not available with u2 interface but the PM1733 is.
Some manager from Samsung once told me that PM1733 and PM1735 is exactly the same Hardware, it is only provisioned differently.
But he did not know whom to ask. Any idea of whom I could contact at Samsung or of how to provision the PM1733 (7.6TB) to a PM1735 (6.4TB) .
I want the provisioning for better DWPD (3DWPD instead of 1DWPD).