Hi guys, I deploy an efk cluster and use ceph as block storage in kubernetes, but RBD write iops sometimes becomes zero and last for a few minutes. I want to check logs about RBD so I add some config to ceph.conf and restart ceph.
Here is my ceph.conf:
[global]
fsid = 53f4e1d5-32ce-4e9c-bf36-f6b54b009962
mon_initial_members = db-16-4-hzxs, db-16-5-hzxs, db-16-6-hzxs
mon_host = 10.25.16.4,10.25.16.5,10.25.16.6
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 3
[client]
debug rbd = 20
debug rbd mirror = 20
debug rbd replay = 20
log file = /var/log/ceph/client_rbd.log
I can not get any logs in /var/log/ceph/client_rbd.log. I also try to execute 'ceph daemon osd.* config set debug_rbd 20’ and there is also no related logs in ceph-osd.log.
How can I get useful logs about this question or How can I analyze this problem? Look forward to your reply.
Thanks
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// 声明:此邮件可能包含依图公司保密或特权信息,并且仅应发送至有权接收该邮件的收件人。如果您无权收取该邮件,您应当立即删除该邮件并通知发件人,您并被禁止传播、分发或复制此邮件以及附件。对于此邮件可能携带的病毒引起的任何损害,本公司不承担任何责任。此外,本公司不保证已正确和完整地传输此信息,也不接受任何延迟收件的赔偿责任。 ////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////// Notice: This email may contain confidential or privileged information of Yitu and was sent solely to the intended recipients. If you are unauthorized to receive this email, you should delete the email and contact the sender immediately. Any unauthorized disclosing, distribution, or copying of this email and attachment thereto is prohibited. Yitu does not accept any liability for any loss caused by possibly viruses in this email. E-mail transmission cannot be guaranteed to be secure or error-free and Yitu is not responsible for any delayed transmission.
ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)
os :CentOS Linux release 7.7.1908 (Core)
single node ceph cluster with 1 mon,1mgr,1 mds,1rgw and 12osds , but only cephfs is used.
ceph -s is blocked after shutting down the machine (192.168.0.104), then ip address changed to 192.168.1.6
I created the monmap with monmap tool and update the ceph.conf , hosts file and then start ceph-mon.
and the ceph-mon log:
...
2019-12-11 08:57:45.170 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1285.14s
2019-12-11 08:57:50.170 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1290.14s
2019-12-11 08:57:55.171 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1295.14s
2019-12-11 08:58:00.171 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1300.14s
2019-12-11 08:58:05.172 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1305.14s
2019-12-11 08:58:10.171 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1310.14s
2019-12-11 08:58:15.173 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1315.14s
2019-12-11 08:58:20.173 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1320.14s
2019-12-11 08:58:25.174 7f952cdac700 1 mon.ceph-node1 at 0(leader).mds e34 no beacon from mds.0.10 (gid: 4384 addr: [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: up:active) since 1325.14s
...
I changed IP back to 192.168.0.104 yeasterday, but all the same.
# cat /etc/ceph/ceph.conf
[client.libvirt]
admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok # must be writable by QEMU and allowed by SELinux or AppArmor
log file = /var/log/ceph/qemu-guest-$pid.log # must be writable by QEMU and allowed by SELinux or AppArmor
[client.rgw.ceph-node1.rgw0]
host = ceph-node1
keyring = /var/lib/ceph/radosgw/ceph-rgw.ceph-node1.rgw0/keyring
log file = /var/log/ceph/ceph-rgw-ceph-node1.rgw0.log
rgw frontends = beast endpoint=192.168.1.6:8080
rgw thread pool size = 512
# Please do not change this file directly since it is managed by Ansible and will be overwritten
[global]
cluster network = 192.168.1.0/24
fsid = e384e8e6-94d5-4812-bfbb-d1b0468bdef5
mon host = [v2:192.168.1.6:3300,v1:192.168.1.6:6789]
mon initial members = ceph-node1
osd crush chooseleaf type = 0
osd pool default crush rule = -1
public network = 192.168.1.0/24
[osd]
osd memory target = 7870655146
The last two days we've experienced a couple short outages shortly after
setting both 'noscrub' and 'nodeep-scrub' on one of our largest Ceph clusters
(~2,200 OSDs). This cluster is running Nautilus (14.2.6) and setting/unsetting
these flags has been done many times in the past without a problem.
One thing I've noticed is that on both days right after setting 'noscrub' or
'nodeep-scrub' that a do_prune message shows up in the monitor logs followed by
a timeout. About 30 seconds later we start seeing OSDs getting marked down:
2020-06-03 08:06:53.914 7fcc3ed57700 0 mon.p3cephmon004@0(leader) e11 handle_command mon_command({"prefix": "osd set", "key": "noscrub"} v 0) v1
2020-06-03 08:06:53.914 7fcc3ed57700 0 log_channel(audit) log [INF] : from='client.5773023471 10.2.128.8:0/523139029' entity='client.admin' cmd=[{"prefix": "osd set", "key": "noscrub"}]: dispatch
2020-06-03 08:06:54.231 7fcc4155c700 1 mon.p3lcephmon004(a)0(leader).osd e1535232 do_prune osdmap full prune enabled
2020-06-03 08:06:54.318 7fcc3f558700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc3f558700' had timed out after 0
2020-06-03 08:06:54.319 7fcc4055a700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc4055a700' had timed out after 0
2020-06-03 08:06:54.319 7fcc40d5b700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc40d5b700' had timed out after 0
2020-06-03 08:06:54.319 7fcc3fd59700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7fcc3fd59700' had timed out after 0
...
2020-06-03 08:07:16.049 7fcc3ed57700 1 mon.p3cephmon004(a)0(leader).osd e1535234 prepare_failure osd.736 [v2:10.6.170.130:6816/1294580,v1:10.6.170.130:6817/1294580] from osd.1165 is reporting failure:1
2020-06-03 08:07:16.049 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.736 reported failed by osd.1165
2020-06-03 08:07:16.304 7fcc3ed57700 1 mon.p3cephmon004(a)0(leader).osd e1535234 prepare_failure osd.736 [v2:10.6.170.130:6816/1294580,v1:10.6.170.130:6817/1294580] from osd.127 is reporting failure:1
2020-06-03 08:07:16.304 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.736 reported failed by osd.127
2020-06-03 08:07:16.693 7fcc3ed57700 1 mon.p3cephmon004(a)0(leader).osd e1535234 prepare_failure osd.736 [v2:10.6.170.130:6816/1294580,v1:10.6.170.130:6817/1294580] from osd.1455 is reporting failure:1
2020-06-03 08:07:16.693 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.736 reported failed by osd.1455
2020-06-03 08:07:16.695 7fcc3ed57700 1 mon.p3cephmon004(a)0(leader).osd e1535234 we have enough reporters to mark osd.736 down
2020-06-03 08:07:16.696 7fcc3ed57700 0 log_channel(cluster) log [INF] : osd.736 failed (root=default,rack=S06-06,chassis=S06-06-17,host=p3cephosd386) (3 reporters from different host after 20.389591 >= grace 20.025280)
2020-06-03 08:07:16.696 7fcc3ed57700 1 mon.p3cephmon004(a)0(leader).osd e1535234 prepare_failure osd.1463 [v2:10.7.208.30:6824/3947672,v1:10.7.208.30:6825/3947672] from osd.1455 is reporting failure:1
2020-06-03 08:07:16.696 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.1463 reported failed by osd.1455
2020-06-03 08:07:16.758 7fcc3ed57700 1 mon.p3cephmon004(a)0(leader).osd e1535234 prepare_failure osd.1463 [v2:10.7.208.30:6824/3947672,v1:10.7.208.30:6825/3947672] from osd.2108 is reporting failure:1
2020-06-03 08:07:16.758 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.1463 reported failed by osd.2108
2020-06-03 08:07:16.800 7fcc3ed57700 1 mon.p3cephmon004(a)0(leader).osd e1535234 prepare_failure osd.1463 [v2:10.7.208.30:6824/3947672,v1:10.7.208.30:6825/3947672] from osd.1166 is reporting failure:1
2020-06-03 08:07:16.800 7fcc3ed57700 0 log_channel(cluster) log [DBG] : osd.1463 reported failed by osd.1166
2020-06-03 08:07:16.835 7fcc4155c700 1 mon.p3cephmon004(a)0(leader).osd e1535234 do_prune osdmap full prune enabled
...
Does any one know why setting the no scrubbing flags would cause such an issue?
Or if this is a known issue with a fix in 14.2.9 or 14.2.10 (when it comes out)?
Thanks,
Bryan
Hi,
I've been using pg-upmap items both in the ceph balancer and by hand
running osdmaptool for a while now (on Ceph 12.2.13).
But I've noticed a side effect of up-map-items which can sometimes lead to
some unnecessary data movement.
My understanding is that the ceph osdmap keeps track of upmap-items that I
undo (in my case using the CERN scrip upmap-remapped.py).
These can be seen in the osdmap (or osd dump) json output. It looks, for
example, like this:
"pg_upmap_items": [
{
"pgid": "9.10",
"mappings": [
{
"from": 1761,
"to": 6
}
]
},
When upmapping pg 9.10 I first need to clear this pg_upmap_item by
executing an rm-upmap-item command:
ceph rm-pg-upmap-items 9.10
All this does is unmap the from/to osds (here from osd.6 to osd.1761) which
is sometimes not useful.
I would prefer to rather "forget" this upmap. I.e remove it permanently
from pg_upmap_items. Is there any way to do this?
Cheers,
Toms
Hi All,
I am trying to upgrade ceph 15.2.1 to 15.2.3. I've two node setup on small environment for test only. I ran the following commands:
$ ceph mon ok-to-stop mon.vx-rg23-rk65-u43-130
>> quorum should be preserved (vx-rg23-rk65-u43-130,vx-rg23-rk65-u43-130-1) after stopping [mon.vx-rg23-rk65-u43-130]
$ ceph orch upgrade start --ceph-version 15.2.3
However, Ceph says it is NOT safe to stop mon.vx-rg23-rk65-u43-130.
Debug log:
2020-06-01T22:09:21.967310+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [INF] Upgrade: Checking mon daemons...
2020-06-01T22:09:21.967426+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] daemon mon.vx-rg23-rk65-u43-130 not correct (docker.io/ceph/ceph:v15, bc83a388465f0568dab4501fb7684398dca8b50ca12a342a57f21815721723c2, 15.2.1)
2020-06-01T22:09:21.967563+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] Have connection to vx-rg23-rk65-u43-130
2020-06-01T22:09:21.967668+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] None container image docker.io/ceph/ceph:v15.2.3
2020-06-01T22:09:21.967778+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] args:
--image docker.io/ceph/ceph:v15.2.3 inspect-image
2020-06-01T22:09:23.400842+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] code:
0
2020-06-01T22:09:23.401062+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] out: {
"ceph_version": "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)",
"image_id": "d72755c420bcbdae08d063de6035d060ea0487f8a43f777c75bdbfcd9fd907fa"
}
2020-06-01T22:09:23.404700+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] mon_command: 'mon ok-to-stop' -> -16 in 0.002s
2020-06-01T22:09:23.405002+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [INF] Upgrade: It is NOT safe to stop mon.vx-rg23-rk65-u43-130
2020-06-01T22:09:38.416475+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] mon_command: 'mon ok-to-stop' -> -16 in 0.003s
2020-06-01T22:09:38.417296+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [INF] Upgrade: It is NOT safe to stop mon.vx-rg23-rk65-u43-130
2020-06-01T22:09:53.421473+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] mon_command: 'mon ok-to-stop' -> -16 in 0.003s
2020-06-01T22:09:53.422350+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [INF] Upgrade: It is NOT safe to stop mon.vx-rg23-rk65-u43-130
2020-06-01T22:10:08.440422+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [DBG] mon_command: 'mon ok-to-stop' -> -16 in 0.003s
2020-06-01T22:10:08.441122+0000 mgr.vx-rg23-rk65-u43-130-1.pxmyie [INF] Upgrade: It is NOT safe to stop mon.vx-rg23-rk65-u43-130
How can I solve this and upgrade?
Thanks,
Gencer.
Kevin, Ignazio, Marc,
Thanks for the information. I now consider myself well-advised.
-Patrick
On Tue, Jun 2, 2020 at 1:21 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
> Ceph is from redhat and redhat is owned by IBM. I think the best
> training you could get would be from RedHat.
>
> I would not advise to learn how to use a mouse with a web interface nor
> this ansible or some other deploy tool. Do it from scratch manually so
> you know the basics. If you know those, go for some tools that make your
> life easier. (and never install the newest stable release ;))
>
>
>
>
> -----Original Message-----
> Subject: [ceph-users] Re: professional services and support for newest
> Ceph
>
> Hello, I am testing ceph from croit and it works fine: very easy web
> interface for installing and managing ceph and very clear support
> pricing.
> Ignazio
>
> Il Mar 2 Giu 2020, 19:36 <response(a)ifastnet.com> ha scritto:
>
> > and theres
> >
> > https://croit.io/consulting
> >
> > best regards
> > Kevin M
> >
> > ----- Original Message -----
> > From: "Patrick Calhoun" <phineas(a)ou.edu>
> > To: ceph-users(a)ceph.io
> > Sent: Tuesday, June 2, 2020 5:29:11 PM
> > Subject: [ceph-users] professional services and support for newest
> > Ceph
> >
> > Are there reputable training/support options for Ceph that are not
> > geared toward a specific commercial product (e.g. "Red Hat Ceph
> > Storage,") but instead would cover the newest open source stable
> release?
> >
> > Thanks,
> > Patrick
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> > email to ceph-users-leave(a)ceph.io
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> > email to ceph-users-leave(a)ceph.io
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
>
>
>
--
Patrick Calhoun, RHCE
Petascale Storage Administrator
OU Supercomputing Center for Education and Research
Department of Information Technology
University of Oklahoma
(405) 325-4210
Are there reputable training/support options for Ceph that are not geared
toward a specific commercial product (e.g. "Red Hat Ceph Storage,") but
instead would cover the newest open source stable release?
Thanks,
Patrick
We are rebuilding servers and before luminous our process was:
1. Reweight the OSD to 0
2. Wait for rebalance to complete
3. Out the osd
4. Crush remove osd
5. Auth del osd
6. Ceph osd rm #
Seems the luminous documentation says that you should:
1. Out the osd
2. Wait for the cluster rebalance to finish
3. Stop the osd
4. Osd purge #
Is reweighting to 0 no longer suggested?
Side note: I tried our existing process and even after reweight, the entire
cluster restarted the balance again after step 4 ( crush remove osd ) of the
old process. I should also note, by reweighting to 0, when I tried to run
"ceph osd out #", it said it was already marked out.
I assume the docs are correct, but just want to make sure since reweighting
had been previously recommended.
Regards,
-Brent
Existing Clusters:
Test: Nautilus 14.2.2 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi
gateways ( all virtual on nvme )
US Production(HDD): Nautilus 14.2.2 with 11 osd servers, 3 mons, 4 gateways,
2 iscsi gateways
UK Production(HDD): Nautilus 14.2.2 with 12 osd servers, 3 mons, 4 gateways
US Production(SSD): Nautilus 14.2.2 with 6 osd servers, 3 mons, 3 gateways,
2 iscsi gateways