Hi all,
We have a Ceph(version 12.2.4)cluster that adopts EC pools, and it consists
of 10 hosts for OSDs.
The corresponding commands to create the EC pool are listed as follows:
ceph osd erasure-code-profile set profile_jerasure_4_3_reed_sol_van \
plugin=jerasure \
k=4 \
m=3 \
technique=reed_sol_van \
packetsize=2048 \
crush-device-class=hdd \
crush-failure-domain=host
ceph osd pool create pool_jerasure_4_3_reed_sol_van 2048 2048 erasure
profile_jerasure_4_3_reed_sol_van
Since that the EC pool's crush-failure-domain is configured to be "host",
we just disable the network interfaces of some hosts (using "ifdown"
command) to verify the functionality of the EC pool.
And here are the phenomena we have observed:
First of all, the IO rate (of "rados bench", which we used for benchmark)
drops immediately to 0 when one host goes offline.
Secondly, it takes a lot of time (around 100 seconds) for Ceph to detect
corresponding OSDs on that host are down.
Finally, once the Ceph has detected all offline OSDs, the EC pool seems to
act normally and it is ready for IO operations again.
So, here are my questions:
1. Is this normal that the IO rate drops to 0 immediately even though there
is only one host goes offline?
2. How to make Ceph reduce the time needed to detect failed OSDs?
Thanks for any help.
Best regards,
Majia Xiao
Hi everyone,
We've identified a data corruption bug[1], first introduced[2] (by yours
truly) in 14.2.3 and affecting both 14.2.3 and 14.2.4. The corruption
appears as a rocksdb checksum error or assertion that looks like
os/bluestore/fastbmap_allocator_impl.h: 750: FAILED ceph_assert(available >= allocated)
or in some cases a rocksdb checksum error. It only affects BlueStore OSDs
that have a separate 'db' or 'wal' device.
We have a fix[3] that is working its way through testing, and will
expedite the next Nautilus point release (14.2.5) once it is ready.
If you are running 14.2.2 or 14.2.1 and use BlueStore OSDs with
separate 'db' volumes, you should consider waiting to upgrade
until 14.2.5 is released.
A big thank you to Igor Fedotov and several *extremely* helpful users who
managed to reproduce and track down this problem!
sage
[1] https://tracker.ceph.com/issues/42223
[2] https://github.com/ceph/ceph/commit/096033b9d931312c0688c2eea7e14626bfde0ad…
[3] https://github.com/ceph/ceph/pull/31621
Hi,
I have a CephFS instance and I am also planning on also deploying an
Object Storage interface.
My servers have 2 network boards each. I would like to use the current
local one to talk to Cephs clients (both CephFS and Object Storage)
and use the second one to all Cephs processes to talk one to the
other.
I'm quite sure that Ceph has support for this kind of setup but I
can't find how to do such a thing.
I already have a "public network" setting that, AFAIU, sets the
interface used by Ceph to talk to clients but I don't know if it
defines CephFS MDS, RGW, or all of them.
And I can't see how to make the different Ceph processes to talk to
each other through the other interface. I found "mon_host" setting but
should I change it to get what I want? And what else?
Is there some example of such a setup that I could learn from?
Or maybe someone would be kind enough to scratch a basic config to
achieve this setup?
Or maybe there is some documentation that deals with such a scenario
that I haven't found.
Well, I'm in search for more info, any kind.
Thanks in advance for your help and attention,
Rodrigo Severo
I am not sure since when, but I am not able to create nor delete
snapshots anymore. I am getting a permission denied. I upgraded recently
from Luminous to Nautilus and set this allow_new_snaps as mentioned
on[1]
[@ .snap]# ls
snap-1 snap-2 snap-3 snap-4 snap-5 snap-6 snap-7
[@ .snap]# rmdir snap-7
rmdir: failed to remove ‘snap-7’: Permission denied
[@ .snap]# mkdir snap-8
mkdir: cannot create directory ‘snap-8’: Permission denied
[1]
https://docs.ceph.com/docs/nautilus/dev/cephfs-snapshots/
Is there a ceph auth command that just list all clients? Without dumping
keys to the console. I would recommend making 'ceph auth ls' just
display client names/ids. If you want the key, there are already other
commands.
Hello,
We have a Ceph cluster (version 12.2.4) with 10 hosts, and there are 21
OSDs on each host.
An EC pool is created with the following commands:
ceph osd erasure-code-profile set profile_jerasure_4_3_reed_sol_van \
plugin=jerasure \
k=4 \
m=3 \
technique=reed_sol_van \
packetsize=2048 \
crush-device-class=hdd \
crush-failure-domain=host
ceph osd pool create pool_jerasure_4_3_reed_sol_van 2048 2048 erasure
profile_jerasure_4_3_reed_sol_van
Here are my questions:
1. The EC pool is created using k=4, m=3, and crush-device-class=hdd, so
we just disable the network interfaces of some hosts (using "ifdown"
command) to verify the functionality of the EC pool while performing ‘rados
bench’ command.
However, the IO rate drops immediately to 0 when a single host goes
offline, and it takes a long time (~100 seconds) for the IO rate becoming
normal.
As far as I know, the default value of min_size is k+1 or 5, which means
that the EC pool can be still working even if there are two hosts offline.
Is there something wrong with my understanding?
2. According to our observations, it seems that the IO rate becomes
normal when Ceph detects all OSDs corresponding to the failed host.
Is there any way to reduce the time needed for Ceph to detect all failed
OSDs?
Thanks for any help.
Best regards,
Majia Xiao
If it was a network issue, the counters should explose (as i said,
with a log level of 5 on the messenger, we observed more then 80 000
lossy channels per minute) but nothing abnormal is relevant on the
counters (on switchs and servers)
On the switchs no drop, no crc error, no packet loss, only some
output discards but not enough to be significant. On the NICs on the
servers via ethtool -S, nothing is relevant.
And as i said, an other mimic cluster with different hardware has the
same behavior
Ceph uses connexions pools from host to host but how does it check the
availability of these connexions over the time ?
And as the network doesn't seem to be guilty, what can explain these
broken channels ?
Le mer. 27 nov. 2019 à 19:05, Anthony D'Atri <aad(a)dreamsnake.net> a écrit :
>
> Are you bonding NIC ports? If so do you have the correct hash policy defined? Have you looked at the *switch* side for packet loss, CRC errors, etc? What you report could be consistent with this. Since the host interface for a given connection will vary by the bond hash, some OSD connections will use one port and some the other. So if one port has switch side errors, or is blackholed on the switch, you could see some heart beating impacted but not others.
>
> Also make sure you have the optimal reporters value.
>
> > On Nov 27, 2019, at 7:31 AM, Vincent Godin <vince.mlist(a)gmail.com> wrote:
> >
> > Till i submit the mail below few days ago, we found some clues
> > We observed a lot of lossy connexion like :
> > ceph-osd.9.log:2019-11-27 11:03:49.369 7f6bb77d0700 0 --
> > 192.168.4.181:6818/2281415 >> 192.168.4.41:0/1962809518
> > conn(0x563979a9f600 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
> > pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy)
> > channel (new one lossy=1)
> > We raised the log of the messenger to 5/5 and observed for the whole
> > cluster more than 80 000 lossy connexion per minute !!!
> > We adjusted the "ms_tcp_read_timeout" from 900 to 60 sec then no more
> > lossy connexion in logs nor health check failed
> > It's just a workaround but there is a real problem with these broken
> > sessions and it leads to two
> > assertions :
> > - Ceph take too much time to detect broken session and should recycle quicker !
> > - The reasons for these broken sessions ?
> >
> > We have a other mimic cluster on different hardware and observed the
> > same behavior : lot of lossy sessions, slow ops and co.
> > Symptoms are the same :
> > - some OSDs on one host have no response from an other osd on a different hosts
> > - after some time, slow ops are detected
> > - sometime it leads to ioblocked
> > - after about 15mn the problem vanish
> >
> > -----------
> >
> > Help on diag needed : heartbeat_failed
> >
> > We encounter a strange behavior on our Mimic 13.2.6 cluster. A any
> > time, and without any load, some OSDs become unreachable from only
> > some hosts. It last 10 mn and then the problem vanish.
> > It 's not always the same OSDs and the same hosts. There is no network
> > failure on any of the host (because only some OSDs become unreachable)
> > nor disk freeze as we can see in our grafana dashboard. Logs message
> > are :
> > first msg :
> > 2019-11-24 09:19:43.292 7fa9980fc700 -1 osd.596 146481
> > heartbeat_check: no reply from 192.168.6.112:6817 osd.394 since back
> > 2019-11-24 09:19:22.761142 front 2019-11-24 09:19:39.769138 (cutoff
> > 2019-11-24 09:19:23.293436)
> > last msg:
> > 2019-11-24 09:30:33.735 7f632354f700 -1 osd.591 146481
> > heartbeat_check: no reply from 192.168.6.123:6828 osd.600 since back
> > 2019-11-24 09:27:05.269330 front 2019-11-24 09:30:33.214874 (cutoff
> > 2019-11-24 09:30:13.736517)
> > During this time, 3 hosts were involved : host-18, host-20 and host-30 :
> > host-30 is the only one who can't see osds 346,356,and 352 on host-18
> > host-30 is the only one who can't see osds 387 and 394 on host-20
> > host-18 is the only one who can't see osds 583, 585, 591 and 597 on host-30
> > We can't see any strange behavior on hosts 18, 20 and 30 in our node
> > exporter data during this time
> > Any ideas or advices ?
Hello,
I am new to Ceph and currently i am working on setting up CephFs and RBD
environment. I have successfully setup Ceph Cluster with 4 OSD's (2 OSD's
with size 50GB and 2 OSD's with size 300GB).
But while setting up CephFs the size which i see allocated for CephFs Data
and metadata pools is 55GB. But i want to have 300GB assigned for CephFs.
I tried using "target_size_bytes" flag while creating pool but it is
not working (it saus invalid command). Same result when i
use target_size_bytes with (ceph osd pool set) after creating pool.
I am not sure if i am doing something silly here.
Can someone please guide me on this?
Thanks in adv.!
Thanks for the information, I'll take a look at this pr and think it over.
Jeff Layton <jlayton(a)redhat.com> 于2019年11月27日周三 下午6:50写道:
> On Wed, 2019-11-27 at 15:14 +0800, j j wrote:
> > Hi all,
> >
> > Recently I encountered a situation requires reliable file storage
> with cephfs, and the point is those data is not allowed to get modified or
> deleted.
> > After some learning I found that the WORM(write once read many) feature
> is exactly what I need.Unfortunately, as far as I know, there is no worm
> > feature in cephfs.
> > So I was wondering is there any plan or design about this feature?
> >
> > Thanks.
>
> There's a pull request for this that has been stalled since spring:
>
> https://github.com/ceph/ceph/pull/26691
>
> Personally, I don't see how we can get away with making file data 100%
> immutable. We'll need to allow _some_ entity to un-WORM the thing, and
> it was never clear to me how that would work.
> --
> Jeff Layton <jlayton(a)redhat.com>
>
>
Till i submit the mail below few days ago, we found some clues
We observed a lot of lossy connexion like :
ceph-osd.9.log:2019-11-27 11:03:49.369 7f6bb77d0700 0 --
192.168.4.181:6818/2281415 >> 192.168.4.41:0/1962809518
conn(0x563979a9f600 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
pgs=0 cs=0 l=1).handle_connect_msg accept replacing existing (lossy)
channel (new one lossy=1)
We raised the log of the messenger to 5/5 and observed for the whole
cluster more than 80 000 lossy connexion per minute !!!
We adjusted the "ms_tcp_read_timeout" from 900 to 60 sec then no more
lossy connexion in logs nor health check failed
It's just a workaround but there is a real problem with these broken
sessions and it leads to two
assertions :
- Ceph take too much time to detect broken session and should recycle quicker !
- The reasons for these broken sessions ?
We have a other mimic cluster on different hardware and observed the
same behavior : lot of lossy sessions, slow ops and co.
Symptoms are the same :
- some OSDs on one host have no response from an other osd on a different hosts
- after some time, slow ops are detected
- sometime it leads to ioblocked
- after about 15mn the problem vanish
-----------
Help on diag needed : heartbeat_failed
We encounter a strange behavior on our Mimic 13.2.6 cluster. A any
time, and without any load, some OSDs become unreachable from only
some hosts. It last 10 mn and then the problem vanish.
It 's not always the same OSDs and the same hosts. There is no network
failure on any of the host (because only some OSDs become unreachable)
nor disk freeze as we can see in our grafana dashboard. Logs message
are :
first msg :
2019-11-24 09:19:43.292 7fa9980fc700 -1 osd.596 146481
heartbeat_check: no reply from 192.168.6.112:6817 osd.394 since back
2019-11-24 09:19:22.761142 front 2019-11-24 09:19:39.769138 (cutoff
2019-11-24 09:19:23.293436)
last msg:
2019-11-24 09:30:33.735 7f632354f700 -1 osd.591 146481
heartbeat_check: no reply from 192.168.6.123:6828 osd.600 since back
2019-11-24 09:27:05.269330 front 2019-11-24 09:30:33.214874 (cutoff
2019-11-24 09:30:13.736517)
During this time, 3 hosts were involved : host-18, host-20 and host-30 :
host-30 is the only one who can't see osds 346,356,and 352 on host-18
host-30 is the only one who can't see osds 387 and 394 on host-20
host-18 is the only one who can't see osds 583, 585, 591 and 597 on host-30
We can't see any strange behavior on hosts 18, 20 and 30 in our node
exporter data during this time
Any ideas or advices ?