4.14 kernel or greater recommendation for multiple active MDS

List overview All Threads
Download

newer

older

Cluster blacklists MDS, can't start

Question about bucket versions

Robert LeBlanc

29 Apr 2020 29 Apr '20

3:51 a.m.

In the Nautilus manual it recommends >= 4.14 kernel for multiple active MDSes. What are the potential issues for running the 4.4 kernel with multiple MDSes? We are in the process of upgrading the clients, but at times overrun the capacity of a single MDS server. MULTIPLE ACTIVE METADATA SERVERS <https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active-metadata-servers> The feature has been supported since the Luminous release. It is recommended to use Linux kernel clients >= 4.14 when there are multiple active MDS. Thank you, Robert LeBlanc ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1

Show replies by date

Gregory Farnum

1 May 1 May

5:17 a.m.

On Tue, Apr 28, 2020 at 11:52 AM Robert LeBlanc <robert(a)leblancnet.us> wrote:

...

I don't think this is documented specifically; you'd have to go through the git logs. Talked with the team and 4.14 was the upstream kernel when we marked multi-MDS as stable, with the general stream of ongoing fixes that always applies there. There aren't any known issues that will cause file consistency to break or anything; I'd be more worried about clients having issues reconnecting when their network blips or an MDS fails over. -Greg

...

MULTIPLE ACTIVE METADATA SERVERS <https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active-metadata-servers> The feature has been supported since the Luminous release. It is recommended to use Linux kernel clients >= 4.14 when there are multiple active MDS. Thank you, Robert LeBlanc ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Paul Emmerich

8 p.m.

I've seen issues with clients reconnects on older kernels, yeah. They sometimes get stuck after a network failure Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Thu, Apr 30, 2020 at 10:19 PM Gregory Farnum <gfarnum(a)redhat.com> wrote:

...

On Tue, Apr 28, 2020 at 11:52 AM Robert LeBlanc <robert(a)leblancnet.us> wrote:

MULTIPLE ACTIVE METADATA SERVERS <

https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active…

The feature has been supported since the Luminous release. It is recommended to use Linux kernel clients >= 4.14 when there are multiple active MDS. Thank you, Robert LeBlanc ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Robert LeBlanc

2 May 2 May

11:37 a.m.

...

On Tue, Apr 28, 2020 at 11:52 AM Robert LeBlanc <robert(a)leblancnet.us> wrote:

MULTIPLE ACTIVE METADATA SERVERS <

https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active…

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Robert LeBlanc

3 May 3 May

7:12 a.m.

If there was a network blip and a client was having trouble reconnecting, do you think reducing the ranks to 1 would allow them to connect? At which point the ranks could be increased again. Or is it a matter of the client kernel panicking so any kind of reconnection won't work? Thanks ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, May 1, 2020 at 7:37 PM Robert LeBlanc <robert(a)leblancnet.us> wrote:

...

Thanks guys. We are so close to the edge that we may just take that chance, usually the only reason an active client has to reconnect is because we have to bounce the MDS because it's overwhelmed. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, May 1, 2020 at 4:00 AM Paul Emmerich <paul.emmerich(a)croit.io> wrote: > I've seen issues with clients reconnects on older kernels, yeah. They > sometimes get stuck after a network failure > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > > On Thu, Apr 30, 2020 at 10:19 PM Gregory Farnum <gfarnum(a)redhat.com> > wrote: > >> On Tue, Apr 28, 2020 at 11:52 AM Robert LeBlanc <robert(a)leblancnet.us> >> wrote: >> > >> > In the Nautilus manual it recommends >= 4.14 kernel for multiple active >> > MDSes. What are the potential issues for running the 4.4 kernel with >> > multiple MDSes? We are in the process of upgrading the clients, but at >> > times overrun the capacity of a single MDS server. >> >> I don't think this is documented specifically; you'd have to go >> through the git logs. Talked with the team and 4.14 was the upstream >> kernel when we marked multi-MDS as stable, with the general stream of >> ongoing fixes that always applies there. >> >> There aren't any known issues that will cause file consistency to >> break or anything; I'd be more worried about clients having issues >> reconnecting when their network blips or an MDS fails over. >> -Greg >> >> > >> > MULTIPLE ACTIVE METADATA SERVERS >> > < >> https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active… >> > >> > >> > The feature has been supported since the Luminous release. It is >> > recommended to use Linux kernel clients >= 4.14 when there are multiple >> > active MDS. >> > Thank you, >> > Robert LeBlanc >> > ---------------- >> > Robert LeBlanc >> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users(a)ceph.io >> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >> > >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io >> >

Gregory Farnum

5 May 5 May

11:51 a.m.

On Sat, May 2, 2020 at 3:12 PM Robert LeBlanc <robert(a)leblancnet.us> wrote:

...

It's client issues. It shouldn't kernel panic, but it also won't connect with the existing mount, and IIRC you can't do anything on the server side to make it do so once it gets stuck. You'd have to reboot the server or possibly do a umount -l -f and then mount again. -Greg

...

Thanks ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, May 1, 2020 at 7:37 PM Robert LeBlanc <robert(a)leblancnet.us> wrote: > > Thanks guys. We are so close to the edge that we may just take that chance, usually the only reason an active client has to reconnect is because we have to bounce the MDS because it's overwhelmed. > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Fri, May 1, 2020 at 4:00 AM Paul Emmerich <paul.emmerich(a)croit.io> wrote: >> >> I've seen issues with clients reconnects on older kernels, yeah. They sometimes get stuck after a network failure >> >> Paul >> >> -- >> Paul Emmerich >> >> Looking for help with your Ceph cluster? Contact us at https://croit.io >> >> croit GmbH >> Freseniusstr. 31h >> 81247 München >> www.croit.io >> Tel: +49 89 1896585 90 >> >> >> On Thu, Apr 30, 2020 at 10:19 PM Gregory Farnum <gfarnum(a)redhat.com> wrote: >>> >>> On Tue, Apr 28, 2020 at 11:52 AM Robert LeBlanc <robert(a)leblancnet.us> wrote: >>> > >>> > In the Nautilus manual it recommends >= 4.14 kernel for multiple active >>> > MDSes. What are the potential issues for running the 4.4 kernel with >>> > multiple MDSes? We are in the process of upgrading the clients, but at >>> > times overrun the capacity of a single MDS server. >>> >>> I don't think this is documented specifically; you'd have to go >>> through the git logs. Talked with the team and 4.14 was the upstream >>> kernel when we marked multi-MDS as stable, with the general stream of >>> ongoing fixes that always applies there. >>> >>> There aren't any known issues that will cause file consistency to >>> break or anything; I'd be more worried about clients having issues >>> reconnecting when their network blips or an MDS fails over. >>> -Greg >>> >>> > >>> > MULTIPLE ACTIVE METADATA SERVERS >>> > <https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active-metadata-servers> >>> > >>> > The feature has been supported since the Luminous release. It is >>> > recommended to use Linux kernel clients >= 4.14 when there are multiple >>> > active MDS. >>> > Thank you, >>> > Robert LeBlanc >>> > ---------------- >>> > Robert LeBlanc >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> > >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

Robert LeBlanc

8 May 8 May

1:32 a.m.

As a follow up, our MDS was locking up under load so I went ahead and tried it. It seemed that some directories were getting bounced around the MDS servers and load would transition from one to the other. Initially my guess was that some of these old clients were sending all requests to one MDS server, but the performance wasn't stabilizing or getting better, so I went back to one. We had a number of clients that stopped being able to work in some directories, but others were fine. We had to reboot all the nodes that were affected in this manner. We are going to move low performance clients to FUSE and working to upgrade the kernel on the rest. Until then, we will just have to fight with these clients that won't release their caps. I've set the cache memory to 200 GB and the reservation to 50% and that seems to help a bit. Any other tricks that might help these stubborn clients release their caps? Thanks, Robert LeBlanc ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, May 4, 2020 at 7:51 PM Gregory Farnum <gfarnum(a)redhat.com> wrote:

...

On Sat, May 2, 2020 at 3:12 PM Robert LeBlanc <robert(a)leblancnet.us> wrote:

If there was a network blip and a client was having trouble

reconnecting, do you think reducing the ranks to 1 would allow them to connect? At which point the ranks could be increased again.

Or is it a matter of the client kernel panicking so any kind of

reconnection won't work? It's client issues. It shouldn't kernel panic, but it also won't connect with the existing mount, and IIRC you can't do anything on the server side to make it do so once it gets stuck. You'd have to reboot the server or possibly do a umount -l -f and then mount again. -Greg

Thanks ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Fri, May 1, 2020 at 7:37 PM Robert LeBlanc <robert(a)leblancnet.us>

wrote:

> > Thanks guys. We are so close to the edge that we may just take that

chance, usually the only reason an active client has to reconnect is because we have to bounce the MDS because it's overwhelmed.

> ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Fri, May 1, 2020 at 4:00 AM Paul Emmerich <paul.emmerich(a)croit.io>

wrote:

>> >> I've seen issues with clients reconnects on older kernels, yeah. They

sometimes get stuck after a network failure

>> >> Paul >> >> -- >> Paul Emmerich >> >> Looking for help with your Ceph cluster? Contact us at

https://croit.io

>> >> croit GmbH >> Freseniusstr. 31h >> 81247 München >> www.croit.io >> Tel: +49 89 1896585 90 >> >> >> On Thu, Apr 30, 2020 at 10:19 PM Gregory Farnum <gfarnum(a)redhat.com>

wrote:

>>> >>> On Tue, Apr 28, 2020 at 11:52 AM Robert LeBlanc <robert(a)leblancnet.us>

wrote:

>>> > >>> > In the Nautilus manual it recommends >= 4.14 kernel for multiple

active

>>> > MDSes. What are the potential issues for running the 4.4 kernel with >>> > multiple MDSes? We are in the process of upgrading the clients, but

>>> > times overrun the capacity of a single MDS server. >>> >>> I don't think this is documented specifically; you'd have to go >>> through the git logs. Talked with the team and 4.14 was the upstream >>> kernel when we marked multi-MDS as stable, with the general stream of >>> ongoing fixes that always applies there. >>> >>> There aren't any known issues that will cause file consistency to >>> break or anything; I'd be more worried about clients having issues >>> reconnecting when their network blips or an MDS fails over. >>> -Greg >>> >>> > >>> > MULTIPLE ACTIVE METADATA SERVERS >>> > <

https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active…

>>> > >>> > The feature has been supported since the Luminous release. It is >>> > recommended to use Linux kernel clients >= 4.14 when there are

multiple

>>> > active MDS. >>> > Thank you, >>> > Robert LeBlanc >>> > ---------------- >>> > Robert LeBlanc >>> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users(a)ceph.io >>> > To unsubscribe send an email to ceph-users-leave(a)ceph.io >>> > >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users(a)ceph.io >>> To unsubscribe send an email to ceph-users-leave(a)ceph.io

1453

days inactive

1462

days old

ceph-users@ceph.io

Manage subscription

6 comments

3 participants

tags (0)

participants (3)

Gregory Farnum
Paul Emmerich
Robert LeBlanc