"No space left on device" errors

List overview All Threads
Download

newer

older

directory fragments and stray...

10/17/2019 perf meeting is on!

Yuri Weinstein

16 Oct 2019 16 Oct '19

10:12 p.m.

"No space left on device" errors https://tracker.ceph.com/issues/42313 Suspect https://github.com/ceph/ceph/pull/30792 This issue is blocking releases testing FYI and looking for help. Thx YuriW

Show replies by date

Gregory Farnum

16 Oct 16 Oct

10:16 p.m.

Have you tried reproducing it after reverting that commit in a teuthology branch? I don’t believe it was a critical change, just part of Kyr’s ongoing python3 work, so we can put it off until we understand the issue if that fixes things. -Greg On Wed, Oct 16, 2019 at 12:13 PM Yuri Weinstein <yweinste(a)redhat.com> wrote:

...

"No space left on device" errors https://tracker.ceph.com/issues/42313 Suspect https://github.com/ceph/ceph/pull/30792 This issue is blocking releases testing FYI and looking for help. Thx YuriW _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

David Galloway

10:20 p.m.

I'm having Yuri schedule a run using 7f80f5de91777d958f36fd6cce943a1d68cefdf7 which is the sha1 right before the commit I suspect broke things. On 10/16/19 3:16 PM, Gregory Farnum wrote: > Have you tried reproducing it after reverting that commit in a > teuthology branch? > I don’t believe it was a critical change, just part of Kyr’s ongoing > python3 work, so we can put it off until we understand the issue if that > fixes things. > -Greg > > On Wed, Oct 16, 2019 at 12:13 PM Yuri Weinstein <yweinste(a)redhat.com > <mailto:yweinste@redhat.com>> wrote: > > "No space left on device" errors > https://tracker.ceph.com/issues/42313 > > Suspect https://github.com/ceph/ceph/pull/30792 > > This issue is blocking releases testing FYI and looking for help. > > Thx > YuriW

David Galloway

10:24 p.m.

Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792 On 10/16/19 3:20 PM, David Galloway wrote: > I'm having Yuri schedule a run using > 7f80f5de91777d958f36fd6cce943a1d68cefdf7 which is the sha1 right before > the commit I suspect broke things. > > On 10/16/19 3:16 PM, Gregory Farnum wrote: >> Have you tried reproducing it after reverting that commit in a >> teuthology branch? >> I don’t believe it was a critical change, just part of Kyr’s ongoing >> python3 work, so we can put it off until we understand the issue if that >> fixes things. >> -Greg >> >> On Wed, Oct 16, 2019 at 12:13 PM Yuri Weinstein <yweinste(a)redhat.com >> <mailto:yweinste@redhat.com>> wrote: >> >> "No space left on device" errors >> https://tracker.ceph.com/issues/42313 >> >> Suspect https://github.com/ceph/ceph/pull/30792 >> >> This issue is blocking releases testing FYI and looking for help. >> >> Thx >> YuriW

Gregory Farnum

10:43 p.m.

On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote:

...

Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792

I'd be surprised if that one (changing iteritems() to items()) could cause this, and it's not a fix for any known bugs, just ongoing py3 work. When I said "that commit" I was referring to https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, which is in the teuthology repo and thus hits every test run. Looking at the comments across https://github.com/ceph/teuthology/pull/1332 and https://tracker.ceph.com/issues/42313 it sounds like that teuthology commit accidentally fixed a bug which triggered another bug that we're not sure how to resolve, but perhaps I'm misunderstanding? > > On 10/16/19 3:20 PM, David Galloway wrote: > > I'm having Yuri schedule a run using > > 7f80f5de91777d958f36fd6cce943a1d68cefdf7 which is the sha1 right before > > the commit I suspect broke things. > > > > On 10/16/19 3:16 PM, Gregory Farnum wrote: > >> Have you tried reproducing it after reverting that commit in a > >> teuthology branch? > >> I don’t believe it was a critical change, just part of Kyr’s ongoing > >> python3 work, so we can put it off until we understand the issue if that > >> fixes things. > >> -Greg > >> > >> On Wed, Oct 16, 2019 at 12:13 PM Yuri Weinstein <yweinste(a)redhat.com > >> <mailto:yweinste@redhat.com>> wrote: > >> > >> "No space left on device" errors > >> https://tracker.ceph.com/issues/42313 > >> > >> Suspect https://github.com/ceph/ceph/pull/30792 > >> > >> This issue is blocking releases testing FYI and looking for help. > >> > >> Thx > >> YuriW

Nathan Cutler

17 Oct 17 Oct

12:14 a.m.

On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote:

...

On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote:

Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792

I think I understand what's going on. Here's an interim fix: https://github.com/ceph/teuthology/pull/1334 Assuming this PR really does fix the issue, the "real" fix will be to drop get_wwn_id_map altogether, since it has long outlived its usefulness ( see https://tracker.ceph.com/issues/14855 ). Nathan

kyr

12:39 a.m.

I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 <https://tracker.ceph.com/issues/42313> because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. Can some one give one-job teuthology-suite command that 100% reproducing the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany

...

On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote:

On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote:

Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792

I think I understand what's going on. Here's an interim fix: https://github.com/ceph/teuthology/pull/1334 <https://github.com/ceph/teuthology/pull/1334> Assuming this PR really does fix the issue, the "real" fix will be to drop get_wwn_id_map altogether, since it has long outlived its usefulness ( see https://tracker.ceph.com/issues/14855 <https://tracker.ceph.com/issues/14855> ). Nathan _______________________________________________ Dev mailing list -- dev(a)ceph.io <mailto:dev@ceph.io> To unsubscribe send an email to dev-leave(a)ceph.io <mailto:dev-leave@ceph.io>

Jason Dillaman

12:53 a.m.

On Wed, Oct 16, 2019 at 5:39 PM kyr <kshatskyy(a)suse.de> wrote:

...

If teuthology is picking the wrong devices for the past week (as described by Nathan's PR), why wouldn't you expect that would impact OSDs running out of space?

...

Can some one give one-job teuthology-suite command that 100% reproducing the issue?

I believe that's the goal to test Nathan's change against the RBD suite since it has been hitting it across all branches since the end of last week at least. The logs prior to this breakage show ... 2019-10-02T21:37:45.198 INFO:tasks.ceph:fs option selected, checking for scratch devs 2019-10-02T21:37:45.199 INFO:tasks.ceph:found devs: ['/dev/vg_nvme/lv_4', '/dev/vg_nvme/lv_3', '/dev/vg_nvme/lv_2', '/dev/vg_nvme/lv_1'] 2019-10-02T21:37:45.199 INFO:teuthology.orchestra.run.smithi197:Running: 2019-10-02T21:37:45.199 INFO:teuthology.orchestra.run.smithi197:> ls -l '/dev/disk/by-id/wwn-*' 2019-10-02T21:37:45.265 INFO:teuthology.orchestra.run.smithi197.stderr:ls: cannot access /dev/disk/by-id/wwn-*: No such file or directory 2019-10-02T21:37:45.265 DEBUG:teuthology.orchestra.run:got remote process result: 2 2019-10-02T21:37:45.266 INFO:teuthology.misc:Failed to get wwn devices! Using /dev/sd* devices... ... and after ... 2019-10-10T19:42:32.759 INFO:tasks.ceph:fs option selected, checking for scratch devs 2019-10-10T19:42:32.759 INFO:tasks.ceph:found devs: ['/dev/vg_nvme/lv_4', '/dev/vg_nvme/lv_3', '/dev/vg_nvme/lv_2', '/dev/vg_nvme/lv_1'] 2019-10-10T19:42:32.759 INFO:teuthology.orchestra.run.smithi177:Running: 2019-10-10T19:42:32.759 INFO:teuthology.orchestra.run.smithi177:> ls -l /dev/disk/by-id/wwn-* 2019-10-10T19:42:32.813 INFO:teuthology.orchestra.run.smithi177.stdout:lrwxrwxrwx. 1 root root 9 Oct 10 19:33 /dev/disk/by-id/wwn-0x5000c5009294113e -> ../../sda 2019-10-10T19:42:32.813 INFO:teuthology.orchestra.run.smithi177.stdout:lrwxrwxrwx. 1 root root 10 Oct 10 19:33 /dev/disk/by-id/wwn-0x5000c5009294113e-part1 -> ../../sda1 2019-10-10T19:42:32.813 INFO:tasks.ceph:dev map: {} ... so it seems like a good candidate fix.

...

Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792 I'd be surprised if that one (changing iteritems() to items()) could cause this, and it's not a fix for any known bugs, just ongoing py3 work. When I said "that commit" I was referring to https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, which is in the teuthology repo and thus hits every test run. Looking at the comments across https://github.com/ceph/teuthology/pull/1332 and https://tracker.ceph.com/issues/42313 it sounds like that teuthology commit accidentally fixed a bug which triggered another bug that we're not sure how to resolve, but perhaps I'm misunderstanding? I think I understand what's going on. Here's an interim fix: https://github.com/ceph/teuthology/pull/1334 Assuming this PR really does fix the issue, the "real" fix will be to drop get_wwn_id_map altogether, since it has long outlived its usefulness ( see https://tracker.ceph.com/issues/14855 ). Nathan _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

-- Jason

Gregory Farnum

1:35 a.m.

On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy(a)suse.de> wrote:

...

I'm not quite sure what you mean here. I think one of these addresses your statement? 1) we were creating very small OSDs on the root device since the partitions weren't being mounted, and so these jobs actually filled them up as a consequence of that. 2) most of the teuthology repo is pulled fresh from master on every run. The workers themselves require restarting to get updates but that's pretty rare. (See https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82) > > Can some one give one-job teuthology-suite command that 100% reproducing the issue? > > Kyrylo Shatskyy > -- > SUSE Software Solutions Germany GmbH > Maxfeldstr. 5 > 90409 Nuremberg > Germany > > > On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: > > On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: > > On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: > > > Yuri just reminded me that he's seeing this problem on the mimic branch. > > Does that mean this PR just needs to be backported to all branches? > > https://github.com/ceph/ceph/pull/30792 > > > I'd be surprised if that one (changing iteritems() to items()) could > cause this, and it's not a fix for any known bugs, just ongoing py3 > work. > > When I said "that commit" I was referring to > https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, > which is in the teuthology repo and thus hits every test run. Looking > at the comments across https://github.com/ceph/teuthology/pull/1332 > and https://tracker.ceph.com/issues/42313 it sounds like that > teuthology commit accidentally fixed a bug which triggered another bug > that we're not sure how to resolve, but perhaps I'm misunderstanding? > > > I think I understand what's going on. Here's an interim fix: > https://github.com/ceph/teuthology/pull/1334 > > Assuming this PR really does fix the issue, the "real" fix will be to drop > get_wwn_id_map altogether, since it has long outlived its usefulness ( see > https://tracker.ceph.com/issues/14855 ). > > Nathan > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io > >

Kyrylo Shatskyy

2:21 a.m.

See responds below Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum@redhat.com<mailto:gfarnum@redhat.com>> wrote: On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy@suse.de<mailto:kshatskyy@suse.de>> wrote: I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. I'm not quite sure what you mean here. I think one of these addresses your statement? when you go to the ticket, you can see a reference to the teutholog in the bug description: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5… You can easily figure out the teuthology version is used there: 2019-10-09T20:24:23.144 INFO:root:teuthology version: 1.0.0-139780c Which means there teuthology code base is used with git sha 139780c, which means it does not include any 41a13eca. teuthology> git log --oneline master | grep 41a13ec || echo ABSENT 41a13eca misc: use remote.sh instead of remote.run teuthology> git log --oneline 139780c | grep 41a13ec || echo ABSENT ABSENT As a confirmation of this, there is in the log: ls -l '/dev/disk/by-id/wwn-*‘ That is misguiding, so probably it is just a coincidence? Or we are trying to resolving another issue which has similar no space left on device errors. 1) we were creating very small OSDs on the root device since the partitions weren't being mounted, and so these jobs actually filled them up as a consequence of that. 2) most of the teuthology repo is pulled fresh from master on every run. The workers themselves require restarting to get updates but that's pretty rare. (See https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82) I meant not the worker.py instance running in the background, but the worker user in the opposite to scheduler user. And of course the worker.py script checkouts the teuthology branch whichever you request, and this version (corresponding sha1) is printed in the teuthology.log. Can some one give one-job teuthology-suite command that 100% reproducing the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler@suse.com<mailto:ncutler@suse.com>> wrote: On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa@redhat.com<mailto:dgallowa@redhat.com>> wrote: Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792 I'd be surprised if that one (changing iteritems() to items()) could cause this, and it's not a fix for any known bugs, just ongoing py3 work. When I said "that commit" I was referring to https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, which is in the teuthology repo and thus hits every test run. Looking at the comments across https://github.com/ceph/teuthology/pull/1332 and https://tracker.ceph.com/issues/42313 it sounds like that teuthology commit accidentally fixed a bug which triggered another bug that we're not sure how to resolve, but perhaps I'm misunderstanding? I think I understand what's going on. Here's an interim fix: https://github.com/ceph/teuthology/pull/1334 Assuming this PR really does fix the issue, the "real" fix will be to drop get_wwn_id_map altogether, since it has long outlived its usefulness ( see https://tracker.ceph.com/issues/14855 ). Nathan _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io _______________________________________________ Dev mailing list -- dev@ceph.io<mailto:dev@ceph.io> To unsubscribe send an email to dev-leave@ceph.io<mailto:dev-leave@ceph.io>

kyr

2:35 a.m.

...

On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum(a)redhat.com> wrote: On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy(a)suse.de <mailto:kshatskyy@suse.de>> wrote:

Can some one give one-job teuthology-suite command that 100% reproducing the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792 I'd be surprised if that one (changing iteritems() to items()) could cause this, and it's not a fix for any known bugs, just ongoing py3 work. When I said "that commit" I was referring to https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, which is in the teuthology repo and thus hits every test run. Looking at the comments across https://github.com/ceph/teuthology/pull/1332 and https://tracker.ceph.com/issues/42313 it sounds like that teuthology commit accidentally fixed a bug which triggered another bug that we're not sure how to resolve, but perhaps I'm misunderstanding? I think I understand what's going on. Here's an interim fix: https://github.com/ceph/teuthology/pull/1334 Assuming this PR really does fix the issue, the "real" fix will be to drop get_wwn_id_map altogether, since it has long outlived its usefulness ( see https://tracker.ceph.com/issues/14855 ). Nathan _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

_______________________________________________ Dev mailing list -- dev(a)ceph.io <mailto:dev@ceph.io> To unsubscribe send an email to dev-leave(a)ceph.io <mailto:dev-leave@ceph.io>

Yuri Weinstein

2:39 a.m.

Kyr Here is how I did it: RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi CEPH_QA_MAIL="ceph-qa(a)ceph.io"h.io"; MACHINE_NAME=smithi; CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70 -R fail,dead,running,waiting to test the fix add "-t wip-wwn-fix" On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy(a)suse.de> wrote: > > So I ran a job on smithi against teuthology code which is supposed to have "No space left on device": > > http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-mas… > > And it passed, has not this issue. Which exact suite does reproduce the issue? > > Kyrylo Shatskyy > -- > SUSE Software Solutions Germany GmbH > Maxfeldstr. 5 > 90409 Nuremberg > Germany > > > On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum(a)redhat.com> wrote: > > On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy(a)suse.de> wrote: > > > I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. > > > I'm not quite sure what you mean here. I think one of these addresses > your statement? > 1) we were creating very small OSDs on the root device since the > partitions weren't being mounted, and so these jobs actually filled > them up as a consequence of that. > 2) most of the teuthology repo is pulled fresh from master on every > run. The workers themselves require restarting to get updates but > that's pretty rare. (See > https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82) > > > > Can some one give one-job teuthology-suite command that 100% reproducing the issue? > > Kyrylo Shatskyy > -- > SUSE Software Solutions Germany GmbH > Maxfeldstr. 5 > 90409 Nuremberg > Germany > > > On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: > > On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: > > On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: > > > Yuri just reminded me that he's seeing this problem on the mimic branch. > > Does that mean this PR just needs to be backported to all branches? > > https://github.com/ceph/ceph/pull/30792 > > > I'd be surprised if that one (changing iteritems() to items()) could > cause this, and it's not a fix for any known bugs, just ongoing py3 > work. > > When I said "that commit" I was referring to > https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, > which is in the teuthology repo and thus hits every test run. Looking > at the comments across https://github.com/ceph/teuthology/pull/1332 > and https://tracker.ceph.com/issues/42313 it sounds like that > teuthology commit accidentally fixed a bug which triggered another bug > that we're not sure how to resolve, but perhaps I'm misunderstanding? > > > I think I understand what's going on. Here's an interim fix: > https://github.com/ceph/teuthology/pull/1334 > > Assuming this PR really does fix the issue, the "real" fix will be to drop > get_wwn_id_map altogether, since it has long outlived its usefulness ( see > https://tracker.ceph.com/issues/14855 ). > > Nathan > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io > > > _______________________________________________ > Dev mailing list -- dev(a)ceph.io > To unsubscribe send an email to dev-leave(a)ceph.io > >

kyr

3:03 a.m.

So Yuri, is it reproducible only for luminous? or you have seen this on master or any other branches? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany

...

On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste(a)redhat.com> wrote: Kyr Here is how I did it: RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi CEPH_QA_MAIL="ceph-qa(a)ceph.io"h.io"; MACHINE_NAME=smithi; CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70 -R fail,dead,running,waiting to test the fix add "-t wip-wwn-fix" On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy(a)suse.de> wrote:

So I ran a job on smithi against teuthology code which is supposed to have "No space left on device": http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-mas… And it passed, has not this issue. Which exact suite does reproduce the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum(a)redhat.com> wrote: On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy(a)suse.de> wrote: I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. I'm not quite sure what you mean here. I think one of these addresses your statement? 1) we were creating very small OSDs on the root device since the partitions weren't being mounted, and so these jobs actually filled them up as a consequence of that. 2) most of the teuthology repo is pulled fresh from master on every run. The workers themselves require restarting to get updates but that's pretty rare. (See https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82) Can some one give one-job teuthology-suite command that 100% reproducing the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792 I'd be surprised if that one (changing iteritems() to items()) could cause this, and it's not a fix for any known bugs, just ongoing py3 work. When I said "that commit" I was referring to https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, which is in the teuthology repo and thus hits every test run. Looking at the comments across https://github.com/ceph/teuthology/pull/1332 and https://tracker.ceph.com/issues/42313 it sounds like that teuthology commit accidentally fixed a bug which triggered another bug that we're not sure how to resolve, but perhaps I'm misunderstanding? I think I understand what's going on. Here's an interim fix: https://github.com/ceph/teuthology/pull/1334 Assuming this PR really does fix the issue, the "real" fix will be to drop get_wwn_id_map altogether, since it has long outlived its usefulness ( see https://tracker.ceph.com/issues/14855 ). Nathan _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

Yuri Weinstein

3:04 a.m.

I think David confirmed the same issue on master On Wed, Oct 16, 2019 at 5:03 PM kyr <kshatskyy(a)suse.de> wrote: > > So Yuri, > > is it reproducible only for luminous? or you have seen this on master or any other branches? > > Kyrylo Shatskyy > -- > SUSE Software Solutions Germany GmbH > Maxfeldstr. 5 > 90409 Nuremberg > Germany > > > > On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste(a)redhat.com> wrote: > > > > Kyr > > > > Here is how I did it: > > > > RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi > > CEPH_QA_MAIL="ceph-qa(a)ceph.io"h.io"; MACHINE_NAME=smithi; > > CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous > > teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN > > --suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo > > https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70 > > -R fail,dead,running,waiting > > > > to test the fix add "-t wip-wwn-fix" > > > > On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy(a)suse.de> wrote: > >> > >> So I ran a job on smithi against teuthology code which is supposed to have "No space left on device": > >> > >> http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-mas… > >> > >> And it passed, has not this issue. Which exact suite does reproduce the issue? > >> > >> Kyrylo Shatskyy > >> -- > >> SUSE Software Solutions Germany GmbH > >> Maxfeldstr. 5 > >> 90409 Nuremberg > >> Germany > >> > >> > >> On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum(a)redhat.com> wrote: > >> > >> On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy(a)suse.de> wrote: > >> > >> > >> I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. > >> > >> > >> I'm not quite sure what you mean here. I think one of these addresses > >> your statement? > >> 1) we were creating very small OSDs on the root device since the > >> partitions weren't being mounted, and so these jobs actually filled > >> them up as a consequence of that. > >> 2) most of the teuthology repo is pulled fresh from master on every > >> run. The workers themselves require restarting to get updates but > >> that's pretty rare. (See > >> https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82) > >> > >> > >> > >> Can some one give one-job teuthology-suite command that 100% reproducing the issue? > >> > >> Kyrylo Shatskyy > >> -- > >> SUSE Software Solutions Germany GmbH > >> Maxfeldstr. 5 > >> 90409 Nuremberg > >> Germany > >> > >> > >> On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: > >> > >> On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: > >> > >> On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: > >> > >> > >> Yuri just reminded me that he's seeing this problem on the mimic branch. > >> > >> Does that mean this PR just needs to be backported to all branches? > >> > >> https://github.com/ceph/ceph/pull/30792 > >> > >> > >> I'd be surprised if that one (changing iteritems() to items()) could > >> cause this, and it's not a fix for any known bugs, just ongoing py3 > >> work. > >> > >> When I said "that commit" I was referring to > >> https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, > >> which is in the teuthology repo and thus hits every test run. Looking > >> at the comments across https://github.com/ceph/teuthology/pull/1332 > >> and https://tracker.ceph.com/issues/42313 it sounds like that > >> teuthology commit accidentally fixed a bug which triggered another bug > >> that we're not sure how to resolve, but perhaps I'm misunderstanding? > >> > >> > >> I think I understand what's going on. Here's an interim fix: > >> https://github.com/ceph/teuthology/pull/1334 > >> > >> Assuming this PR really does fix the issue, the "real" fix will be to drop > >> get_wwn_id_map altogether, since it has long outlived its usefulness ( see > >> https://tracker.ceph.com/issues/14855 ). > >> > >> Nathan > >> _______________________________________________ > >> Dev mailing list -- dev(a)ceph.io > >> To unsubscribe send an email to dev-leave(a)ceph.io > >> > >> > >> _______________________________________________ > >> Dev mailing list -- dev(a)ceph.io > >> To unsubscribe send an email to dev-leave(a)ceph.io > >> > >> > > _______________________________________________ > > Dev mailing list -- dev(a)ceph.io > > To unsubscribe send an email to dev-leave(a)ceph.io > > >

Jason Dillaman

3:06 a.m.

On Wed, Oct 16, 2019 at 8:03 PM kyr <kshatskyy(a)suse.de> wrote:

...

So Yuri, is it reproducible only for luminous? or you have seen this on master or any other branches?

It's on all branches -- as of at least last week.

...

Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany

_______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

-- Jason

Kyrylo Shatskyy

3:36 a.m.

Here is the log which supposed to have this issue: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5… $ curl -s http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5… | grep -c "No space left" 90 The 41a13ec is merged after this job even executed: 0456e3e 2019-10-10 19:01 +0200 kshtsk M─┤ Merge pull request #1318 from kshtsk/wip-misc-use-remote-sh 41a13ec 2019-10-09 00:04 +0200 Kyr Shatskyy │ o {origin/wip-misc-use-remote-sh} misc: use remote.sh instead of remote.run It is another proof it is not the cause of the failure. So my question, why do we still consider it is as the only reason? Do we have any other ideas? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 17, 2019, at 2:06 AM, Jason Dillaman <jdillama@redhat.com<mailto:jdillama@redhat.com>> wrote: On Wed, Oct 16, 2019 at 8:03 PM kyr <kshatskyy@suse.de<mailto:kshatskyy@suse.de>> wrote: So Yuri, is it reproducible only for luminous? or you have seen this on master or any other branches? It's on all branches -- as of at least last week. Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste@redhat.com<mailto:yweinste@redhat.com>> wrote: Kyr Here is how I did it: RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi CEPH_QA_MAIL="ceph-qa@ceph.io<mailto:ceph-qa@ceph.io>"; MACHINE_NAME=smithi; CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70 -R fail,dead,running,waiting to test the fix add "-t wip-wwn-fix" On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy@suse.de<mailto:kshatskyy@suse.de>> wrote: So I ran a job on smithi against teuthology code which is supposed to have "No space left on device": http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-mas… And it passed, has not this issue. Which exact suite does reproduce the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum(a)redhat.com> wrote: On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy(a)suse.de> wrote: I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. I'm not quite sure what you mean here. I think one of these addresses your statement? 1) we were creating very small OSDs on the root device since the partitions weren't being mounted, and so these jobs actually filled them up as a consequence of that. 2) most of the teuthology repo is pulled fresh from master on every run. The workers themselves require restarting to get updates but that's pretty rare. (See https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82) Can some one give one-job teuthology-suite command that 100% reproducing the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792 I'd be surprised if that one (changing iteritems() to items()) could cause this, and it's not a fix for any known bugs, just ongoing py3 work. When I said "that commit" I was referring to https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, which is in the teuthology repo and thus hits every test run. Looking at the comments across https://github.com/ceph/teuthology/pull/1332 and https://tracker.ceph.com/issues/42313 it sounds like that teuthology commit accidentally fixed a bug which triggered another bug that we're not sure how to resolve, but perhaps I'm misunderstanding? I think I understand what's going on. Here's an interim fix: https://github.com/ceph/teuthology/pull/1334 Assuming this PR really does fix the issue, the "real" fix will be to drop get_wwn_id_map altogether, since it has long outlived its usefulness ( see https://tracker.ceph.com/issues/14855 ). Nathan _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io _______________________________________________ Dev mailing list -- dev@ceph.io<mailto:dev@ceph.io> To unsubscribe send an email to dev-leave@ceph.io<mailto:dev-leave@ceph.io> -- Jason

Jason Dillaman

4:07 a.m.

On Wed, Oct 16, 2019 at 8:37 PM Kyrylo Shatskyy <kyrylo.shatskyy(a)suse.com> wrote:

...

Here is the log which supposed to have this issue:

Negative -- stepping back here. I brought this issue up for these runs: http://pulpito.ceph.com/yuriw-2019-10-11_12:58:22-rbd-wip-yuri6-testing-201… http://pulpito.ceph.com/yuriw-2019-10-11_19:41:35-rbd-wip-yuri8-testing-201… In these runs, you can see the OSDs crashing because it runs out of space on the OSD block devices. The log you are looking at is a test where the OSDs at debug log level 20 most likely actually filled up "/var/log" -- but they didn't fill up their block device Under the runs above, when the OSD crashes, we can start to see logs like: 2019-10-13T09:31:17.097 INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:17.095566 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed: 4 full osd(s) (OSD_FULL) 2019-10-13T09:31:22.117 INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:22.115362 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update: 1 full osd(s) (OSD_FULL) 2019-10-13T09:31:27.632 INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:27.630483 7f189f9b9700 -1 log_channel(cluster) log [ERR] : Health check failed: mon b is very low on available space (MON_DISK_CRIT) 2019-10-13T09:31:28.672 INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:28.670676 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed: 4 full osd(s) (OSD_FULL) 2019-10-13T09:31:32.980 INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:32.978369 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update: mons a,b,c are very low on available space (MON_DISK_CRIT) 2019-10-13T09:31:34.061 INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:34.053935 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update: 2 full osd(s) (OSD_FULL) 2019-10-13T09:31:40.153 INFO:tasks.ceph.osd.0.smithi167.stderr:2019-10-13 09:31:40.151323 7fd400753700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now 97% full 2019-10-13T09:31:40.328 INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:40.326090 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed: 1 full osd(s) (OSD_FULL) 2019-10-13T09:31:40.494 INFO:tasks.ceph.osd.3.smithi167.stderr:2019-10-13 09:31:40.492169 7fda29d9c700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now 97% full 2019-10-13T09:31:40.900 INFO:tasks.ceph.osd.1.smithi167.stderr:2019-10-13 09:31:40.898979 7f10785bb700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now 98% full 2019-10-13T09:31:41.279 INFO:tasks.ceph.osd.4.smithi174.stderr:2019-10-13 09:31:41.277991 7f8651337700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now 98% full 2019-10-13T09:31:41.525 INFO:tasks.ceph.osd.2.smithi167.stderr:2019-10-13 09:31:41.523075 7f1df905a700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now 98% full 2019-10-13T09:31:42.493 INFO:tasks.ceph.osd.7.smithi174.stderr:2019-10-13 09:31:42.491132 7febd19f8700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now 98% full 2019-10-13T09:31:43.281 INFO:tasks.ceph.osd.5.smithi174.stderr:2019-10-13 09:31:43.279310 7f9c80f21700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now 98% full 2019-10-13T09:31:44.399 INFO:tasks.ceph.osd.6.smithi174.stderr:2019-10-13 09:31:44.398004 7f4c2bf61700 -1 log_channel(cluster) log [ERR] : full status failsafe engaged, dropping updates, now 98% full And why to the OSDs get full? Teuthology is running w/ 41a13ec included ... 2019-10-13T07:26:11.215 INFO:root:teuthology version: 1.0.0-471e0e3 ... as verified by ... $ git log --oneline 471e0e30 | grep 41a13ec || echo ABSENT 41a13eca misc: use remote.sh instead of remote.run While teuthology is actually formatting the NVMe devices, it never actually mounts them since logs like the following are missing ... 2019-10-09T20:53:48.276 INFO:teuthology.orchestra.run.smithi136:> sudo mount -t xfs -o noatime /dev/vg_nvme/lv_4 /var/lib/ceph/osd/ceph-0 ... and therefore the OSDs are filling up the 1TiB root partition:.

...

http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5… $ curl -s http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5… | grep -c "No space left" 90 The 41a13ec is merged after this job even executed: 0456e3e 2019-10-10 19:01 +0200 kshtsk M─┤ Merge pull request #1318 from kshtsk/wip-misc-use-remote-sh 41a13ec 2019-10-09 00:04 +0200 Kyr Shatskyy │ o {origin/wip-misc-use-remote-sh} misc: use remote.sh instead of remote.run It is another proof it is not the cause of the failure. So my question, why do we still consider it is as the only reason? Do we have any other ideas? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 17, 2019, at 2:06 AM, Jason Dillaman <jdillama(a)redhat.com> wrote: On Wed, Oct 16, 2019 at 8:03 PM kyr <kshatskyy(a)suse.de> wrote: So Yuri, is it reproducible only for luminous? or you have seen this on master or any other branches? It's on all branches -- as of at least last week. Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste(a)redhat.com> wrote: Kyr Here is how I did it: RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi CEPH_QA_MAIL="ceph-qa(a)ceph.io"h.io"; MACHINE_NAME=smithi; CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN --suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70 -R fail,dead,running,waiting to test the fix add "-t wip-wwn-fix" On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy(a)suse.de> wrote: So I ran a job on smithi against teuthology code which is supposed to have "No space left on device": http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-mas… And it passed, has not this issue. Which exact suite does reproduce the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum(a)redhat.com> wrote: On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy(a)suse.de> wrote: I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. I'm not quite sure what you mean here. I think one of these addresses your statement? 1) we were creating very small OSDs on the root device since the partitions weren't being mounted, and so these jobs actually filled them up as a consequence of that. 2) most of the teuthology repo is pulled fresh from master on every run. The workers themselves require restarting to get updates but that's pretty rare. (See https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82) Can some one give one-job teuthology-suite command that 100% reproducing the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792 I'd be surprised if that one (changing iteritems() to items()) could cause this, and it's not a fix for any known bugs, just ongoing py3 work. When I said "that commit" I was referring to https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, which is in the teuthology repo and thus hits every test run. Looking at the comments across https://github.com/ceph/teuthology/pull/1332 and https://tracker.ceph.com/issues/42313 it sounds like that teuthology commit accidentally fixed a bug which triggered another bug that we're not sure how to resolve, but perhaps I'm misunderstanding? I think I understand what's going on. Here's an interim fix: https://github.com/ceph/teuthology/pull/1334 Assuming this PR really does fix the issue, the "real" fix will be to drop get_wwn_id_map altogether, since it has long outlived its usefulness ( see https://tracker.ceph.com/issues/14855 ). Nathan _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io -- Jason

-- Jason

David Galloway

4:28 p.m.

On 10/16/19 9:07 PM, Jason Dillaman wrote:

...

On Wed, Oct 16, 2019 at 8:37 PM Kyrylo Shatskyy <kyrylo.shatskyy(a)suse.com> wrote:

Here is the log which supposed to have this issue:

Almost perfectly described. lv_5 gets mounted at /var/lib/ceph by ceph-cm-ansible and the other 4 LVs on each NVMe are *supposed* to get mounted at a /var/lib/ceph/osd/$id directory. So the jobs were filling up the 15GB lv_5 partition if enough data was written because the OSDs were raw directories on lv_5 instead of mountpoints on the other LVs. Thank you much for chiming in and helping clarify the issue(s). > >> http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5… >> >> $ curl -s http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5… | grep -c "No space left" >> 90 >> >> The 41a13ec is merged after this job even executed: >> 0456e3e 2019-10-10 19:01 +0200 kshtsk M─┤ Merge pull request #1318 from kshtsk/wip-misc-use-remote-sh >> 41a13ec 2019-10-09 00:04 +0200 Kyr Shatskyy │ o {origin/wip-misc-use-remote-sh} misc: use remote.sh instead of remote.run >> >> It is another proof it is not the cause of the failure. >> So my question, why do we still consider it is as the only reason? Do we have any other ideas? >> >> Kyrylo Shatskyy >> -- >> SUSE Software Solutions Germany GmbH >> Maxfeldstr. 5 >> 90409 Nuremberg >> Germany >> >> >> On Oct 17, 2019, at 2:06 AM, Jason Dillaman <jdillama(a)redhat.com> wrote: >> >> On Wed, Oct 16, 2019 at 8:03 PM kyr <kshatskyy(a)suse.de> wrote: >> >> >> So Yuri, >> >> is it reproducible only for luminous? or you have seen this on master or any other branches? >> >> >> It's on all branches -- as of at least last week. >> >> >> Kyrylo Shatskyy >> -- >> SUSE Software Solutions Germany GmbH >> Maxfeldstr. 5 >> 90409 Nuremberg >> Germany >> >> >> On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste(a)redhat.com> wrote: >> >> Kyr >> >> Here is how I did it: >> >> RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi >> CEPH_QA_MAIL="ceph-qa(a)ceph.io"h.io"; MACHINE_NAME=smithi; >> CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous >> teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN >> --suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo >> https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70 >> -R fail,dead,running,waiting >> >> to test the fix add "-t wip-wwn-fix" >> >> On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy(a)suse.de> wrote: >> >> >> So I ran a job on smithi against teuthology code which is supposed to have "No space left on device": >> >> http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-mas… >> >> And it passed, has not this issue. Which exact suite does reproduce the issue? >> >> Kyrylo Shatskyy >> -- >> SUSE Software Solutions Germany GmbH >> Maxfeldstr. 5 >> 90409 Nuremberg >> Germany >> >> >> On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum(a)redhat.com> wrote: >> >> On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy(a)suse.de> wrote: >> >> >> I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. >> >> >> I'm not quite sure what you mean here. I think one of these addresses >> your statement? >> 1) we were creating very small OSDs on the root device since the >> partitions weren't being mounted, and so these jobs actually filled >> them up as a consequence of that. >> 2) most of the teuthology repo is pulled fresh from master on every >> run. The workers themselves require restarting to get updates but >> that's pretty rare. (See >> https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82) >> >> >> >> Can some one give one-job teuthology-suite command that 100% reproducing the issue? >> >> Kyrylo Shatskyy >> -- >> SUSE Software Solutions Germany GmbH >> Maxfeldstr. 5 >> 90409 Nuremberg >> Germany >> >> >> On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: >> >> On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: >> >> On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: >> >> >> Yuri just reminded me that he's seeing this problem on the mimic branch. >> >> Does that mean this PR just needs to be backported to all branches? >> >> https://github.com/ceph/ceph/pull/30792 >> >> >> I'd be surprised if that one (changing iteritems() to items()) could >> cause this, and it's not a fix for any known bugs, just ongoing py3 >> work. >> >> When I said "that commit" I was referring to >> https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, >> which is in the teuthology repo and thus hits every test run. Looking >> at the comments across https://github.com/ceph/teuthology/pull/1332 >> and https://tracker.ceph.com/issues/42313 it sounds like that >> teuthology commit accidentally fixed a bug which triggered another bug >> that we're not sure how to resolve, but perhaps I'm misunderstanding? >> >> >> I think I understand what's going on. Here's an interim fix: >> https://github.com/ceph/teuthology/pull/1334 >> >> Assuming this PR really does fix the issue, the "real" fix will be to drop >> get_wwn_id_map altogether, since it has long outlived its usefulness ( see >> https://tracker.ceph.com/issues/14855 ). >> >> Nathan >> _______________________________________________ >> Dev mailing list -- dev(a)ceph.io >> To unsubscribe send an email to dev-leave(a)ceph.io >> >> >> _______________________________________________ >> Dev mailing list -- dev(a)ceph.io >> To unsubscribe send an email to dev-leave(a)ceph.io >> >> >> _______________________________________________ >> Dev mailing list -- dev(a)ceph.io >> To unsubscribe send an email to dev-leave(a)ceph.io >> >> >> >> >> -- >> Jason >> >> > >

Yuri Weinstein

7:07 p.m.

We are back in business, thank you all ! On Thu, Oct 17, 2019 at 6:28 AM David Galloway <dgallowa(a)redhat.com> wrote: > > > On 10/16/19 9:07 PM, Jason Dillaman wrote: > > On Wed, Oct 16, 2019 at 8:37 PM Kyrylo Shatskyy > > <kyrylo.shatskyy(a)suse.com> wrote: > >> > >> Here is the log which supposed to have this issue: > > > > Negative -- stepping back here. I brought this issue up for these runs: > > > > http://pulpito.ceph.com/yuriw-2019-10-11_12:58:22-rbd-wip-yuri6-testing-201… > > http://pulpito.ceph.com/yuriw-2019-10-11_19:41:35-rbd-wip-yuri8-testing-201… > > > > In these runs, you can see the OSDs crashing because it runs out of > > space on the OSD block devices. The log you are looking at is a test > > where the OSDs at debug log level 20 most likely actually filled up > > "/var/log" -- but they didn't fill up their block device > > > > Under the runs above, when the OSD crashes, we can start to see logs like: > > > > 2019-10-13T09:31:17.097 > > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:17.095566 > > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed: > > 4 full osd(s) (OSD_FULL) > > 2019-10-13T09:31:22.117 > > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:22.115362 > > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update: > > 1 full osd(s) (OSD_FULL) > > 2019-10-13T09:31:27.632 > > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:27.630483 > > 7f189f9b9700 -1 log_channel(cluster) log [ERR] : Health check failed: > > mon b is very low on available space (MON_DISK_CRIT) > > 2019-10-13T09:31:28.672 > > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:28.670676 > > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed: > > 4 full osd(s) (OSD_FULL) > > 2019-10-13T09:31:32.980 > > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:32.978369 > > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update: > > mons a,b,c are very low on available space (MON_DISK_CRIT) > > 2019-10-13T09:31:34.061 > > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:34.053935 > > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check update: > > 2 full osd(s) (OSD_FULL) > > 2019-10-13T09:31:40.153 > > INFO:tasks.ceph.osd.0.smithi167.stderr:2019-10-13 09:31:40.151323 > > 7fd400753700 -1 log_channel(cluster) log [ERR] : full status failsafe > > engaged, dropping updates, now 97% full > > 2019-10-13T09:31:40.328 > > INFO:tasks.ceph.mon.a.smithi167.stderr:2019-10-13 09:31:40.326090 > > 7f18a21be700 -1 log_channel(cluster) log [ERR] : Health check failed: > > 1 full osd(s) (OSD_FULL) > > 2019-10-13T09:31:40.494 > > INFO:tasks.ceph.osd.3.smithi167.stderr:2019-10-13 09:31:40.492169 > > 7fda29d9c700 -1 log_channel(cluster) log [ERR] : full status failsafe > > engaged, dropping updates, now 97% full > > 2019-10-13T09:31:40.900 > > INFO:tasks.ceph.osd.1.smithi167.stderr:2019-10-13 09:31:40.898979 > > 7f10785bb700 -1 log_channel(cluster) log [ERR] : full status failsafe > > engaged, dropping updates, now 98% full > > 2019-10-13T09:31:41.279 > > INFO:tasks.ceph.osd.4.smithi174.stderr:2019-10-13 09:31:41.277991 > > 7f8651337700 -1 log_channel(cluster) log [ERR] : full status failsafe > > engaged, dropping updates, now 98% full > > 2019-10-13T09:31:41.525 > > INFO:tasks.ceph.osd.2.smithi167.stderr:2019-10-13 09:31:41.523075 > > 7f1df905a700 -1 log_channel(cluster) log [ERR] : full status failsafe > > engaged, dropping updates, now 98% full > > 2019-10-13T09:31:42.493 > > INFO:tasks.ceph.osd.7.smithi174.stderr:2019-10-13 09:31:42.491132 > > 7febd19f8700 -1 log_channel(cluster) log [ERR] : full status failsafe > > engaged, dropping updates, now 98% full > > 2019-10-13T09:31:43.281 > > INFO:tasks.ceph.osd.5.smithi174.stderr:2019-10-13 09:31:43.279310 > > 7f9c80f21700 -1 log_channel(cluster) log [ERR] : full status failsafe > > engaged, dropping updates, now 98% full > > 2019-10-13T09:31:44.399 > > INFO:tasks.ceph.osd.6.smithi174.stderr:2019-10-13 09:31:44.398004 > > 7f4c2bf61700 -1 log_channel(cluster) log [ERR] : full status failsafe > > engaged, dropping updates, now 98% full > > > > And why to the OSDs get full? Teuthology is running w/ 41a13ec included ... > > > > 2019-10-13T07:26:11.215 INFO:root:teuthology version: 1.0.0-471e0e3 > > > > ... as verified by ... > > > > $ git log --oneline 471e0e30 | grep 41a13ec || echo ABSENT > > 41a13eca misc: use remote.sh instead of remote.run > > > > While teuthology is actually formatting the NVMe devices, it never > > actually mounts them since logs like the following are missing ... > > > > 2019-10-09T20:53:48.276 INFO:teuthology.orchestra.run.smithi136:> sudo > > mount -t xfs -o noatime /dev/vg_nvme/lv_4 /var/lib/ceph/osd/ceph-0 > > > > ... and therefore the OSDs are filling up the 1TiB root partition:. > > Almost perfectly described. lv_5 gets mounted at /var/lib/ceph by > ceph-cm-ansible and the other 4 LVs on each NVMe are *supposed* to get > mounted at a /var/lib/ceph/osd/$id directory. So the jobs were filling > up the 15GB lv_5 partition if enough data was written because the OSDs > were raw directories on lv_5 instead of mountpoints on the other LVs. > > Thank you much for chiming in and helping clarify the issue(s). > > > > >> http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5… > >> > >> $ curl -s http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-09_15:42:09-rbd-wip-yuri5… | grep -c "No space left" > >> 90 > >> > >> The 41a13ec is merged after this job even executed: > >> 0456e3e 2019-10-10 19:01 +0200 kshtsk M─┤ Merge pull request #1318 from kshtsk/wip-misc-use-remote-sh > >> 41a13ec 2019-10-09 00:04 +0200 Kyr Shatskyy │ o {origin/wip-misc-use-remote-sh} misc: use remote.sh instead of remote.run > >> > >> It is another proof it is not the cause of the failure. > >> So my question, why do we still consider it is as the only reason? Do we have any other ideas? > >> > >> Kyrylo Shatskyy > >> -- > >> SUSE Software Solutions Germany GmbH > >> Maxfeldstr. 5 > >> 90409 Nuremberg > >> Germany > >> > >> > >> On Oct 17, 2019, at 2:06 AM, Jason Dillaman <jdillama(a)redhat.com> wrote: > >> > >> On Wed, Oct 16, 2019 at 8:03 PM kyr <kshatskyy(a)suse.de> wrote: > >> > >> > >> So Yuri, > >> > >> is it reproducible only for luminous? or you have seen this on master or any other branches? > >> > >> > >> It's on all branches -- as of at least last week. > >> > >> > >> Kyrylo Shatskyy > >> -- > >> SUSE Software Solutions Germany GmbH > >> Maxfeldstr. 5 > >> 90409 Nuremberg > >> Germany > >> > >> > >> On Oct 17, 2019, at 1:39 AM, Yuri Weinstein <yweinste(a)redhat.com> wrote: > >> > >> Kyr > >> > >> Here is how I did it: > >> > >> RERUN=yuriw-2019-10-15_22:08:48-rbd-wip-yuri8-testing-2019-10-11-1347-luminous-distro-basic-smithi > >> CEPH_QA_MAIL="ceph-qa(a)ceph.io"h.io"; MACHINE_NAME=smithi; > >> CEPH_BRANCH=wip-yuri8-testing-2019-10-11-1347-luminous > >> teuthology-suite -v -c $CEPH_BRANCH -m $MACHINE_NAME -r $RERUN > >> --suite-repo https://github.com/ceph/ceph-ci.git --ceph-repo > >> https://github.com/ceph/ceph-ci.git --suite-branch $CEPH_BRANCH -p 70 > >> -R fail,dead,running,waiting > >> > >> to test the fix add "-t wip-wwn-fix" > >> > >> On Wed, Oct 16, 2019 at 4:36 PM kyr <kshatskyy(a)suse.de> wrote: > >> > >> > >> So I ran a job on smithi against teuthology code which is supposed to have "No space left on device": > >> > >> http://qa-proxy.ceph.com/teuthology/kyr-2019-10-16_22:55:36-smoke:basic-mas… > >> > >> And it passed, has not this issue. Which exact suite does reproduce the issue? > >> > >> Kyrylo Shatskyy > >> -- > >> SUSE Software Solutions Germany GmbH > >> Maxfeldstr. 5 > >> 90409 Nuremberg > >> Germany > >> > >> > >> On Oct 17, 2019, at 12:35 AM, Gregory Farnum <gfarnum(a)redhat.com> wrote: > >> > >> On Wed, Oct 16, 2019 at 2:39 PM kyr <kshatskyy(a)suse.de> wrote: > >> > >> > >> I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. > >> > >> > >> I'm not quite sure what you mean here. I think one of these addresses > >> your statement? > >> 1) we were creating very small OSDs on the root device since the > >> partitions weren't being mounted, and so these jobs actually filled > >> them up as a consequence of that. > >> 2) most of the teuthology repo is pulled fresh from master on every > >> run. The workers themselves require restarting to get updates but > >> that's pretty rare. (See > >> https://github.com/ceph/teuthology/blob/master/teuthology/worker.py#L82) > >> > >> > >> > >> Can some one give one-job teuthology-suite command that 100% reproducing the issue? > >> > >> Kyrylo Shatskyy > >> -- > >> SUSE Software Solutions Germany GmbH > >> Maxfeldstr. 5 > >> 90409 Nuremberg > >> Germany > >> > >> > >> On Oct 16, 2019, at 11:14 PM, Nathan Cutler <ncutler(a)suse.com> wrote: > >> > >> On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: > >> > >> On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa(a)redhat.com> wrote: > >> > >> > >> Yuri just reminded me that he's seeing this problem on the mimic branch. > >> > >> Does that mean this PR just needs to be backported to all branches? > >> > >> https://github.com/ceph/ceph/pull/30792 > >> > >> > >> I'd be surprised if that one (changing iteritems() to items()) could > >> cause this, and it's not a fix for any known bugs, just ongoing py3 > >> work. > >> > >> When I said "that commit" I was referring to > >> https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, > >> which is in the teuthology repo and thus hits every test run. Looking > >> at the comments across https://github.com/ceph/teuthology/pull/1332 > >> and https://tracker.ceph.com/issues/42313 it sounds like that > >> teuthology commit accidentally fixed a bug which triggered another bug > >> that we're not sure how to resolve, but perhaps I'm misunderstanding? > >> > >> > >> I think I understand what's going on. Here's an interim fix: > >> https://github.com/ceph/teuthology/pull/1334 > >> > >> Assuming this PR really does fix the issue, the "real" fix will be to drop > >> get_wwn_id_map altogether, since it has long outlived its usefulness ( see > >> https://tracker.ceph.com/issues/14855 ). > >> > >> Nathan > >> _______________________________________________ > >> Dev mailing list -- dev(a)ceph.io > >> To unsubscribe send an email to dev-leave(a)ceph.io > >> > >> > >> _______________________________________________ > >> Dev mailing list -- dev(a)ceph.io > >> To unsubscribe send an email to dev-leave(a)ceph.io > >> > >> > >> _______________________________________________ > >> Dev mailing list -- dev(a)ceph.io > >> To unsubscribe send an email to dev-leave(a)ceph.io > >> > >> > >> > >> > >> -- > >> Jason > >> > >> > > > >

Kyrylo Shatskyy

12:41 a.m.

I hope the nathans fix can probably do the thing, however it does not cover the the log referenced in the description of https://tracker.ceph.com/issues/42313 because teuthology worker does not include that fix which is supposed to be cause for "No space left on device" issue. Can some one give one-job teuthology-suite command that 100% reproducing the issue? Kyrylo Shatskyy -- SUSE Software Solutions Germany GmbH Maxfeldstr. 5 90409 Nuremberg Germany On Oct 16, 2019, at 11:14 PM, Nathan Cutler <NCutler@suse.com<mailto:NCutler@suse.com>> wrote: On Wed, Oct 16, 2019 at 12:43:32PM -0700, Gregory Farnum wrote: On Wed, Oct 16, 2019 at 12:24 PM David Galloway <dgallowa@redhat.com<mailto:dgallowa@redhat.com>> wrote: Yuri just reminded me that he's seeing this problem on the mimic branch. Does that mean this PR just needs to be backported to all branches? https://github.com/ceph/ceph/pull/30792 I'd be surprised if that one (changing iteritems() to items()) could cause this, and it's not a fix for any known bugs, just ongoing py3 work. When I said "that commit" I was referring to https://github.com/ceph/teuthology/commit/41a13eca480e38cfeeba7a180b4516b90…, which is in the teuthology repo and thus hits every test run. Looking at the comments across https://github.com/ceph/teuthology/pull/1332 and https://tracker.ceph.com/issues/42313 it sounds like that teuthology commit accidentally fixed a bug which triggered another bug that we're not sure how to resolve, but perhaps I'm misunderstanding? I think I understand what's going on. Here's an interim fix: https://github.com/ceph/teuthology/pull/1334 Assuming this PR really does fix the issue, the "real" fix will be to drop get_wwn_id_map altogether, since it has long outlived its usefulness ( see https://tracker.ceph.com/issues/14855 ). Nathan _______________________________________________ Dev mailing list -- dev(a)ceph.io To unsubscribe send an email to dev-leave(a)ceph.io

1660

days inactive

1661

days old

dev@ceph.io

Manage subscription

19 comments

7 participants

tags (0)

participants (7)

David Galloway
Gregory Farnum
Jason Dillaman
kyr
Kyrylo Shatskyy
Nathan Cutler
Yuri Weinstein