November 2019 - ceph-users

by foundationschool school

The Foundation schools were established in 2009. The Gunjur branch has become one of the top CBSE schools in varthur, Gunjur, Whitefield and best school in Sarjapur Road areas in a short period of time. The schools follow a methodology that helps children through hands on activities and experiential learning. Children at The Foundation School are seen not as consumers who need external entertainment, but as producers who interact with various learning material with their own hands; construct knowledge and gain immense satisfaction from the process. In keeping with this methodology, The Foundation School follows the Montessori Philosophy at the pre-primary level, and experiential learning from grade 1 onwards. All of these make The Foundation School, one of the foremost leading and progressive schools in Varthur, Gunjur, Schools in Whitefield and also schools near sarjapur Road areas of Bangalore. https://www.foundationschoolindia.com/

4 years, 6 months

1
0
0 0

mgr daemons becoming unresponsive

by Oliver Freyermuth

Dear Cephers, this is a 14.2.4 cluster with device health metrics enabled - since about a day, all mgr daemons go "silent" on me after a few hours, i.e. "ceph -s" shows: cluster: id: 269cf2b2-7e7c-4ceb-bd1b-a33d915ceee9 health: HEALTH_WARN no active mgr 1/3 mons down, quorum mon001,mon002 services: mon: 3 daemons, quorum mon001,mon002 (age 57m), out of quorum: mon003 mgr: no daemons active (since 56m) ... (the third mon has a planned outage and will come back in a few days) Checking the logs of the mgr daemons, I find some "reset" messages at the time when it goes "silent", first for the first mgr: 2019-11-01 21:34:40.286 7f2df6a6b700 0 log_channel(cluster) log [DBG] : pgmap v1798: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail 2019-11-01 21:34:41.458 7f2e0d59b700 0 client.0 ms_handle_reset on v2:10.160.16.1:6800/401248 2019-11-01 21:34:42.287 7f2df6a6b700 0 log_channel(cluster) log [DBG] : pgmap v1799: 1585 pgs: 1585 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail and a bit later, on the standby mgr: 2019-11-01 22:18:14.892 7f7bcc8ae700 0 log_channel(cluster) log [DBG] : pgmap v1798: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail 2019-11-01 22:18:16.022 7f7be9e72700 0 client.0 ms_handle_reset on v2:10.160.16.2:6800/352196 2019-11-01 22:18:16.893 7f7bcc8ae700 0 log_channel(cluster) log [DBG] : pgmap v1799: 1585 pgs: 166 active+clean+snaptrim, 858 active+clean+snaptrim_wait, 561 active+clean; 1.1 TiB data, 2.3 TiB used, 136 TiB / 138 TiB avail Interestingly, the dashboard still works, but presents outdated information, and for example zero I/O going on. I believe this started to happen mainly after the third mon went into the known downtime, but I am not fully sure if this was the trigger, since the cluster is still growing. It may also have been the addition of 24 more OSDs. I also find other messages in the mgr logs which seem problematic, but I am not sure they are related: ------------------------------ 2019-11-01 21:17:09.849 7f2df4266700 0 mgr[devicehealth] Error reading OMAP: [errno 22] Failed to operate read op for oid Traceback (most recent call last): File "/usr/share/ceph/mgr/devicehealth/module.py", line 396, in put_device_metrics ioctx.operate_read_op(op, devid) File "rados.pyx", line 516, in rados.requires.wrapper.validate_func (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUIL D/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:4721) File "rados.pyx", line 3474, in rados.Ioctx.operate_read_op (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.4/rpm/el7/BUILD/ceph-14.2.4/build/src/pybind/rados/pyrex/rados.c:36554) InvalidArgumentError: [errno 22] Failed to operate read op for oid ------------------------------ or: ------------------------------ 2019-11-01 21:33:53.977 7f7bd38bc700 0 mgr[devicehealth] Fail to parse JSON result from daemon osd.51 () 2019-11-01 21:33:53.978 7f7bd38bc700 0 mgr[devicehealth] Fail to parse JSON result from daemon osd.52 () 2019-11-01 21:33:53.979 7f7bd38bc700 0 mgr[devicehealth] Fail to parse JSON result from daemon osd.53 () ------------------------------ The reason why I am cautious about the health metrics is that I observed a crash when trying to query them: ------------------------------ 2019-11-01 20:21:23.661 7fa46314a700 0 log_channel(audit) log [DBG] : from='client.174136 -' entity='client.admin' cmd=[{"prefix": "device get-health-metrics", "devid": "osd.11", "target": ["mgr", ""]}]: dispatch 2019-11-01 20:21:23.661 7fa46394b700 0 mgr[devicehealth] handle_command 2019-11-01 20:21:23.663 7fa46394b700 -1 *** Caught signal (Segmentation fault) ** in thread 7fa46394b700 thread_name:mgr-fin ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable) 1: (()+0xf5f0) [0x7fa488cee5f0] 2: (PyEval_EvalFrameEx()+0x1a9) [0x7fa48aeb50f9] 3: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] 4: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] 5: (PyEval_EvalFrameEx()+0x67bd) [0x7fa48aebb70d] 6: (PyEval_EvalCodeEx()+0x7ed) [0x7fa48aebe08d] 7: (()+0x709c8) [0x7fa48ae479c8] 8: (PyObject_Call()+0x43) [0x7fa48ae22ab3] 9: (()+0x5aaa5) [0x7fa48ae31aa5] 10: (PyObject_Call()+0x43) [0x7fa48ae22ab3] 11: (()+0x4bb95) [0x7fa48ae22b95] 12: (PyObject_CallMethod()+0xbb) [0x7fa48ae22ecb] 13: (ActivePyModule::handle_command(std::map<std::string, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::string const, boost::variant<std::string, bool, long, double, std::vector<std::string, std::allocator<std::string> >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::buffer::v14_2_0::list const&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*)+0x20e) [0x55c3c1fefc5e] 14: (()+0x16c23d) [0x55c3c204023d] 15: (FunctionContext::finish(int)+0x2c) [0x55c3c2001eac] 16: (Context::complete(int)+0x9) [0x55c3c1ffe659] 17: (Finisher::finisher_thread_entry()+0x156) [0x7fa48b439cc6] 18: (()+0x7e65) [0x7fa488ce6e65] 19: (clone()+0x6d) [0x7fa48799488d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ------------------------------ I have issued: ceph device monitoring off for now and will keep waiting to see if mgrs go silent again. If there are any better ideas or this issue is known, let me know. Cheers, Oliver

4 years, 6 months

8
25
0 0

RGW compression not compressing

by Bryan Stillwell

Today I tried enabling RGW compression on a Nautilus 14.2.4 test cluster and found it wasn't doing any compression at all. I figure I must have missed something in the docs, but I haven't been able to find out what that is and could use some help. This is the command I used to enable zlib-based compression: # radosgw-admin zone placement modify --rgw-zone=default --placement-id=default-placement --compression=zlib I then restarted the radosgw process to activate the change (there's only 1 RGW in this test cluster): # systemctl restart ceph-radosgw(a)radosgw.$(hostname -s) I verified it was enabled correctly with: # radosgw-admin zone get | jq -r '.placement_pools' [ { "key": "default-placement", "val": { "index_pool": "default.rgw.buckets.index", "storage_classes": { "STANDARD": { "data_pool": "default.rgw.buckets.data", "compression_type": "zlib" } }, "data_extra_pool": "default.rgw.buckets.non-ec", "index_type": 0 } } ] Before starting the test I had nothing in the default.rgw.buckets.data pool: # ceph df | grep default.rgw.buckets.data default.rgw.buckets.data 16 0 B 0 0 B 0 230 TiB I then tried uploading a 1GiB file consisting of all 0s from /dev/zero with both S3 (boto) and Swift (swiftclient) and each time they used 1GiB of data on the cluster: # ceph df -f json | jq -r '.' | grep -A9 default.rgw.buckets.data "name": "default.rgw.buckets.data", "id": 16, "stats": { "stored": 1073741824, "objects": 256, "kb_used": 1048576, "bytes_used": 1073741824, "percent_used": 1.4138463484414387e-06, "max_avail": 253148744646656 } The same thing was reported by bucket stats: # radosgw-admin bucket stats --bucket=bs-test | jq -r '.usage' { "rgw.main": { "size": 1073741824, "size_actual": 1073741824, "size_utilized": 1073741824, "size_kb": 1048576, "size_kb_actual": 1048576, "size_kb_utilized": 1048576, "num_objects": 1 } } What am I missing? Thanks, Bryan

4 years, 6 months

2
4
0 0

Device Health Metrics on EL 7

by Oliver Freyermuth

Dear Cephers, I went through some of the OSD logs of our 14.2.4 nodes and found this: ---------------------------------- Nov 01 01:22:25 sudo[1087697]: ceph : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/sbin/smartctl -a --json /dev/sds Nov 01 01:22:51 sudo[1087729]: pam_unix(sudo:auth): conversation failed Nov 01 01:22:51 sudo[1087729]: pam_unix(sudo:auth): auth could not identify password for [ceph] Nov 01 01:22:51 sudo[1087729]: pam_succeed_if(sudo:auth): requirement "uid >= 1000" not met by user "ceph" Nov 01 01:22:53 sudo[1087729]: ceph : command not allowed ; TTY=unknown ; PWD=/ ; USER=root ; COMMAND=nvme lvm smart-log-add --json /dev/sds ---------------------------------- It seems with device health metrics, the OSDs try to run smartctl with "sudo", which expectedly fails, since the Ceph user (as system user) has a uid smaller than 1000. Also, it's of course not in /etc/sudoers. Does somebody have a working setup with device health metrics which could be shared (and documented, or made part of future packaging ;-) ) ? Cheers, Oliver

4 years, 6 months

2
3
0 0

Balancer is active, but not balancing

by Thomas Schneider

Hi, I activated balancer in order to balance data distribution: root@ld3955:~# ceph balancer status { "active": true, "plans": [], "mode": "upmap" } However, the data stored on 1.6TB HDD in specific pool "hdb_backup" is not balanced; the range starts with osd.265 size: 1.6 usage: 52.83 reweight: 1.00000 and ends with osd.145 size: 1.6 usage: 80.19 reweight: 1.00000 The affected drives are located on 4 nodes. The result is that not all available disk space is available for usage. I have attached pastebin <https://pastebin.com/dNyEwNR0> with - ceph osd df sorted by usage - ceph osd df tree My cluster has multiple crush roots respresenting different disks. In addition I have defined multiple pools, one pool for each disk type: hdd, ssd, nvme. I need to get the distribution better based on pools. Please advise how to start balancer to correct data distribution. THX

4 years, 6 months

1
0
0 0

Re: mds crash loop

by Karsten Nielsen

That is awesome. Now I just need to figure out where the lost+found files needs to go. And what happened to the missing objects for the dirs. Any tool that is able to do that ? Thanks - Karsten -----Original message----- From: Yan, Zheng <ukernel(a)gmail.com> Sent: Thu 07-11-2019 09:22 Subject: Re: [ceph-users] Re: mds crash loop To: Karsten Nielsen <karsten(a)foo-bar.dk>; CC: ceph-users(a)ceph.io; > I have tracked down the root cause. See https://tracker.ceph.com/issues/42675 > > Regards > Yan, Zheng > > On Thu, Nov 7, 2019 at 4:01 PM Karsten Nielsen <karsten(a)foo-bar.dk> wrote: > > > > -----Original message----- > > From: Yan, Zheng <ukernel(a)gmail.com> > > Sent: Thu 07-11-2019 07:21 > > Subject: Re: [ceph-users] Re: mds crash loop > > To: Karsten Nielsen <karsten(a)foo-bar.dk>; > > CC: ceph-users(a)ceph.io; > > > On Thu, Nov 7, 2019 at 5:50 AM Karsten Nielsen <karsten(a)foo-bar.dk> wrote: > > > > > > > > -----Original message----- > > > > From: Yan, Zheng <ukernel(a)gmail.com> > > > > Sent: Wed 06-11-2019 14:16 > > > > Subject: Re: [ceph-users] mds crash loop > > > > To: Karsten Nielsen <karsten(a)foo-bar.dk>; > > > > CC: ceph-users(a)ceph.io; > > > > > On Wed, Nov 6, 2019 at 4:42 PM Karsten Nielsen <karsten(a)foo-bar.dk> > wrote: > > > > > > > > > > > > -----Original message----- > > > > > > From: Yan, Zheng <ukernel(a)gmail.com> > > > > > > Sent: Wed 06-11-2019 08:15 > > > > > > Subject: Re: [ceph-users] mds crash loop > > > > > > To: Karsten Nielsen <karsten(a)foo-bar.dk>; > > > > > > CC: ceph-users(a)ceph.io; > > > > > > > On Tue, Nov 5, 2019 at 5:29 PM Karsten Nielsen <karsten(a)foo-bar.dk> > > > wrote: > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > Last week I upgraded my ceph cluster from luminus to mimic 13.2.6 > > > > > > > > It was running fine for a while but yesterday my mds went into a > crash > > > > > loop. > > > > > > > > > > > > > > > > I have 1 active and 1 standby mds for my cephfs both of which is > > > running > > > > > the > > > > > > > same crash loop. > > > > > > > > I am running ceph based on https://hub.docker.com/r/ceph/daemon > > > version > > > > > > > v3.2.7-stable-3.2-minic-centos-7-x86_64 with a etcd kv store. > > > > > > > > > > > > > > > > Log details are: https://paste.debian.net/1113943/ > > > > > > > > > > > > > > > > > > > > > > please try again with debug_mds=20. Thanks > > > > > > > > > > > > > > Yan, Zheng > > > > > > > > > > > > Yes I have set that and had to move to pastebin.com as debian > apperently > > > only > > > > > supports 150k > > > > > > > > > > > > > > > > > > https://pastebin.com/Gv7c5h54 > > > > > > > > > > > > > > > > Looks like on-disk root inode is corrupted. have you encountered any > > > > > unusually things during the upgrade? > > > > > > > > > > please run 'rados -p <cephfs metadata pool> stat 1.00000000.inode' , > > > > > check if the object is modified before or after the 'luminous -> > > > > > 13.2.6' upgrade. > > > > > To fix the corrupted object. Run 'cephfs-data-scan init > > > > > --force-init'. Then restart mds. After mds become active, run 'ceph > > > > > daemon mds.x scrub_path / force repair' > > > > > > > > > > > > > I followed the steps I got the mds started but now a lot of files are in > > > lost+found 24283 and I have these errors in the mds log > > > > > > > 'cephfs-data-scan init --force-init' does not move files into > > > lost+found. have you ever run other 'cephfs-data-scan foo' command or > > > 'cephfs-journal-tool foo' command? > > > > I have had a similar problem with the cluster before where I went through the > cycle of: > > https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ -> Using an > alternate metadata pool for recovery > > > > I did run the cephfs-journal-tool journal reset command, mostly because > cephfs is not that utilized so I thought it was safe to do as after the upgrade > the cluster has not been used much, so data lose would be minimal - apparently > I was wrong. > > > > > > > > > 2019-11-06 20:20:18.215 7f0bd9090700 1 mds.0.32011 cluster recovered. > > > > 2019-11-06 20:20:19.019 7f0bd2dfa700 0 mds.0.cache.dir(0x100013acfcb) > > > _fetched missing object for [dir 0x100013acfcb > /nextcloud/custom_apps/carnet/ > > > [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() > > > hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55d4dc4f5100] > > > > 2019-11-06 20:20:19.019 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : > dir > > > 0x100013acfcb object missing on disk; some files may be lost > > > (/nextcloud/custom_apps/carnet) > > > > 2019-11-06 20:20:19.275 7f0bd2dfa700 0 mds.0.cache.dir(0x100013a3156) > > > _fetched missing object for [dir 0x100013a3156 /nextcloud/custom_apps/mail/ > > > [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() > > > hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55d4dcc40000] > > > > 2019-11-06 20:20:19.275 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : > dir > > > 0x100013a3156 object missing on disk; some files may be lost > > > (/nextcloud/custom_apps/mail) > > > > 2019-11-06 20:20:19.371 7f0bd2dfa700 0 mds.0.cache.dir(0x100013abb3c) > > > _fetched missing object for [dir 0x100013abb3c > > > /nextcloud/custom_apps/passwords/ [2,head] auth v=0 cv=0/0 ap=1+0+0 > > > state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 > > > 0x55d4dcc40700] > > > > 2019-11-06 20:20:19.371 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : > dir > > > 0x100013abb3c object missing on disk; some files may be lost > > > (/nextcloud/custom_apps/passwords) > > > > 2019-11-06 20:20:19.383 7f0bd2dfa700 0 mds.0.cache.dir(0x100013a9b9b) > > > _fetched missing object for [dir 0x100013a9b9b > > > /nextcloud/custom_apps/phonetrack/ [2,head] auth v=0 cv=0/0 ap=1+0+0 > > > state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 > > > 0x55d4dcc40e00] > > > > 2019-11-06 20:20:19.383 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : > dir > > > 0x100013a9b9b object missing on disk; some files may be lost > > > (/nextcloud/custom_apps/phonetrack) > > > > 2019-11-06 20:20:19.431 7f0bd2dfa700 0 mds.0.cache.dir(0x100013a2659) > > > _fetched missing object for [dir 0x100013a2659 > > > /nextcloud/custom_apps/richdocuments/ [2,head] auth v=0 cv=0/0 ap=1+0+0 > > > state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 > > > 0x55d4dcc41500] > > > > 2019-11-06 20:20:19.431 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : > dir > > > 0x100013a2659 object missing on disk; some files may be lost > > > (/nextcloud/custom_apps/richdocuments) > > > > 2019-11-06 20:20:22.360 7f0bd9090700 1 mds.k8s-node-01 Updating MDS map > to > > > version 32015 from mon.1 > > > > > > > > > > > > > > > > > > > - Karsten > > > > > > > > > > > > > > > > > > > > > Thanks for any hints > > > > > > > > - Karsten > > > > > > > > _______________________________________________ > > > > > > > > ceph-users mailing list -- ceph-users(a)ceph.io > > > > > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users(a)ceph.io > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >

4 years, 6 months

2
1
0 0

Disabling keep alive with rgw beast

by Dan van der Ster

Hi all, We are experiencing some sporadic 502's on our traefik/rgw setup. It's similar to the issue described here: https://github.com/containous/traefik/issues/3237 The solution seems to be to disable keep-alive in the traefik and rgw configurations. We found the option for civetweb (enable_keep_alive=no) but didn't find the equivalent for beast. Any ideas? Thanks! Dan

4 years, 6 months

1
0
0 0

Re: mds crash loop

by Karsten Nielsen

-----Original message----- From: Yan, Zheng <ukernel(a)gmail.com> Sent: Thu 07-11-2019 07:21 Subject: Re: [ceph-users] Re: mds crash loop To: Karsten Nielsen <karsten(a)foo-bar.dk>; CC: ceph-users(a)ceph.io; > On Thu, Nov 7, 2019 at 5:50 AM Karsten Nielsen <karsten(a)foo-bar.dk> wrote: > > > > -----Original message----- > > From: Yan, Zheng <ukernel(a)gmail.com> > > Sent: Wed 06-11-2019 14:16 > > Subject: Re: [ceph-users] mds crash loop > > To: Karsten Nielsen <karsten(a)foo-bar.dk>; > > CC: ceph-users(a)ceph.io; > > > On Wed, Nov 6, 2019 at 4:42 PM Karsten Nielsen <karsten(a)foo-bar.dk> wrote: > > > > > > > > -----Original message----- > > > > From: Yan, Zheng <ukernel(a)gmail.com> > > > > Sent: Wed 06-11-2019 08:15 > > > > Subject: Re: [ceph-users] mds crash loop > > > > To: Karsten Nielsen <karsten(a)foo-bar.dk>; > > > > CC: ceph-users(a)ceph.io; > > > > > On Tue, Nov 5, 2019 at 5:29 PM Karsten Nielsen <karsten(a)foo-bar.dk> > wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > Last week I upgraded my ceph cluster from luminus to mimic 13.2.6 > > > > > > It was running fine for a while but yesterday my mds went into a crash > > > loop. > > > > > > > > > > > > I have 1 active and 1 standby mds for my cephfs both of which is > running > > > the > > > > > same crash loop. > > > > > > I am running ceph based on https://hub.docker.com/r/ceph/daemon > version > > > > > v3.2.7-stable-3.2-minic-centos-7-x86_64 with a etcd kv store. > > > > > > > > > > > > Log details are: https://paste.debian.net/1113943/ > > > > > > > > > > > > > > > > please try again with debug_mds=20. Thanks > > > > > > > > > > Yan, Zheng > > > > > > > > Yes I have set that and had to move to pastebin.com as debian apperently > only > > > supports 150k > > > > > > > > > > > > https://pastebin.com/Gv7c5h54 > > > > > > > > > > Looks like on-disk root inode is corrupted. have you encountered any > > > unusually things during the upgrade? > > > > > > please run 'rados -p <cephfs metadata pool> stat 1.00000000.inode' , > > > check if the object is modified before or after the 'luminous -> > > > 13.2.6' upgrade. > > > To fix the corrupted object. Run 'cephfs-data-scan init > > > --force-init'. Then restart mds. After mds become active, run 'ceph > > > daemon mds.x scrub_path / force repair' > > > > > > > I followed the steps I got the mds started but now a lot of files are in > lost+found 24283 and I have these errors in the mds log > > > 'cephfs-data-scan init --force-init' does not move files into > lost+found. have you ever run other 'cephfs-data-scan foo' command or > 'cephfs-journal-tool foo' command? I have had a similar problem with the cluster before where I went through the cycle of: https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ -> Using an alternate metadata pool for recovery I did run the cephfs-journal-tool journal reset command, mostly because cephfs is not that utilized so I thought it was safe to do as after the upgrade the cluster has not been used much, so data lose would be minimal - apparently I was wrong. > > > 2019-11-06 20:20:18.215 7f0bd9090700 1 mds.0.32011 cluster recovered. > > 2019-11-06 20:20:19.019 7f0bd2dfa700 0 mds.0.cache.dir(0x100013acfcb) > _fetched missing object for [dir 0x100013acfcb /nextcloud/custom_apps/carnet/ > [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() > hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55d4dc4f5100] > > 2019-11-06 20:20:19.019 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir > 0x100013acfcb object missing on disk; some files may be lost > (/nextcloud/custom_apps/carnet) > > 2019-11-06 20:20:19.275 7f0bd2dfa700 0 mds.0.cache.dir(0x100013a3156) > _fetched missing object for [dir 0x100013a3156 /nextcloud/custom_apps/mail/ > [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() > hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55d4dcc40000] > > 2019-11-06 20:20:19.275 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir > 0x100013a3156 object missing on disk; some files may be lost > (/nextcloud/custom_apps/mail) > > 2019-11-06 20:20:19.371 7f0bd2dfa700 0 mds.0.cache.dir(0x100013abb3c) > _fetched missing object for [dir 0x100013abb3c > /nextcloud/custom_apps/passwords/ [2,head] auth v=0 cv=0/0 ap=1+0+0 > state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 > 0x55d4dcc40700] > > 2019-11-06 20:20:19.371 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir > 0x100013abb3c object missing on disk; some files may be lost > (/nextcloud/custom_apps/passwords) > > 2019-11-06 20:20:19.383 7f0bd2dfa700 0 mds.0.cache.dir(0x100013a9b9b) > _fetched missing object for [dir 0x100013a9b9b > /nextcloud/custom_apps/phonetrack/ [2,head] auth v=0 cv=0/0 ap=1+0+0 > state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 > 0x55d4dcc40e00] > > 2019-11-06 20:20:19.383 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir > 0x100013a9b9b object missing on disk; some files may be lost > (/nextcloud/custom_apps/phonetrack) > > 2019-11-06 20:20:19.431 7f0bd2dfa700 0 mds.0.cache.dir(0x100013a2659) > _fetched missing object for [dir 0x100013a2659 > /nextcloud/custom_apps/richdocuments/ [2,head] auth v=0 cv=0/0 ap=1+0+0 > state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 > 0x55d4dcc41500] > > 2019-11-06 20:20:19.431 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir > 0x100013a2659 object missing on disk; some files may be lost > (/nextcloud/custom_apps/richdocuments) > > 2019-11-06 20:20:22.360 7f0bd9090700 1 mds.k8s-node-01 Updating MDS map to > version 32015 from mon.1 > > > > > > > > > > > - Karsten > > > > > > > > > > > > > > > Thanks for any hints > > > > > > - Karsten > > > > > > _______________________________________________ > > > > > > ceph-users mailing list -- ceph-users(a)ceph.io > > > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > > > > > > > > > > > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > >

4 years, 6 months

2
1
0 0

Re: mds crash loop

by Karsten Nielsen

-----Original message----- From: Yan, Zheng <ukernel(a)gmail.com> Sent: Wed 06-11-2019 14:16 Subject: Re: [ceph-users] mds crash loop To: Karsten Nielsen <karsten(a)foo-bar.dk>; CC: ceph-users(a)ceph.io; > On Wed, Nov 6, 2019 at 4:42 PM Karsten Nielsen <karsten(a)foo-bar.dk> wrote: > > > > -----Original message----- > > From: Yan, Zheng <ukernel(a)gmail.com> > > Sent: Wed 06-11-2019 08:15 > > Subject: Re: [ceph-users] mds crash loop > > To: Karsten Nielsen <karsten(a)foo-bar.dk>; > > CC: ceph-users(a)ceph.io; > > > On Tue, Nov 5, 2019 at 5:29 PM Karsten Nielsen <karsten(a)foo-bar.dk> wrote: > > > > > > > > Hi, > > > > > > > > Last week I upgraded my ceph cluster from luminus to mimic 13.2.6 > > > > It was running fine for a while but yesterday my mds went into a crash > loop. > > > > > > > > I have 1 active and 1 standby mds for my cephfs both of which is running > the > > > same crash loop. > > > > I am running ceph based on https://hub.docker.com/r/ceph/daemon version > > > v3.2.7-stable-3.2-minic-centos-7-x86_64 with a etcd kv store. > > > > > > > > Log details are: https://paste.debian.net/1113943/ > > > > > > > > > > please try again with debug_mds=20. Thanks > > > > > > Yan, Zheng > > > > Yes I have set that and had to move to pastebin.com as debian apperently only > supports 150k > > > > > > https://pastebin.com/Gv7c5h54 > > > > Looks like on-disk root inode is corrupted. have you encountered any > unusually things during the upgrade? > > please run 'rados -p <cephfs metadata pool> stat 1.00000000.inode' , > check if the object is modified before or after the 'luminous -> > 13.2.6' upgrade. > To fix the corrupted object. Run 'cephfs-data-scan init > --force-init'. Then restart mds. After mds become active, run 'ceph > daemon mds.x scrub_path / force repair' > I followed the steps I got the mds started but now a lot of files are in lost+found 24283 and I have these errors in the mds log 2019-11-06 20:20:18.215 7f0bd9090700 1 mds.0.32011 cluster recovered. 2019-11-06 20:20:19.019 7f0bd2dfa700 0 mds.0.cache.dir(0x100013acfcb) _fetched missing object for [dir 0x100013acfcb /nextcloud/custom_apps/carnet/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55d4dc4f5100] 2019-11-06 20:20:19.019 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir 0x100013acfcb object missing on disk; some files may be lost (/nextcloud/custom_apps/carnet) 2019-11-06 20:20:19.275 7f0bd2dfa700 0 mds.0.cache.dir(0x100013a3156) _fetched missing object for [dir 0x100013a3156 /nextcloud/custom_apps/mail/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55d4dcc40000] 2019-11-06 20:20:19.275 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir 0x100013a3156 object missing on disk; some files may be lost (/nextcloud/custom_apps/mail) 2019-11-06 20:20:19.371 7f0bd2dfa700 0 mds.0.cache.dir(0x100013abb3c) _fetched missing object for [dir 0x100013abb3c /nextcloud/custom_apps/passwords/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55d4dcc40700] 2019-11-06 20:20:19.371 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir 0x100013abb3c object missing on disk; some files may be lost (/nextcloud/custom_apps/passwords) 2019-11-06 20:20:19.383 7f0bd2dfa700 0 mds.0.cache.dir(0x100013a9b9b) _fetched missing object for [dir 0x100013a9b9b /nextcloud/custom_apps/phonetrack/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55d4dcc40e00] 2019-11-06 20:20:19.383 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir 0x100013a9b9b object missing on disk; some files may be lost (/nextcloud/custom_apps/phonetrack) 2019-11-06 20:20:19.431 7f0bd2dfa700 0 mds.0.cache.dir(0x100013a2659) _fetched missing object for [dir 0x100013a2659 /nextcloud/custom_apps/richdocuments/ [2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x55d4dcc41500] 2019-11-06 20:20:19.431 7f0bd2dfa700 -1 log_channel(cluster) log [ERR] : dir 0x100013a2659 object missing on disk; some files may be lost (/nextcloud/custom_apps/richdocuments) 2019-11-06 20:20:22.360 7f0bd9090700 1 mds.k8s-node-01 Updating MDS map to version 32015 from mon.1 > > > - Karsten > > > > > > > > > Thanks for any hints > > > > - Karsten > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users(a)ceph.io > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > > > > > >

4 years, 6 months

2
1
0 0

Fwd: Broken: caps osd = "profile rbd-read-only"

by Markus Kienast

In Nautilus (Ubuntu Cloud Archive Train Version) the osd caps profile rbd-read-only seems broken. It is impossible to map a RBD if the user has the following caps: [client.yyy] key = AQBYL8NdHDpnERAAhk8XOKgFNwhUpCo3EMaW3g== caps mgr = "profile rbd" caps mon = "profile rbd" caps osd = "profile rbd-read-only" (also tried with 'caps osd = "profile rbd-read-only pool=rbd") This is what I do and what I get back: root@maas:~# sudo rbd map rbd/ltsp-01 --read-only --name client.yyy -m 10.101.0.24 --keyfile /etc/ceph/ceph.client.yyy.key rbd: sysfs write failed In some cases useful info is found in syslog - try "dmesg | tail". rbd: map failed: (1) Operation not permitted I can see no reason in dmesg: root@maas:~# dmesg | tail [ 4442.395179] rbd0: p1 [ 4442.396140] rbd: rbd0: capacity 21474836480 features 0x5 [ 4900.499466] libceph: mon0 (1)10.101.0.24:3300 socket closed (con state CONNECTING) [ 4900.884945] libceph: mon0 (1)10.101.0.24:3300 socket closed (con state CONNECTING) [ 4901.876882] libceph: mon0 (1)10.101.0.24:3300 socket closed (con state CONNECTING) [ 4903.892861] libceph: mon0 (1)10.101.0.24:3300 socket closed (con state CONNECTING) [ 4903.959079] libceph: mon1 (1)10.101.0.24:6789 session established [ 4903.959624] libceph: client15954 fsid afdda66a-000c-11ea-824b-00163e31d42c [ 4923.847328] libceph: mon1 (1)10.101.0.24:6789 session established [ 4923.847749] libceph: client15969 fsid afdda66a-000c-11ea-824b-00163e31d42c Using /sys/bus/rbd/add_single_major directly does not work as well: root@maas:~# echo "10.101.0.24 name=client.yyy,secret=AQBYL8NdHDpnERAAhk8XOKgFNwhUpCo3EMaW3g== rbd ltsp-01" | sudo tee /sys/bus/rbd/add_single_major 10.101.0.24 name=client.yyy,secret=AQBYL8NdHDpnERAAhk8XOKgFNwhUpCo3EMaW3g== rbd ltsp-01 tee: /sys/bus/rbd/add_single_major: Operation not permitted Using a client with 'caps osd = "profile rbd"' does work without any problems. System Information: Ubuntu 18.04.3 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable) Kernel: 5.3.0-19-generic (but also tested with 5.0.0-32-generic and 4.15.0-66-generic) The cluster was built with MAAS and JUJU from openstack-charmers stable charms: Ubuntu 18.04.3 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable) Kernel: 4.15.0-66-generic I wanted to file this as a bug report, but my account (trickkiste) has not been approved yet. Probably due to this bug according to jdillaman on IRC: https://tracker.ceph.com/issues/42667 Regards, Markus

4 years, 6 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2019