November 2020 - ceph-users

by Ramanathan S

Hi all, I just had created a ceph cluster to use cephfs. When i create the a ceph fs pool i get the filesystem below error. # ceph osd pool create cephfs_data 128 pool 'cephfs_data' created # ceph osd pool create cephfs_metadata 128 pool 'cephfs_metadata' created # ceph fs new cephfs cephfs_metadata cephfs_data new fs with metadata pool 6 and data pool 5 # ceph -s cluster: id: 1c27def45-f0f9-494d-sfke-eb4323432fd health: HEALTH_ERR 1 filesystem is offline 1 filesystem is online with fewer MDS than max_mds services: mon: 2 daemons, quorum ceph-mon01,ceph-mon02 mgr: ceph-adm01(active) mds: cephfs-0/0/1 up osd: 12 osds: 12 up, 12 in data: pools: 2 pools, 256 pgs objects: 0 objects, 0 B usage: 12 GiB used, 588 GiB / 600 GiB avail pgs: 256 active+clean but when i check the max_mds for the ceph fs it says 1 # ceph fs get cephfs | grep max_mds max_mds 1 Let anyone know what am i missing here? Any inputs is much appreciated. Regards, Ram Ceph-explorer..

3 weeks, 3 days

3
3
0 0

Re: NoSuchKey on key that is visible in s3 list/radosgw bk

by Eric Ivancich

I have some questions for those who’ve experienced this issue. 1. It seems like those reporting this issue are seeing it strictly after upgrading to Octopus. From what version did each of these sites upgrade to Octopus? From Nautilus? Mimic? Luminous? 2. Does anyone have any lifecycle rules on a bucket experiencing this issue? If so, please describe. 3. Is anyone making copies of the affected objects (to same or to a different bucket) prior to seeing the issue? And if they are making copies, does the destination bucket have lifecycle rules? And if they are making copies, are those copies ever being removed? 4. Is anyone experiencing this issue willing to run their RGWs with 'debug_ms=1'? That would allow us to see a request from an RGW to either remove a tail object or decrement its reference counter (and when its counter reaches 0 it will be deleted). Thanks, Eric > On Nov 12, 2020, at 4:54 PM, huxiaoyu(a)horebdata.cn wrote: > > Looks like this is a very dangerous bug for data safety. Hope the bug would be quickly identified and fixed. > > best regards, > > Samuel > > > > huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn> > > From: Janek Bevendorff > Date: 2020-11-12 18:17 > To: huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>; EDH - Manuel Rios; Rafael Lopez > CC: Robin H. Johnson; ceph-users > Subject: Re: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk > I have never seen this on Luminous. I recently upgraded to Octopus and the issue started occurring only few weeks later. > > On 12/11/2020 16:37, huxiaoyu(a)horebdata.cn wrote: > which Ceph versions are affected by this RGW bug/issues? Luminous, Mimic, Octupos, or the latest? > > any idea? > > samuel > > > > huxiaoyu(a)horebdata.cn > > From: EDH - Manuel Rios > Date: 2020-11-12 14:27 > To: Janek Bevendorff; Rafael Lopez > CC: Robin H. Johnson; ceph-users > Subject: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk > This same error caused us to wipe a full cluster of 300TB... will be related to some rados index/database bug not to s3. > > As Janek exposed is a mayor issue, because the error silent happend and you can only detect it with S3, when you're going to delete/purge a S3 bucket. Dropping NoSuchKey. Error is not related to S3 logic .. > > Hope this time dev's can take enought time to find and resolve the issue. Error happens with low ec profiles, even with replica x3 in some cases. > > Regards > > > > -----Mensaje original----- > De: Janek Bevendorff <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>> > Enviado el: jueves, 12 de noviembre de 2020 14:06 > Para: Rafael Lopez <rafael.lopez(a)monash.edu <mailto:rafael.lopez@monash.edu>> > CC: Robin H. Johnson <robbat2(a)gentoo.org <mailto:robbat2@gentoo.org>>; ceph-users <ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>> > Asunto: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk > > Here is a bug report concerning (probably) this exact issue: > https://tracker.ceph.com/issues/47866 <https://tracker.ceph.com/issues/47866> > > I left a comment describing the situation and my (limited) experiences with it. > > > On 11/11/2020 10:04, Janek Bevendorff wrote: >> >> Yeah, that seems to be it. There are 239 objects prefixed >> .8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none >> of the multiparts from the other file to be found and the head object >> is 0 bytes. >> >> I checked another multipart object with an end pointer of 11. >> Surprisingly, it had way more than 11 parts (39 to be precise) named >> .1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I >> could find them in the dump at least. >> >> I have no idea why the objects disappeared. I ran a Spark job over all >> buckets, read 1 byte of every object and recorded errors. Of the 78 >> buckets, two are missing objects. One bucket is missing one object, >> the other 15. So, luckily, the incidence is still quite low, but the >> problem seems to be expanding slowly. >> >> >> On 10/11/2020 23:46, Rafael Lopez wrote: >>> Hi Janek, >>> >>> What you said sounds right - an S3 single part obj won't have an S3 >>> multipart string as part of the prefix. S3 multipart string looks >>> like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme". >>> >>> From memory, single part S3 objects that don't fit in a single rados >>> object are assigned a random prefix that has nothing to do with >>> the object name, and the rados tail/data objects (not the head >>> object) have that prefix. >>> As per your working example, the prefix for that would be >>> '.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow" >>> objects with names containing that prefix, and if you add up the >>> sizes it should be the size of your S3 object. >>> >>> You should look at working and non working examples of both single >>> and multipart S3 objects, as they are probably all a bit different >>> when you look in rados. >>> >>> I agree it is a serious issue, because once objects are no longer in >>> rados, they cannot be recovered. If it was a case that there was a >>> link broken or rados objects renamed, then we could work to >>> recover...but as far as I can tell, it looks like stuff is just >>> vanishing from rados. The only explanation I can think of is some >>> (rgw or rados) background process is incorrectly doing something with >>> these objects (eg. renaming/deleting). I had thought perhaps it was a >>> bug with the rgw garbage collector..but that is pure speculation. >>> >>> Once you can articulate the problem, I'd recommend logging a bug >>> tracker upstream. >>> >>> >>> On Wed, 11 Nov 2020 at 06:33, Janek Bevendorff >>> <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de> >>> <mailto:janek.bevendorff@uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>>> wrote: >>> >>> Here's something else I noticed: when I stat objects that work >>> via radosgw-admin, the stat info contains a "begin_iter" JSON >>> object with RADOS key info like this >>> >>> >>> "key": { >>> "name": >>> "29/items/WIDE-20110924034843-crawl420/WIDE-20110924065228-02544.warc.gz", >>> "instance": "", >>> "ns": "" >>> } >>> >>> >>> and then "end_iter" with key info like this: >>> >>> >>> "key": { >>> "name": >>> ".8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh_239", >>> "instance": "", >>> "ns": "shadow" >>> } >>> >>> However, when I check the broken 0-byte object, the "begin_iter" >>> and "end_iter" keys look like this: >>> >>> >>> "key": { >>> "name": >>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.1", >>> "instance": "", >>> "ns": "multipart" >>> } >>> >>> [...] >>> >>> >>> "key": { >>> "name": >>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.19", >>> "instance": "", >>> "ns": "multipart" >>> } >>> >>> So, it's the full name plus a suffix and the namespace is >>> multipart, not shadow (or empty). This in itself may just be an >>> artefact of whether the object was uploaded in one go or as a >>> multipart object, but the second difference is that I cannot find >>> any of the multipart objects in my pool's object name dump. I >>> can, however, find the shadow RADOS object of the intact S3 object. >>> >>> >>> >>> >>> -- >>> *Rafael Lopez* >>> Devops Systems Engineer >>> Monash University eResearch Centre >>> >>> T: +61 3 9905 9118 <tel:%2B61%203%209905%209118> >>> E: rafael.lopez(a)monash.edu <mailto:rafael.lopez@monash.edu> >>> > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 weeks, 3 days

6
22
0 0

kernel client osdc ops stuck and mds slow reqs

by Dan van der Ster

Hi all, We are quite regularly (a couple times per week) seeing: HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs report slow requests MDS_CLIENT_LATE_RELEASE 1 clients failing to respond to capability release mdshpc-be143(mds.0): Client hpc-be028.cern.ch: failing to respond to capability release client_id: 52919162 MDS_SLOW_REQUEST 1 MDSs report slow requests mdshpc-be143(mds.0): 1 slow requests are blocked > 30 secs Which is being caused by osdc ops stuck in a kernel client, e.g.: 10:57:18 root hpc-be028 /root → cat /sys/kernel/debug/ceph/4da6fd06-b069-49af-901f-c9513baabdbd.client52919162/osdc REQUESTS 9 homeless 0 46559317 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 [243,501,92]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 0x400014 1 read 46559322 osd243 3.ee6ffcdb 3.cdb [243,501,92]/243 [243,501,92]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a01.00000057 0x400014 1 read 46559323 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559341 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559342 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559345 osd243 3.969cc573 3.573 [243,330,226]/243 [243,330,226]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a56.00000056 0x400014 1 read 46559621 osd243 3.6313e8ef 3.8ef [243,330,521]/243 [243,330,521]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a45.0000007a 0x400014 1 read 46559629 osd243 3.b280c852 3.852 [243,113,539]/243 [243,113,539]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f09a3a.0000007f 0x400014 1 read 46559928 osd243 3.1ee7bab4 3.ab4 [243,332,94]/243 [243,332,94]/243 e678697 fsvolumens_355f485c-6319-4ffe-acd6-94a07f2a14b4/10003f099ff.0000073f 0x400024 1 write LINGER REQUESTS BACKOFFS We can unblock those requests by doing `ceph osd down osd.243` (or restarting osd.243). This is ceph v14.2.6 and the client kernel is el7 3.10.0-957.27.2.el7.x86_64. Are there a better way to debug this? Best Regards, Dan

1 year, 2 months

4
12
0 0

cephfs forward scrubbing docs

by Dan van der Ster

Hi, Today while debugging something we had a few questions that might lead to improving the cephfs forward scrub docs: https://docs.ceph.com/en/latest/cephfs/scrub/ tldr: 1. Should we document which sorts of issues that the forward scrub is able to fix? 2. Can we make it more visible (in docs) that scrubbing is not supported with multi-mds? 3. Isn't the new `ceph -s` scrub task status misleading with multi-mds? Details here: 1) We found a CephFS directory with a number of zero sized files: # ls -l ... -rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:58 upload_fc501199e3e7abe6b574101cf34aeefb.png -rw-r--r-- 1 1001890000 1001890000 0 Nov 3 12:23 upload_fce4f55348185fefa0abdd8d11095ba8.gif -rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:54 upload_fd95b8358851f0dac22fb775046a6163.png ... The user claims that those files were non-zero sized last week. The sequence of zero sized files includes *all* files written between Nov 2 and 9. The user claims that his client was running out of memory, but this is now fixed. So I suspect that his ceph client (kernel 3.10.0-1127.19.1.el7.x86_64) was not behaving well. Anyway, I noticed that even though the dentries list 0 bytes, the underlying rados objects have data, and the data looks good. E.g. # rados get -p cephfs_data 200212e68b5.00000000 --namespace=xxx 200212e68b5.00000000 # file 200212e68b5.00000000 200212e68b5.00000000: PNG image data, 960 x 815, 8-bit/color RGBA, non-interlaced So I managed to recover the files doing something like this (using an input file mapping inode to filename) [see PS 0]. But I'm wondering if a forward scrub is able to fix this sort of problem directly? Should we document which sorts of issues that the forward scrub is able to fix? I anyway tried to scrub it, which led to: # ceph tell mds.cephflax-mds-xxx scrub start /volumes/_nogroup/xxx recursive repair Scrub is not currently supported for multiple active MDS. Please reduce max_mds to 1 and then scrub. So ... 2) Shouldn't we update the doc to mention loud and clear that scrub is not currently supported for multiple active MDS? 3) I was somehow surprised by this, because I had thought that the new `ceph -s` multi-mds scrub status implied that multi-mds scrubbing was now working: task status: scrub status: mds.x: idle mds.y: idle mds.z: idle Is it worth reporting this task status for cephfs if we can't even scrub them? Thanks!! Dan [0] mkdir -p recovered while read -r a b; do for i in {0..9} do echo "rados stat --cluster=flax --pool=cephfs_data --namespace=xxx" $(printf "%x" $a).0000000$i "&&" "rados get --cluster=flax --pool=cephfs_data --namespace=xxx" $(printf "%x" $a).0000000$i $(printf "%x" $a).0000000$i done echo cat $(printf "%x" $a).* ">" $(printf "%x" $a) echo mv $(printf "%x" $a) recovered/$b done < inones_fnames.txt

2 years, 10 months

2
2
0 0

does ceph rgw has any option to limit bandwidth

by Zhenshi Zhou

Hi, Is there any option of rados gateway that limit bandwidth?

2 years, 10 months

5
6
0 0

Zabbix module Octopus 15.2.3

by Gert Wieberdink

Trying to configure Zabbix module in Octopus 15.2.3. CentOS 8.1 environment. Installed zabbix40-agent for CentOS 8.1 (from epel repository). This will also install zabbix_sender. After enabling the Zabbix module in Ceph, I configured my Zabbix host and Zabbix identifier. # ceph zabbix config-set zabbix_host <zabbix-fqdn> # ceph zabbix config-set zabbix_identifier <ident> # ceph zabbix config-show Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1153, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/zabbix/module.py", line 407, in handle_command return 0, json.dumps(self.config, index=4, sort_keys=True), '' File "/lib64/python3.6/json/__init__.py", line 238, in dumps **kw).encode(obj) TypeError: __init__() got an unexpected keyword argument 'index' # ceph -v ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable) # ceph health detail HEALTH_OK Anyone found a solution? rgds, -gw

2 years, 11 months

5
4
0 0

fixing future rctimes

by Dan van der Ster

Hi all, We have a few subdirs with an rctime in the future. # getfattr -n ceph.dir.rctime session # file: session ceph.dir.rctime="2576387188.090" I can't find any subdir or item in that directory with that rctime, so I presume that there was previously a file and that rctime cannot go backwards [1] Is there any way to fix these rctimes so they show the latest ctime of the subtree? Also -- are we still relying on the client clock to set the rctime / ctime of a file? Would it make sense to limit ctime/rctime for any update to the current time on the MDS ? Best Regards, Dan [1] https://github.com/ceph/ceph/pull/24023/commits/920ef964311a61fcc6c0d6671b7…

3 years, 1 month

2
3
0 0

Remapped PGs

by David Orman

Hi, We see that we have 5 'remapped' PGs, but are unclear why/what to do about it. We shifted some target ratios for the autobalancer and it resulted in this state. When adjusting ratio, we noticed two OSDs go down, but we just restarted the container for those OSDs with podman, and they came back up. Here's status output: ################### root@ceph01:~# ceph status INFO:cephadm:Inferring fsid x INFO:cephadm:Inferring config x INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 cluster: id: 41bb9256-c3bf-11ea-85b9-9e07b0435492 health: HEALTH_OK services: mon: 5 daemons, quorum ceph01,ceph04,ceph02,ceph03,ceph05 (age 2w) mgr: ceph03.ytkuyr(active, since 2w), standbys: ceph01.aqkgbl, ceph02.gcglcg, ceph04.smbdew, ceph05.yropto osd: 168 osds: 168 up (since 2d), 168 in (since 2d); 5 remapped pgs data: pools: 3 pools, 1057 pgs objects: 18.00M objects, 69 TiB usage: 119 TiB used, 2.0 PiB / 2.1 PiB avail pgs: 1056 active+clean 1 active+clean+scrubbing+deep io: client: 859 KiB/s rd, 212 MiB/s wr, 644 op/s rd, 391 op/s wr root@ceph01:~# ################### When I look at ceph pg dump, I don't see any marked as remapped: ################### root@ceph01:~# ceph pg dump |grep remapped INFO:cephadm:Inferring fsid x INFO:cephadm:Inferring config x INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 dumped all root@ceph01:~# ################### Any idea what might be going on/how to recover? All OSDs are up. Health is 'OK'. This is Ceph 15.2.4 deployed using Cephadm in containers, on Podman 2.0.3.

3 years, 1 month

2
4
0 0

Re: mds lost very frequently

by Stefan Kooman

Hi, After setting: ceph config set mds mds_recall_max_caps 10000 (5000 before change) and ceph config set mds mds_recall_max_decay_rate 1.0 (2.5 before change) And the: ceph tell 'mds.*' injectargs '--mds_recall_max_caps 10000' ceph tell 'mds.*' injectargs '--mds_recall_max_decay_rate 1.0' our up:active MDS stopped responding and the standby-replay stepped in ... and hit an assert (same as in this thread): 2020-02-06 16:42:16.712 7ff76a528700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15 2020-02-06 16:42:17.616 7ff76ff1b700 0 mds.beacon.mds2 MDS is no longer laggy 2020-02-06 16:42:20.348 7ff76d716700 -1 /build/ceph-13.2.8/src/mds/Locker.cc: In function 'void Locker::file_recover(ScatterLock*)' thread 7ff76d716700 time 2020-02-06 16:42:20.351124 /build/ceph-13.2.8/src/mds/Locker.cc: 5307: FAILED assert(lock->get_state() == LOCK_PRE_SCAN) ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7ff7759939de] 2: (()+0x287b67) [0x7ff775993b67] 3: (()+0x28a9ea) [0x5585eb2b79ea] 4: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b] 5: (MDSRank::active_start()+0x135) [0x5585eb146be5] 6: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5] 7: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608] 8: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc] 9: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b] 10: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d] 12: (()+0x76db) [0x7ff7752846db] 13: (clone()+0x3f) [0x7ff77446a88f] 2020-02-06 16:42:20.348 7ff76d716700 -1 *** Caught signal (Aborted) ** in thread 7ff76d716700 thread_name:ms_dispatch ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable) 1: (()+0x12890) [0x7ff77528f890] 2: (gsignal()+0xc7) [0x7ff774387e97] 3: (abort()+0x141) [0x7ff774389801] 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7ff775993ae6] 5: (()+0x287b67) [0x7ff775993b67] 6: (()+0x28a9ea) [0x5585eb2b79ea] 7: (MDCache::start_files_to_recover()+0xbb) [0x5585eb1f897b] 8: (MDSRank::active_start()+0x135) [0x5585eb146be5] 9: (MDSRankDispatcher::handle_mds_map(MMDSMap*, MDSMap*)+0x4e5) [0x5585eb151ea5] 10: (MDSDaemon::handle_mds_map(MMDSMap*)+0xca8) [0x5585eb134608] 11: (MDSDaemon::handle_core_message(Message*)+0x6c) [0x5585eb138bbc] 12: (MDSDaemon::ms_dispatch(Message*)+0xbb) [0x5585eb13929b] 13: (DispatchQueue::entry()+0xb92) [0x7ff775a56e52] 14: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff775af3e2d] 15: (()+0x76db) [0x7ff7752846db] 16: (clone()+0x3f) [0x7ff77446a88f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Quoting Yan, Zheng (ukernel(a)gmail.com): > Please try below patch if you can compile ceph from source. If you > can't compile ceph or the issue still happens, please set debug_mds = > 10 for standby mds (change debug_mds to 0 after mds becomes active). > > Regards > Yan, Zheng > > diff --git a/src/mds/MDSRank.cc b/src/mds/MDSRank.cc > index 1e8b024b8a..d1150578f1 100644 > --- a/src/mds/MDSRank.cc > +++ b/src/mds/MDSRank.cc > @@ -1454,8 +1454,8 @@ void MDSRank::rejoin_done() > void MDSRank::clientreplay_start() > { > dout(1) << "clientreplay_start" << dendl; > - finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > mdcache->start_files_to_recover(); > + finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > queue_one_replay(); > } > > @@ -1487,8 +1487,8 @@ void MDSRank::active_start() > > mdcache->clean_open_file_lists(); > mdcache->export_remaining_imported_caps(); > - finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > mdcache->start_files_to_recover(); > + finish_contexts(g_ceph_context, waiting_for_replay); // kick waiters > > mdcache->reissue_all_caps(); > mdcache->activate_stray_manager(); AFAICT this patch has never been tested and never commited. Do you still think this might fix the issue? Any hints on how we might reproduce this issue: failing active mds and hitting this specific recovery scenario We will happily apply this patch and do testing to check if it really fixes the issue. Gr. Stefan P.s. For my understanding: the MDS should never stop responding by setting these parameters, right? -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info(a)bit.nl

3 years, 2 months

1
1
0 0

Default data pool in CEPH

by Gabriel Medve

Hi, I have a CEPH 15.2.4 running in a docker. How to configure for use a specific data pool? i try put the follow line in the ceph.conf but the changes not working. . [client.myclient] rbd default data pool = Mydatapool I need it to configure for erasure pool with cloudstack Can help me? , where is the ceph conf we i need configure? Thanks. -- Untitled Document

3 years, 2 months

4
6
0 0

2024

2023

2022

2021

2020

2019

ceph-users November 2020