Hi all,
I just had created a ceph cluster to use cephfs. When i create the a ceph
fs pool i get the filesystem below error.
# ceph osd pool create cephfs_data 128
pool 'cephfs_data' created
# ceph osd pool create cephfs_metadata 128
pool 'cephfs_metadata' created
# ceph fs new cephfs cephfs_metadata cephfs_data
new fs with metadata pool 6 and data pool 5
# ceph -s
cluster:
id: 1c27def45-f0f9-494d-sfke-eb4323432fd
health: HEALTH_ERR
1 filesystem is offline
1 filesystem is online with fewer MDS than max_mds
services:
mon: 2 daemons, quorum ceph-mon01,ceph-mon02
mgr: ceph-adm01(active)
mds: cephfs-0/0/1 up
osd: 12 osds: 12 up, 12 in
data:
pools: 2 pools, 256 pgs
objects: 0 objects, 0 B
usage: 12 GiB used, 588 GiB / 600 GiB avail
pgs: 256 active+clean
but when i check the max_mds for the ceph fs it says 1
# ceph fs get cephfs | grep max_mds
max_mds 1
Let anyone know what am i missing here? Any inputs is much appreciated.
Regards,
Ram
Ceph-explorer..
I have some questions for those who’ve experienced this issue.
1. It seems like those reporting this issue are seeing it strictly after upgrading to Octopus. From what version did each of these sites upgrade to Octopus? From Nautilus? Mimic? Luminous?
2. Does anyone have any lifecycle rules on a bucket experiencing this issue? If so, please describe.
3. Is anyone making copies of the affected objects (to same or to a different bucket) prior to seeing the issue? And if they are making copies, does the destination bucket have lifecycle rules? And if they are making copies, are those copies ever being removed?
4. Is anyone experiencing this issue willing to run their RGWs with 'debug_ms=1'? That would allow us to see a request from an RGW to either remove a tail object or decrement its reference counter (and when its counter reaches 0 it will be deleted).
Thanks,
Eric
> On Nov 12, 2020, at 4:54 PM, huxiaoyu(a)horebdata.cn wrote:
>
> Looks like this is a very dangerous bug for data safety. Hope the bug would be quickly identified and fixed.
>
> best regards,
>
> Samuel
>
>
>
> huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>
>
> From: Janek Bevendorff
> Date: 2020-11-12 18:17
> To: huxiaoyu(a)horebdata.cn <mailto:huxiaoyu@horebdata.cn>; EDH - Manuel Rios; Rafael Lopez
> CC: Robin H. Johnson; ceph-users
> Subject: Re: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk
> I have never seen this on Luminous. I recently upgraded to Octopus and the issue started occurring only few weeks later.
>
> On 12/11/2020 16:37, huxiaoyu(a)horebdata.cn wrote:
> which Ceph versions are affected by this RGW bug/issues? Luminous, Mimic, Octupos, or the latest?
>
> any idea?
>
> samuel
>
>
>
> huxiaoyu(a)horebdata.cn
>
> From: EDH - Manuel Rios
> Date: 2020-11-12 14:27
> To: Janek Bevendorff; Rafael Lopez
> CC: Robin H. Johnson; ceph-users
> Subject: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk
> This same error caused us to wipe a full cluster of 300TB... will be related to some rados index/database bug not to s3.
>
> As Janek exposed is a mayor issue, because the error silent happend and you can only detect it with S3, when you're going to delete/purge a S3 bucket. Dropping NoSuchKey. Error is not related to S3 logic ..
>
> Hope this time dev's can take enought time to find and resolve the issue. Error happens with low ec profiles, even with replica x3 in some cases.
>
> Regards
>
>
>
> -----Mensaje original-----
> De: Janek Bevendorff <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>>
> Enviado el: jueves, 12 de noviembre de 2020 14:06
> Para: Rafael Lopez <rafael.lopez(a)monash.edu <mailto:rafael.lopez@monash.edu>>
> CC: Robin H. Johnson <robbat2(a)gentoo.org <mailto:robbat2@gentoo.org>>; ceph-users <ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>>
> Asunto: [ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk
>
> Here is a bug report concerning (probably) this exact issue:
> https://tracker.ceph.com/issues/47866 <https://tracker.ceph.com/issues/47866>
>
> I left a comment describing the situation and my (limited) experiences with it.
>
>
> On 11/11/2020 10:04, Janek Bevendorff wrote:
>>
>> Yeah, that seems to be it. There are 239 objects prefixed
>> .8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh in my dump. However, there are none
>> of the multiparts from the other file to be found and the head object
>> is 0 bytes.
>>
>> I checked another multipart object with an end pointer of 11.
>> Surprisingly, it had way more than 11 parts (39 to be precise) named
>> .1, .1_1 .1_2, .1_3, etc. Not sure how Ceph identifies those, but I
>> could find them in the dump at least.
>>
>> I have no idea why the objects disappeared. I ran a Spark job over all
>> buckets, read 1 byte of every object and recorded errors. Of the 78
>> buckets, two are missing objects. One bucket is missing one object,
>> the other 15. So, luckily, the incidence is still quite low, but the
>> problem seems to be expanding slowly.
>>
>>
>> On 10/11/2020 23:46, Rafael Lopez wrote:
>>> Hi Janek,
>>>
>>> What you said sounds right - an S3 single part obj won't have an S3
>>> multipart string as part of the prefix. S3 multipart string looks
>>> like "2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme".
>>>
>>> From memory, single part S3 objects that don't fit in a single rados
>>> object are assigned a random prefix that has nothing to do with
>>> the object name, and the rados tail/data objects (not the head
>>> object) have that prefix.
>>> As per your working example, the prefix for that would be
>>> '.8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh'. So there would be (239) "shadow"
>>> objects with names containing that prefix, and if you add up the
>>> sizes it should be the size of your S3 object.
>>>
>>> You should look at working and non working examples of both single
>>> and multipart S3 objects, as they are probably all a bit different
>>> when you look in rados.
>>>
>>> I agree it is a serious issue, because once objects are no longer in
>>> rados, they cannot be recovered. If it was a case that there was a
>>> link broken or rados objects renamed, then we could work to
>>> recover...but as far as I can tell, it looks like stuff is just
>>> vanishing from rados. The only explanation I can think of is some
>>> (rgw or rados) background process is incorrectly doing something with
>>> these objects (eg. renaming/deleting). I had thought perhaps it was a
>>> bug with the rgw garbage collector..but that is pure speculation.
>>>
>>> Once you can articulate the problem, I'd recommend logging a bug
>>> tracker upstream.
>>>
>>>
>>> On Wed, 11 Nov 2020 at 06:33, Janek Bevendorff
>>> <janek.bevendorff(a)uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>
>>> <mailto:janek.bevendorff@uni-weimar.de <mailto:janek.bevendorff@uni-weimar.de>>> wrote:
>>>
>>> Here's something else I noticed: when I stat objects that work
>>> via radosgw-admin, the stat info contains a "begin_iter" JSON
>>> object with RADOS key info like this
>>>
>>>
>>> "key": {
>>> "name":
>>> "29/items/WIDE-20110924034843-crawl420/WIDE-20110924065228-02544.warc.gz",
>>> "instance": "",
>>> "ns": ""
>>> }
>>>
>>>
>>> and then "end_iter" with key info like this:
>>>
>>>
>>> "key": {
>>> "name":
>>> ".8naRUHSG2zfgjqmwLnTPvvY1m6DZsgh_239",
>>> "instance": "",
>>> "ns": "shadow"
>>> }
>>>
>>> However, when I check the broken 0-byte object, the "begin_iter"
>>> and "end_iter" keys look like this:
>>>
>>>
>>> "key": {
>>> "name":
>>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.1",
>>> "instance": "",
>>> "ns": "multipart"
>>> }
>>>
>>> [...]
>>>
>>>
>>> "key": {
>>> "name":
>>> "29/items/WIDE-20110903143858-crawl428/WIDE-20110903143858-01166.warc.gz.2~m5Y42lPMIeis5qgJAZJfuNnzOKd7lme.19",
>>> "instance": "",
>>> "ns": "multipart"
>>> }
>>>
>>> So, it's the full name plus a suffix and the namespace is
>>> multipart, not shadow (or empty). This in itself may just be an
>>> artefact of whether the object was uploaded in one go or as a
>>> multipart object, but the second difference is that I cannot find
>>> any of the multipart objects in my pool's object name dump. I
>>> can, however, find the shadow RADOS object of the intact S3 object.
>>>
>>>
>>>
>>>
>>> --
>>> *Rafael Lopez*
>>> Devops Systems Engineer
>>> Monash University eResearch Centre
>>>
>>> T: +61 3 9905 9118 <tel:%2B61%203%209905%209118>
>>> E: rafael.lopez(a)monash.edu <mailto:rafael.lopez@monash.edu>
>>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi,
Today while debugging something we had a few questions that might lead
to improving the cephfs forward scrub docs:
https://docs.ceph.com/en/latest/cephfs/scrub/
tldr:
1. Should we document which sorts of issues that the forward scrub is
able to fix?
2. Can we make it more visible (in docs) that scrubbing is not
supported with multi-mds?
3. Isn't the new `ceph -s` scrub task status misleading with multi-mds?
Details here:
1) We found a CephFS directory with a number of zero sized files:
# ls -l
...
-rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:58
upload_fc501199e3e7abe6b574101cf34aeefb.png
-rw-r--r-- 1 1001890000 1001890000 0 Nov 3 12:23
upload_fce4f55348185fefa0abdd8d11095ba8.gif
-rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:54
upload_fd95b8358851f0dac22fb775046a6163.png
...
The user claims that those files were non-zero sized last week. The
sequence of zero sized files includes *all* files written between Nov
2 and 9.
The user claims that his client was running out of memory, but this is
now fixed. So I suspect that his ceph client (kernel
3.10.0-1127.19.1.el7.x86_64) was not behaving well.
Anyway, I noticed that even though the dentries list 0 bytes, the
underlying rados objects have data, and the data looks good. E.g.
# rados get -p cephfs_data 200212e68b5.00000000 --namespace=xxx
200212e68b5.00000000
# file 200212e68b5.00000000
200212e68b5.00000000: PNG image data, 960 x 815, 8-bit/color RGBA,
non-interlaced
So I managed to recover the files doing something like this (using an
input file mapping inode to filename) [see PS 0].
But I'm wondering if a forward scrub is able to fix this sort of
problem directly?
Should we document which sorts of issues that the forward scrub is able to fix?
I anyway tried to scrub it, which led to:
# ceph tell mds.cephflax-mds-xxx scrub start /volumes/_nogroup/xxx
recursive repair
Scrub is not currently supported for multiple active MDS. Please
reduce max_mds to 1 and then scrub.
So ...
2) Shouldn't we update the doc to mention loud and clear that scrub is
not currently supported for multiple active MDS?
3) I was somehow surprised by this, because I had thought that the new
`ceph -s` multi-mds scrub status implied that multi-mds scrubbing was
now working:
task status:
scrub status:
mds.x: idle
mds.y: idle
mds.z: idle
Is it worth reporting this task status for cephfs if we can't even scrub them?
Thanks!!
Dan
[0]
mkdir -p recovered
while read -r a b; do
for i in {0..9}
do
echo "rados stat --cluster=flax --pool=cephfs_data
--namespace=xxx" $(printf "%x" $a).0000000$i "&&" "rados get
--cluster=flax --pool=cephfs_data --namespace=xxx" $(printf "%x"
$a).0000000$i $(printf "%x" $a).0000000$i
done
echo cat $(printf "%x" $a).* ">" $(printf "%x" $a)
echo mv $(printf "%x" $a) recovered/$b
done < inones_fnames.txt
Hi Patrick,
Any updates? Looking forward to your reply :D
On Thu, Dec 17, 2020 at 11:39 AM Patrick Donnelly <pdonnell(a)redhat.com> wrote:
>
> On Wed, Dec 16, 2020 at 5:46 PM Alex Taylor <alexu4993(a)gmail.com> wrote:
> >
> > Hi Cephers,
> >
> > I'm using VSCode remote development with a docker server. It worked OK
> > but fails to start the debugger after /root mounted by ceph-fuse. The
> > log shows that the binary passes access X_OK check but cannot be
> > actually executed. see:
> >
> > ```
> > strace_log: access("/root/.vscode-server/extensions/ms-vscode.cpptools-1.1.3/debugAdapters/OpenDebugAD7",
> > X_OK) = 0
> >
> > root@develop:~# ls -alh
> > .vscode-server/extensions/ms-vscode.cpptools-1.1.3/debugAdapters/OpenDebugAD7
> > -rw-r--r-- 1 root root 978 Dec 10 13:06
> > .vscode-server/extensions/ms-vscode.cpptools-1.1.3/debugAdapters/OpenDebugAD7
> > ```
> >
> > I also test the access syscall on ext4, xfs and even cephfs kernel
> > client, all of them return -EACCES, which is expected (the extension
> > will then explicitly call chmod +x).
> >
> > After some digging in the code, I found it is probably caused by
> > https://github.com/ceph/ceph/blob/master/src/client/Client.cc#L5549-L5550.
> > So here come two questions:
> > 1. Is this a bug or is there any concern I missed?
>
> I tried reproducing it with the master branch and could not. It might
> be due to an older fuse/ceph. I suggest you upgrade!
>
I tried the master(332a188d9b3c4eb5c5ad2720b7299913c5a772ee) as well
and the issue still exists. My test program is:
```
#include <stdio.h>
#include <unistd.h>
int main() {
int r;
const char path[] = "test";
r = access(path, F_OK);
printf("file exists: %d\n", r);
r = access(path, X_OK);
printf("file executable: %d\n", r);
return 0;
}
```
And the test result:
```
# local filesystem: ext4
root@f626800a6e85:~# ls -l test
-rw-r--r-- 1 root root 6 Dec 19 06:13 test
root@f626800a6e85:~# ./a.out
file exists: 0
file executable: -1
root@f626800a6e85:~# findmnt -t fuse.ceph-fuse
TARGET SOURCE FSTYPE OPTIONS
/root/mnt ceph-fuse fuse.ceph-fuse
rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other
root@f626800a6e85:~# cd mnt
# ceph-fuse
root@f626800a6e85:~/mnt# ls -l test
-rw-r--r-- 1 root root 6 Dec 19 06:10 test
root@f626800a6e85:~/mnt# ./a.out
file exists: 0
file executable: 0
root@f626800a6e85:~/mnt# ./test
bash: ./test: Permission denied
```
Again, ceph-fuse says file `test` is executable but in fact it can't
be executed.
The kernel version I'm testing on is:
```
root@f626800a6e85:~/mnt# uname -ar
Linux f626800a6e85 4.9.0-7-amd64 #1 SMP Debian 4.9.110-1 (2018-07-05)
x86_64 GNU/Linux
```
Please try the program above and make sure you're running it as root
user, thank you. And if the reproduction still fails, please let me
know the kernel version.
> > 2. It works again with fuse_default_permissions=true, any drawbacks if
> > this option is set?
>
> Correctness (ironically, for you) and performance.
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Principal Software Engineer
> Red Hat Sunnyvale, CA
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
Hi,
There is a default limit of 1TiB for the max_file_size in CephFS. I altered that to 2TiB, but I now got a request for storing a file up to 7TiB.
I'd expect the limit to be there for a reason, but what is the risk of setting that value to say 10TiB?
--
Mark Schouten <mark(a)tuxis.nl>
Tuxis, Ede, https://www.tuxis.nl
T: +31 318 200208
Hi all,
We have a few subdirs with an rctime in the future.
# getfattr -n ceph.dir.rctime session
# file: session
ceph.dir.rctime="2576387188.090"
I can't find any subdir or item in that directory with that rctime, so
I presume that there was previously a file and that rctime cannot go
backwards [1]
Is there any way to fix these rctimes so they show the latest ctime of
the subtree?
Also -- are we still relying on the client clock to set the rctime /
ctime of a file? Would it make sense to limit ctime/rctime for any
update to the current time on the MDS ?
Best Regards,
Dan
[1] https://github.com/ceph/ceph/pull/24023/commits/920ef964311a61fcc6c0d6671b7…
I have a cephfs secondary (non-root) data pool with unfound and degraded
objects that I have not been able to recover[1]. I created an
additional data pool and used "setfattr -n ceph.dir.layout.pool' and a
very long rsync to move the files off of the degraded pool and onto the
new pool. This has completed, and using find + 'getfattr -n
ceph.file.layout.pool', I verified that no files are using the old pool
anymore. No ceph.dir.layout.pool attributes point to the old pool either.
However, the old pool still reports that there are objects in the old
pool, likely the same ones that were unfound/degraded from before:
https://pastebin.com/qzVA7eZr
Based on a old message from the mailing list[2], I checked the MDS for
stray objects (ceph daemon mds.ceph4 dump cache file.txt ; grep -i stray
file.txt) and found 36 stray entries in the cache:
https://pastebin.com/MHkpw3DV. However, I'm not certain how to map
these stray cache objects to clients that may be accessing them.
'rados -p fs.data.archive.frames ls' shows 145 objects. Looking at the
parent of each object shows 2 strays:
for obj in $(cat rados.ls.txt) ; do echo $obj ; rados -p
fs.data.archive.frames getxattr $obj parent | strings ; done
[...]
10000020fa1.00000000
10000020fa1
stray6
10000020fbc.00000000
10000020fbc
stray6
[...]
...before getting stuck on one object for over 5 minutes (then I gave up):
1000005b1af.00000083
What can I do to make sure this pool is ready to be safely deleted from
cephfs (ceph fs rm_data_pool archive fs.data.archive.frames)?
--Mike
[1]https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/QHFOGEKXK7VDNNSKR74BA6IIMGGIXBXA/#7YQ6SSTESM5LTFVLQK3FSYFW5FDXJ5CF
[2]http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005233.h…