[ceph-users] Re: data loss on full file system?

6 Feb 2020

On Mon, 3 Feb 2020, Paul Emmerich wrote:

...
  On Sun, Feb 2, 2020 at 9:35 PM Håkan T Johansson
&lt;f96hajo(a)chalmers.se&gt; wrote:

  > Changing cp (or whatever standard tool
is used) to call fsync() before
> each close() is not an option for a user.  Also, doing that would lead to
> terrible performance generally.  Just tested - a recursive copy of a 70k
> files linux source tree went from 15 s to 6 minutes on a local filesystem
> I have at hand.

 Don't do it for every file:  cp foo bar; sync 
Does not help:

$ md5sum  ~/rnd100M
2e6c0b54748fa04dfcc54c1705e11a20  /home/htj/rnd100M
$ for i in `seq --format="%05.0f" 1 1000` ; do cp ~/rnd100M rnd1_$i ; done
$ sync
$ for i in `seq --format="%05.0f" 1 50 1000` ; 
do md5sum rnd1_$i ; done
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00001
2f282b84e7e608d5852449ed940bfc51  rnd1_00051
2f282b84e7e608d5852449ed940bfc51  rnd1_00101
2f282b84e7e608d5852449ed940bfc51  rnd1_00151
2f282b84e7e608d5852449ed940bfc51  rnd1_00201
2f282b84e7e608d5852449ed940bfc51  rnd1_00251
2f282b84e7e608d5852449ed940bfc51  rnd1_00301
2f282b84e7e608d5852449ed940bfc51  rnd1_00351
2f282b84e7e608d5852449ed940bfc51  rnd1_00401
2f282b84e7e608d5852449ed940bfc51  rnd1_00451
2f282b84e7e608d5852449ed940bfc51  rnd1_00501
2f282b84e7e608d5852449ed940bfc51  rnd1_00551
2f282b84e7e608d5852449ed940bfc51  rnd1_00601
2f282b84e7e608d5852449ed940bfc51  rnd1_00651
2f282b84e7e608d5852449ed940bfc51  rnd1_00701
2f282b84e7e608d5852449ed940bfc51  rnd1_00751
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00801
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00851
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00901
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00951
$ for i in `seq --format="%05.0f" 1 50 1000` ; do ls -l rnd1_$i ; done
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00001
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00051
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00101
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00151
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00201
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00251
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00301
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00351
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00401
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00451
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00501
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00551
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00601
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00651
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00701
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00751
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00801
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00851
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00901
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00951

(2f282... is the md5sum of a 100 MiB file of 0s)

md5sums for the transition to filesystem full:

2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00018
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00019
2e6c0b54748fa04dfcc54c1705e11a20  rnd1_00020
29a396ece342d8b2bc8ca509d961bd02  rnd1_00021
ee7e0deeb6c817bddf7930c3984da83d  rnd1_00022
c07ad8b66905d90fc37183b2bc3ba3ee  rnd1_00023
42f2a55b82642632fcee8d521038e531  rnd1_00024
2f282b84e7e608d5852449ed940bfc51  rnd1_00025
2f282b84e7e608d5852449ed940bfc51  rnd1_00026
2f282b84e7e608d5852449ed940bfc51  rnd1_00027

-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00018
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00019
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00020
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00021
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00022
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00023
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00024
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00025
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00026
-rw-r--r-- 1 htj htj 104857600 feb  5 23:18 rnd1_00027

This was done on a filesystem that had about 3 GB free space:
Writing here 100 GB in total forced much of the data again out of the 
client cache, and subsequent md5sum therefore gets different data.

Note that POSIX only says that sync() shall schedule the updates to the 
filesystem, not necessarily wait until completion.  fsync() however shall 
wait until completion.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/sync.html
https://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html

Side-note: when I used a larger file (1G) as the copy source, sometimes 
out-of-space was reported.  But the results were still not reliable.

I do not see how to fulfill the requirement that a read() after a 
successful write() shall get the written data, without the cephfs client 
asking the respective to-be OSD node of each block for a reservation, 
before returning success for the write(). Such reservations, the clients 
could request speculatively as soon as they have opened a file in write 
mode. Alternative would be that clients cannot drop caches until the 
out-of-space condition has been cleared.

Cheers,
Håkan

>
>>
>> Best regards,
>> Håkan
>>
>>
>>
>>>
>>>
>>> Paul
>>>
>>> --
>>> Paul Emmerich
>>>
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>
>>> croit GmbH
>>> Freseniusstr. 31h
>>> 81247 München
>>> www.croit.io
>>> Tel: +49 89 1896585 90
>>>
>>> On Mon, Jan 27, 2020 at 9:11 PM Håkan T Johansson &lt;f96hajo(a)chalmers.se&gt;
wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> for test purposes, I have set up two 100 GB OSDs, one
>>>> taking a data pool and the other metadata pool for cephfs.
>>>>
>>>> Am running 14.2.6-1-gffd69200ad-1 with packages from
>>>> https://mirror.croit.io/debian-nautilus
>>>>
>>>> Am then running a program that creates a lot of 1 MiB files by calling
>>>>    fopen()
>>>>    fwrite()
>>>>    fclose()
>>>> for each of them.  Error codes are checked.
>>>>
>>>> This works successfully for ~100 GB of data, and then strangely also
succeeds
>>>> for many more 100 GB of data...  ??
>>>>
>>>> All written files have size 1 MiB with 'ls', and thus should
contain the data
>>>> written.  However, on inspection, the files written after the first ~100
GiB,
>>>> are full of just 0s.  (hexdump -C)
>>>>
>>>>
>>>> To further test this, I use the standard tool 'cp' to copy a few
random-content
>>>> files into the full cephfs filessystem.  cp reports no complaints, and
after
>>>> the copy operations, content is seen with hexdump -C.  However, after
forcing
>>>> the data out of cache on the client by reading other earlier created
files,
>>>> hexdump -C show all-0 content for the files copied with 'cp'. 
Data that was
>>>> there is suddenly gone...?
>>>>
>>>>
>>>> I am new to ceph.  Is there an option I have missed to avoid this
behaviour?
>>>> (I could not find one in
>>>> https://docs.ceph.com/docs/master/man/8/mount.ceph/ )
>>>>
>>>> Is this behaviour related to
>>>> https://docs.ceph.com/docs/mimic/cephfs/full/
>>>> ?
>>>>
>>>> (That page states 'sometime after a write call has already returned
0'. But if
>>>> write returns 0, then no data has been written, so the user program would
not
>>>> assume any kind of success.)
>>>>
>>>> Best regards,
>>>>
>>>> Håkan
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>
>

2024

2023

2022

2021

2020

2019

[ceph-users] Re: data loss on full file system?