In the week since upgrading one of our clusters from Nautilus 14.2.21 to Pacific 16.2.4 I've seen four spurious read errors that always have the same bad checksum of 0x6706be76. I've never seen this in any of our clusters before. Here's an example of what I'm seeing in the logs:
ceph-osd.132.log:2021-06-20T22:53:20.584-0400 7fde2e4fc700 -1 bluestore(/var/lib/ceph/osd/ceph-132) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0xee74a56a, device location [0x18c81b40000~1000], logical extent 0x200000~1000, object #29:2d8210bf:::rbd_data.94f4232ae8944a.0000000000026c57:head#
I'm not seeing any indication of inconsistent PGs, only the spurious read error. I don't see an explicit indication of a retry in the logs following the above message. Bluestore code to retry three times was introduced in 2018 following a similar issue with the same checksum: https://tracker.ceph.com/issues/22464
Here's an example of what my health detail looks like:
HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors
osd.117 reads with retries: 1
I followed this (unresolved) thread, too: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DRBVFQLZ5ZY…
I do have swap enabled, but I don't think memory pressure is an issue with 30GB available out of 96GB (and no sign I've been close to summoning the OOMkiller). The OSDs that have thrown the cluster into HEALTH_WARN with the spurious read errors are busy 12TB rotational HDDs and I _think_ it's only happening during a deep scrub. We're on Ubuntu 18.04; uname: 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux.
Does Pacific retry three times on a spurious read error? Would I see an indication of a retry in the logs?
Thanks!
~Jay
Hi Reed,
To add to this command by Weiwen:
On 28.05.21 13:03, 胡 玮文 wrote:
> Have you tried just start multiple rsync process simultaneously to transfer different directories? Distributed system like ceph often benefits from more parallelism.
When I migrated from XFS on iSCSI (legacy system, no Ceph) to CephFS a
few months ago, I used msrsync [1] and was quite happy with the speed.
For your use case, I would start with -p 12 but might experiment with up
to -p 24 (as you only have 6C/12T in your CPU). With many small files,
you also might want to increase -s from the default 1000.
Note that msrsync does not work with the --delete rsync flag. As I was
syncing a live system, I ended up with this workflow:
- Initial sync with msrsync (something like ./msrsync -p 12 --progress
--stats --rsync "-aS --numeric-ids" ...)
- Second sync with msrsync (to sync changes during the first sync)
- Take old storage off-line for users / read-only
- Final rsync with --delete (i.e. rsync -aS --numeric-ids --delete ...)
- Mount cephfs at location of old storage, adjust /etc/exports with fsid
entries where necessary, turn system back on-line / read-write
Cheers
Sebastian
[1] https://github.com/jbd/msrsync
Hello
We did try to use Cephadm with Podman to start 44 OSDs per host which consistently stop after adding 24 OSDs per host.
We did look into the cephadm.log on the problematic host and saw that the command `cephadm ceph-volume lvm list --format json` did stuck.
We were the output of the command wasn't complete. Therefore, we tried to use compacted JSON and we could increase the number to 36 OSDs per host.
If you need more information just ask.
Podman version: 3.2.1
Ceph version: 16.2.4
OS version: Suse Leap 15.3
Greetings,
Jan
Hi,
Today while debugging something we had a few questions that might lead
to improving the cephfs forward scrub docs:
https://docs.ceph.com/en/latest/cephfs/scrub/
tldr:
1. Should we document which sorts of issues that the forward scrub is
able to fix?
2. Can we make it more visible (in docs) that scrubbing is not
supported with multi-mds?
3. Isn't the new `ceph -s` scrub task status misleading with multi-mds?
Details here:
1) We found a CephFS directory with a number of zero sized files:
# ls -l
...
-rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:58
upload_fc501199e3e7abe6b574101cf34aeefb.png
-rw-r--r-- 1 1001890000 1001890000 0 Nov 3 12:23
upload_fce4f55348185fefa0abdd8d11095ba8.gif
-rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:54
upload_fd95b8358851f0dac22fb775046a6163.png
...
The user claims that those files were non-zero sized last week. The
sequence of zero sized files includes *all* files written between Nov
2 and 9.
The user claims that his client was running out of memory, but this is
now fixed. So I suspect that his ceph client (kernel
3.10.0-1127.19.1.el7.x86_64) was not behaving well.
Anyway, I noticed that even though the dentries list 0 bytes, the
underlying rados objects have data, and the data looks good. E.g.
# rados get -p cephfs_data 200212e68b5.00000000 --namespace=xxx
200212e68b5.00000000
# file 200212e68b5.00000000
200212e68b5.00000000: PNG image data, 960 x 815, 8-bit/color RGBA,
non-interlaced
So I managed to recover the files doing something like this (using an
input file mapping inode to filename) [see PS 0].
But I'm wondering if a forward scrub is able to fix this sort of
problem directly?
Should we document which sorts of issues that the forward scrub is able to fix?
I anyway tried to scrub it, which led to:
# ceph tell mds.cephflax-mds-xxx scrub start /volumes/_nogroup/xxx
recursive repair
Scrub is not currently supported for multiple active MDS. Please
reduce max_mds to 1 and then scrub.
So ...
2) Shouldn't we update the doc to mention loud and clear that scrub is
not currently supported for multiple active MDS?
3) I was somehow surprised by this, because I had thought that the new
`ceph -s` multi-mds scrub status implied that multi-mds scrubbing was
now working:
task status:
scrub status:
mds.x: idle
mds.y: idle
mds.z: idle
Is it worth reporting this task status for cephfs if we can't even scrub them?
Thanks!!
Dan
[0]
mkdir -p recovered
while read -r a b; do
for i in {0..9}
do
echo "rados stat --cluster=flax --pool=cephfs_data
--namespace=xxx" $(printf "%x" $a).0000000$i "&&" "rados get
--cluster=flax --pool=cephfs_data --namespace=xxx" $(printf "%x"
$a).0000000$i $(printf "%x" $a).0000000$i
done
echo cat $(printf "%x" $a).* ">" $(printf "%x" $a)
echo mv $(printf "%x" $a) recovered/$b
done < inones_fnames.txt
Dear all,
I'm sorry if I'm asking for the obvious or missing a previous discussion of
this but I could not find the answer to my question online. I'd be happy to
be pointed to the right direction only.
The cephfs-mirror tool in pacific looks extremely promising. How does it
work exactly? Is it based on files and (recursive) ctime or rather based on
object information? Does it handle incremental changes (only) between
snapshots?
There is an issue related to this that mentions recursive ctime. But that
would mean that users could "rsync -a" data to the file system and this
would not get synchronized.
I have good experience with ZFS which is able to identify changes between
two snapshots A and B and then only transfer these changes (using a
sub-file level, on the ZFS equivalent of blocks to my understanding) to
another server with the same file system that is in the exact state as
snapshot A. Does cephfs-mirror work the same?
Best wishes,
Manuel
Sounds exactly like testing to me...
On 30.6.2021 г. 17:46 ч., Teoman Onay wrote:
> What do you mean by different?
>
> RHEL is a supported product while CentOS is not. You get bug/security
> fixes sooner on RHEL than on CentOS as depending on their severity
> level they are released during the z-stream releases.
>
> Stream is one minor release ahead than RHEL which means it already
> contains part of the fixes which will be released a few months later
> in RHEL. It could be considered even more stable as it already
> contains part of the fixes.
>
>
>
>
> On Wed, 30 Jun 2021, 15:15 Radoslav Milanov,
> <radoslav.milanov(a)gmail.com <mailto:radoslav.milanov@gmail.com>> wrote:
>
> If stream is so great why is RHEL different ?
>
> On 30.6.2021 г. 03:49 ч., Teoman Onay wrote:
> >>
> >>> For similar reasons, CentOS 8 stream, as opposed to every
> other CentOS
> >> released before, is very experimental. I would never go in
> production with
> >> CentOS 8 stream.
> >>
> >>
> > Experimental?? Looks like you still don't understand what CentOS
> stream is.
> > If you have some time just read this:
> >
> https://www.linkedin.com/pulse/why-you-should-have-already-been-centos-stre…
> <https://www.linkedin.com/pulse/why-you-should-have-already-been-centos-stre…>
> >
> > He summarized quite well what CentOS stream is.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> <mailto:ceph-users@ceph.io>
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> <mailto:ceph-users-leave@ceph.io>
>
We're happy to announce the 22nd and likely final backport release in the Nautilus series. Ultimately, we recommend all users upgrade to newer Ceph releases.
For a detailed release notes with links & changelog please refer to the official blog entry at https://ceph.io/en/news/blog/2021/v14-2-22-nautilus-released
Notable Changes
---------------
* This release sets `bluefs_buffered_io` to true by default to improve performance
for metadata heavy workloads. Enabling this option has been reported to
occasionally cause excessive kernel swapping under certain workloads.
Currently, the most consistent performing combination is to enable
bluefs_buffered_io and disable system level swap.
* The default value of `bluestore_cache_trim_max_skip_pinned` has been
increased to 1000 to control memory growth due to onodes.
* Several other bug fixes in BlueStore, including a fix for an unexpected
ENOSPC bug in Avl/Hybrid allocators.
* The trimming logic in the monitor has been made dynamic, with the
introduction of `paxos_service_trim_max_multiplier`, a factor by which
`paxos_service_trim_max` is multiplied to make trimming faster,
when required. Setting it to 0 disables the upper bound check for trimming
and makes the monitors trim at the maximum rate.
* A `--max <n>` option is available with the `osd ok-to-stop` command to
provide up to N OSDs that can be stopped together without making PGs
unavailable.
* OSD: the option `osd_fast_shutdown_notify_mon` has been introduced to allow
the OSD to notify the monitor it is shutting down even if `osd_fast_shutdown`
is enabled. This helps with the monitor logs on larger clusters, that may get
many 'osd.X reported immediately failed by osd.Y' messages, and confuse tools.
* A long-standing bug that prevented 32-bit and 64-bit client/server
interoperability under msgr v2 has been fixed. In particular, mixing armv7l
(armhf) and x86_64 or aarch64 servers in the same cluster now works.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-14.2.22.tar.gz
* For packages, see https://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: ca74598065096e6fcbd8433c8779a2be0c889351
Hi,
Is there any proper documentation how to connect ceph with openstack?
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.