Hi,
I’m continuously getting scrub errors in my index pool and log pool that I need to repair always.
HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 2 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 20.19 is active+clean+inconsistent, acting [39,41,37]
Why is this?
I have no cue at all, no log entry no anything ☹
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi,
We've done our fair share of Ceph cluster upgrades since Hammer, and
have not seen much problems with them. I'm now at the point that I have
to upgrade a rather large cluster running Luminous and I would like to
hear from other users if they have experiences with issues I can expect
so that I can anticipate on them beforehand.
As said, the cluster is running Luminous (12.2.13) and has the following
services active:
services:
mon: 3 daemons, quorum osdnode01,osdnode02,osdnode04
mgr: osdnode01(active), standbys: osdnode02, osdnode03
mds: pmrb-3/3/3 up {0=osdnode06=up:active,1=osdnode08=up:active,2=osdnode07=up:active}, 1 up:standby
osd: 116 osds: 116 up, 116 in;
rgw: 3 daemons active
Of the OSD's, we have 11 SSD's and 105 HDD. The capacity of the cluster
is 1.01PiB.
We have 2 active crush-rules on 18 pools. All pools have a size of 3 there is a total of 5760 pgs.
{
"rule_id": 1,
"rule_name": "hdd-data",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -10,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "ssd-data",
"ruleset": 2,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -21,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
rbd -> crush_rule: hdd-data
.rgw.root -> crush_rule: hdd-data
default.rgw.control -> crush_rule: hdd-data
default.rgw.data.root -> crush_rule: ssd-data
default.rgw.gc -> crush_rule: ssd-data
default.rgw.log -> crush_rule: ssd-data
default.rgw.users.uid -> crush_rule: hdd-data
default.rgw.usage -> crush_rule: ssd-data
default.rgw.users.email -> crush_rule: hdd-data
default.rgw.users.keys -> crush_rule: hdd-data
default.rgw.meta -> crush_rule: hdd-data
default.rgw.buckets.index -> crush_rule: ssd-data
default.rgw.buckets.data -> crush_rule: hdd-data
default.rgw.users.swift -> crush_rule: hdd-data
default.rgw.buckets.non-ec -> crush_rule: ssd-data
DB0475 -> crush_rule: hdd-data
cephfs_pmrb_data -> crush_rule: hdd-data
cephfs_pmrb_metadata -> crush_rule: ssd-data
All but four clients are running Luminous, the four are running Jewel
(that needs upgrading before proceeding with this upgrade).
So, normally, I would 'just' upgrade all Ceph packages on the
monitor-nodes and restart mons and then mgrs.
After that, I would upgrade all Ceph packages on the OSD nodes and
restart all the OSD's. Then, after that, the MDSes and RGWs. Restarting
the OSD's will probably take a while.
If anyone has a hint on what I should expect to cause some extra load or
waiting time, that would be great.
Obviously, we have read
https://ceph.com/releases/v14-2-0-nautilus-released/ , but I'm looking
for real world experiences.
Thanks!
--
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | info(a)tuxis.nl
In the week since upgrading one of our clusters from Nautilus 14.2.21 to Pacific 16.2.4 I've seen four spurious read errors that always have the same bad checksum of 0x6706be76. I've never seen this in any of our clusters before. Here's an example of what I'm seeing in the logs:
ceph-osd.132.log:2021-06-20T22:53:20.584-0400 7fde2e4fc700 -1 bluestore(/var/lib/ceph/osd/ceph-132) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0xee74a56a, device location [0x18c81b40000~1000], logical extent 0x200000~1000, object #29:2d8210bf:::rbd_data.94f4232ae8944a.0000000000026c57:head#
I'm not seeing any indication of inconsistent PGs, only the spurious read error. I don't see an explicit indication of a retry in the logs following the above message. Bluestore code to retry three times was introduced in 2018 following a similar issue with the same checksum: https://tracker.ceph.com/issues/22464
Here's an example of what my health detail looks like:
HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors
osd.117 reads with retries: 1
I followed this (unresolved) thread, too: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DRBVFQLZ5ZY…
I do have swap enabled, but I don't think memory pressure is an issue with 30GB available out of 96GB (and no sign I've been close to summoning the OOMkiller). The OSDs that have thrown the cluster into HEALTH_WARN with the spurious read errors are busy 12TB rotational HDDs and I _think_ it's only happening during a deep scrub. We're on Ubuntu 18.04; uname: 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux.
Does Pacific retry three times on a spurious read error? Would I see an indication of a retry in the logs?
Thanks!
~Jay
Hi Reed,
To add to this command by Weiwen:
On 28.05.21 13:03, 胡 玮文 wrote:
> Have you tried just start multiple rsync process simultaneously to transfer different directories? Distributed system like ceph often benefits from more parallelism.
When I migrated from XFS on iSCSI (legacy system, no Ceph) to CephFS a
few months ago, I used msrsync [1] and was quite happy with the speed.
For your use case, I would start with -p 12 but might experiment with up
to -p 24 (as you only have 6C/12T in your CPU). With many small files,
you also might want to increase -s from the default 1000.
Note that msrsync does not work with the --delete rsync flag. As I was
syncing a live system, I ended up with this workflow:
- Initial sync with msrsync (something like ./msrsync -p 12 --progress
--stats --rsync "-aS --numeric-ids" ...)
- Second sync with msrsync (to sync changes during the first sync)
- Take old storage off-line for users / read-only
- Final rsync with --delete (i.e. rsync -aS --numeric-ids --delete ...)
- Mount cephfs at location of old storage, adjust /etc/exports with fsid
entries where necessary, turn system back on-line / read-write
Cheers
Sebastian
[1] https://github.com/jbd/msrsync
Hello
We did try to use Cephadm with Podman to start 44 OSDs per host which consistently stop after adding 24 OSDs per host.
We did look into the cephadm.log on the problematic host and saw that the command `cephadm ceph-volume lvm list --format json` did stuck.
We were the output of the command wasn't complete. Therefore, we tried to use compacted JSON and we could increase the number to 36 OSDs per host.
If you need more information just ask.
Podman version: 3.2.1
Ceph version: 16.2.4
OS version: Suse Leap 15.3
Greetings,
Jan
Hi,
Today while debugging something we had a few questions that might lead
to improving the cephfs forward scrub docs:
https://docs.ceph.com/en/latest/cephfs/scrub/
tldr:
1. Should we document which sorts of issues that the forward scrub is
able to fix?
2. Can we make it more visible (in docs) that scrubbing is not
supported with multi-mds?
3. Isn't the new `ceph -s` scrub task status misleading with multi-mds?
Details here:
1) We found a CephFS directory with a number of zero sized files:
# ls -l
...
-rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:58
upload_fc501199e3e7abe6b574101cf34aeefb.png
-rw-r--r-- 1 1001890000 1001890000 0 Nov 3 12:23
upload_fce4f55348185fefa0abdd8d11095ba8.gif
-rw-r--r-- 1 1001890000 1001890000 0 Nov 3 11:54
upload_fd95b8358851f0dac22fb775046a6163.png
...
The user claims that those files were non-zero sized last week. The
sequence of zero sized files includes *all* files written between Nov
2 and 9.
The user claims that his client was running out of memory, but this is
now fixed. So I suspect that his ceph client (kernel
3.10.0-1127.19.1.el7.x86_64) was not behaving well.
Anyway, I noticed that even though the dentries list 0 bytes, the
underlying rados objects have data, and the data looks good. E.g.
# rados get -p cephfs_data 200212e68b5.00000000 --namespace=xxx
200212e68b5.00000000
# file 200212e68b5.00000000
200212e68b5.00000000: PNG image data, 960 x 815, 8-bit/color RGBA,
non-interlaced
So I managed to recover the files doing something like this (using an
input file mapping inode to filename) [see PS 0].
But I'm wondering if a forward scrub is able to fix this sort of
problem directly?
Should we document which sorts of issues that the forward scrub is able to fix?
I anyway tried to scrub it, which led to:
# ceph tell mds.cephflax-mds-xxx scrub start /volumes/_nogroup/xxx
recursive repair
Scrub is not currently supported for multiple active MDS. Please
reduce max_mds to 1 and then scrub.
So ...
2) Shouldn't we update the doc to mention loud and clear that scrub is
not currently supported for multiple active MDS?
3) I was somehow surprised by this, because I had thought that the new
`ceph -s` multi-mds scrub status implied that multi-mds scrubbing was
now working:
task status:
scrub status:
mds.x: idle
mds.y: idle
mds.z: idle
Is it worth reporting this task status for cephfs if we can't even scrub them?
Thanks!!
Dan
[0]
mkdir -p recovered
while read -r a b; do
for i in {0..9}
do
echo "rados stat --cluster=flax --pool=cephfs_data
--namespace=xxx" $(printf "%x" $a).0000000$i "&&" "rados get
--cluster=flax --pool=cephfs_data --namespace=xxx" $(printf "%x"
$a).0000000$i $(printf "%x" $a).0000000$i
done
echo cat $(printf "%x" $a).* ">" $(printf "%x" $a)
echo mv $(printf "%x" $a) recovered/$b
done < inones_fnames.txt
Dear all,
I'm sorry if I'm asking for the obvious or missing a previous discussion of
this but I could not find the answer to my question online. I'd be happy to
be pointed to the right direction only.
The cephfs-mirror tool in pacific looks extremely promising. How does it
work exactly? Is it based on files and (recursive) ctime or rather based on
object information? Does it handle incremental changes (only) between
snapshots?
There is an issue related to this that mentions recursive ctime. But that
would mean that users could "rsync -a" data to the file system and this
would not get synchronized.
I have good experience with ZFS which is able to identify changes between
two snapshots A and B and then only transfer these changes (using a
sub-file level, on the ZFS equivalent of blocks to my understanding) to
another server with the same file system that is in the exact state as
snapshot A. Does cephfs-mirror work the same?
Best wishes,
Manuel