Hi, I am seeing an issue on one of our older ceph clusters (mimic 13.2.1) in an erasure
coded pool on bluestore OSDs in which we are seeing 1 inconsistent pg and 1 scrub error.
It should be noted that we have an ongoing rebalance of misplaced data that predates this
issue which came from flapping OSDs due to OSD_NEARFULL OSD_TOOFULL warnings/errors we
corrected by removing some user data from ceph's rgw/s3 api interface (users "s3
objects" where deleted via the s3 api).
If anyone has any suggestions or guidance for dealing with this it would be very much
appreciated. I've included all the relevant / helpful information I can think of
below, if there is any additional information that you think would be helpful to me or you
in providing suggestions please let me know.
$ sudo ceph -s
cluster:
id: 6fa7ec72-79fb-4f45-8b9f-ea5cdc7ab18d
health: HEALTH_ERR
248317/437145405 objects misplaced (0.057%)
1 scrub errors
Possible data damage: 1 pg inconsistent
services:
mon: 3 daemons, quorum HW-CEPHM-AT01,HW-CEPHM-AT02,HW-CEPHM-AT03
mgr: HW-CEPHM-AT02(active)
osd: 109 osds: 107 up, 106 in; 2 remapped pgs
rgw: 3 daemons active
data:
pools: 10 pools, 1380 pgs
objects: 54.70 M objects, 68 TiB
usage: 116 TiB used, 169 TiB / 285 TiB avail
pgs: 248317/437145405 objects misplaced (0.057%)
1374 active+clean
3 active+clean+scrubbing+deep
2 active+remapped+backfilling
1 active+clean+inconsistent
io:
client: 28 KiB/s rd, 306 KiB/s wr, 26 op/s rd, 30 op/s wr
recovery: 6.2 MiB/s, 4 objects/s
$ sudo ceph health detail
HEALTH_ERR 247241/437143405 objects misplaced (0.057%); 1 scrub errors; Possible data
damage: 1 pg inconsistent
OBJECT_MISPLACED 247241/437143405 objects misplaced (0.057%)
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 7.1 is active+clean+inconsistent, acting [2,57,51,15,20,28,9,39]
Examination of osd logs shows the error is in osd.2
zgrep -Hn 'ERR' ceph-osd.2.log-20200614.gz
ceph-osd.2.log-20200614.gz:1292:2020-06-14 03:31:06.572 7f94591a9700 -1
log_channel(cluster) log [ERR] : 7.1s0 deep-scrub stat mismatch, got 213029/213030
objects, 0/0 clones, 213029/213030 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0
whiteouts, 292308615921/292308670959 bytes, 0/0 manifest objects, 0/0 hit_set_archive
bytes.
ceph-osd.2.log-20200614.gz:1293:2020-06-14 03:31:06.572 7f94591a9700 -1
log_channel(cluster) log [ERR] : 7.1 deep-scrub 1 errors
All other OSDs appear to be clean of errors
The pg in question (7.1) has been instructed to repair/scrub/deep-scrub but I do not see
any indication in it's logs that it has done a scrub or repair (it does log a
deep-scrub which comes back OK) and listing inconsistent objects seems to indicate no
issues
$ sudo rados list-inconsistent-pg default.rgw.buckets.data
["7.1"]
$ sudo ceph pg repair 7.1
instructing pg 7.1s0 on osd.2 to repair
$ sudo ceph pg scrub 7.1
instructing pg 7.1s0 on osd.2 to scrub
$ sudo ceph pg deep-scrub 7.1
instructing pg 7.1s0 on osd.2 to deep-scrub
grep -HnEi 'scrub|repair|deep-scrub' ceph-osd.2.log
ceph-osd.2.log:118:2020-06-14 07:28:10.139 7f94599aa700 0 log_channel(cluster) log
[DBG] : 7.91 deep-scrub starts
ceph-osd.2.log:177:2020-06-14 08:39:11.404 7f94599aa700 0 log_channel(cluster) log
[DBG] : 7.91 deep-scrub ok
ceph-osd.2.log:322:2020-06-14 12:17:31.405 7f94579a6700 0 log_channel(cluster) log
[DBG] : 13.135 deep-scrub starts
ceph-osd.2.log:323:2020-06-14 12:17:32.744 7f94579a6700 0 log_channel(cluster) log
[DBG] : 13.135 deep-scrub ok
ceph-osd.2.log:387:2020-06-14 13:40:35.941 7f94591a9700 0 log_channel(cluster) log
[DBG] : 7.d8 deep-scrub starts
ceph-osd.2.log:441:2020-06-14 14:49:06.111 7f94591a9700 0 log_channel(cluster) log
[DBG] : 7.d8 deep-scrub ok
Only the last deep-scrub was manually triggered
$ sudo rados list-inconsistent-obj 7.1 --format=json-pretty
{
"epoch": 30869,
"inconsistents": []
}
$ sudo rados list-inconsistent-obj 7.1s0 --format=json-pretty
{
"epoch": 30869,
"inconsistents": []
}
I'm not sure why no inconsistents (empty set) are reported in the above
Chris Shultz
Global Systems Architect
1 Stiles Road
Suite 202
SalemNH03079
United States
cshultz(a)korewireless.com
(m) 774.270.2679
korewireless.com
Disclaimer
The information contained in this communication from the sender is confidential. It is
intended solely for use by the recipient and others authorized to receive it. If you are
not the recipient, you are hereby notified that any disclosure, copying, distribution or
taking action in relation of the contents of this information is strictly prohibited and
may be unlawful.
This email has been scanned for viruses and malware, and may have been automatically
archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business.
Providing a safer and more useful place for your human generated data. Specializing in;
Security, archiving and compliance. To find out more visit the Mimecast website.