Good day
I'm currently decommissioning a cluster that runs EC3+1 (rack failure
domain - with 5 racks), however the cluster still has some production items
on it since I'm in the process of moving it to our new EC8+2 cluster.
Running Luminous 12.2.13 on Ubuntu 16 HWE, containerized with ceph-ansible
3.2.
I currently get the following error after we lost 1 OSD (195).
I'm forced to repair, scrub, deep scrub, restart OSDs etc, everything
mentioned in the troubleshooting docs & information from IRC but cannot for
the life of me get it to work.
What I'm seeing is, that pg 9.3dd (volume_images) has a status of 1 OSD
(osd is down) which I know, but the other OSD shows 316 (not queried). Also
pg 9.3dd has 4x functioning UP's but still reference the missing OSD under
acting.
Regards
OBJECT_UNFOUND 1/501815192 objects unfound (0.000%)
pg 9.3dd has 1 unfound objects
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 9.3d1 is active+clean+inconsistent, acting [347,316,307,249]
PG_DEGRADED Degraded data redundancy: 1219/2001837265 objects degraded
(0.000%), 1 pg degraded, 1 pg undersized
pg 9.3dd is stuck undersized for 55486.439002, current state
active+recovery_wait+forced_recovery+undersized+degraded+remapped, last
acting [355,2147483647,64,367]
ceph pg 9.3dd query
"up": [
355,
139,
64,
367
],
"acting": [
355,
2147483647,
64,
367
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2021-03-08 16:07:51.239010",
"might_have_unfound": [
{
"osd": "64(2)",
"status": "already probed"
},
{
"osd": "139(1)",
"status": "already probed"
},
{
"osd": "195(1)",
"status": "osd is down"
},
{
"osd": "316(2)",
"status": "not queried"
},
{
"osd": "367(3)",
"status": "already probed"
}
],
"recovery_progress": {
"backfill_targets": [
"139(1)"
ceph pg 9.3d1 query
{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 168488,
"up": [
347,
316,
307,
249
],
"acting": [
347,
316,
307,
249
],
"actingbackfill": [
"249(3)",
"307(2)",
"316(1)",
"347(0)"
--
CLUSTER STATS:
cluster:
id: 1ea59fbe-46a4-474e-8225-a66b32ca86b7
health: HEALTH_ERR
1/490166525 objects unfound (0.000%)
1 scrub errors
Possible data damage: 1 pg inconsistent
Degraded data redundancy: 1259/1956055233 objects degraded
(0.000%), 1 pg degraded, 1 pg undersized
services:
mon: 3 daemons, quorum B-04-11-cephctl,B-05-11-cephctl,B-03-11-cephctl
mgr: B-03-11-cephctl(active), standbys: B-04-11-cephctl, B-05-11-cephctl
mds: cephfs-1/1/1 up {0=B-04-11-cephctl=up:active}, 2 up:standby
osd: 384 osds: 383 up, 383 in; 1 remapped pgs
data:
pools: 11 pools, 13440 pgs
objects: 490.17M objects, 1.35PiB
usage: 1.88PiB used, 2.33PiB / 4.21PiB avail
pgs: 1259/1956055233 objects degraded (0.000%)
1/490166525 objects unfound (0.000%)
13332 active+clean
96 active+clean+scrubbing+deep
10 active+clean+scrubbing
1 active+clean+inconsistent
1
active+recovery_wait+forced_recovery+undersized+degraded+remapped
*Jeremi-Ernst Avenant, Mr.*Cloud Infrastructure Specialist
Inter-University Institute for Data Intensive Astronomy
5th Floor, Department of Physics and Astronomy,
University of Cape Town
Tel: 021 959 4137 <0219592327>
Web: www.idia.ac.za <http://www.uwc.ac.za/>
E-mail (IDIA): jeremi(a)idia.ac.za <mfundo(a)idia.ac.za>
Rondebosch, Cape Town, 7600, South Africa
Hi All,
We've been dealing with what seems to be a pretty annoying bug for a while
now. We are unable to delete a customer's bucket that seems to have an
extremely large number of aborted multipart uploads. I've had $(radosgw-admin
bucket rm --bucket=pusulax --purge-objects) running in a screen session for
almost 3 weeks now and it's still not finished; it's most likely stuck in a
loop or something. The screen session with debug-rgw=10 spams billions of
these messages:
2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/d3/04d33e18-3f13-433c-b924-56602d702d60-31.msg.2~0DTalUjTHsnIiKraN1klwIFO88Vc2E3.meta[]
2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/d7/04d7ad26-c8ec-4a39-9938-329acd6d9da7-102.msg.2~K_gAeTpfEongNvaOMNa0IFwSGPpQ1iA.meta[]
2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/da/04da4147-c949-4c3a-aca6-e63298f5ff62-102.msg.2~-hXBSFcjQKbMkiyEqSgLaXMm75qFzEp.meta[]
2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/db/04dbb0e6-dfb0-42fb-9d0f-49cceb18457f-102.msg.2~B5EhGgBU5U_U7EA5r8IhVpO3Aj2OvKg.meta[]
2021-02-23 15:38:58.667 7f9b55704840 10
RGWRados::cls_bucket_list_unordered: got
_multipart_04/df/04df39be-06ab-4c72-bc63-3fac1d2700a9-11.msg.2~_8h5fWlkNrIMqcrZgNbAoJfc8BN1Xx-.meta[]
This is probably the 2nd or 3rd time I've been unable to delete this
bucket. I also tried running $(radosgw-admin bucket check --fix
--check-objects --bucket=pusulax) before kicking off the delete job, but
that didn't work either. Here is the bucket in question, the num_objects
counter never decreases after trying to delete the bucket:
[root@os5 ~]# radosgw-admin bucket stats --bucket=pusulax
{
"bucket": "pusulax",
"num_shards": 144,
"tenant": "",
"zonegroup": "dbb69c5b-b33f-4af2-950c-173d695a4d2c",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.3209338.4",
"marker": "c0d0b8a5-c63c-4c24-9dab-8deee88dbf0b.3292800.7",
"index_type": "Normal",
"owner": "REDACTED",
"ver":
"0#115613,1#115196,2#115884,3#115497,4#114649,5#114150,6#116127,7#114269,8#115220,9#115092,10#114003,11#114538,12#115235,13#113463,14#114928,15#115135,16#115535,17#114867,18#116010,19#115766,20#115274,21#114818,22#114805,23#114853,24#114099,25#114359,26#114966,27#115790,28#114572,29#114826,30#114767,31#115614,32#113995,33#115305,34#114227,35#114342,36#114144,37#114704,38#114088,39#114738,40#114133,41#114520,42#114420,43#114168,44#113820,45#115093,46#114788,47#115522,48#114713,49#115315,50#115055,51#114513,52#114086,53#114401,54#114079,55#113649,56#114089,57#114157,58#114064,59#115224,60#114753,61#114686,62#115169,63#114321,64#114949,65#115075,66#115003,67#114993,68#115320,69#114392,70#114893,71#114219,72#114190,73#114868,74#113432,75#114882,76#115300,77#114755,78#114598,79#114221,80#114895,81#114031,82#114566,83#113849,84#115155,85#113790,86#113334,87#113800,88#114856,89#114841,90#115073,91#113849,92#114554,93#114820,94#114256,95#113840,96#114838,97#113784,98#114876,99#115524,100#115686,101#112969,102#112156,103#112635,104#112732,105#112933,106#112412,107#113090,108#112239,109#112697,110#113444,111#111730,112#112446,113#114479,114#113318,115#113032,116#112048,117#112404,118#114545,119#112563,120#112341,121#112518,122#111719,123#112273,124#112014,125#112979,126#112209,127#112830,128#113186,129#112944,130#111991,131#112865,132#112688,133#113819,134#112586,135#113275,136#112172,137#113019,138#112872,139#113130,140#112716,141#112091,142#111859,143#112773",
"master_ver":
"0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0,13#0,14#0,15#0,16#0,17#0,18#0,19#0,20#0,21#0,22#0,23#0,24#0,25#0,26#0,27#0,28#0,29#0,30#0,31#0,32#0,33#0,34#0,35#0,36#0,37#0,38#0,39#0,40#0,41#0,42#0,43#0,44#0,45#0,46#0,47#0,48#0,49#0,50#0,51#0,52#0,53#0,54#0,55#0,56#0,57#0,58#0,59#0,60#0,61#0,62#0,63#0,64#0,65#0,66#0,67#0,68#0,69#0,70#0,71#0,72#0,73#0,74#0,75#0,76#0,77#0,78#0,79#0,80#0,81#0,82#0,83#0,84#0,85#0,86#0,87#0,88#0,89#0,90#0,91#0,92#0,93#0,94#0,95#0,96#0,97#0,98#0,99#0,100#0,101#0,102#0,103#0,104#0,105#0,106#0,107#0,108#0,109#0,110#0,111#0,112#0,113#0,114#0,115#0,116#0,117#0,118#0,119#0,120#0,121#0,122#0,123#0,124#0,125#0,126#0,127#0,128#0,129#0,130#0,131#0,132#0,133#0,134#0,135#0,136#0,137#0,138#0,139#0,140#0,141#0,142#0,143#0",
"mtime": "2020-06-17 20:27:16.685833Z",
"max_marker":
"0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#,20#,21#,22#,23#,24#,25#,26#,27#,28#,29#,30#,31#,32#,33#,34#,35#,36#,37#,38#,39#,40#,41#,42#,43#,44#,45#,46#,47#,48#,49#,50#,51#,52#,53#,54#,55#,56#,57#,58#,59#,60#,61#,62#,63#,64#,65#,66#,67#,68#,69#,70#,71#,72#,73#,74#,75#,76#,77#,78#,79#,80#,81#,82#,83#,84#,85#,86#,87#,88#,89#,90#,91#,92#,93#,94#,95#,96#,97#,98#,99#,100#,101#,102#,103#,104#,105#,106#,107#,108#,109#,110#,111#,112#,113#,114#,115#,116#,117#,118#,119#,120#,121#,122#,123#,124#,125#,126#,127#,128#,129#,130#,131#,132#,133#,134#,135#,136#,137#,138#,139#,140#,141#,142#,143#",
"usage": {
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 97009704,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 94737,
"num_objects": 3592952
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}
We're running 14.2.16 on our RGWs and OSD nodes. Anyone have any ideas? Is
it possible to target this bucket via rados directly to try and delete it?
I'm weary of doing stuff like that though. Thanks in advance.
- Dave Monschein
Howdy,
After the IBM acquisition of RedHat the landscape for CentOS quickly changed.
As I understand it right now Ceph 14 is the last version that will run on CentOS/EL7 but CentOS8 was "killed off".
So given that, if you were going to build a Ceph cluster today would you even bother doing it using a non-commercial distribution or would you just use RHEL 8 (or even their commercial Ceph product).
Secondly, are we expecting IBM to "kill off" Ceph as well?
Thanks,
-Drew
Hi!
I was upgrading from 15.2.8 to 15.2.9 via `ceph orch upgrade` (Ubuntu Bionic). One OSD seemed to have failed to upgrade, so I just redeployed it.
The OSD is up/in, but this warning is not clearing:
UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.3 on host ceph-osd4 failed.
It seems the warning is not listed in the docs, so I'm wondering how to go about getting the cluster back into HEALTH_OK.
Hope you can help,
Samy
Hi Drew,
On Friday, March 5th, 2021 at 20:22, Drew Weaver <drew.weaver(a)thenap.com> wrote:
> Sorry for multi-reply, I got that command to run:
>
> for obj in $(rados -p default.rgw.buckets.index ls | grep 2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637); do printf "%-60s %7d\n" $obj $(rados -p default.rgw.buckets.index listomapkeys $obj | wc -l); done;
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.0 10423
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.15 10445
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.3 10542
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.6 10511
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.13 10414
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.12 10479
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.2 10486
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.5 10448
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.4 10470
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.8 10474
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.1 10470
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.10 10411
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.7 10445
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.14 10413
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.9 10356
> .dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.637.11 10410
So this looks fine, it means that this bucket is sharded properly and the index keys are evenly distributed.
However, the problematic object is `.dir.2b67ef7c-2015-4ca0-bf50-b7595d01e46e.74194.213.0`; you said there's no bucket with that ID, which means it's probably a stale instance. Does it still exist in `default.rgw.buckets.index`? If so, you can try running `radosgw-admin reshard stale-instances rm`, it should get rid of it (you can verify afterwards with `rados -p default.rgw.buckets.index ls`).
--
Ben
Hi,
if the host fails, to which the grafana-api-url points (in the example
below ceph01.hostxyz.tld:3000), Ceph Dashboard can't Display Grafana Data:
# ceph dashboard get-grafana-api-url
https://ceph01.hostxyz.tld:3000
Is it possible to automagically switch to an other host?
Thanks, Erich