[ceph-users] Re: Recover pgs from failed osds

7 Sep 2020

On 04/09/2020 13:50, Eugen Block wrote:
...
  Hi,

 Wido had an idea in a different thread [1], you could try to advise the 
 OSDs to compact at boot:

 [osd]
 osd_compact_on_start = true 
This is in master only, not yet in any release.

...

 Can you give that a shot?

 Wido also reported something about large OSD memory in [2], but noone 
 commented yet.

Still seeing that problem indeed. haven't been able to solve it.

Wido

> Regards,
> Eugen
> 
> 
> [1] 
>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/EDL7U5EWFHS…

> 
> [2] 
>
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F5MOI47FIVS…

> 
> 
> 
> Zitat von Vahideh Alinouri &lt;vahideh.alinouri(a)gmail.com&gt;om>:
> 
>> Is not any solution or advice?
>>
>> On Tue, Sep 1, 2020, 11:53 AM Vahideh Alinouri 
>> &lt;vahideh.alinouri(a)gmail.com&gt;
>> wrote:
>>
>>> One of failed osd with 3G RAM started and dump_mempools shows total RAM
>>> usage is 18G and buff_anon uses 17G RAM!
>>>
>>> On Mon, Aug 31, 2020 at 6:24 PM Vahideh Alinouri <
>>> vahideh.alinouri(a)gmail.com&gt; wrote:
>>>
>>>> osd_memory_target of failed osd in one ceph-osd node changed to 6G but
>>>> other osd_memory_target is 3G, starting failed osd with 6G 
>>>> memory_target
>>>> causes other osd "down" in ceph-osd node! and failed osd is
still down.
>>>>
>>>> On Mon, Aug 31, 2020 at 2:19 PM Eugen Block &lt;eblock(a)nde.ag&gt; wrote:
>>>>
>>>>> Can you try the opposite and turn up the memory_target and only try
to
>>>>> start a single OSD?
>>>>>
>>>>>
>>>>> Zitat von Vahideh Alinouri &lt;vahideh.alinouri(a)gmail.com&gt;om>:
>>>>>
>>>>> > osd_memory_target is changed to 3G, starting failed osd causes 
>>>>> ceph-osd
>>>>> > nodes crash! and failed osd is still "down"
>>>>> >
>>>>> > On Fri, Aug 28, 2020 at 1:13 PM Vahideh Alinouri <
>>>>> vahideh.alinouri(a)gmail.com&gt;
>>>>> > wrote:
>>>>> >
>>>>> >> Yes, each osd node has 7 osds with 4 GB memory_target.
>>>>> >>
>>>>> >>
>>>>> >> On Fri, Aug 28, 2020, 12:48 PM Eugen Block
&lt;eblock(a)nde.ag&gt; wrote:
>>>>> >>
>>>>> >>> Just to confirm, each OSD node has 7 OSDs with 4 GB
memory_target?
>>>>> >>> That leaves only 4 GB RAM for the rest, and in case of
heavy 
>>>>> load the
>>>>> >>> OSDs use even more. I would suggest to reduce the
memory_target 
>>>>> to 3
>>>>> >>> GB and see if they start successfully.
>>>>> >>>
>>>>> >>>
>>>>> >>> Zitat von Vahideh Alinouri
&lt;vahideh.alinouri(a)gmail.com&gt;om>:
>>>>> >>>
>>>>> >>> > osd_memory_target is 4294967296.
>>>>> >>> > Cluster setup:
>>>>> >>> > 3 mon, 3 mgr, 21 osds on 3 ceph-osd nodes in lvm
scenario.
>>>>> ceph-osd
>>>>> >>> nodes
>>>>> >>> > resources are 32G RAM - 4 core CPU - osd disk 4TB -
9 osds have
>>>>> >>> > block.wal on SSDs.  Public network is 1G and
cluster network is
>>>>> 10G.
>>>>> >>> > Cluster installed and upgraded using ceph-ansible.
>>>>> >>> >
>>>>> >>> > On Thu, Aug 27, 2020 at 7:01 PM Eugen Block
&lt;eblock(a)nde.ag&gt; 
>>>>> wrote:
>>>>> >>> >
>>>>> >>> >> What is the memory_target for your OSDs? Can
you share more
>>>>> details
>>>>> >>> >> about your setup? You write about high memory,
are the OSD 
>>>>> nodes
>>>>> >>> >> affected by OOM killer? You could try to reduce
the
>>>>> osd_memory_target
>>>>> >>> >> and see if that helps bring the OSDs back up.
Splitting the PGs
>>>>> is a
>>>>> >>> >> very heavy operation.
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >> Zitat von Vahideh Alinouri
&lt;vahideh.alinouri(a)gmail.com&gt;om>:
>>>>> >>> >>
>>>>> >>> >> > Ceph cluster is updated from nautilus to
octopus. On ceph-osd
>>>>> nodes
>>>>> >>> we
>>>>> >>> >> have
>>>>> >>> >> > high I/O wait.
>>>>> >>> >> >
>>>>> >>> >> > After increasing one of pool’s pg_num from
64 to 128 
>>>>> according
>>>>> to
>>>>> >>> warning
>>>>> >>> >> > message (more objects per pg), this lead
to high cpu load and
>>>>> ram
>>>>> >>> usage
>>>>> >>> >> on
>>>>> >>> >> > ceph-osd nodes and finally crashed the
whole cluster. Three
>>>>> osds,
>>>>> >>> one on
>>>>> >>> >> > each host, stuck at down state (osd.34
osd.35 osd.40).
>>>>> >>> >> >
>>>>> >>> >> > Starting the down osd service causes high
ram usage and cpu
>>>>> load and
>>>>> >>> >> > ceph-osd node to crash until the osd
service fails.
>>>>> >>> >> >
>>>>> >>> >> > The active mgr service on each mon host
will crash after
>>>>> consuming
>>>>> >>> almost
>>>>> >>> >> > all available ram on the physical hosts.
>>>>> >>> >> >
>>>>> >>> >> > I need to recover pgs and solving
corruption. How can i 
>>>>> recover
>>>>> >>> unknown
>>>>> >>> >> and
>>>>> >>> >> > down pgs? Is there any way to starting up
failed osd?
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > Below steps are done:
>>>>> >>> >> >
>>>>> >>> >> > 1- osd nodes’ kernel was upgraded to 5.4.2
before ceph 
>>>>> cluster
>>>>> >>> upgrading.
>>>>> >>> >> > Reverting to previous kernel 4.2.1 is
tested for iowate
>>>>> decreasing,
>>>>> >>> but
>>>>> >>> >> it
>>>>> >>> >> > had no effect.
>>>>> >>> >> >
>>>>> >>> >> > 2- Recovering 11 pgs on failed osds by
export them using
>>>>> >>> >> > ceph-objectstore-tools utility and import
them on other osds.
>>>>> The
>>>>> >>> result
>>>>> >>> >> > followed: 9 pgs are “down” and 2 pgs are
“unknown”.
>>>>> >>> >> >
>>>>> >>> >> > 2-1) 9 pgs export and import successfully
but status is 
>>>>> “down”
>>>>> >>> because of
>>>>> >>> >> > "peering_blocked_by" 3 failed
osds. I cannot lost osds 
>>>>> because
>>>>> of
>>>>> >>> >> > preventing unknown pgs from getting lost.
pgs size in K 
>>>>> and M.
>>>>> >>> >> >
>>>>> >>> >> > "peering_blocked_by": [
>>>>> >>> >> >
>>>>> >>> >> > {
>>>>> >>> >> >
>>>>> >>> >> > "osd": 34,
>>>>> >>> >> >
>>>>> >>> >> > "current_lost_at": 0,
>>>>> >>> >> >
>>>>> >>> >> > "comment": "starting or
marking this osd lost may let us
>>>>> proceed"
>>>>> >>> >> >
>>>>> >>> >> > },
>>>>> >>> >> >
>>>>> >>> >> > {
>>>>> >>> >> >
>>>>> >>> >> > "osd": 35,
>>>>> >>> >> >
>>>>> >>> >> > "current_lost_at": 0,
>>>>> >>> >> >
>>>>> >>> >> > "comment": "starting or
marking this osd lost may let us
>>>>> proceed"
>>>>> >>> >> >
>>>>> >>> >> > },
>>>>> >>> >> >
>>>>> >>> >> > {
>>>>> >>> >> >
>>>>> >>> >> > "osd": 40,
>>>>> >>> >> >
>>>>> >>> >> > "current_lost_at": 0,
>>>>> >>> >> >
>>>>> >>> >> > "comment": "starting or
marking this osd lost may let us
>>>>> proceed"
>>>>> >>> >> >
>>>>> >>> >> > }
>>>>> >>> >> >
>>>>> >>> >> > ]
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > 2-2) 1 pg (2.39) export and import
successfully, but after
>>>>> starting
>>>>> >>> osd
>>>>> >>> >> > service (pg import to it), ceph-osd node
RAM and CPU 
>>>>> consumption
>>>>> >>> increase
>>>>> >>> >> > and cause ceph-osd node to crash until the
osd service fails.
>>>>> Other
>>>>> >>> osds
>>>>> >>> >> > become "down" on ceph-osd node.
pg status is “unknown”. I
>>>>> cannot use
>>>>> >>> >> > "force-create-pg" because of
data lost. pg 2.39 size is 19G.
>>>>> >>> >> >
>>>>> >>> >> > # ceph pg map 2.39
>>>>> >>> >> >
>>>>> >>> >> > osdmap e40347 pg 2.39 (2.39) -> up
[32,37] acting [32,37]
>>>>> >>> >> >
>>>>> >>> >> > # ceph pg 2.39 query
>>>>> >>> >> >
>>>>> >>> >> > Error ENOENT: i don't have pgid 2.39
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > *pg 2.39 info on failed osd:
>>>>> >>> >> >
>>>>> >>> >> > # ceph-objectstore-tool --data-path 
>>>>> /var/lib/ceph/osd/*ceph-34*
>>>>> --op
>>>>> >>> info
>>>>> >>> >> > --pgid 2.39
>>>>> >>> >> >
>>>>> >>> >> > {
>>>>> >>> >> >
>>>>> >>> >> > "pgid": "2.39",
>>>>> >>> >> >
>>>>> >>> >> > "last_update":
"35344'6456084",
>>>>> >>> >> >
>>>>> >>> >> > "last_complete":
"35344'6456084",
>>>>> >>> >> >
>>>>> >>> >> > "log_tail":
"35344'6453182",
>>>>> >>> >> >
>>>>> >>> >> > "last_user_version": 10595821,
>>>>> >>> >> >
>>>>> >>> >> > "last_backfill":
"MAX",
>>>>> >>> >> >
>>>>> >>> >> > "purged_snaps": [],
>>>>> >>> >> >
>>>>> >>> >> > "history": {
>>>>> >>> >> >
>>>>> >>> >> > "epoch_created": 146,
>>>>> >>> >> >
>>>>> >>> >> > "epoch_pool_created": 79,
>>>>> >>> >> >
>>>>> >>> >> > "last_epoch_started": 25208,
>>>>> >>> >> >
>>>>> >>> >> > "last_interval_started": 25207,
>>>>> >>> >> >
>>>>> >>> >> > "last_epoch_clean": 25208,
>>>>> >>> >> >
>>>>> >>> >> > "last_interval_clean": 25207,
>>>>> >>> >> >
>>>>> >>> >> > "last_epoch_split": 370,
>>>>> >>> >> >
>>>>> >>> >> > "last_epoch_marked_full": 0,
>>>>> >>> >> >
>>>>> >>> >> > "same_up_since": 8347,
>>>>> >>> >> >
>>>>> >>> >> > "same_interval_since": 25207,
>>>>> >>> >> >
>>>>> >>> >> > "same_primary_since": 8321,
>>>>> >>> >> >
>>>>> >>> >> > "last_scrub":
"35328'6440139",
>>>>> >>> >> >
>>>>> >>> >> > "last_scrub_stamp":
"2020-08-19T12:00:59.377593+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_deep_scrub":
"35261'6031075",
>>>>> >>> >> >
>>>>> >>> >> > "last_deep_scrub_stamp":
"2020-08-17T01:59:26.606037+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_clean_scrub_stamp":
"2020-08-19T12:00:59.377593+0430",
>>>>> >>> >> >
>>>>> >>> >> > "prior_readable_until_ub": 0
>>>>> >>> >> >
>>>>> >>> >> > },
>>>>> >>> >> >
>>>>> >>> >> > "stats": {
>>>>> >>> >> >
>>>>> >>> >> > "version":
"35344'6456082",
>>>>> >>> >> >
>>>>> >>> >> > "reported_seq":
"11733156",
>>>>> >>> >> >
>>>>> >>> >> > "reported_epoch":
"35344",
>>>>> >>> >> >
>>>>> >>> >> > "state":
"active+clean",
>>>>> >>> >> >
>>>>> >>> >> > "last_fresh":
"2020-08-19T14:16:18.587435+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_change":
"2020-08-19T12:00:59.377747+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_active":
"2020-08-19T14:16:18.587435+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_peered":
"2020-08-19T14:16:18.587435+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_clean":
"2020-08-19T14:16:18.587435+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_became_active":
"2020-08-06T00:23:51.016769+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_became_peered":
"2020-08-06T00:23:51.016769+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_unstale":
"2020-08-19T14:16:18.587435+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_undegraded":
"2020-08-19T14:16:18.587435+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_fullsized":
"2020-08-19T14:16:18.587435+0430",
>>>>> >>> >> >
>>>>> >>> >> > "mapping_epoch": 8347,
>>>>> >>> >> >
>>>>> >>> >> > "log_start":
"35344'6453182",
>>>>> >>> >> >
>>>>> >>> >> > "ondisk_log_start":
"35344'6453182",
>>>>> >>> >> >
>>>>> >>> >> > "created": 146,
>>>>> >>> >> >
>>>>> >>> >> > "last_epoch_clean": 25208,
>>>>> >>> >> >
>>>>> >>> >> > "parent": "0.0",
>>>>> >>> >> >
>>>>> >>> >> > "parent_split_bits": 7,
>>>>> >>> >> >
>>>>> >>> >> > "last_scrub":
"35328'6440139",
>>>>> >>> >> >
>>>>> >>> >> > "last_scrub_stamp":
"2020-08-19T12:00:59.377593+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_deep_scrub":
"35261'6031075",
>>>>> >>> >> >
>>>>> >>> >> > "last_deep_scrub_stamp":
"2020-08-17T01:59:26.606037+0430",
>>>>> >>> >> >
>>>>> >>> >> > "last_clean_scrub_stamp":
"2020-08-19T12:00:59.377593+0430",
>>>>> >>> >> >
>>>>> >>> >> > "log_size": 2900,
>>>>> >>> >> >
>>>>> >>> >> > "ondisk_log_size": 2900,
>>>>> >>> >> >
>>>>> >>> >> > "stats_invalid": false,
>>>>> >>> >> >
>>>>> >>> >> > "dirty_stats_invalid": false,
>>>>> >>> >> >
>>>>> >>> >> > "omap_stats_invalid": false,
>>>>> >>> >> >
>>>>> >>> >> > "hitset_stats_invalid": false,
>>>>> >>> >> >
>>>>> >>> >> > "hitset_bytes_stats_invalid":
false,
>>>>> >>> >> >
>>>>> >>> >> > "pin_stats_invalid": false,
>>>>> >>> >> >
>>>>> >>> >> > "manifest_stats_invalid":
false,
>>>>> >>> >> >
>>>>> >>> >> > "snaptrimq_len": 0,
>>>>> >>> >> >
>>>>> >>> >> > "stat_sum": {
>>>>> >>> >> >
>>>>> >>> >> > "num_bytes": 19749578960,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects": 2442,
>>>>> >>> >> >
>>>>> >>> >> > "num_object_clones": 20,
>>>>> >>> >> >
>>>>> >>> >> > "num_object_copies": 7326,
>>>>> >>> >> >
>>>>> >>> >> >
"num_objects_missing_on_primary": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_missing": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_degraded": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_misplaced": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_unfound": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_dirty": 2442,
>>>>> >>> >> >
>>>>> >>> >> > "num_whiteouts": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_read": 16120686,
>>>>> >>> >> >
>>>>> >>> >> > "num_read_kb": 82264126,
>>>>> >>> >> >
>>>>> >>> >> > "num_write": 19731882,
>>>>> >>> >> >
>>>>> >>> >> > "num_write_kb": 379030181,
>>>>> >>> >> >
>>>>> >>> >> > "num_scrub_errors": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_shallow_scrub_errors": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_deep_scrub_errors": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_recovered": 2861,
>>>>> >>> >> >
>>>>> >>> >> > "num_bytes_recovered":
21673259070,
>>>>> >>> >> >
>>>>> >>> >> > "num_keys_recovered": 32,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_omap": 2,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_hit_set_archive":
0,
>>>>> >>> >> >
>>>>> >>> >> > "num_bytes_hit_set_archive": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_flush": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_flush_kb": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_evict": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_evict_kb": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_promote": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_flush_mode_high": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_flush_mode_low": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_evict_mode_some": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_evict_mode_full": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_pinned": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_legacy_snapsets": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_large_omap_objects": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_manifest": 0,
>>>>> >>> >> >
>>>>> >>> >> > "num_omap_bytes": 152,
>>>>> >>> >> >
>>>>> >>> >> > "num_omap_keys": 16,
>>>>> >>> >> >
>>>>> >>> >> > "num_objects_repaired": 0
>>>>> >>> >> >
>>>>> >>> >> > },
>>>>> >>> >> >
>>>>> >>> >> > "up": [
>>>>> >>> >> >
>>>>> >>> >> > 40,
>>>>> >>> >> >
>>>>> >>> >> > 35,
>>>>> >>> >> >
>>>>> >>> >> > 34
>>>>> >>> >> >
>>>>> >>> >> > ],
>>>>> >>> >> >
>>>>> >>> >> > "acting": [
>>>>> >>> >> >
>>>>> >>> >> > 40,
>>>>> >>> >> >
>>>>> >>> >> > 35,
>>>>> >>> >> >
>>>>> >>> >> > 34
>>>>> >>> >> >
>>>>> >>> >> > ],
>>>>> >>> >> >
>>>>> >>> >> > "avail_no_missing": [],
>>>>> >>> >> >
>>>>> >>> >> > "object_location_counts": [],
>>>>> >>> >> >
>>>>> >>> >> > "blocked_by": [],
>>>>> >>> >> >
>>>>> >>> >> > "up_primary": 40,
>>>>> >>> >> >
>>>>> >>> >> > "acting_primary": 40,
>>>>> >>> >> >
>>>>> >>> >> > "purged_snaps": []
>>>>> >>> >> >
>>>>> >>> >> > },
>>>>> >>> >> >
>>>>> >>> >> > "empty": 0,
>>>>> >>> >> >
>>>>> >>> >> > "dne": 0,
>>>>> >>> >> >
>>>>> >>> >> > "incomplete": 0,
>>>>> >>> >> >
>>>>> >>> >> > "last_epoch_started": 25208,
>>>>> >>> >> >
>>>>> >>> >> > "hit_set_history": {
>>>>> >>> >> >
>>>>> >>> >> > "current_last_update":
"0'0",
>>>>> >>> >> >
>>>>> >>> >> > "history": []
>>>>> >>> >> >
>>>>> >>> >> > }
>>>>> >>> >> >
>>>>> >>> >> > }
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > *pg 2.39 info on osd which import to it:
>>>>> >>> >> >
>>>>> >>> >> > # ceph-objectstore-tool --data-path 
>>>>> /var/lib/ceph/osd/*ceph-37*
>>>>> --op
>>>>> >>> info
>>>>> >>> >> > --pgid 2.39
>>>>> >>> >> >
>>>>> >>> >> > PG '2.39' not found
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > 2-3) 1 pg (2.79) is lost! This pg is not
found on any of 
>>>>> three
>>>>> failed
>>>>> >>> >> osds
>>>>> >>> >> > (osd.34 osd.35 osd.40)! status is
“unknown”. pg 2.79 
>>>>> export is
>>>>> >>> failed: "
>>>>> >>> >> >  PG '2.79' not found"
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > # ceph pg map 2.79
>>>>> >>> >> >
>>>>> >>> >> > Error ENOENT: i don't have pgid 2.79
>>>>> >>> >> >
>>>>> >>> >> > # ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-34
>>>>> --op
>>>>> >>> info
>>>>> >>> >> > --pgid 2.79
>>>>> >>> >> >
>>>>> >>> >> > PG '2.79' not found
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > 3- Using
>>>>> https://gitlab.lbader.de/kryptur/ceph-recovery/tree/master
>>>>> >>> but
>>>>> >>> >> it
>>>>> >>> >> > does not work for recent ceph versions and
tested on “hammer”
>>>>> >>> release.
>>>>> >>> >> >
>>>>> >>> >> > 4- Using
>>>>> >>>
https://ceph.io/planet/recovering-from-a-complete-node-failure/
>>>>> >>> >> > but in lvm scenario I could not mount
failed osd lv to new
>>>>> >>> >> > /var/lib/ceph/osd/ceph-x* .*Could not
prepare and activate 
>>>>> new
>>>>> osd to
>>>>> >>> >> > failed osd disk.
>>>>> >>> >> >
>>>>> >>> >> > 5- Setting pool min_size=1 that down pgs
belong to it, 
>>>>> restart
>>>>> osds
>>>>> >>> that
>>>>> >>> >> > pgs import to them but no changes.
>>>>> >>> >> >
>>>>> >>> >> > 6- Seting pool min_size=1 that pg 2.39
belong to it, restart
>>>>> osds
>>>>> >>> that pg
>>>>> >>> >> > import to them but no changes.
>>>>> >>> >> >
>>>>> >>> >> > 7- Repairing failed osds using
ceph-objectstore-tools, making
>>>>> “in”
>>>>> >>> and
>>>>> >>> >> > starting them but no changes.
>>>>> >>> >> >
>>>>> >>> >> > # ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-x
>>>>> --op
>>>>> >>> repair
>>>>> >>> >> >
>>>>> >>> >> >
>>>>> >>> >> > 8- Repairing 2 unknown pgs, but no
changes.
>>>>> >>> >> >
>>>>> >>> >> > # ceph pg repaire 2.39
>>>>> >>> >> >
>>>>> >>> >> > # ceph pg repair 2.79
>>>>> >>> >> >
>>>>> >>> >> > 9- Forcing recovery 2 unknown pgs, but no
changes.
>>>>> >>> >> >
>>>>> >>> >> > # ceph pg force-recovery 2.39
>>>>> >>> >> >
>>>>> >>> >> > # ceph pg force-recovery 2.79
>>>>> >>> >> >
>>>>> >>> >> > 10- Check PID count in ceph-osd nodes
because of osd services
>>>>> failed
>>>>> >>> to
>>>>> >>> >> > start.
>>>>> >>> >> >
>>>>> >>> >> > kernel.pid.max = 4194304
>>>>> >>> >> >
>>>>> >>> >> > 11- Raising
osd_op_thread_suicide_timeout=900, but no change.
>>>>> >>> >> >
_______________________________________________
>>>>> >>> >> > ceph-users mailing list --
ceph-users(a)ceph.io
>>>>> >>> >> > To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>> >>> >>
>>>>> >>> >>
>>>>> >>> >>
_______________________________________________
>>>>> >>> >> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>> >>> >> To unsubscribe send an email to
ceph-users-leave(a)ceph.io
>>>>> >>> >>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>>
>>>>>
>>>>>
>>>>>
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Recover pgs from failed osds