Hello everybody,
Can somebody add support for Debian buster and ceph-deploy:
https://tracker.ceph.com/issues/42870
Highly appreciated,
Regards,
Jelle de Jong
Cash app to Cash app installment is normally moment and store effectively inside 10 seconds. Thus, you don't get a choice to drop the installment. On the off chance that there is any sum deducted from your record, at that point use Cash.app/help to get a discount. The sum appears for you when the sum is discounted by the beneficiary. https://www.cashapp-customerservice.com/
Ceph cluster is updated from nautilus to octopus. On ceph-osd nodes we have
high I/O wait.
After increasing one of pool’s pg_num from 64 to 128 according to warning
message (more objects per pg), this lead to high cpu load and ram usage on
ceph-osd nodes and finally crashed the whole cluster. Three osds, one on
each host, stuck at down state (osd.34 osd.35 osd.40).
Starting the down osd service causes high ram usage and cpu load and
ceph-osd node to crash until the osd service fails.
The active mgr service on each mon host will crash after consuming almost
all available ram on the physical hosts.
I need to recover pgs and solving corruption. How can i recover unknown and
down pgs? Is there any way to starting up failed osd?
Below steps are done:
1- osd nodes’ kernel was upgraded to 5.4.2 before ceph cluster upgrading.
Reverting to previous kernel 4.2.1 is tested for iowate decreasing, but it
had no effect.
2- Recovering 11 pgs on failed osds by export them using
ceph-objectstore-tools utility and import them on other osds. The result
followed: 9 pgs are “down” and 2 pgs are “unknown”.
2-1) 9 pgs export and import successfully but status is “down” because of
"peering_blocked_by" 3 failed osds. I cannot lost osds because of
preventing unknown pgs from getting lost. pgs size in K and M.
"peering_blocked_by": [
{
"osd": 34,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
},
{
"osd": 35,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
},
{
"osd": 40,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"
}
]
2-2) 1 pg (2.39) export and import successfully, but after starting osd
service (pg import to it), ceph-osd node RAM and CPU consumption increase
and cause ceph-osd node to crash until the osd service fails. Other osds
become "down" on ceph-osd node. pg status is “unknown”. I cannot use
"force-create-pg" because of data lost. pg 2.39 size is 19G.
# ceph pg map 2.39
osdmap e40347 pg 2.39 (2.39) -> up [32,37] acting [32,37]
# ceph pg 2.39 query
Error ENOENT: i don't have pgid 2.39
*pg 2.39 info on failed osd:
# ceph-objectstore-tool --data-path /var/lib/ceph/osd/*ceph-34* --op info
--pgid 2.39
{
"pgid": "2.39",
"last_update": "35344'6456084",
"last_complete": "35344'6456084",
"log_tail": "35344'6453182",
"last_user_version": 10595821,
"last_backfill": "MAX",
"purged_snaps": [],
"history": {
"epoch_created": 146,
"epoch_pool_created": 79,
"last_epoch_started": 25208,
"last_interval_started": 25207,
"last_epoch_clean": 25208,
"last_interval_clean": 25207,
"last_epoch_split": 370,
"last_epoch_marked_full": 0,
"same_up_since": 8347,
"same_interval_since": 25207,
"same_primary_since": 8321,
"last_scrub": "35328'6440139",
"last_scrub_stamp": "2020-08-19T12:00:59.377593+0430",
"last_deep_scrub": "35261'6031075",
"last_deep_scrub_stamp": "2020-08-17T01:59:26.606037+0430",
"last_clean_scrub_stamp": "2020-08-19T12:00:59.377593+0430",
"prior_readable_until_ub": 0
},
"stats": {
"version": "35344'6456082",
"reported_seq": "11733156",
"reported_epoch": "35344",
"state": "active+clean",
"last_fresh": "2020-08-19T14:16:18.587435+0430",
"last_change": "2020-08-19T12:00:59.377747+0430",
"last_active": "2020-08-19T14:16:18.587435+0430",
"last_peered": "2020-08-19T14:16:18.587435+0430",
"last_clean": "2020-08-19T14:16:18.587435+0430",
"last_became_active": "2020-08-06T00:23:51.016769+0430",
"last_became_peered": "2020-08-06T00:23:51.016769+0430",
"last_unstale": "2020-08-19T14:16:18.587435+0430",
"last_undegraded": "2020-08-19T14:16:18.587435+0430",
"last_fullsized": "2020-08-19T14:16:18.587435+0430",
"mapping_epoch": 8347,
"log_start": "35344'6453182",
"ondisk_log_start": "35344'6453182",
"created": 146,
"last_epoch_clean": 25208,
"parent": "0.0",
"parent_split_bits": 7,
"last_scrub": "35328'6440139",
"last_scrub_stamp": "2020-08-19T12:00:59.377593+0430",
"last_deep_scrub": "35261'6031075",
"last_deep_scrub_stamp": "2020-08-17T01:59:26.606037+0430",
"last_clean_scrub_stamp": "2020-08-19T12:00:59.377593+0430",
"log_size": 2900,
"ondisk_log_size": 2900,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 19749578960,
"num_objects": 2442,
"num_object_clones": 20,
"num_object_copies": 7326,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 2442,
"num_whiteouts": 0,
"num_read": 16120686,
"num_read_kb": 82264126,
"num_write": 19731882,
"num_write_kb": 379030181,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 2861,
"num_bytes_recovered": 21673259070,
"num_keys_recovered": 32,
"num_objects_omap": 2,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0,
"num_omap_bytes": 152,
"num_omap_keys": 16,
"num_objects_repaired": 0
},
"up": [
40,
35,
34
],
"acting": [
40,
35,
34
],
"avail_no_missing": [],
"object_location_counts": [],
"blocked_by": [],
"up_primary": 40,
"acting_primary": 40,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 25208,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
}
*pg 2.39 info on osd which import to it:
# ceph-objectstore-tool --data-path /var/lib/ceph/osd/*ceph-37* --op info
--pgid 2.39
PG '2.39' not found
2-3) 1 pg (2.79) is lost! This pg is not found on any of three failed osds
(osd.34 osd.35 osd.40)! status is “unknown”. pg 2.79 export is failed: "
PG '2.79' not found"
# ceph pg map 2.79
Error ENOENT: i don't have pgid 2.79
# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-34 --op info
--pgid 2.79
PG '2.79' not found
3- Using https://gitlab.lbader.de/kryptur/ceph-recovery/tree/master but it
does not work for recent ceph versions and tested on “hammer” release.
4- Using https://ceph.io/planet/recovering-from-a-complete-node-failure/
but in lvm scenario I could not mount failed osd lv to new
/var/lib/ceph/osd/ceph-x* .*Could not prepare and activate new osd to
failed osd disk.
5- Setting pool min_size=1 that down pgs belong to it, restart osds that
pgs import to them but no changes.
6- Seting pool min_size=1 that pg 2.39 belong to it, restart osds that pg
import to them but no changes.
7- Repairing failed osds using ceph-objectstore-tools, making “in” and
starting them but no changes.
# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-x --op repair
8- Repairing 2 unknown pgs, but no changes.
# ceph pg repaire 2.39
# ceph pg repair 2.79
9- Forcing recovery 2 unknown pgs, but no changes.
# ceph pg force-recovery 2.39
# ceph pg force-recovery 2.79
10- Check PID count in ceph-osd nodes because of osd services failed to
start.
kernel.pid.max = 4194304
11- Raising osd_op_thread_suicide_timeout=900, but no change.
The current OSD's size and pg_num? Are you using the different size OSDs?
On 6/9/2020 上午1:34, huxiaoyu(a)horebdata.cn wrote:
> Dear Ceph folks,
>
> As the capacity of one HDD (OSD) is growing bigger and bigger, e.g. from 6TB up to 18TB or even more, should the number of PG per OSD increase as well, e.g. for 200 to 800. As far as i know, the capacity of each PG should be set smaller for performance reasons due to the existence of PG locks, thus shall i set the number of PGs per OSD to 1000 or even 2000? what is the actual reason for not setting the number of PGs per OSD? Is there any practical limations on the number of PGs?
>
> thanks a lot,
>
> Samuel
>
>
>
>
> huxiaoyu(a)horebdata.cn
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
I have a productive 60 osd's cluster. No extra Journals. Its performing okay. Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is little slower or equal to the 60 hdd pool. 4K random as also sequential reads. All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on bluestore. Ceph Luminous.
What should be possible 16 ssd's vs. 60 hhd's no extra journals?
Hi there,
we reconfigured our ceph cluster yesterday to remove the cluster
network and things didn't quite go to plan. I am trying to figure out
what went wrong and also what to do next.
We are running nautilus 14.2.10 on Scientific Linux 7.8.
So, we are using a mixture of RBDs and cephfs. For the transition we
switched off all machines that are using the RBDs and switched off the
cephfs using
ceph fs set one down true
Once no more MDS were running we reconfigured ceph to remove the
cluster network and set various flags
ceph osd set noout
ceph osd set nodown
ceph osd set pause
ceph osd set nobackfill
ceph osd set norebalance
ceph osd set norecover
We then restarted the OSDs one host at a time. During this process ceph
was mostly happy, except for two PGs. After all OSDs had been restarted
we switched off the cluster network switches to make sure it was
totally gone. ceph was still happy. The PG error also disappeared. We
then unset all those errors and re-enabled cephfs.
We then switched on the servers using the RBDs with no issues. So far
so good.
We then started using the cephfs (we keep VM images on the cephfs). The
MDS were showing an error. I restarted the MDS but they didn't come
back. We then followed the instructions here:
https://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disas…
up to truncating the journal. The MDS started again. However, as soon
as we started writing the cephfs the MDS crashed. A scrub of the cephfs
revealed backtrace damage.
We have now followed the remaining steps of the disaster recovery
procedure and are waiting for the cephfs-data-scan scan_extents to
complete.
It would be really helpful if you could give an indication of how long
this process will take (we have ~40TB in our cephfs) and how many
workers to use.
The other missing bit of documentation is the cephfs scrubbing. Is that
something we should run routinely?
Regards
magnus
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hello, list.
Have anybody been in the situation when after "ceph fs reset" filesystem
becomes blank (mounts OK, ls shows no files/directories), but data and
metadata pools still hold something (698G and 400M respectively by "ceph
fs status").
Would be grateful for documentation vectors and/or suggestions.
Maybe i remember wrong, but few times in the past same "ceph fs reset"
produced minor corruption to recent filesystem changes.
I think there are multiple variables there.
My advice is for HDDs to aim for an average of 150-200 as I wrote before. The limitation is the speed of the device, throw a thousand PGs on there and you won’t get any more out of it, you’ll just have more peering and more RAM used.
NVMe is a different story.
>
> Is there any rules for computing RAM requeirements in terms of the number of PGs?
>
> Just curious abount what is the fundamental limitations on the number of PGs per OSD for bigger capacity HDD
>
> best regards,
>
> Samuel
>
>
>
> huxiaoyu(a)horebdata.cn
>
> From: Anthony D'Atri
> Date: 2020-09-05 20:00
> To: huxiaoyu(a)horebdata.cn
> CC: ceph-users
> Subject: Re: [ceph-users] PG number per OSD
> One factor is RAM usage, that was IIRC the motivation for the lowering of the recommendation of the ratio from 200 to 100. Memory needs also increase during recovery and backfill.
>
> When calculating, be sure to consider repllicas.
>
> ratio = (pgp_num x replication) / num_osds
>
> As HDDs grow the interface though isn’t becoming faster (with SATA at least), and there are only so many IOPS and MB/s that you’re going to get out of one no matter how you slice it. Everything always depends on your use-case and workload, but I suspect that often the bottleneck is the drive, not PG or OSD serialization.
>
> For example, do you prize IOPS more, latency, or MB/s? If you don’t care about latency, then you can drive your HDDs harder and get more MB/s throughput out of them, though your average latency might climb to 100ms. Which eg. RBD VM clients probably wouldn’t be too happy about, but which an object service *might* tolerate.
>
> Basically in the absence of more info, I would personally suggest aiming at the 150-200 average range, with pgp_num a power of 2. If you aim a bit high, the ratio will come down a bit when you add nodes/OSDs to your cluster to gain capacity. Be sure to balance usage and watch your mon_max_pg_per_osd setting — allowing some headroom for natural variation and for when components fail.
>
> YMMV.
>
> — aad
>
>> On Sep 5, 2020, at 10:34 AM, huxiaoyu(a)horebdata.cn wrote:
>>
>> Dear Ceph folks,
>>
>> As the capacity of one HDD (OSD) is growing bigger and bigger, e.g. from 6TB up to 18TB or even more, should the number of PG per OSD increase as well, e.g. for 200 to 800. As far as i know, the capacity of each PG should be set smaller for performance reasons due to the existence of PG locks, thus shall i set the number of PGs per OSD to 1000 or even 2000? what is the actual reason for not setting the number of PGs per OSD? Is there any practical limations on the number of PGs?
>>
>> thanks a lot,
>>
>> Samuel
>>
>>
>>
>>
>> huxiaoyu(a)horebdata.cn
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Good question!
Did you already observe some performance impact of very large PGs?
Which PG locks are you speaking of? Is there perhaps some way to
improve this with the op queue shards?
(I'm cc'ing Mark in case this is something that the performance team
has already looked into).
With a 20TB osd, we'll have up to 200GB PGs following the current
suggestions -- but even then, backfilling those huge PGs would still
be done in under an hour, which seems pretty reasonable IMHO.
-- dan
On Sat, Sep 5, 2020 at 7:35 PM huxiaoyu(a)horebdata.cn
<huxiaoyu(a)horebdata.cn> wrote:
>
> Dear Ceph folks,
>
> As the capacity of one HDD (OSD) is growing bigger and bigger, e.g. from 6TB up to 18TB or even more, should the number of PG per OSD increase as well, e.g. for 200 to 800. As far as i know, the capacity of each PG should be set smaller for performance reasons due to the existence of PG locks, thus shall i set the number of PGs per OSD to 1000 or even 2000? what is the actual reason for not setting the number of PGs per OSD? Is there any practical limations on the number of PGs?
>
> thanks a lot,
>
> Samuel
>
>
>
>
> huxiaoyu(a)horebdata.cn
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
One factor is RAM usage, that was IIRC the motivation for the lowering of the recommendation of the ratio from 200 to 100. Memory needs also increase during recovery and backfill.
When calculating, be sure to consider repllicas.
ratio = (pgp_num x replication) / num_osds
As HDDs grow the interface though isn’t becoming faster (with SATA at least), and there are only so many IOPS and MB/s that you’re going to get out of one no matter how you slice it. Everything always depends on your use-case and workload, but I suspect that often the bottleneck is the drive, not PG or OSD serialization.
For example, do you prize IOPS more, latency, or MB/s? If you don’t care about latency, then you can drive your HDDs harder and get more MB/s throughput out of them, though your average latency might climb to 100ms. Which eg. RBD VM clients probably wouldn’t be too happy about, but which an object service *might* tolerate.
Basically in the absence of more info, I would personally suggest aiming at the 150-200 average range, with pgp_num a power of 2. If you aim a bit high, the ratio will come down a bit when you add nodes/OSDs to your cluster to gain capacity. Be sure to balance usage and watch your mon_max_pg_per_osd setting — allowing some headroom for natural variation and for when components fail.
YMMV.
— aad
> On Sep 5, 2020, at 10:34 AM, huxiaoyu(a)horebdata.cn wrote:
>
> Dear Ceph folks,
>
> As the capacity of one HDD (OSD) is growing bigger and bigger, e.g. from 6TB up to 18TB or even more, should the number of PG per OSD increase as well, e.g. for 200 to 800. As far as i know, the capacity of each PG should be set smaller for performance reasons due to the existence of PG locks, thus shall i set the number of PGs per OSD to 1000 or even 2000? what is the actual reason for not setting the number of PGs per OSD? Is there any practical limations on the number of PGs?
>
> thanks a lot,
>
> Samuel
>
>
>
>
> huxiaoyu(a)horebdata.cn
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io