September 2020 - ceph-users

add debian buster stable support for ceph-deploy

by Jelle de Jong

Hello everybody, Can somebody add support for Debian buster and ceph-deploy: https://tracker.ceph.com/issues/42870 Highly appreciated, Regards, Jelle de Jong

3 years, 8 months

6
8
0 0

Does Cash.app/help useful for discount?

by mary smith

Cash app to Cash app installment is normally moment and store effectively inside 10 seconds. Thus, you don't get a choice to drop the installment. On the off chance that there is any sum deducted from your record, at that point use Cash.app/help to get a discount. The sum appears for you when the sum is discounted by the beneficiary. https://www.cashapp-customerservice.com/

3 years, 8 months

1
0
0 0

Recover pgs from failed osds

by Vahideh Alinouri

Ceph cluster is updated from nautilus to octopus. On ceph-osd nodes we have high I/O wait. After increasing one of pool’s pg_num from 64 to 128 according to warning message (more objects per pg), this lead to high cpu load and ram usage on ceph-osd nodes and finally crashed the whole cluster. Three osds, one on each host, stuck at down state (osd.34 osd.35 osd.40). Starting the down osd service causes high ram usage and cpu load and ceph-osd node to crash until the osd service fails. The active mgr service on each mon host will crash after consuming almost all available ram on the physical hosts. I need to recover pgs and solving corruption. How can i recover unknown and down pgs? Is there any way to starting up failed osd? Below steps are done: 1- osd nodes’ kernel was upgraded to 5.4.2 before ceph cluster upgrading. Reverting to previous kernel 4.2.1 is tested for iowate decreasing, but it had no effect. 2- Recovering 11 pgs on failed osds by export them using ceph-objectstore-tools utility and import them on other osds. The result followed: 9 pgs are “down” and 2 pgs are “unknown”. 2-1) 9 pgs export and import successfully but status is “down” because of "peering_blocked_by" 3 failed osds. I cannot lost osds because of preventing unknown pgs from getting lost. pgs size in K and M. "peering_blocked_by": [ { "osd": 34, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed" }, { "osd": 35, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed" }, { "osd": 40, "current_lost_at": 0, "comment": "starting or marking this osd lost may let us proceed" } ] 2-2) 1 pg (2.39) export and import successfully, but after starting osd service (pg import to it), ceph-osd node RAM and CPU consumption increase and cause ceph-osd node to crash until the osd service fails. Other osds become "down" on ceph-osd node. pg status is “unknown”. I cannot use "force-create-pg" because of data lost. pg 2.39 size is 19G. # ceph pg map 2.39 osdmap e40347 pg 2.39 (2.39) -> up [32,37] acting [32,37] # ceph pg 2.39 query Error ENOENT: i don't have pgid 2.39 *pg 2.39 info on failed osd: # ceph-objectstore-tool --data-path /var/lib/ceph/osd/*ceph-34* --op info --pgid 2.39 { "pgid": "2.39", "last_update": "35344'6456084", "last_complete": "35344'6456084", "log_tail": "35344'6453182", "last_user_version": 10595821, "last_backfill": "MAX", "purged_snaps": [], "history": { "epoch_created": 146, "epoch_pool_created": 79, "last_epoch_started": 25208, "last_interval_started": 25207, "last_epoch_clean": 25208, "last_interval_clean": 25207, "last_epoch_split": 370, "last_epoch_marked_full": 0, "same_up_since": 8347, "same_interval_since": 25207, "same_primary_since": 8321, "last_scrub": "35328'6440139", "last_scrub_stamp": "2020-08-19T12:00:59.377593+0430", "last_deep_scrub": "35261'6031075", "last_deep_scrub_stamp": "2020-08-17T01:59:26.606037+0430", "last_clean_scrub_stamp": "2020-08-19T12:00:59.377593+0430", "prior_readable_until_ub": 0 }, "stats": { "version": "35344'6456082", "reported_seq": "11733156", "reported_epoch": "35344", "state": "active+clean", "last_fresh": "2020-08-19T14:16:18.587435+0430", "last_change": "2020-08-19T12:00:59.377747+0430", "last_active": "2020-08-19T14:16:18.587435+0430", "last_peered": "2020-08-19T14:16:18.587435+0430", "last_clean": "2020-08-19T14:16:18.587435+0430", "last_became_active": "2020-08-06T00:23:51.016769+0430", "last_became_peered": "2020-08-06T00:23:51.016769+0430", "last_unstale": "2020-08-19T14:16:18.587435+0430", "last_undegraded": "2020-08-19T14:16:18.587435+0430", "last_fullsized": "2020-08-19T14:16:18.587435+0430", "mapping_epoch": 8347, "log_start": "35344'6453182", "ondisk_log_start": "35344'6453182", "created": 146, "last_epoch_clean": 25208, "parent": "0.0", "parent_split_bits": 7, "last_scrub": "35328'6440139", "last_scrub_stamp": "2020-08-19T12:00:59.377593+0430", "last_deep_scrub": "35261'6031075", "last_deep_scrub_stamp": "2020-08-17T01:59:26.606037+0430", "last_clean_scrub_stamp": "2020-08-19T12:00:59.377593+0430", "log_size": 2900, "ondisk_log_size": 2900, "stats_invalid": false, "dirty_stats_invalid": false, "omap_stats_invalid": false, "hitset_stats_invalid": false, "hitset_bytes_stats_invalid": false, "pin_stats_invalid": false, "manifest_stats_invalid": false, "snaptrimq_len": 0, "stat_sum": { "num_bytes": 19749578960, "num_objects": 2442, "num_object_clones": 20, "num_object_copies": 7326, "num_objects_missing_on_primary": 0, "num_objects_missing": 0, "num_objects_degraded": 0, "num_objects_misplaced": 0, "num_objects_unfound": 0, "num_objects_dirty": 2442, "num_whiteouts": 0, "num_read": 16120686, "num_read_kb": 82264126, "num_write": 19731882, "num_write_kb": 379030181, "num_scrub_errors": 0, "num_shallow_scrub_errors": 0, "num_deep_scrub_errors": 0, "num_objects_recovered": 2861, "num_bytes_recovered": 21673259070, "num_keys_recovered": 32, "num_objects_omap": 2, "num_objects_hit_set_archive": 0, "num_bytes_hit_set_archive": 0, "num_flush": 0, "num_flush_kb": 0, "num_evict": 0, "num_evict_kb": 0, "num_promote": 0, "num_flush_mode_high": 0, "num_flush_mode_low": 0, "num_evict_mode_some": 0, "num_evict_mode_full": 0, "num_objects_pinned": 0, "num_legacy_snapsets": 0, "num_large_omap_objects": 0, "num_objects_manifest": 0, "num_omap_bytes": 152, "num_omap_keys": 16, "num_objects_repaired": 0 }, "up": [ 40, 35, 34 ], "acting": [ 40, 35, 34 ], "avail_no_missing": [], "object_location_counts": [], "blocked_by": [], "up_primary": 40, "acting_primary": 40, "purged_snaps": [] }, "empty": 0, "dne": 0, "incomplete": 0, "last_epoch_started": 25208, "hit_set_history": { "current_last_update": "0'0", "history": [] } } *pg 2.39 info on osd which import to it: # ceph-objectstore-tool --data-path /var/lib/ceph/osd/*ceph-37* --op info --pgid 2.39 PG '2.39' not found 2-3) 1 pg (2.79) is lost! This pg is not found on any of three failed osds (osd.34 osd.35 osd.40)! status is “unknown”. pg 2.79 export is failed: " PG '2.79' not found" # ceph pg map 2.79 Error ENOENT: i don't have pgid 2.79 # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-34 --op info --pgid 2.79 PG '2.79' not found 3- Using https://gitlab.lbader.de/kryptur/ceph-recovery/tree/master but it does not work for recent ceph versions and tested on “hammer” release. 4- Using https://ceph.io/planet/recovering-from-a-complete-node-failure/ but in lvm scenario I could not mount failed osd lv to new /var/lib/ceph/osd/ceph-x* .*Could not prepare and activate new osd to failed osd disk. 5- Setting pool min_size=1 that down pgs belong to it, restart osds that pgs import to them but no changes. 6- Seting pool min_size=1 that pg 2.39 belong to it, restart osds that pg import to them but no changes. 7- Repairing failed osds using ceph-objectstore-tools, making “in” and starting them but no changes. # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-x --op repair 8- Repairing 2 unknown pgs, but no changes. # ceph pg repaire 2.39 # ceph pg repair 2.79 9- Forcing recovery 2 unknown pgs, but no changes. # ceph pg force-recovery 2.39 # ceph pg force-recovery 2.79 10- Check PID count in ceph-osd nodes because of osd services failed to start. kernel.pid.max = 4194304 11- Raising osd_op_thread_suicide_timeout=900, but no change.

3 years, 8 months

3
12
0 0

Re: PG number per OSD

by norman

The current OSD's size and pg_num? Are you using the different size OSDs? On 6/9/2020 上午1:34, huxiaoyu(a)horebdata.cn wrote: > Dear Ceph folks, > > As the capacity of one HDD (OSD) is growing bigger and bigger, e.g. from 6TB up to 18TB or even more, should the number of PG per OSD increase as well, e.g. for 200 to 800. As far as i know, the capacity of each PG should be set smaller for performance reasons due to the existence of PG locks, thus shall i set the number of PGs per OSD to 1000 or even 2000? what is the actual reason for not setting the number of PGs per OSD? Is there any practical limations on the number of PGs? > > thanks a lot, > > Samuel > > > > > huxiaoyu(a)horebdata.cn > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 8 months

1
0
0 0

Can 16 server grade ssd's be slower then 60 hdds? (no extra journals)

by VELARTIS Philipp Dürhammer

I have a productive 60 osd's cluster. No extra Journals. Its performing okay. Now I added an extra ssd Pool with 16 Micron 5100 MAX. And the performance is little slower or equal to the 60 hdd pool. 4K random as also sequential reads. All on dedicated 2 times 10G Network. HDDS are still on filestore. SSD on bluestore. Ceph Luminous. What should be possible 16 ssd's vs. 60 hhd's no extra journals?

3 years, 8 months

6
16
0 0

damaged cephfs

by Magnus HAGDORN

Hi there, we reconfigured our ceph cluster yesterday to remove the cluster network and things didn't quite go to plan. I am trying to figure out what went wrong and also what to do next. We are running nautilus 14.2.10 on Scientific Linux 7.8. So, we are using a mixture of RBDs and cephfs. For the transition we switched off all machines that are using the RBDs and switched off the cephfs using ceph fs set one down true Once no more MDS were running we reconfigured ceph to remove the cluster network and set various flags ceph osd set noout ceph osd set nodown ceph osd set pause ceph osd set nobackfill ceph osd set norebalance ceph osd set norecover We then restarted the OSDs one host at a time. During this process ceph was mostly happy, except for two PGs. After all OSDs had been restarted we switched off the cluster network switches to make sure it was totally gone. ceph was still happy. The PG error also disappeared. We then unset all those errors and re-enabled cephfs. We then switched on the servers using the RBDs with no issues. So far so good. We then started using the cephfs (we keep VM images on the cephfs). The MDS were showing an error. I restarted the MDS but they didn't come back. We then followed the instructions here: https://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disas… up to truncating the journal. The MDS started again. However, as soon as we started writing the cephfs the MDS crashed. A scrub of the cephfs revealed backtrace damage. We have now followed the remaining steps of the disaster recovery procedure and are waiting for the cephfs-data-scan scan_extents to complete. It would be really helpful if you could give an indication of how long this process will take (we have ~40TB in our cephfs) and how many workers to use. The other missing bit of documentation is the cephfs scrubbing. Is that something we should run routinely? Regards magnus The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

3 years, 8 months

2
3
0 0

ceph fs reset situation

by Alexander B. Ustinov

Hello, list. Have anybody been in the situation when after "ceph fs reset" filesystem becomes blank (mounts OK, ls shows no files/directories), but data and metadata pools still hold something (698G and 400M respectively by "ceph fs status"). Would be grateful for documentation vectors and/or suggestions. Maybe i remember wrong, but few times in the past same "ceph fs reset" produced minor corruption to recent filesystem changes.

3 years, 8 months

1
0
0 0

Re: PG number per OSD

by Anthony D'Atri

I think there are multiple variables there. My advice is for HDDs to aim for an average of 150-200 as I wrote before. The limitation is the speed of the device, throw a thousand PGs on there and you won’t get any more out of it, you’ll just have more peering and more RAM used. NVMe is a different story. > > Is there any rules for computing RAM requeirements in terms of the number of PGs? > > Just curious abount what is the fundamental limitations on the number of PGs per OSD for bigger capacity HDD > > best regards, > > Samuel > > > > huxiaoyu(a)horebdata.cn > > From: Anthony D'Atri > Date: 2020-09-05 20:00 > To: huxiaoyu(a)horebdata.cn > CC: ceph-users > Subject: Re: [ceph-users] PG number per OSD > One factor is RAM usage, that was IIRC the motivation for the lowering of the recommendation of the ratio from 200 to 100. Memory needs also increase during recovery and backfill. > > When calculating, be sure to consider repllicas. > > ratio = (pgp_num x replication) / num_osds > > As HDDs grow the interface though isn’t becoming faster (with SATA at least), and there are only so many IOPS and MB/s that you’re going to get out of one no matter how you slice it. Everything always depends on your use-case and workload, but I suspect that often the bottleneck is the drive, not PG or OSD serialization. > > For example, do you prize IOPS more, latency, or MB/s? If you don’t care about latency, then you can drive your HDDs harder and get more MB/s throughput out of them, though your average latency might climb to 100ms. Which eg. RBD VM clients probably wouldn’t be too happy about, but which an object service *might* tolerate. > > Basically in the absence of more info, I would personally suggest aiming at the 150-200 average range, with pgp_num a power of 2. If you aim a bit high, the ratio will come down a bit when you add nodes/OSDs to your cluster to gain capacity. Be sure to balance usage and watch your mon_max_pg_per_osd setting — allowing some headroom for natural variation and for when components fail. > > YMMV. > > — aad > >> On Sep 5, 2020, at 10:34 AM, huxiaoyu(a)horebdata.cn wrote: >> >> Dear Ceph folks, >> >> As the capacity of one HDD (OSD) is growing bigger and bigger, e.g. from 6TB up to 18TB or even more, should the number of PG per OSD increase as well, e.g. for 200 to 800. As far as i know, the capacity of each PG should be set smaller for performance reasons due to the existence of PG locks, thus shall i set the number of PGs per OSD to 1000 or even 2000? what is the actual reason for not setting the number of PGs per OSD? Is there any practical limations on the number of PGs? >> >> thanks a lot, >> >> Samuel >> >> >> >> >> huxiaoyu(a)horebdata.cn >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 8 months

1
0
0 0

Re: PG number per OSD

by Dan van der Ster

Good question! Did you already observe some performance impact of very large PGs? Which PG locks are you speaking of? Is there perhaps some way to improve this with the op queue shards? (I'm cc'ing Mark in case this is something that the performance team has already looked into). With a 20TB osd, we'll have up to 200GB PGs following the current suggestions -- but even then, backfilling those huge PGs would still be done in under an hour, which seems pretty reasonable IMHO. -- dan On Sat, Sep 5, 2020 at 7:35 PM huxiaoyu(a)horebdata.cn <huxiaoyu(a)horebdata.cn> wrote: > > Dear Ceph folks, > > As the capacity of one HDD (OSD) is growing bigger and bigger, e.g. from 6TB up to 18TB or even more, should the number of PG per OSD increase as well, e.g. for 200 to 800. As far as i know, the capacity of each PG should be set smaller for performance reasons due to the existence of PG locks, thus shall i set the number of PGs per OSD to 1000 or even 2000? what is the actual reason for not setting the number of PGs per OSD? Is there any practical limations on the number of PGs? > > thanks a lot, > > Samuel > > > > > huxiaoyu(a)horebdata.cn > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 8 months

2
1
0 0

Re: PG number per OSD

by Anthony D'Atri

One factor is RAM usage, that was IIRC the motivation for the lowering of the recommendation of the ratio from 200 to 100. Memory needs also increase during recovery and backfill. When calculating, be sure to consider repllicas. ratio = (pgp_num x replication) / num_osds As HDDs grow the interface though isn’t becoming faster (with SATA at least), and there are only so many IOPS and MB/s that you’re going to get out of one no matter how you slice it. Everything always depends on your use-case and workload, but I suspect that often the bottleneck is the drive, not PG or OSD serialization. For example, do you prize IOPS more, latency, or MB/s? If you don’t care about latency, then you can drive your HDDs harder and get more MB/s throughput out of them, though your average latency might climb to 100ms. Which eg. RBD VM clients probably wouldn’t be too happy about, but which an object service *might* tolerate. Basically in the absence of more info, I would personally suggest aiming at the 150-200 average range, with pgp_num a power of 2. If you aim a bit high, the ratio will come down a bit when you add nodes/OSDs to your cluster to gain capacity. Be sure to balance usage and watch your mon_max_pg_per_osd setting — allowing some headroom for natural variation and for when components fail. YMMV. — aad > On Sep 5, 2020, at 10:34 AM, huxiaoyu(a)horebdata.cn wrote: > > Dear Ceph folks, > > As the capacity of one HDD (OSD) is growing bigger and bigger, e.g. from 6TB up to 18TB or even more, should the number of PG per OSD increase as well, e.g. for 200 to 800. As far as i know, the capacity of each PG should be set smaller for performance reasons due to the existence of PG locks, thus shall i set the number of PGs per OSD to 1000 or even 2000? what is the actual reason for not setting the number of PGs per OSD? Is there any practical limations on the number of PGs? > > thanks a lot, > > Samuel > > > > > huxiaoyu(a)horebdata.cn > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 8 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2020