July 2020 - ceph-users - lists.ceph.io

by Mike Dawson

Tonight an old Ceph cluster we run suffered a hardware failure that resulted in the loss of Ceph journal SSDs on 7 nodes out of 36. Overview of this old setup: - Super-old Ceph Dumpling v0.67 - 3x replication for RBD w/ 3 failure domains in replication hierarchy - OSDs on XFS on spinning disks with Journals on SSD In total we lost 7 SSDs hosting Journals for 21 OSDs (3 each). The lost nodes span all three failure domains which makes me nervous that there are likely missing Placement Groups in the pool. Due to how Ceph shards data across the Placement Groups, I'm concerned I may have lost all the RBD volumes in this pool. The obvious solution is to attempt to bring the OSDs back online (for at least one failure domain) to ensure there is at least one complete copy of the data then rebuild everything else. The issue is I lost the journals when the SSDs died. I don't see much published about recovering OSDs in the event of a lost journal except: https://ceph.io/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/ And that doesn't mention if the data is valid afterwards. I think I recall Inktank used to deal with this situation and may have had a potential solution. At this point, I'll take any constructive advice. Thanks you in advance, Mike

3 years, 9 months

1
0
0 0

Re: RBD thin provisioning and time to format a volume

by Void Star Nill

On Thu, Jul 9, 2020 at 10:33 AM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > What about ntfs? You have there a not quick option. Maybe it writes to > the whole disk some random pattern. Why do you ask? > I am writing an API layer to plug into our platform, so I want to know if the format times can be deterministic or unbounded. From what I saw with ext3, ext4 and xfs volumes, the format time is actually not dependent on the size of the volume, so I just wanted to confirm if we can assume that, or if I am missing something. Thanks, Shridhar > > > -----Original Message----- > Cc: ceph-users > Subject: [ceph-users] Re: RBD thin provisioning and time to format a > volume > > Thanks Jason. > > Do you mean to say some filesystems will initialize the entire disk > during format? Does that mean we will see the entire size of the volume > getting allocated during formatting? > Or do you mean to say, some filesystem formatting just takes longer than > others, as it does more initialization? > > I am just trying to understand if there are cases where Ceph will > allocate all the blocks for a filesystem during formation operations, OR > if they continue to be thin provisioned (allocate as you go based on > real data). So far I have tried with ext3, ext4 and xfs and none of them > dont allocate all the blocks during format. > > -Shridhar > > > On Thu, 9 Jul 2020 at 06:58, Jason Dillaman <jdillama(a)redhat.com> wrote: > > > On Thu, Jul 9, 2020 at 12:02 AM Void Star Nill > > <void.star.nill(a)gmail.com> > > wrote: > > > > > > > > > > > > On Wed, Jul 8, 2020 at 4:56 PM Jason Dillaman <jdillama(a)redhat.com> > > wrote: > > >> > > >> On Wed, Jul 8, 2020 at 3:28 PM Void Star Nill > > >> <void.star.nill(a)gmail.com> > > wrote: > > >> > > > >> > Hello, > > >> > > > >> > My understanding is that the time to format an RBD volume is not > > dependent > > >> > on its size as the RBD volumes are thin provisioned. Is this > correct? > > >> > > > >> > For example, formatting a 1G volume should take almost the same > > >> > time > > as > > >> > formatting a 1TB volume - although accounting for differences in > > latencies > > >> > due to load on the Ceph cluster. Is that a fair assumption? > > >> > > >> Yes, that is a fair comparison when creating the RBD image. > > >> However, a format operation might initialize and discard extents on > > > >> the disk, so a larger disk will take longer to format. > > > > > > > > > Thanks for the response Jason. Could you please explain a bit more > > > on > > the the format operation? > > > > I'm not sure what else there is to explain. When you create a file > > system on top of any block device, it needs to initialize the block > > device. Depending on the file system, it might take more time for > > larger block devices because it's doing more work. > > > > > Is there a relative time that we can determine based on the volume > size? > > > > > > Thanks > > > Shridhar > > > > > > > > >> > > >> > > >> > Thanks, > > >> > Shridhar > > >> > _______________________________________________ > > >> > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send > > > >> > an email to ceph-users-leave(a)ceph.io > > >> > > > >> > > >> > > >> -- > > >> Jason > > >> > > > > > > -- > > Jason > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > email to ceph-users-leave(a)ceph.io > > >

3 years, 9 months

1
0
0 0

post - bluestore default vs tuned performance comparison

by Frank Ritchie

Hi, For this post: https://ceph.io/community/bluestore-default-vs-tuned-performance-comparison/ I don't see a way to contact the authors so I thought I would try here. Does anyone know how the rocksdb tuning parameters of: " bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB " were chosen? Some of the settings seem to not be in line with the rocksdb tuning guide: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide thx Frank

3 years, 9 months

2
1
0 0

RBD thin provisioning and time to format a volume

by Void Star Nill

Hello, My understanding is that the time to format an RBD volume is not dependent on its size as the RBD volumes are thin provisioned. Is this correct? For example, formatting a 1G volume should take almost the same time as formatting a 1TB volume - although accounting for differences in latencies due to load on the Ceph cluster. Is that a fair assumption? Thanks, Shridhar

3 years, 9 months

3
5
0 0

Bucket index logs (bilogs) not being trimmed automatically (multisite, ceph nautilus 14.2.9)

by david.piper＠metaswitch.com

Hi all, We're seeing a problem in our multisite Ceph deployment, where bilogs aren't being trimmed for several buckets. This is causing bilogs to accumulate over time, leading to large OMAP object warnings for the indexes on these buckets. In every case, Ceph reports that the bucket is in sync and the data is consistent across both sites, so we're perplexed as to why the logs aren't being trimmed. It's not affecting all of our buckets, we're not sure what's 'different' in the affected cases causing them to accumulate. We're seeing this in both unsharded and sharded buckets. Some buckets with heavy activity (lots of object updates) have accumulated millions of bilogs, but this does not affect all of our very active buckets. I've tried running 'radosgw-admin bilog autotrim' against an affected bucket, and it doesn't appear to do anything. I've used 'radosgw-admin bilog trim' with a suitable 'end-marker' to trim all of the bilogs, but the implications of doing this aren't clear to me, and the logs continue to accumulate afterwards. We're running Ceph Nautilus 14.2.9 and running the services in containers, we have 3 hosts on each site with 1 OSD on each host. We're adjusting a fairly minimal set of config and I don't think it includes anything that would affect bilog trimming. Checking the running config on the mon service, I think these defaulted parameters are relevant: "rgw_sync_log_trim_concurrent_buckets": "4", "rgw_sync_log_trim_interval": "1200", "rgw_sync_log_trim_max_buckets": "16", "rgw_sync_log_trim_min_cold_buckets": "4", I can't find any documentation on these parameters but we have more than 16 buckets, so is it possible that some buckets are just never being selected for trimming? Any other ideas as to what might be causing this, or anything else we could try to help diagnose or fix this? Thanks in advance! I've included an example below for one such affected bucket, showing its current state. Zone details (as per 'radosgw-admin zonegroup get') are at the bottom. $ radosgw-admin bucket sync status --bucket=edin2z6-sharedconfig realm b7f31089-0879-4fa2-9cbc-cfdf5f866a35 (geored_realm) zonegroup 5d74eb0e-5d99-481f-ae33-43483f6cebc0 (geored_zg) zone c48f33ad-6d79-4b9f-a22f-78589f67526e (siteA) bucket edin2z6-sharedconfig[033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1] source zone 0a3c29b7-1a2c-432d-979b-d324a05cc831 (siteApubsub) full sync: 0/1 shards incremental sync: 0/1 shards bucket is caught up with source source zone 9f5fba56-4a32-46a6-8695-89253be81614 (siteB) full sync: 0/1 shards incremental sync: 1/1 shards bucket is caught up with source source zone c72b3aa8-a051-4665-9421-909510702412 (siteBpubsub) full sync: 0/1 shards incremental sync: 0/1 shards bucket is caught up with source $ radosgw-admin bilog list --bucket edin2z6-sharedconfig --max-entries 600000000 | grep op_id | wc -l 1299392 $ rados -p siteA.rgw.buckets.index listomapkeys .dir.033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1 | wc -l 1299083 $ radosgw-admin bucket stats --bucket=edin2z6-sharedconfig { "bucket": "edin2z6-sharedconfig", "num_shards": 0, "tenant": "", "zonegroup": "5d74eb0e-5d99-481f-ae33-43483f6cebc0", "placement_rule": "default-placement", "explicit_placement": { "data_pool": "", "data_extra_pool": "", "index_pool": "" }, "id": "033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1", "marker": "033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1", "index_type": "Normal", "owner": "edin2z6", "ver": "0#1622676", "master_ver": "0#0", "mtime": "2020-01-14 14:30:18.606142Z", "max_marker": "0#00001622675.2115836.5", "usage": { "rgw.main": { "size": 15209, "size_actual": 40960, "size_utilized": 15209, "size_kb": 15, "size_kb_actual": 40, "size_kb_utilized": 15, "num_objects": 7 } }, "bucket_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 } } $ radosgw-admin bucket limit check ... { "bucket": "edin2z6-sharedconfig", "tenant": "", "num_objects": 7, "num_shards": 0, "objects_per_shard": 7, "fill_status": "OK" }, ... $ radosgw-admin zonegroup get ++ sudo docker ps --filter name=ceph-rgw-.*rgw -q ++ sudo docker exec d2c999b1f3f8 radosgw-admin { "id": "5d74eb0e-5d99-481f-ae33-43483f6cebc0", "name": "geored_zg", "api_name": "geored_zg", "is_master": "true", "endpoints": [ "https://10.254.2.93:7480" ], "hostnames": [], "hostnames_s3website": [], "master_zone": "c48f33ad-6d79-4b9f-a22f-78589f67526e", "zones": [ { "id": "0a3c29b7-1a2c-432d-979b-d324a05cc831", "name": "siteApubsub", "endpoints": [ "https://10.254.2.93:7481", "https://10.254.2.94:7481", "https://10.254.2.95:7481" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "pubsub", "sync_from_all": "false", "sync_from": [ "siteA" ], "redirect_zone": "" }, { "id": "9f5fba56-4a32-46a6-8695-89253be81614", "name": "siteB", "endpoints": [ "https://10.254.2.224:7480", "https://10.254.2.225:7480", "https://10.254.2.226:7480" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "c48f33ad-6d79-4b9f-a22f-78589f67526e", "name": "siteA", "endpoints": [ "https://10.254.2.93:7480", "https://10.254.2.94:7480", "https://10.254.2.95:7480" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" }, { "id": "c72b3aa8-a051-4665-9421-909510702412", "name": "siteBpubsub", "endpoints": [ "https://10.254.2.224:7481", "https://10.254.2.225:7481", "https://10.254.2.226:7481" ], "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "pubsub", "sync_from_all": "false", "sync_from": [ "siteB" ], "redirect_zone": "" } ], "placement_targets": [ { "name": "default-placement", "tags": [], "storage_classes": [ "STANDARD" ] } ], "default_placement": "default-placement", "realm_id": "b7f31089-0879-4fa2-9cbc-cfdf5f866a35" }

3 years, 9 months

1
0
0 0

default.rgw.data.root pool

by Seena Fallah

Hi all. Is there any docs related to default.rgw.data.root pool? I have this pool and there are no objects in default.rgw.meta pool. Thanks for your help.

3 years, 9 months

1
0
0 0

Ceph df Vs Dashboard pool usage mismatch

by Richard Kearsley

Hi My ceph dashboard reports 64% usage for rgw.buckets.data: [image: cephdashboard.png] But "ceph df" command shows 56.81% used: RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 611 TiB 282 TiB 328 TiB 329 TiB 53.81 TOTAL 611 TiB 282 TiB 328 TiB 329 TiB 53.81 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL .rgw.root 1 12 KiB 25 7.5 MiB 0 50 TiB rgw.control 5 0 B 15 0 B 0 50 TiB rgw.meta 6 13 KiB 51 14 MiB 0 50 TiB rgw.log 7 7.5 MiB 783 69 MiB 0 50 TiB rgw.buckets.index 8 57 MiB 16 57 MiB 0 50 TiB buckets.non-ec 10 0 B 0 0 B 0 50 TiB rgw.buckets.data 11 217 TiB 57.48M 328 TiB 56.81 187 TiB Does ceph df have the correct one? 🤔 ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable) Thanks Richard

3 years, 9 months

3
2
0 0

Questions on Ceph on ARM

by norman

Hi all, I'm using Ceph on X86 and ARM, is it safe make x86 and arm64 in the same cluster?

3 years, 9 months

5
10
0 0

bluestore: osd bluestore_allocated is much larger than bluestore_stored

by Jerry Pu

Hi: We have a cluster (v13.2.4), and we do some tests on a EC k=2, m=1 pool "VMPool0". We deploy some VMs (Windows, CentOS7) on the pool and then use IOMeter to write data to these VMs. After a period of time, we observe a strange thing that pool actual usage is much larger than stored data * 1.5 (stored_raw). [root@Sim-str-R6-4 ~]# ceph df GLOBAL: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 3.5 TiB 1.8 TiB 1.7 TiB 1.7 TiB 48.74 TOTAL 3.5 TiB 1.8 TiB 1.7 TiB 1.7 TiB 48.74 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS cephfs_data 1 29 GiB 100.00 0 B 2597 cephfs_md 2 831 MiB 100.00 0 B 133 erasure_meta_hdd 3 22 MiB 100.00 0 B 170 VMPool0 4 1.2 TiB 56.77 644 GiB 116011 stresspool 5 2.6 MiB 100.00 0 B 32 [root@Sim-str-R6-4 ~]# ceph df detail -f json-pretty -----snippet----- { "name": "VMPool0", "id": 4, "stats": { "kb_used": 1328888832, "bytes_used": 1360782163968, <---------------- "percent_used": 0.567110, "max_avail": 692481687552, "objects": 116011, "quota_objects": 0, "quota_bytes": 0, "dirty": 116011, "rd": 27449034, "rd_bytes": 126572760064, "wr": 20675381, "wr_bytes": 1006460652544, "comp_ratio": 1.000000, "stored": 497657610240, "stored_raw": 746486431744, <---------------- } }, The perf counters of all osds (all hdd) used by VMPool0 also show that bluestore_allocated is much larger than bluestore_stored. [root@Sim-str-R6-4 ~]# for i in {0..3}; do echo $i; ceph daemon osd.$i perf dump | grep bluestore | head -6; done 0 "bluestore": { "bluestore_allocated": 175032369152, <---------------- "bluestore_stored": 83557936482, <---------------- "bluestore_compressed": 958795770, "bluestore_compressed_allocated": 6431965184, "bluestore_compressed_original": 18576584704, 1 "bluestore": { "bluestore_allocated": 119943593984, <---------------- "bluestore_stored": 53325238866, <---------------- "bluestore_compressed": 670158436, "bluestore_compressed_allocated": 4751818752, "bluestore_compressed_original": 13752328192, 2 "bluestore": { "bluestore_allocated": 155444707328, <---------------- "bluestore_stored": 69067116553, <---------------- "bluestore_compressed": 565170876, "bluestore_compressed_allocated": 4614324224, "bluestore_compressed_original": 13469696000, 3 "bluestore": { "bluestore_allocated": 128179240960, <---------------- "bluestore_stored": 60884752114, <---------------- "bluestore_compressed": 1653455847, "bluestore_compressed_allocated": 9741795328, "bluestore_compressed_original": 27878768640, [root@Sim-str-R6-5 osd]# for i in {4..7}; do echo $i; sh -c "ceph daemon osd.$i perf dump | grep bluestore | head -6"; done 4 "bluestore": { "bluestore_allocated": 165950652416, <---------------- "bluestore_stored": 80255191687, <---------------- "bluestore_compressed": 1526871060, "bluestore_compressed_allocated": 8900378624, "bluestore_compressed_original": 25324142592, 5 admin_socket: exception getting command descriptions: [Errno 111] Connection refused 6 "bluestore": { "bluestore_allocated": 166022152192, <---------------- "bluestore_stored": 84645390708, <---------------- "bluestore_compressed": 1169055606, "bluestore_compressed_allocated": 8647278592, "bluestore_compressed_original": 25135091712, 7 "bluestore": { "bluestore_allocated": 204633604096, <---------------- "bluestore_stored": 100116382041, <---------------- "bluestore_compressed": 1081260422, "bluestore_compressed_allocated": 6510018560, "bluestore_compressed_original": 18654052352, [root@Sim-str-R6-6 osd]# for i in {8..12}; do echo $i; ceph daemon osd.$i perf dump | grep bluestore | head -6; done 8 "bluestore": { "bluestore_allocated": 106330193920, <---------------- "bluestore_stored": 45282848089, <---------------- "bluestore_compressed": 1136610231, "bluestore_compressed_allocated": 7248609280, "bluestore_compressed_original": 20882960384, 9 "bluestore": { "bluestore_allocated": 120657412096, <---------------- "bluestore_stored": 52550745942, <---------------- "bluestore_compressed": 1321632665, "bluestore_compressed_allocated": 7401504768, "bluestore_compressed_original": 21073027072, 10 "bluestore": { "bluestore_allocated": 155985772544, <---------------- "bluestore_stored": 73236910054, <---------------- "bluestore_compressed": 98351920, "bluestore_compressed_allocated": 772210688, "bluestore_compressed_original": 2242043904, 11 "bluestore": { "bluestore_allocated": 106040524800, <---------------- "bluestore_stored": 45353612134, <---------------- "bluestore_compressed": 874216443, "bluestore_compressed_allocated": 4962844672, "bluestore_compressed_original": 14160310272, 12 "bluestore": { "bluestore_allocated": 118751363072, <---------------- "bluestore_stored": 52194408691, <---------------- "bluestore_compressed": 782919969, "bluestore_compressed_allocated": 5546311680, "bluestore_compressed_original": 16043233280, The config min_alloc_size_hdd of all osds are 64K and rados objects of VMPool0 are all 4M rbd_data.x.xxxxxxx.xxxxxxx objects. It's kind of strange that allocation space is much larger than stored data. Can anyone explains this?

3 years, 9 months

2
7
0 0

bucket index nvme

by Szabo, Istvan (Agoda)

Hello, Can someone explain me a bit about the objectstore indexing? It's not really clear, when redhat says one of the important tuning for objectstore is to put the indexes on a fast drive, when I check our current ceph cluster and I see petabytes of read operations but the size of the index pool is 0, so how I can size 1 nvme drive in each server with 4 osd to host the index pool? Also thinking to share the nvme with different partition for journaling and bucket index, but no problem to order 1 more for it. Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 9 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users July 2020