Tonight an old Ceph cluster we run suffered a hardware failure that
resulted in the loss of Ceph journal SSDs on 7 nodes out of 36. Overview
of this old setup:
- Super-old Ceph Dumpling v0.67
- 3x replication for RBD w/ 3 failure domains in replication hierarchy
- OSDs on XFS on spinning disks with Journals on SSD
In total we lost 7 SSDs hosting Journals for 21 OSDs (3 each). The lost
nodes span all three failure domains which makes me nervous that there
are likely missing Placement Groups in the pool. Due to how Ceph shards
data across the Placement Groups, I'm concerned I may have lost all the
RBD volumes in this pool.
The obvious solution is to attempt to bring the OSDs back online (for at
least one failure domain) to ensure there is at least one complete copy
of the data then rebuild everything else. The issue is I lost the
journals when the SSDs died.
I don't see much published about recovering OSDs in the event of a lost
journal except:
https://ceph.io/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/
And that doesn't mention if the data is valid afterwards. I think I
recall Inktank used to deal with this situation and may have had a
potential solution. At this point, I'll take any constructive advice.
Thanks you in advance,
Mike
On Thu, Jul 9, 2020 at 10:33 AM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
> What about ntfs? You have there a not quick option. Maybe it writes to
> the whole disk some random pattern. Why do you ask?
>
I am writing an API layer to plug into our platform, so I want to know if
the format times can be deterministic or unbounded. From what I saw with
ext3, ext4 and xfs volumes, the format time is actually not dependent on
the size of the volume, so I just wanted to confirm if we can assume that,
or if I am missing something.
Thanks,
Shridhar
>
>
> -----Original Message-----
> Cc: ceph-users
> Subject: [ceph-users] Re: RBD thin provisioning and time to format a
> volume
>
> Thanks Jason.
>
> Do you mean to say some filesystems will initialize the entire disk
> during format? Does that mean we will see the entire size of the volume
> getting allocated during formatting?
> Or do you mean to say, some filesystem formatting just takes longer than
> others, as it does more initialization?
>
> I am just trying to understand if there are cases where Ceph will
> allocate all the blocks for a filesystem during formation operations, OR
> if they continue to be thin provisioned (allocate as you go based on
> real data). So far I have tried with ext3, ext4 and xfs and none of them
> dont allocate all the blocks during format.
>
> -Shridhar
>
>
> On Thu, 9 Jul 2020 at 06:58, Jason Dillaman <jdillama(a)redhat.com> wrote:
>
> > On Thu, Jul 9, 2020 at 12:02 AM Void Star Nill
> > <void.star.nill(a)gmail.com>
> > wrote:
> > >
> > >
> > >
> > > On Wed, Jul 8, 2020 at 4:56 PM Jason Dillaman <jdillama(a)redhat.com>
> > wrote:
> > >>
> > >> On Wed, Jul 8, 2020 at 3:28 PM Void Star Nill
> > >> <void.star.nill(a)gmail.com>
> > wrote:
> > >> >
> > >> > Hello,
> > >> >
> > >> > My understanding is that the time to format an RBD volume is not
> > dependent
> > >> > on its size as the RBD volumes are thin provisioned. Is this
> correct?
> > >> >
> > >> > For example, formatting a 1G volume should take almost the same
> > >> > time
> > as
> > >> > formatting a 1TB volume - although accounting for differences in
> > latencies
> > >> > due to load on the Ceph cluster. Is that a fair assumption?
> > >>
> > >> Yes, that is a fair comparison when creating the RBD image.
> > >> However, a format operation might initialize and discard extents on
>
> > >> the disk, so a larger disk will take longer to format.
> > >
> > >
> > > Thanks for the response Jason. Could you please explain a bit more
> > > on
> > the the format operation?
> >
> > I'm not sure what else there is to explain. When you create a file
> > system on top of any block device, it needs to initialize the block
> > device. Depending on the file system, it might take more time for
> > larger block devices because it's doing more work.
> >
> > > Is there a relative time that we can determine based on the volume
> size?
> > >
> > > Thanks
> > > Shridhar
> > >
> > >
> > >>
> > >>
> > >> > Thanks,
> > >> > Shridhar
> > >> > _______________________________________________
> > >> > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send
>
> > >> > an email to ceph-users-leave(a)ceph.io
> > >> >
> > >>
> > >>
> > >> --
> > >> Jason
> > >>
> >
> >
> > --
> > Jason
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
>
>
>
Hi,
For this post:
https://ceph.io/community/bluestore-default-vs-tuned-performance-comparison/
I don't see a way to contact the authors so I thought I would try here.
Does anyone know how the rocksdb tuning parameters of:
"
bluestore_rocksdb_options =
compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
"
were chosen?
Some of the settings seem to not be in line with the rocksdb tuning guide:
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
thx
Frank
Hello,
My understanding is that the time to format an RBD volume is not dependent
on its size as the RBD volumes are thin provisioned. Is this correct?
For example, formatting a 1G volume should take almost the same time as
formatting a 1TB volume - although accounting for differences in latencies
due to load on the Ceph cluster. Is that a fair assumption?
Thanks,
Shridhar
Hi all,
We're seeing a problem in our multisite Ceph deployment, where bilogs aren't being trimmed for several buckets. This is causing bilogs to accumulate over time, leading to large OMAP object warnings for the indexes on these buckets.
In every case, Ceph reports that the bucket is in sync and the data is consistent across both sites, so we're perplexed as to why the logs aren't being trimmed. It's not affecting all of our buckets, we're not sure what's 'different' in the affected cases causing them to accumulate. We're seeing this in both unsharded and sharded buckets. Some buckets with heavy activity (lots of object updates) have accumulated millions of bilogs, but this does not affect all of our very active buckets.
I've tried running 'radosgw-admin bilog autotrim' against an affected bucket, and it doesn't appear to do anything. I've used 'radosgw-admin bilog trim' with a suitable 'end-marker' to trim all of the bilogs, but the implications of doing this aren't clear to me, and the logs continue to accumulate afterwards.
We're running Ceph Nautilus 14.2.9 and running the services in containers, we have 3 hosts on each site with 1 OSD on each host.
We're adjusting a fairly minimal set of config and I don't think it includes anything that would affect bilog trimming. Checking the running config on the mon service, I think these defaulted parameters are relevant:
"rgw_sync_log_trim_concurrent_buckets": "4",
"rgw_sync_log_trim_interval": "1200",
"rgw_sync_log_trim_max_buckets": "16",
"rgw_sync_log_trim_min_cold_buckets": "4",
I can't find any documentation on these parameters but we have more than 16 buckets, so is it possible that some buckets are just never being selected for trimming?
Any other ideas as to what might be causing this, or anything else we could try to help diagnose or fix this? Thanks in advance!
I've included an example below for one such affected bucket, showing its current state. Zone details (as per 'radosgw-admin zonegroup get') are at the bottom.
$ radosgw-admin bucket sync status --bucket=edin2z6-sharedconfig
realm b7f31089-0879-4fa2-9cbc-cfdf5f866a35 (geored_realm)
zonegroup 5d74eb0e-5d99-481f-ae33-43483f6cebc0 (geored_zg)
zone c48f33ad-6d79-4b9f-a22f-78589f67526e (siteA)
bucket edin2z6-sharedconfig[033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1]
source zone 0a3c29b7-1a2c-432d-979b-d324a05cc831 (siteApubsub)
full sync: 0/1 shards
incremental sync: 0/1 shards
bucket is caught up with source
source zone 9f5fba56-4a32-46a6-8695-89253be81614 (siteB)
full sync: 0/1 shards
incremental sync: 1/1 shards
bucket is caught up with source
source zone c72b3aa8-a051-4665-9421-909510702412 (siteBpubsub)
full sync: 0/1 shards
incremental sync: 0/1 shards
bucket is caught up with source
$ radosgw-admin bilog list --bucket edin2z6-sharedconfig --max-entries 600000000 | grep op_id | wc -l
1299392
$ rados -p siteA.rgw.buckets.index listomapkeys .dir.033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1 | wc -l
1299083
$ radosgw-admin bucket stats --bucket=edin2z6-sharedconfig
{
"bucket": "edin2z6-sharedconfig",
"num_shards": 0,
"tenant": "",
"zonegroup": "5d74eb0e-5d99-481f-ae33-43483f6cebc0",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1",
"marker": "033709fc-924a-4582-b00d-97c90e9e61b6.3634407.1",
"index_type": "Normal",
"owner": "edin2z6",
"ver": "0#1622676",
"master_ver": "0#0",
"mtime": "2020-01-14 14:30:18.606142Z",
"max_marker": "0#00001622675.2115836.5",
"usage": {
"rgw.main": {
"size": 15209,
"size_actual": 40960,
"size_utilized": 15209,
"size_kb": 15,
"size_kb_actual": 40,
"size_kb_utilized": 15,
"num_objects": 7
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}
$ radosgw-admin bucket limit check
...
{
"bucket": "edin2z6-sharedconfig",
"tenant": "",
"num_objects": 7,
"num_shards": 0,
"objects_per_shard": 7,
"fill_status": "OK"
},
...
$ radosgw-admin zonegroup get
++ sudo docker ps --filter name=ceph-rgw-.*rgw -q
++ sudo docker exec d2c999b1f3f8 radosgw-admin
{
"id": "5d74eb0e-5d99-481f-ae33-43483f6cebc0",
"name": "geored_zg",
"api_name": "geored_zg",
"is_master": "true",
"endpoints": [
"https://10.254.2.93:7480"
],
"hostnames": [],
"hostnames_s3website": [],
"master_zone": "c48f33ad-6d79-4b9f-a22f-78589f67526e",
"zones": [
{
"id": "0a3c29b7-1a2c-432d-979b-d324a05cc831",
"name": "siteApubsub",
"endpoints": [
"https://10.254.2.93:7481",
"https://10.254.2.94:7481",
"https://10.254.2.95:7481"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteA"
],
"redirect_zone": ""
},
{
"id": "9f5fba56-4a32-46a6-8695-89253be81614",
"name": "siteB",
"endpoints": [
"https://10.254.2.224:7480",
"https://10.254.2.225:7480",
"https://10.254.2.226:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
},
{
"id": "c48f33ad-6d79-4b9f-a22f-78589f67526e",
"name": "siteA",
"endpoints": [
"https://10.254.2.93:7480",
"https://10.254.2.94:7480",
"https://10.254.2.95:7480"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "",
"sync_from_all": "true",
"sync_from": [],
"redirect_zone": ""
},
{
"id": "c72b3aa8-a051-4665-9421-909510702412",
"name": "siteBpubsub",
"endpoints": [
"https://10.254.2.224:7481",
"https://10.254.2.225:7481",
"https://10.254.2.226:7481"
],
"log_meta": "false",
"log_data": "true",
"bucket_index_max_shards": 0,
"read_only": "false",
"tier_type": "pubsub",
"sync_from_all": "false",
"sync_from": [
"siteB"
],
"redirect_zone": ""
}
],
"placement_targets": [
{
"name": "default-placement",
"tags": [],
"storage_classes": [
"STANDARD"
]
}
],
"default_placement": "default-placement",
"realm_id": "b7f31089-0879-4fa2-9cbc-cfdf5f866a35"
}
Hi all.
Is there any docs related to default.rgw.data.root pool? I have this
pool and there are no objects in default.rgw.meta pool.
Thanks for your help.
Hello,
Can someone explain me a bit about the objectstore indexing? It's not really clear, when redhat says one of the important tuning for objectstore is to put the indexes on a fast drive, when I check our current ceph cluster and I see petabytes of read operations but the size of the index pool is 0, so how I can size 1 nvme drive in each server with 4 osd to host the index pool? Also thinking to share the nvme with different partition for journaling and bucket index, but no problem to order 1 more for it.
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.