March 2020 - ceph-users - lists.ceph.io

by Jake Grimmett

Dear All, We are "in a bit of a pickle"... No reply to my message (23/03/2020), subject "OSD: FAILED ceph_assert(clone_size.count(clone))" So I'm presuming it's not possible to recover the crashed OSD This is bad news, as one pg may be lost, (we are using EC 8+2, pg dump shows [NONE,NONE,NONE,388,125,25,427,226,77,154] ) Without this pg we have 1.8PB of broken cephfs. I could rebuild the cluster from scratch, but this means no user backups for a couple of weeks. The cluster has 10 nodes, uses an EC 8:2 pool for cephfs data (replicated NVMe metdata pool) and is running Nautilus 14.2.8 Clearly, it would be nicer if we could fix the OSD, but if this isn't possible, can someone confirm that the right procedure to recover from a corrupt pg is: 1) Stop all client access 2) find all files that store data on the bad pg, with: # cephfs-data-scan pg_files /backup 5.750 2> /dev/null > /root/bad_files 3) delete all of these bad files - presumably using truncate? or is "rm" fine? 4) destroy the bad pg # ceph osd force-create-pg 5.750 5) Copy the missing files back with rsync or similar... a better "recipe" or other advice gratefully received, best regards, Jake **** Note: I am working from home until further notice. For help, contact unixadmin(a)mrc-lmb.cam.ac.uk -- Dr Jake Grimmett Head Of Scientific Computing MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge CB2 0QH, UK. Phone 01223 267019 Mobile 0776 9886539

4 years, 1 month

3
4
0 0

fast luminous -> nautilus -> octopus upgrade could lead to assertion failure on OSD

by kefu chai

hi folks, if you are upgrading from luminous to octopus, or you plan to do so, please read on. in octopus, osd will crash if it processes an osdmap whose require_osd_release flag is still luminous. this only happens if a cluster upgrades very quickly from luminous to nautilus and to octopus. in this case, there are good chances that an octopus OSD will need to consume osdmaps which were created back in luminous. because we assumed that ceph did not ugprade across major releases, in octopus, OSD will panic at seeing osdmap from luminous. this is a known bug[0], and already fixed in master. and the next octopus release will include the fix to address this issue. as a workaround, you need to wait a while after running ceph osd require-osd-release nautilus and optionally inject lots of osdmaps into cluster to ensure that the old luminous osd maps are trimmed: for i in `seq 500`; do ceph osd blacklist add 192.168.0.1 ceph osd blacklist rm 192.168.0.1 done after the whole cluster are active+clean, then upgrade to octopus. happy upgrading! cheers, -- [0] https://tracker.ceph.com/issues/44759 -- Regards Kefu Chai

4 years, 1 month

1
0
0 0

[ceph][nautilus] error initalizing secondary zone

by Ignazio Cassano

Hello , I am trying to initializing the secondary zone pulling the realm define in the primary zone: radosgw-admin realm pull --rgw-realm=nivola --url=http://10.102.184.190:8080 --access-key=access --secret=secret The following errors appears: request failed: (16) Device or resource busy Could you help me please ? Ignazio

4 years, 1 month

1
1
0 0

Re: Combining erasure coding and replication?

by Eugen Block

Hi Brett, > Our concern with Ceph is the cost of having three replicas. Storage > may be cheap but I’d rather not buy ANOTHER 5pb for a third replica > if there are ways to do this more efficiently. Site-level redundancy > is important to us so we can’t simply create an erasure-coded volume > across two buildings – if we lose power to a building, the entire > array would become unavailable. can you elaborate on that? Why is EC not an option? We have installed several clusters with two datacenters resilient to losing a whole dc (and additional disks if required). So it's basically the choice of the right EC profile. Or did I misunderstand something? Zitat von Brett Randall <brett.randall(a)gmail.com>: > Hi all > > Had a fun time trying to join this list, hopefully you don’t get > this message 3 times! > > On to Ceph… We are looking at setting up our first ever Ceph cluster > to replace Gluster as our media asset storage and production system. > The Ceph cluster will have 5pb of usable storage. Whether we use it > as object-storage, or put CephFS in front of it, is still TBD. > > Obviously we’re keen to protect this data well. Our current Gluster > setup utilises RAID-6 on each of the nodes and then we have a single > replica of each brick. The Gluster bricks are split between > buildings so that the replica is guaranteed to be in another > premises. By doing it this way, we guarantee that we can have a > decent number of disk or node failures (even an entire building) > before we lose both connectivity and data. > > Our concern with Ceph is the cost of having three replicas. Storage > may be cheap but I’d rather not buy ANOTHER 5pb for a third replica > if there are ways to do this more efficiently. Site-level redundancy > is important to us so we can’t simply create an erasure-coded volume > across two buildings – if we lose power to a building, the entire > array would become unavailable. Likewise, we can’t simply have a > single replica – our fault tolerance would drop way down on what it > is right now. > > Is there a way to use both erasure coding AND replication at the > same time in Ceph to mimic the architecture we currently have in > Gluster? I know we COULD just create RAID6 volumes on each node and > use the entire volume as a single OSD, but that this is not the > recommended way to use Ceph. So is there some other way? > > Apologies if this is a nonsensical question, I’m still trying to > wrap my head around Ceph, CRUSH maps, placement rules, volume types, > etc etc! > > TIA > > Brett > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

4 years, 1 month

3
3
0 0

Re: Combining erasure coding and replication?

by Lars Täuber

Hi Brett, I'm far from being an expert, but you may consider rbd-mirroring between EC-pools. Cheers, Lars Am Fri, 27 Mar 2020 06:28:02 +0000 schrieb Brett Randall <brett.randall(a)gmail.com>: > Hi all > > Had a fun time trying to join this list, hopefully you don’t get this message 3 times! > > On to Ceph… We are looking at setting up our first ever Ceph cluster to replace Gluster as our media asset storage and production system. The Ceph cluster will have 5pb of usable storage. Whether we use it as object-storage, or put CephFS in front of it, is still TBD. > > Obviously we’re keen to protect this data well. Our current Gluster setup utilises RAID-6 on each of the nodes and then we have a single replica of each brick. The Gluster bricks are split between buildings so that the replica is guaranteed to be in another premises. By doing it this way, we guarantee that we can have a decent number of disk or node failures (even an entire building) before we lose both connectivity and data. > > Our concern with Ceph is the cost of having three replicas. Storage may be cheap but I’d rather not buy ANOTHER 5pb for a third replica if there are ways to do this more efficiently. Site-level redundancy is important to us so we can’t simply create an erasure-coded volume across two buildings – if we lose power to a building, the entire array would become unavailable. Likewise, we can’t simply have a single replica – our fault tolerance would drop way down on what it is right now. > > Is there a way to use both erasure coding AND replication at the same time in Ceph to mimic the architecture we currently have in Gluster? I know we COULD just create RAID6 volumes on each node and use the entire volume as a single OSD, but that this is not the recommended way to use Ceph. So is there some other way? > > Apologies if this is a nonsensical question, I’m still trying to wrap my head around Ceph, CRUSH maps, placement rules, volume types, etc etc! > > TIA > > Brett > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de

4 years, 1 month

1
0
0 0

Combining erasure coding and replication?

by Brett Randall

Hi all Had a fun time trying to join this list, hopefully you don’t get this message 3 times! On to Ceph… We are looking at setting up our first ever Ceph cluster to replace Gluster as our media asset storage and production system. The Ceph cluster will have 5pb of usable storage. Whether we use it as object-storage, or put CephFS in front of it, is still TBD. Obviously we’re keen to protect this data well. Our current Gluster setup utilises RAID-6 on each of the nodes and then we have a single replica of each brick. The Gluster bricks are split between buildings so that the replica is guaranteed to be in another premises. By doing it this way, we guarantee that we can have a decent number of disk or node failures (even an entire building) before we lose both connectivity and data. Our concern with Ceph is the cost of having three replicas. Storage may be cheap but I’d rather not buy ANOTHER 5pb for a third replica if there are ways to do this more efficiently. Site-level redundancy is important to us so we can’t simply create an erasure-coded volume across two buildings – if we lose power to a building, the entire array would become unavailable. Likewise, we can’t simply have a single replica – our fault tolerance would drop way down on what it is right now. Is there a way to use both erasure coding AND replication at the same time in Ceph to mimic the architecture we currently have in Gluster? I know we COULD just create RAID6 volumes on each node and use the entire volume as a single OSD, but that this is not the recommended way to use Ceph. So is there some other way? Apologies if this is a nonsensical question, I’m still trying to wrap my head around Ceph, CRUSH maps, placement rules, volume types, etc etc! TIA Brett

4 years, 1 month

1
0
0 0

Combining erasure coding and replication?

by Brett Randall

Hi all We are looking at setting up our first ever Ceph cluster to replace Gluster as our media asset storage and production system. The Ceph cluster will have 5pb of usable storage. Whether we use it as object-storage, or put CephFS in front of it, is still TBD. Obviously we’re keen to protect this data well. Our current Gluster setup utilises RAID-6 on each of the nodes and then we have a single replica of each brick. The Gluster bricks are split between buildings so that the replica is guaranteed to be in another premises. By doing it this way, we guarantee that we can have a decent number of disk or node failures (even an entire building) before we lose both connectivity and data. Our concern with Ceph is the cost of having three replicas. Storage may be cheap but I’d rather not buy ANOTHER 5pb for a third replica if there are ways to do this more efficiently. Site-level redundancy is important to us so we can’t simply create an erasure-coded volume across two buildings – if we lose power to a building, the entire array would become unavailable. Likewise, we can’t simply have a single replica – our fault tolerance would drop way down on what it is right now. Is there a way to use both erasure coding AND replication at the same time in Ceph to mimic the architecture we currently have in Gluster? I know we COULD just create RAID6 volumes on each node and use the entire volume as a single OSD, but that this is not the recommended way to use Ceph. So is there some other way? Apologies if this is a nonsensical question, I’m still trying to wrap my head around Ceph, CRUSH maps, placement rules, volume types, etc etc! TIA Brett

4 years, 1 month

1
0
0 0

Ceph rbd mirror and object storage multisite

by Ignazio Cassano

Hello All, I am going to test rbd mirroring and object storage multisite. I would like to know which network is used by rbd mirror ( ceph public or cluster network ?) Same question for object storage multisite.... What about firewall ? What about bandwidth ? Our sites are connected with 1Gbs network ? How many volumes can I mirror in this scenario? Could anyone help me ? Best Regards Ignazio

4 years, 1 month

1
0
0 0

Performance characteristics of ‘if-none-match’ on rgw

by akmd＠metaswitch.com

Hello, I am observing non-intuitive results for a performance test using the S3 API to RGW. I am wondering if others have similar experiences or knowledge here. Our application is using the “if-none-match” header on S3-API requests. This header is set by the application if it already has a copy of the object in question but wishes to check if there is a newer version. If the etag of the current object matches then RGW sends a 304 response, and if it doesn’t it sends the updated content of the object. We’re observing that the response time of requests resulting in “304 Not Modified” is typically slower than those for normal object retrieval. This wasn’t intuitive to me – in the 304 case there is no content to transfer over the network and I would expect the request can be satisfied just by looking at the RGW index (I was under the impression that metadata including etag is in the index). Anecdotally, HEAD requests see similar results but I haven't yet analysed in full. Does anyone else have data or experience about expected performance of this scenario? Are there any potential avenues for optimization of configuration ? What kind of commands can I use to debug this further ? Some details of the current setup: => ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable) => Objects are typically 80-100KB. => Versioning is enabled on the bucket. => Our requests specify a Range header (hence will generate 206 not 200). => Multisite features are enabled. => Bucket has 20 shards – I’ve put a dump of "bucket limits" below. Performance results Response, Request Count, Median, 75th percentile, 90th percentile, 95th percentile, 206 Partial, 20473, 3, 3, 16, 129, 1200 304 Not Modified, 15644, 9, 16, 46, 212, 1192 Bucket details { "bucket": "albansstack-scsdata", "tenant": "", "num_objects": 465780, "num_shards": 20, "objects_per_shard": 23289, "fill_status": "OK" }, Many thanks, Alistair.

4 years, 1 month

1
0
0 0

octopus upgrade stuck: Assertion `map->require_osd_release >= ceph_release_t::mimic' failed.

by Ml Ml

Hello List, i followed: https://ceph.io/releases/v15-2-0-octopus-released/ I came from a healthy nautilus and i am stuck at: 5.) Upgrade all OSDs by installing the new packages and restarting the ceph-osd daemons on all OSD host When i try to start an osd like this, i get: /usr/bin/ceph-osd -f --cluster ceph --id 32 --setuser ceph --setgroup ceph ... 2020-03-25T20:11:03.292+0100 7f2762874e00 -1 osd.32 57223 log_to_monitors {default=true} ceph-osd: /build/ceph-15.2.0/src/osd/PeeringState.cc:109: void PGPool::update(ceph::common::CephContext*, OSDMapRef): Assertion `map->require_osd_release >= ceph_release_t::mimic' failed. ceph-osd: /build/ceph-15.2.0/src/osd/PeeringState.cc:109: void PGPool::update(ceph::common::CephContext*, OSDMapRef): Assertion `map->require_osd_release >= ceph_release_t::mimic' failed. *** Caught signal (Aborted) ** in thread 7f274854f700 thread_name:tp_osd_tp Aborted My current status: root@ceph03:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 60.70999 root default -2 20.25140 host ceph01 0 hdd 1.71089 osd.0 up 1.00000 1.00000 8 hdd 2.67029 osd.8 up 1.00000 1.00000 11 hdd 1.59999 osd.11 up 1.00000 1.00000 12 hdd 1.59999 osd.12 up 1.00000 1.00000 14 hdd 2.79999 osd.14 up 1.00000 1.00000 18 hdd 1.59999 osd.18 up 1.00000 1.00000 22 hdd 2.79999 osd.22 up 1.00000 1.00000 23 hdd 2.79999 osd.23 up 1.00000 1.00000 26 hdd 2.67029 osd.26 up 1.00000 1.00000 -3 23.05193 host ceph02 2 hdd 2.67029 osd.2 up 1.00000 1.00000 3 hdd 2.00000 osd.3 up 1.00000 1.00000 7 hdd 2.67029 osd.7 up 1.00000 1.00000 9 hdd 2.67029 osd.9 up 1.00000 1.00000 13 hdd 2.00000 osd.13 up 1.00000 1.00000 16 hdd 1.59999 osd.16 up 1.00000 1.00000 19 hdd 2.38409 osd.19 up 1.00000 1.00000 24 hdd 2.67020 osd.24 up 1.00000 1.00000 25 hdd 1.71649 osd.25 up 1.00000 1.00000 28 hdd 2.67029 osd.28 up 1.00000 1.00000 -4 17.40666 host ceph03 5 hdd 1.71660 osd.5 down 1.00000 1.00000 6 hdd 1.71660 osd.6 down 1.00000 1.00000 10 hdd 2.67029 osd.10 down 1.00000 1.00000 15 hdd 2.00000 osd.15 down 1.00000 1.00000 17 hdd 1.20000 osd.17 down 1.00000 1.00000 20 hdd 1.71649 osd.20 down 1.00000 1.00000 21 hdd 2.00000 osd.21 down 1.00000 1.00000 27 hdd 1.71649 osd.27 down 1.00000 1.00000 32 hdd 2.67020 osd.32 down 1.00000 1.00000 root@ceph03:~# ceph osd dump | grep require_osd_release require_osd_release nautilus root@ceph03:~# ceph osd versions { "ceph version 14.2.8 (88c3b82e8bc76d3444c2d84a30c4a380d6169d46) nautilus (stable)": 19 } oot@ceph03:~# ceph mon dump | grep min_mon_release dumped monmap epoch 12 min_mon_release 15 (octopus) ceph versions { "mon": { "ceph version 15.2.0 (dc6a0b5c3cbf6a5e1d6d4f20b5ad466d76b96247) octopus (rc)": 3 }, "mgr": { "ceph version 15.2.0 (dc6a0b5c3cbf6a5e1d6d4f20b5ad466d76b96247) octopus (rc)": 3 }, "osd": { "ceph version 14.2.8 (88c3b82e8bc76d3444c2d84a30c4a380d6169d46) nautilus (stable)": 19 }, "mds": { "ceph version 15.2.0 (dc6a0b5c3cbf6a5e1d6d4f20b5ad466d76b96247) octopus (rc)": 3 }, "overall": { "ceph version 14.2.8 (88c3b82e8bc76d3444c2d84a30c4a380d6169d46) nautilus (stable)": 19, "ceph version 15.2.0 (dc6a0b5c3cbf6a5e1d6d4f20b5ad466d76b96247) octopus (rc)": 9 } } Why does it complain about map->require_osd_release >= ceph_release_t::mimic ? Cheers, Michael

4 years, 1 month

2
2
0 0

2024

2023

2022

2021

2020

2019

ceph-users March 2020