Dear All,
We are "in a bit of a pickle"...
No reply to my message (23/03/2020), subject "OSD: FAILED
ceph_assert(clone_size.count(clone))"
So I'm presuming it's not possible to recover the crashed OSD
This is bad news, as one pg may be lost, (we are using EC 8+2, pg dump
shows [NONE,NONE,NONE,388,125,25,427,226,77,154] )
Without this pg we have 1.8PB of broken cephfs.
I could rebuild the cluster from scratch, but this means no user backups
for a couple of weeks.
The cluster has 10 nodes, uses an EC 8:2 pool for cephfs data
(replicated NVMe metdata pool) and is running Nautilus 14.2.8
Clearly, it would be nicer if we could fix the OSD, but if this isn't
possible, can someone confirm that the right procedure to recover from a
corrupt pg is:
1) Stop all client access
2) find all files that store data on the bad pg, with:
# cephfs-data-scan pg_files /backup 5.750 2> /dev/null > /root/bad_files
3) delete all of these bad files - presumably using truncate? or is "rm"
fine?
4) destroy the bad pg
# ceph osd force-create-pg 5.750
5) Copy the missing files back with rsync or similar...
a better "recipe" or other advice gratefully received,
best regards,
Jake
****
Note: I am working from home until further notice.
For help, contact unixadmin(a)mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
hi folks,
if you are upgrading from luminous to octopus, or you plan to do so,
please read on.
in octopus, osd will crash if it processes an osdmap whose
require_osd_release flag is still luminous.
this only happens if a cluster upgrades very quickly from luminous to
nautilus and to octopus. in this case, there are good chances that an
octopus OSD will need to consume osdmaps which were created back in
luminous. because we assumed that ceph did not ugprade across major
releases, in octopus, OSD will panic at seeing osdmap from luminous.
this is a known bug[0], and already fixed in master. and the next
octopus release will include the fix to address this issue. as a
workaround, you need to wait a while after running
ceph osd require-osd-release nautilus
and optionally inject lots of osdmaps into cluster to ensure that the
old luminous osd maps are trimmed:
for i in `seq 500`; do
ceph osd blacklist add 192.168.0.1
ceph osd blacklist rm 192.168.0.1
done
after the whole cluster are active+clean, then upgrade to octopus.
happy upgrading!
cheers,
--
[0] https://tracker.ceph.com/issues/44759
--
Regards
Kefu Chai
Hello , I am trying to initializing the secondary zone pulling the realm
define in the primary zone:
radosgw-admin realm pull --rgw-realm=nivola --url=http://10.102.184.190:8080
--access-key=access --secret=secret
The following errors appears:
request failed: (16) Device or resource busy
Could you help me please ?
Ignazio
Hi Brett,
> Our concern with Ceph is the cost of having three replicas. Storage
> may be cheap but I’d rather not buy ANOTHER 5pb for a third replica
> if there are ways to do this more efficiently. Site-level redundancy
> is important to us so we can’t simply create an erasure-coded volume
> across two buildings – if we lose power to a building, the entire
> array would become unavailable.
can you elaborate on that? Why is EC not an option? We have installed
several clusters with two datacenters resilient to losing a whole dc
(and additional disks if required). So it's basically the choice of
the right EC profile. Or did I misunderstand something?
Zitat von Brett Randall <brett.randall(a)gmail.com>:
> Hi all
>
> Had a fun time trying to join this list, hopefully you don’t get
> this message 3 times!
>
> On to Ceph… We are looking at setting up our first ever Ceph cluster
> to replace Gluster as our media asset storage and production system.
> The Ceph cluster will have 5pb of usable storage. Whether we use it
> as object-storage, or put CephFS in front of it, is still TBD.
>
> Obviously we’re keen to protect this data well. Our current Gluster
> setup utilises RAID-6 on each of the nodes and then we have a single
> replica of each brick. The Gluster bricks are split between
> buildings so that the replica is guaranteed to be in another
> premises. By doing it this way, we guarantee that we can have a
> decent number of disk or node failures (even an entire building)
> before we lose both connectivity and data.
>
> Our concern with Ceph is the cost of having three replicas. Storage
> may be cheap but I’d rather not buy ANOTHER 5pb for a third replica
> if there are ways to do this more efficiently. Site-level redundancy
> is important to us so we can’t simply create an erasure-coded volume
> across two buildings – if we lose power to a building, the entire
> array would become unavailable. Likewise, we can’t simply have a
> single replica – our fault tolerance would drop way down on what it
> is right now.
>
> Is there a way to use both erasure coding AND replication at the
> same time in Ceph to mimic the architecture we currently have in
> Gluster? I know we COULD just create RAID6 volumes on each node and
> use the entire volume as a single OSD, but that this is not the
> recommended way to use Ceph. So is there some other way?
>
> Apologies if this is a nonsensical question, I’m still trying to
> wrap my head around Ceph, CRUSH maps, placement rules, volume types,
> etc etc!
>
> TIA
>
> Brett
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi Brett,
I'm far from being an expert, but you may consider rbd-mirroring between EC-pools.
Cheers,
Lars
Am Fri, 27 Mar 2020 06:28:02 +0000
schrieb Brett Randall <brett.randall(a)gmail.com>:
> Hi all
>
> Had a fun time trying to join this list, hopefully you don’t get this message 3 times!
>
> On to Ceph… We are looking at setting up our first ever Ceph cluster to replace Gluster as our media asset storage and production system. The Ceph cluster will have 5pb of usable storage. Whether we use it as object-storage, or put CephFS in front of it, is still TBD.
>
> Obviously we’re keen to protect this data well. Our current Gluster setup utilises RAID-6 on each of the nodes and then we have a single replica of each brick. The Gluster bricks are split between buildings so that the replica is guaranteed to be in another premises. By doing it this way, we guarantee that we can have a decent number of disk or node failures (even an entire building) before we lose both connectivity and data.
>
> Our concern with Ceph is the cost of having three replicas. Storage may be cheap but I’d rather not buy ANOTHER 5pb for a third replica if there are ways to do this more efficiently. Site-level redundancy is important to us so we can’t simply create an erasure-coded volume across two buildings – if we lose power to a building, the entire array would become unavailable. Likewise, we can’t simply have a single replica – our fault tolerance would drop way down on what it is right now.
>
> Is there a way to use both erasure coding AND replication at the same time in Ceph to mimic the architecture we currently have in Gluster? I know we COULD just create RAID6 volumes on each node and use the entire volume as a single OSD, but that this is not the recommended way to use Ceph. So is there some other way?
>
> Apologies if this is a nonsensical question, I’m still trying to wrap my head around Ceph, CRUSH maps, placement rules, volume types, etc etc!
>
> TIA
>
> Brett
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
--
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23 10117 Berlin
Tel.: +49 30 20370-352 http://www.bbaw.de
Hi all
Had a fun time trying to join this list, hopefully you don’t get this message 3 times!
On to Ceph… We are looking at setting up our first ever Ceph cluster to replace Gluster as our media asset storage and production system. The Ceph cluster will have 5pb of usable storage. Whether we use it as object-storage, or put CephFS in front of it, is still TBD.
Obviously we’re keen to protect this data well. Our current Gluster setup utilises RAID-6 on each of the nodes and then we have a single replica of each brick. The Gluster bricks are split between buildings so that the replica is guaranteed to be in another premises. By doing it this way, we guarantee that we can have a decent number of disk or node failures (even an entire building) before we lose both connectivity and data.
Our concern with Ceph is the cost of having three replicas. Storage may be cheap but I’d rather not buy ANOTHER 5pb for a third replica if there are ways to do this more efficiently. Site-level redundancy is important to us so we can’t simply create an erasure-coded volume across two buildings – if we lose power to a building, the entire array would become unavailable. Likewise, we can’t simply have a single replica – our fault tolerance would drop way down on what it is right now.
Is there a way to use both erasure coding AND replication at the same time in Ceph to mimic the architecture we currently have in Gluster? I know we COULD just create RAID6 volumes on each node and use the entire volume as a single OSD, but that this is not the recommended way to use Ceph. So is there some other way?
Apologies if this is a nonsensical question, I’m still trying to wrap my head around Ceph, CRUSH maps, placement rules, volume types, etc etc!
TIA
Brett
Hi all
We are looking at setting up our first ever Ceph cluster to replace Gluster as our media asset storage and production system. The Ceph cluster will have 5pb of usable storage. Whether we use it as object-storage, or put CephFS in front of it, is still TBD.
Obviously we’re keen to protect this data well. Our current Gluster setup utilises RAID-6 on each of the nodes and then we have a single replica of each brick. The Gluster bricks are split between buildings so that the replica is guaranteed to be in another premises. By doing it this way, we guarantee that we can have a decent number of disk or node failures (even an entire building) before we lose both connectivity and data.
Our concern with Ceph is the cost of having three replicas. Storage may be cheap but I’d rather not buy ANOTHER 5pb for a third replica if there are ways to do this more efficiently. Site-level redundancy is important to us so we can’t simply create an erasure-coded volume across two buildings – if we lose power to a building, the entire array would become unavailable. Likewise, we can’t simply have a single replica – our fault tolerance would drop way down on what it is right now.
Is there a way to use both erasure coding AND replication at the same time in Ceph to mimic the architecture we currently have in Gluster? I know we COULD just create RAID6 volumes on each node and use the entire volume as a single OSD, but that this is not the recommended way to use Ceph. So is there some other way?
Apologies if this is a nonsensical question, I’m still trying to wrap my head around Ceph, CRUSH maps, placement rules, volume types, etc etc!
TIA
Brett
Hello All,
I am going to test rbd mirroring and object storage multisite.
I would like to know which network is used by rbd mirror ( ceph public or
cluster network ?)
Same question for object storage multisite....
What about firewall ?
What about bandwidth ?
Our sites are connected with 1Gbs network ? How many volumes can I mirror
in this scenario?
Could anyone help me ?
Best Regards
Ignazio
Hello,
I am observing non-intuitive results for a performance test using the S3 API to RGW. I am wondering if others have similar experiences or knowledge here.
Our application is using the “if-none-match” header on S3-API requests. This header is set by the application if it already has a copy of the object in question but wishes to check if there is a newer version. If the etag of the current object matches then RGW sends a 304 response, and if it doesn’t it sends the updated content of the object.
We’re observing that the response time of requests resulting in “304 Not Modified” is typically slower than those for normal object retrieval. This wasn’t intuitive to me – in the 304 case there is no content to transfer over the network and I would expect the request can be satisfied just by looking at the RGW index (I was under the impression that metadata including etag is in the index). Anecdotally, HEAD requests see similar results but I haven't yet analysed in full.
Does anyone else have data or experience about expected performance of this scenario? Are there any potential avenues for optimization of configuration ? What kind of commands can I use to debug this further ?
Some details of the current setup:
=> ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable)
=> Objects are typically 80-100KB.
=> Versioning is enabled on the bucket.
=> Our requests specify a Range header (hence will generate 206 not 200).
=> Multisite features are enabled.
=> Bucket has 20 shards – I’ve put a dump of "bucket limits" below.
Performance results
Response, Request Count, Median, 75th percentile, 90th percentile, 95th percentile,
206 Partial, 20473, 3, 3, 16, 129, 1200
304 Not Modified, 15644, 9, 16, 46, 212, 1192
Bucket details
{
"bucket": "albansstack-scsdata",
"tenant": "",
"num_objects": 465780,
"num_shards": 20,
"objects_per_shard": 23289,
"fill_status": "OK"
},
Many thanks,
Alistair.