Hi all,
I'm planning to upgrade one on my Ceph Cluster currently on Luminous
12.2.13 / Debian Stretch (updated).
On this cluster, Luminous is packaged from the official Ceph repo (deb
https://download.ceph.com/debian-luminous/ stretch main)
I would like to upgrade it with Debian Buster and Nautilus using the
croit.io repository (deb https://mirror.croit.io/debian-nautilus/ buster
main)
I already prepared the steps procedure but I just want to verify one
step regarding the upgrade of the ceph packages.
Do I have to upgrade ceph in the same time than Debian or do i have to
upgrade ceph after the Debian upgrade from Stretch to Buster ?
1) In the first case :
* Replace stretch by buster in /etc/apt/sources.list
* Modify the ceph.list repo by croit.io one
* Upgrade the entire nodes
2) In the second case (upgrade Debian then Ceph)
* Replace stretch by buster in /etc/apt/sources.list
* keep the /etc/apt/sources.list.d/ceph.list as it is
* Upgrade and reboot the nodes
* replace the ceph.list file by croit.io
* upgrade the ceph packages
* restarting the Ceph services (in the right order MON -> MGR -> OSD
-> MDS)
Thanks a lot for your advices
Regards,
Hervé
Hi all,
During a crash disaster we had destroyed and reinstalled with a
different number a few osds.
As an example osd 3 was destroyed and recreated with id 101 by command
ceph osd purge 3 --yes-i-really-mean-it + ceph osd create (to block id
3) + ceph-deploy osd create --data /dev/sdxx <server> and finally ceph
osd rm 3).
Some of our pgs are now incomplet (which can be understood) but blocked
by some of the removed osd :
ex: here is an part of the ceph pg 30.3 query
{
"state": "incomplete",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 384075,
"up": [
103,
43,
29,
2,
66
],
"acting": [
103,
43,
29,
2,
66
],
........
"peer_info": [
{
"peer": "2(3)",
"pgid": "30.3s3",
"last_update": "373570'105925965",
"last_complete": "373570'105925965",
.......
},
"up": [
103,
43,
29,
2,
66
],
"acting": [
103,
43,
29,
2,
66
],
"avail_no_missing": [],
"object_location_counts": [],
*"blocked_by": [**
** 3,**
** 49**
** ],*
............
"down_osds_we_would_probe": [
*3*
],
"peering_blocked_by": [],
"peering_blocked_by_detail": [
{
* "detail": "peering_blocked_by_history_les_bound"*
}
]
I don't understand why the removed osd are still considered and present
in the pg infos.
Is there a way to get rid of that ?
Moreover, we have tons of slow ops (more than 15 000) but I guess that
the two problems are linked.
Thanks for your help.
F.
Do I misunderstand this script, or does it not _quite_ do what’s desired here?
I fully get the scenario of applying a full-cluster map to allow incremental topology changes.
To be clear, if this is run to effectively freeze backfill during / following a traumatic event, it will freeze that adapted state, not strictly return to the pre-event state? And thus the pg-upmap balancer would still need to be run to revert to the prior state? And this would also hold true for a failed/replaced OSD?
> On May 1, 2020, at 7:37 AM, Dylan McCulloch <dmc(a)unimelb.edu.au> wrote:
>
> Thanks Dan, that looks like a really neat method & script for a few use-cases. We've actually used several of the scripts in that repo over the years, so, many thanks for sharing.
>
> That method will definitely help in the scenario in which a set of unnecessary pg remaps have been triggered and can be caught early and reverted. I'm still a little concerned about the possibility of, for example, a brief network glitch occurring at night and then waking up to a full unbalanced cluster. Especially with NVMe clusters that can rapidly remap and rebalance (and for which we also have a greater impetus to squeeze out as much available capacity as possible with upmap due to cost per TB). It's just a risk I hadn't previously considered and was wondering if others have either run into it or felt any need to plan around it.
>
> Cheers,
> Dylan
>
>
>> From: Dan van der Ster <dan(a)vanderster.com>
>> Sent: Friday, 1 May 2020 5:53 PM
>> To: Dylan McCulloch <dmc(a)unimelb.edu.au>
>> Cc: ceph-users <ceph-users(a)ceph.io>
>>
>> Subject: Re: [ceph-users] upmap balancer and consequences of osds briefly marked out
>>
>> Hi,
>>
>> You're correct that all the relevant upmap entries are removed when an
>> OSD is marked out.
>> You can try to use this script which will recreate them and get the
>> cluster back to HEALTH_OK quickly:
>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema…
>>
>> Cheers, Dan
>>
>>
>> On Fri, May 1, 2020 at 9:36 AM Dylan McCulloch <dmc(a)unimelb.edu.au> wrote:
>>>
>>> Hi all,
>>>
>>> We're using upmap balancer which has made a huge improvement in evenly distributing data on our osds and has provided a substantial increase in usable capacity.
>>>
>>> Currently on ceph version: 12.2.13 luminous
>>>
>>> We ran into a firewall issue recently which led to a large number of osds being briefly marked 'down' & 'out'. The osds came back 'up' & 'in' after about 25 mins and the cluster was fine but had to perform a significant amount of backfilling/recovery despite
>> there being no end-user client I/O during that period.
>>>
>>> Presumably the large number of remapped pgs and backfills were due to pg_upmap_items being removed from the osdmap when osds were marked out and subsequently those pgs were redistributed using the default crush algorithm.
>>> As a result of the brief outage our cluster became significantly imbalanced again with several osds very close to full.
>>> Is there any reasonable mitigation for that scenario?
>>>
>>> The auto-balancer will not perform optimizations while there are degraded pgs, so it would only start reapplying pg upmap exceptions after initial recovery is complete (at which point capacity may be dangerously reduced).
>>> Similarly, as admins, we normally only apply changes when the cluster is in a healthy state, but if the same issue were to occur again would it be advisable to manually apply balancer plans while initial recovery is still taking place?
>>>
>>> I guess my concern from this experience is that making use of the capacity gained by using upmap balancer appears to carry some risk. i.e. it's possible for a brief outage to remove those space efficiencies relatively quickly and potentially result in full
>> osds/cluster before the automatic balancer is able to resume and redistribute pgs using upmap.
>>>
>>> Curious whether others have any thoughts or experience regarding this.
>>>
>>> Cheers,
>>> Dylan
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi Dylan,
The backfillfull_ratio, which defaults to 0.9, prevents backfilling
into an osd which is getting too full.
So, worst case scenario is that your cluster will have some osds
getting up to 90% full, after which case the upmap balancer should
starting putting things back into place.
Also, check that your "mon osd down out subtree limit" is set
appropriately for your cluster. In our case, we set it to "host" -- we
don't want to automatically "out" all the osds from an entire host,
because this is normally something that we can quickly fix with a
manual intervention.
But I fear that wouldn't have helped in your case, because the
firewall issue probably downed a random subset of osds from several
hosts all at once.
We've also had happen a couple times, and now set "mon osd down out
interval = 3600" so that we have time to notice the network outage and
set noout on the cluster to prevent lots of rebalancing carnage.
Hope it helps,
Dan
On Fri, May 1, 2020 at 4:37 PM Dylan McCulloch <dmc(a)unimelb.edu.au> wrote:
>
> Thanks Dan, that looks like a really neat method & script for a few use-cases. We've actually used several of the scripts in that repo over the years, so, many thanks for sharing.
>
> That method will definitely help in the scenario in which a set of unnecessary pg remaps have been triggered and can be caught early and reverted. I'm still a little concerned about the possibility of, for example, a brief network glitch occurring at night and then waking up to a full unbalanced cluster. Especially with NVMe clusters that can rapidly remap and rebalance (and for which we also have a greater impetus to squeeze out as much available capacity as possible with upmap due to cost per TB). It's just a risk I hadn't previously considered and was wondering if others have either run into it or felt any need to plan around it.
>
> Cheers,
> Dylan
>
>
> >From: Dan van der Ster <dan(a)vanderster.com>
> >Sent: Friday, 1 May 2020 5:53 PM
> >To: Dylan McCulloch <dmc(a)unimelb.edu.au>
> >Cc: ceph-users <ceph-users(a)ceph.io>
> >
> >Subject: Re: [ceph-users] upmap balancer and consequences of osds briefly marked out
> >
> >Hi,
> >
> >You're correct that all the relevant upmap entries are removed when an
> >OSD is marked out.
> >You can try to use this script which will recreate them and get the
> >cluster back to HEALTH_OK quickly:
> >https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema…
> >
> >Cheers, Dan
> >
> >
> >On Fri, May 1, 2020 at 9:36 AM Dylan McCulloch <dmc(a)unimelb.edu.au> wrote:
> >>
> >> Hi all,
> >>
> >> We're using upmap balancer which has made a huge improvement in evenly distributing data on our osds and has provided a substantial increase in usable capacity.
> >>
> >> Currently on ceph version: 12.2.13 luminous
> >>
> >> We ran into a firewall issue recently which led to a large number of osds being briefly marked 'down' & 'out'. The osds came back 'up' & 'in' after about 25 mins and the cluster was fine but had to perform a significant amount of backfilling/recovery despite
> > there being no end-user client I/O during that period.
> >>
> >> Presumably the large number of remapped pgs and backfills were due to pg_upmap_items being removed from the osdmap when osds were marked out and subsequently those pgs were redistributed using the default crush algorithm.
> >> As a result of the brief outage our cluster became significantly imbalanced again with several osds very close to full.
> >> Is there any reasonable mitigation for that scenario?
> >>
> >> The auto-balancer will not perform optimizations while there are degraded pgs, so it would only start reapplying pg upmap exceptions after initial recovery is complete (at which point capacity may be dangerously reduced).
> >> Similarly, as admins, we normally only apply changes when the cluster is in a healthy state, but if the same issue were to occur again would it be advisable to manually apply balancer plans while initial recovery is still taking place?
> >>
> >> I guess my concern from this experience is that making use of the capacity gained by using upmap balancer appears to carry some risk. i.e. it's possible for a brief outage to remove those space efficiencies relatively quickly and potentially result in full
> > osds/cluster before the automatic balancer is able to resume and redistribute pgs using upmap.
> >>
> >> Curious whether others have any thoughts or experience regarding this.
> >>
> >> Cheers,
> >> Dylan
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users(a)ceph.io
> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
Hello,
We use purely cephfs in out ceph cluster (version 14.2.7). The cephfs
data is an EC pool (k=4, m=2) with hdd OSDs using bluestore. The
default file layout (i.e. 4MB object size) is used.
We see the following output of ceph df:
---
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW
USED
hdd 951 TiB 888 TiB 63 TiB 63
TiB 6.58
ssd 9.6 TiB 9.6 TiB 1.4 GiB 16
GiB 0.17
TOTAL 961 TiB 898 TiB 63 TiB 63
TiB 6.52
POOLS:
POOL ID STORED OBJECTS USED %USE
D MAX AVAIL
cephfs-data 2 34 TiB 12.51M 52
TiB 5.93 553 TiB
cephfs-metadata 4 994 MiB 98.61k 1.5
GiB 0.02 3.0 TiB
---
What triggered my attention is the discrepency between the reported
size of "USED" (52 TiB) and "STORED" (34 TiB) on the cephfs-data pool.
From this document (
https://docs.ceph.com/docs/master/releases/nautilus/#upgrade-compatibility-…
), it says that
- "USED" represents amount of space allocated purely for data by all
OSD nodes in KB
- "STORED" represents amount of data stored by the user.
I seem to undersand that the "USED" size can be roughly taken as the
number of objects (12.51M) times the object size (4MB) of the file
layout; and since there are many files with size smaller than 4 MB in
our system, the actual stored data is less.
Is my interpretation correct? If so, does it mean that we will be
wasting a lot of space when we have a lot files smaller than the object
size of 4MB in the system? Thanks for the help!
Cheers, Hong
--
Hurng-Chun (Hong) Lee, PhD
ICT manager
Donders Institute for Brain, Cognition and Behaviour,
Centre for Cognitive Neuroimaging
Radboud University Nijmegen
e-mail: h.lee(a)donders.ru.nl
tel: +31(0) 243610977
web: http://www.ru.nl/donders/
Hello,
Hoping you can help me.
Ceph had been largely problem free for us for the better part of a year.
We have a high file count in a single CephFS filesystem, and are seeing
this error in the logs:
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.9/rpm/el7/BUILD/ceph-14.2.9/src/mds/OpenFileTable.cc:
777: FAILED ceph_assert(omap_num_objs == num_objs)
The issued seemed to occur this morning, and restarting the MDS as well as
rebooting the servers doesn't correct the problem.
Not really sure where to look next as the MDS daemons crash.
Appreciate any help you can provide
Marco
Hi,
We had a major crash which ended with ~1/3 of our osd downs.
Trying to fix it we reinstalled a few down osd (that was a mistake, I
agree) and destroy the datas on it.
Finally, we could fix the problem (thanks to Igor Fedotov) and restart
almost all of our osds except one for which the rocksdb seems corrupted
(at least for one file).
Unfortunately, we now have 4 pgs down (all involving the dead osd) and 8
pg incompletes (some of them also involving the down osd).
Before considering data loss, we would like to try to restart the down
osds hopping to recover the down pgs and maybe some of the incomplete ones.
Does someone have an idea on how to do that (maybe by removing the file
corrupting the rocksdb or forcing to ignore the data in error) ?
If it's not possible, how can we fix (even with dataloss) the downs and
incomplete pgs ?
Thanks for your advices.
F.
Hi,
I have installed ceph on Ubuntu Focal Fossa using the ubuntu repo, instead of ceph-deploy (as ceph-deploy install does not work for Focal Fossa yet) install I used:
sudo apt-get install -y ceph ceph-mds radosgw ceph-mgr-dashboard
The rest of the setup was the same as the quickstart on ceph.io <http://ceph.io/> with ceph-deploy.
It installed ceph version 15.2.1 (octopus).
If I do a 'ceph -s' I get the warning:
health: HEALTH_WARN
2 mgr modules have failed dependencies
If I run 'ceph mgr module ls', for enabled and active modules I get:
"always_on_modules": [
"balancer",
"crash",
"devicehealth",
"orchestrator",
"osd_support",
"pg_autoscaler",
"progress",
"rbd_support",
"status",
"telemetry",
"volumes"
],
"enabled_modules": [
"iostat",
"restful”
Then when I run 'ceph mgr module enable dashboard’ I get the error:
Error ENOENT: module 'dashboard' reports that it cannot run on the active manager daemon: No module named 'yaml' (pass --force to force enablement)
I have tried searching, and searching with apt but cannot find any ‘yaml’ package that might be used by ceph.
Duncan
Hi,
You're correct that all the relevant upmap entries are removed when an
OSD is marked out.
You can try to use this script which will recreate them and get the
cluster back to HEALTH_OK quickly:
https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-rema…
Cheers, Dan
On Fri, May 1, 2020 at 9:36 AM Dylan McCulloch <dmc(a)unimelb.edu.au> wrote:
>
> Hi all,
>
> We're using upmap balancer which has made a huge improvement in evenly distributing data on our osds and has provided a substantial increase in usable capacity.
>
> Currently on ceph version: 12.2.13 luminous
>
> We ran into a firewall issue recently which led to a large number of osds being briefly marked 'down' & 'out'. The osds came back 'up' & 'in' after about 25 mins and the cluster was fine but had to perform a significant amount of backfilling/recovery despite there being no end-user client I/O during that period.
>
> Presumably the large number of remapped pgs and backfills were due to pg_upmap_items being removed from the osdmap when osds were marked out and subsequently those pgs were redistributed using the default crush algorithm.
> As a result of the brief outage our cluster became significantly imbalanced again with several osds very close to full.
> Is there any reasonable mitigation for that scenario?
>
> The auto-balancer will not perform optimizations while there are degraded pgs, so it would only start reapplying pg upmap exceptions after initial recovery is complete (at which point capacity may be dangerously reduced).
> Similarly, as admins, we normally only apply changes when the cluster is in a healthy state, but if the same issue were to occur again would it be advisable to manually apply balancer plans while initial recovery is still taking place?
>
> I guess my concern from this experience is that making use of the capacity gained by using upmap balancer appears to carry some risk. i.e. it's possible for a brief outage to remove those space efficiencies relatively quickly and potentially result in full osds/cluster before the automatic balancer is able to resume and redistribute pgs using upmap.
>
> Curious whether others have any thoughts or experience regarding this.
>
> Cheers,
> Dylan
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io