hdd pg's migrating when converting ssd class osd's

List overview All Threads
Download

newer

older

S3 Buckets with "object-lock"

Re: Understanding what ceph-volume...

Marc Roos

27 Sep 2020 27 Sep '20

2:05 p.m.

I have been converting ssd's osd's to dmcrypt, and I have noticed that pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to 0.0 of a ssd osd this: 17.35 10415 0 0 9907 0 36001743890 0 0 3045 3045 active+remapped+backfilling 2020-09-27 12:55:49.093054 83758'20725398 83758:100379720 [8,14,23] 8 [3,14,23] 3 83636'20718129 2020-09-27 00:58:07.098096 83300'20689151 2020-09-24 21:42:07.385360 0 However osds 3,14,23,8 are all hdd osd's Since this is a cluster from Kraken/Luminous, I am not sure if the device class of the replicated_ruleset[1] was set when the pool 17 was created. Weird thing is that all pg's of this pool seem to be on hdd osd[2] Q. How can I display the definition of 'crush_rule 0' at the time of the pool creation? (To be sure it had already this device class hdd configured) [1] [@~]# ceph osd pool ls detail | grep 'pool 17' pool 17 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 83712 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd [@~]# ceph osd crush rule dump replicated_ruleset { "rule_id": 0, "rule_name": "replicated_ruleset", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -10, "item_name": "default~hdd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } [2] [@~]# for osd in `ceph pg dump pgs| grep '^17' | awk '{print $17" "$19}' | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush get-device-class osd.$osd ; done | sort -u dumped pgs hdd

Show replies by date

Stefan Kooman

27 Sep 27 Sep

9:34 p.m.

On 2020-09-27 14:05, Marc Roos wrote:

...

Q. How can I display the definition of 'crush_rule 0' at the time of the pool creation? (To be sure it had already this device class hdd configured)

Interesting question. I don't think that information is stored in Ceph somewhere. But it would be very useful. Similar to what zfs is doing with "zpool history". I find that very helpful to look back and see what has and hasn't been done in the past, i.e. ceph pool $pool history /me heads off to issue tracker to file a feature request ... Gr. Stefan

Eugen Block

28 Sep 28 Sep

10:06 a.m.

Are all the OSDs in the same crush root? I would think that since the crush weight of hosts change as soon as OSDs are out it impacts the whole crush tree. If you separate the SSDs from the HDDs logically (e.g. different bucket type in the crush tree) the ramapping wouldn't affect the HDDs. Zitat von Marc Roos <M.Roos(a)f1-outsourcing.eu>eu>:

...

Marc Roos

29 Sep 29 Sep

9:39 a.m.

I have practically a default setup. If I do a 'ceph osd crush tree --show-shadow' I have a listing like this[1]. I would assume from the hosts being listed within the default~ssd and default~hdd, they are separate (enough)? [1] root default~ssd host c01~ssd .. .. host c02~ssd .. root default~hdd host c01~hdd .. host c02~hdd .. root default -----Original Message----- To: ceph-users(a)ceph.io Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Are all the OSDs in the same crush root? I would think that since the crush weight of hosts change as soon as OSDs are out it impacts the whole crush tree. If you separate the SSDs from the HDDs logically (e.g. different bucket type in the crush tree) the ramapping wouldn't affect the HDDs.

...

I have been converting ssd's osd's to dmcrypt, and I have noticed that

...

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

...

0.0 of a ssd osd this: 17.35 10415 0 0 9907 0 36001743890 0 0 3045 3045 active+remapped+backfilling 2020-09-27 12:55:49.093054 83758'20725398 83758:100379720 [8,14,23] 8 [3,14,23] 3 83636'20718129 2020-09-27 00:58:07.098096 83300'20689151 2020-09-24 21:42:07.385360 0 However osds 3,14,23,8 are all hdd osd's Since this is a cluster from Kraken/Luminous, I am not sure if the device class of the replicated_ruleset[1] was set when the pool 17 was

...

created. Weird thing is that all pg's of this pool seem to be on hdd osd[2] Q. How can I display the definition of 'crush_rule 0' at the time of the pool creation? (To be sure it had already this device class hdd configured) [1] [@~]# ceph osd pool ls detail | grep 'pool 17' pool 17 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 83712 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd [@~]# ceph osd crush rule dump replicated_ruleset { "rule_id": 0, "rule_name": "replicated_ruleset", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -10, "item_name": "default~hdd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } [2] [@~]# for osd in `ceph pg dump pgs| grep '^17' | awk '{print $17"

"$19}'

...

| grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush | get-device-class osd.$osd ; done | sort -u dumped pgs hdd _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

10:06 a.m.

They're still in the same root (default) and each host is member of both device-classes, I guess you have a mixed setup (hosts c01/c02 have both HDDs and SSDs)? I don't think this separation is enough to avoid remapping even if a different device-class is affected (your report confirms that). Dividing the crush tree into different subtrees might help here but I'm not sure if that's really something you need. You might also just deal with the remapping as long as it doesn't happen too often, I guess. On the other hand, if your setup won't change (except adding more OSDs) you might as well think about a different crush tree. It really depends on your actual requirements. We created two different subtrees when we got new hardware and it helped us a lot moving the data only once to the new hardware avoiding multiple remappings, now the older hardware is our EC environment except for some SSDs on those old hosts that had to stay in the main subtree. So our setup is also very individual but it works quite nice. :-) Zitat von Marc Roos <M.Roos(a)f1-outsourcing.eu>eu>:

...

I have been converting ssd's osd's to dmcrypt, and I have noticed that

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

"$19}'

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Marc Roos

8:54 p.m.

Yes correct, hosts have indeed both ssd's and hdd's combined. Is this not more of a bug then? I would assume the goal of using device classes is that you separate these and one does not affect the other, even the host weight of the ssd and hdd class are already available. The algorithm should just use that instead of the weight of the whole host. Or is there some specific use case, where these classes combined is required? -----Original Message----- Cc: ceph-users Subject: *****SPAM***** Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's They're still in the same root (default) and each host is member of both device-classes, I guess you have a mixed setup (hosts c01/c02 have both HDDs and SSDs)? I don't think this separation is enough to avoid remapping even if a different device-class is affected (your report confirms that). Dividing the crush tree into different subtrees might help here but I'm not sure if that's really something you need. You might also just deal with the remapping as long as it doesn't happen too often, I guess. On the other hand, if your setup won't change (except adding more OSDs) you might as well think about a different crush tree. It really depends on your actual requirements. We created two different subtrees when we got new hardware and it helped us a lot moving the data only once to the new hardware avoiding multiple remappings, now the older hardware is our EC environment except for some SSDs on those old hosts that had to stay in the main subtree. So our setup is also very individual but it works quite nice. :-) Zitat von :

...

osd's Are all the OSDs in the same crush root? I would think that since the crush weight of hosts change as soon as OSDs are out it impacts the whole crush tree. If you separate the SSDs from the HDDs logically

(e.g.

...

different bucket type in the crush tree) the ramapping wouldn't affect

...

the HDDs.

I have been converting ssd's osd's to dmcrypt, and I have noticed that

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

0.0 of a ssd osd this: 17.35 10415 0 0 9907 0 36001743890 0 0 3045 3045 active+remapped+backfilling 2020-09-27 12:55:49.093054 active+remapped+83758'20725398 83758:100379720 [8,14,23] 8 [3,14,23] 3 83636'20718129 2020-09-27 00:58:07.098096 83300'20689151 2020-09-24 21:42:07.385360 0 However osds 3,14,23,8 are all hdd osd's Since this is a cluster from Kraken/Luminous, I am not sure if the device class of the replicated_ruleset[1] was set when the pool 17 was

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

Frank Schilder

9:09 p.m.

Are these crush maps inherited from pre-mimic versions? I have re-balanced SSD and HDD pools in mimic (mimic deployed) where one device class never influenced the placement of the other. I have mixed hosts and went as far as introducing rbd_meta, rbd_data and such classes to sub-divide even further (all these devices have different perf specs). This worked like a charm. When adding devices of one class, only pools in this class were ever affected. As far as I understand, starting with mimic, every shadow class defines a separate tree (not just leafs/OSDs). Thus, device classes are independent of each other. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Marc Roos <M.Roos(a)f1-outsourcing.eu> Sent: 29 September 2020 20:54:48 To: eblock Cc: ceph-users Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Yes correct, hosts have indeed both ssd's and hdd's combined. Is this not more of a bug then? I would assume the goal of using device classes is that you separate these and one does not affect the other, even the host weight of the ssd and hdd class are already available. The algorithm should just use that instead of the weight of the whole host. Or is there some specific use case, where these classes combined is required? -----Original Message----- Cc: ceph-users Subject: *****SPAM***** Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's They're still in the same root (default) and each host is member of both device-classes, I guess you have a mixed setup (hosts c01/c02 have both HDDs and SSDs)? I don't think this separation is enough to avoid remapping even if a different device-class is affected (your report confirms that). Dividing the crush tree into different subtrees might help here but I'm not sure if that's really something you need. You might also just deal with the remapping as long as it doesn't happen too often, I guess. On the other hand, if your setup won't change (except adding more OSDs) you might as well think about a different crush tree. It really depends on your actual requirements. We created two different subtrees when we got new hardware and it helped us a lot moving the data only once to the new hardware avoiding multiple remappings, now the older hardware is our EC environment except for some SSDs on those old hosts that had to stay in the main subtree. So our setup is also very individual but it works quite nice. :-) Zitat von :

...

(e.g.

...

different bucket type in the crush tree) the ramapping wouldn't affect

...

the HDDs.

I have been converting ssd's osd's to dmcrypt, and I have noticed that

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Marc Roos

10:19 p.m.

Yes correct this is coming from Luminous or maybe even Kraken. How does a default crush tree look like in mimic or octopus? Or is there some manual how to bring this to the new 'default'? -----Original Message----- Cc: ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Are these crush maps inherited from pre-mimic versions? I have re-balanced SSD and HDD pools in mimic (mimic deployed) where one device class never influenced the placement of the other. I have mixed hosts and went as far as introducing rbd_meta, rbd_data and such classes to sub-divide even further (all these devices have different perf specs). This worked like a charm. When adding devices of one class, only pools in this class were ever affected. As far as I understand, starting with mimic, every shadow class defines a separate tree (not just leafs/OSDs). Thus, device classes are independent of each other. ________________________________________ Sent: 29 September 2020 20:54:48 To: eblock Cc: ceph-users Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Yes correct, hosts have indeed both ssd's and hdd's combined. Is this not more of a bug then? I would assume the goal of using device classes is that you separate these and one does not affect the other, even the host weight of the ssd and hdd class are already available. The algorithm should just use that instead of the weight of the whole host. Or is there some specific use case, where these classes combined is required? -----Original Message----- Cc: ceph-users Subject: *****SPAM***** Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's They're still in the same root (default) and each host is member of both device-classes, I guess you have a mixed setup (hosts c01/c02 have both HDDs and SSDs)? I don't think this separation is enough to avoid remapping even if a different device-class is affected (your report confirms that). Dividing the crush tree into different subtrees might help here but I'm not sure if that's really something you need. You might also just deal with the remapping as long as it doesn't happen too often, I guess. On the other hand, if your setup won't change (except adding more OSDs) you might as well think about a different crush tree. It really depends on your actual requirements. We created two different subtrees when we got new hardware and it helped us a lot moving the data only once to the new hardware avoiding multiple remappings, now the older hardware is our EC environment except for some SSDs on those old hosts that had to stay in the main subtree. So our setup is also very individual but it works quite nice. :-) Zitat von :

...

(e.g.

...

different bucket type in the crush tree) the ramapping wouldn't affect

...

the HDDs.

I have been converting ssd's osd's to dmcrypt, and I have noticed that

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

11:48 p.m.

Somebody on this list posted a script that can convert pre-mimic crush trees with buckets for different types of devices to crush trees with device classes with minimal data movement (trying to maintain IDs as much as possible). Don't have a thread name right now, but could try to find it tomorrow. I can check tomorrow how our crush tree unfolds. Basically, for every device class there is a full copy (shadow hierarchy) for each device class with its own weights etc. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Marc Roos <M.Roos(a)f1-outsourcing.eu> Sent: 29 September 2020 22:19:33 To: eblock; Frank Schilder Cc: ceph-users Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Yes correct this is coming from Luminous or maybe even Kraken. How does a default crush tree look like in mimic or octopus? Or is there some manual how to bring this to the new 'default'? -----Original Message----- Cc: ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Are these crush maps inherited from pre-mimic versions? I have re-balanced SSD and HDD pools in mimic (mimic deployed) where one device class never influenced the placement of the other. I have mixed hosts and went as far as introducing rbd_meta, rbd_data and such classes to sub-divide even further (all these devices have different perf specs). This worked like a charm. When adding devices of one class, only pools in this class were ever affected. As far as I understand, starting with mimic, every shadow class defines a separate tree (not just leafs/OSDs). Thus, device classes are independent of each other. ________________________________________ Sent: 29 September 2020 20:54:48 To: eblock Cc: ceph-users Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Yes correct, hosts have indeed both ssd's and hdd's combined. Is this not more of a bug then? I would assume the goal of using device classes is that you separate these and one does not affect the other, even the host weight of the ssd and hdd class are already available. The algorithm should just use that instead of the weight of the whole host. Or is there some specific use case, where these classes combined is required? -----Original Message----- Cc: ceph-users Subject: *****SPAM***** Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's They're still in the same root (default) and each host is member of both device-classes, I guess you have a mixed setup (hosts c01/c02 have both HDDs and SSDs)? I don't think this separation is enough to avoid remapping even if a different device-class is affected (your report confirms that). Dividing the crush tree into different subtrees might help here but I'm not sure if that's really something you need. You might also just deal with the remapping as long as it doesn't happen too often, I guess. On the other hand, if your setup won't change (except adding more OSDs) you might as well think about a different crush tree. It really depends on your actual requirements. We created two different subtrees when we got new hardware and it helped us a lot moving the data only once to the new hardware avoiding multiple remappings, now the older hardware is our EC environment except for some SSDs on those old hosts that had to stay in the main subtree. So our setup is also very individual but it works quite nice. :-) Zitat von :

...

(e.g.

...

different bucket type in the crush tree) the ramapping wouldn't affect

...

the HDDs.

I have been converting ssd's osd's to dmcrypt, and I have noticed that

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Eugen Block

30 Sep 30 Sep

8:43 a.m.

Interesting, I also did this test on an upgraded cluster (L to N). I'll repeat the test on a native Nautilus to see it for myself. Zitat von Frank Schilder <frans(a)dtu.dk>dk>:

...

(e.g.

different bucket type in the crush tree) the ramapping wouldn't affect

the HDDs.

I have been converting ssd's osd's to dmcrypt, and I have noticed that

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

8:59 a.m.

This is how my crush tree including shadow hierarchies looks like (a mess :): https://pastebin.com/iCLbi4Up Every device class has its own tree. Starting with mimic, this is automatic when creating new device classes. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock(a)nde.ag> Sent: 30 September 2020 08:43:47 To: Frank Schilder Cc: Marc Roos; ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Interesting, I also did this test on an upgraded cluster (L to N). I'll repeat the test on a native Nautilus to see it for myself. Zitat von Frank Schilder <frans(a)dtu.dk>dk>:

...

(e.g.

different bucket type in the crush tree) the ramapping wouldn't affect

the HDDs.

I have been converting ssd's osd's to dmcrypt, and I have noticed that

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Nico Schottelius

9:12 a.m.

Hey Frank, I uploaded our kraken created and nautilus upgraded crush map on [0]. To me it looks like the structure of both maps is pretty much the same - or am I mistaken? Best regards, Nico [0] https://www.nico.schottelius.org/temp/ceph-shadowtree20200930 Frank Schilder <frans(a)dtu.dk> writes:

...

(e.g.

different bucket type in the crush tree) the ramapping wouldn't affect

the HDDs.

I have been converting ssd's osd's to dmcrypt, and I have noticed that

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch

Frank Schilder

9:59 a.m.

...

To me it looks like the structure of both maps is pretty much the same - or am I mistaken?

Yes, but you are not Marc Roos. Do you work on the same cluster or do you observe the same problem? In any case, here is a thread pointing to the crush tree/rule conversion I mentioned: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/675QZ2JXXX4… The tool is "crushtool reclassify" and is recommended to use when upgrading from luminous to newer to convert crush rules to use device classes. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Nico Schottelius <nico.schottelius(a)ungleich.ch> Sent: 30 September 2020 09:12:49 To: Frank Schilder Cc: Eugen Block; Marc Roos; ceph-users(a)ceph.io Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Hey Frank, I uploaded our kraken created and nautilus upgraded crush map on [0]. To me it looks like the structure of both maps is pretty much the same - or am I mistaken? Best regards, Nico [0] https://www.nico.schottelius.org/temp/ceph-shadowtree20200930 Frank Schilder <frans(a)dtu.dk> writes:

...

(e.g.

different bucket type in the crush tree) the ramapping wouldn't affect

the HDDs.

I have been converting ssd's osd's to dmcrypt, and I have noticed that

pg's of pools are migrated that should be (and are?) on hdd class. On a healthy ok cluster I am getting, when I set the crush reweight to

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

-- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch

Nico Schottelius

11:26 a.m.

Frank Schilder <frans(a)dtu.dk> writes:

...

To me it looks like the structure of both maps is pretty much the same - or am I mistaken?

Yes, but you are not Marc Roos. Do you work on the same cluster or do you observe the same problem?

No, but we recently also noticed that rebuilding one pool ("ssd") influenced speed on other pools, which was unexpected.

...

In any case, here is a thread pointing to the crush tree/rule conversion I mentioned: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/675QZ2JXXX4…

Thanks will check it out.

...

The tool is "crushtool reclassify" and is recommended to use when upgrading from luminous to newer to convert crush rules to use device classes.

Same here - thanks a lot for the pointers! Cheers, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch

Marc Roos

2:59 p.m.

Hi Frank, thanks this 'root default' indeed looks different with these 0 there. I have also uploaded mine[1] because it looks very similar to Nico's. I guess his hdd pg's can also start moving in some occassions. Thanks for 'crushtool reclassify' hint, I guess I have missed this in the release notes or so. [1] https://pastebin.com/PFx0V3S7 -----Original Message----- To: Eugen Block Cc: Marc Roos; ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's This is how my crush tree including shadow hierarchies looks like (a mess :): https://pastebin.com/iCLbi4Up Every device class has its own tree. Starting with mimic, this is automatic when creating new device classes. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock(a)nde.ag> Sent: 30 September 2020 08:43:47 To: Frank Schilder Cc: Marc Roos; ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Interesting, I also did this test on an upgraded cluster (L to N). I'll repeat the test on a native Nautilus to see it for myself. Zitat von Frank Schilder

...

Somebody on this list posted a script that can convert pre-mimic crush

...

trees with buckets for different types of devices to crush trees with device classes with minimal data movement (trying to maintain IDs as much as possible). Don't have a thread name right now, but could try to find it tomorrow. I can check tomorrow how our crush tree unfolds. Basically, for every device class there is a full copy (shadow hierarchy) for each device class with its own weights etc. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Marc Roos Sent: 29 September 2020 22:19:33 To: eblock; Frank Schilder Cc: ceph-users Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Yes correct this is coming from Luminous or maybe even Kraken. How does a default crush tree look like in mimic or octopus? Or is there some manual how to bring this to the new 'default'? -----Original Message----- Cc: ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Are these crush maps inherited from pre-mimic versions? I have re-balanced SSD and HDD pools in mimic (mimic deployed) where one device class never influenced the placement of the other. I have mixed

...

hosts and went as far as introducing rbd_meta, rbd_data and such classes to sub-divide even further (all these devices have different

perf specs).

...

This worked like a charm. When adding devices of one class, only pools

...

in this class were ever affected. As far as I understand, starting with mimic, every shadow class defines a separate tree (not just leafs/OSDs). Thus, device classes are independent of each other. ________________________________________ Sent: 29 September 2020 20:54:48 To: eblock Cc: ceph-users Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class

...

osd's Yes correct, hosts have indeed both ssd's and hdd's combined. Is this not more of a bug then? I would assume the goal of using device classes is that you separate these and one does not affect the other, even the host weight of the ssd and hdd class are already available. The algorithm should just use that instead of the weight of the whole

host.

...

Or is there some specific use case, where these classes combined is required? -----Original Message----- Cc: ceph-users Subject: *****SPAM***** Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's They're still in the same root (default) and each host is member of both device-classes, I guess you have a mixed setup (hosts c01/c02 have both HDDs and SSDs)? I don't think this separation is enough to avoid remapping even if a different device-class is affected (your report confirms that). Dividing the crush tree into different subtrees might help here but I'm not sure if that's really something you need. You might also just deal with the remapping as long as it doesn't happen too often, I guess. On the other hand, if your setup won't change (except adding more OSDs) you might as well think about a different crush tree. It really depends on your actual requirements. We created two different subtrees when we got new hardware and it helped us a lot moving the data only once to the new hardware avoiding

...

multiple remappings, now the older hardware is our EC environment except for some SSDs on those old hosts that had to stay in the main subtree. So our setup is also very individual but it works quite nice. :-) Zitat von : > I have practically a default setup. If I do a 'ceph osd crush tree > --show-shadow' I have a listing like this[1]. I would assume from the

...

hosts being listed within the default~ssd and default~hdd, they are separate (enough)? [1] root default~ssd host c01~ssd .. .. host c02~ssd .. root default~hdd host c01~hdd .. host c02~hdd .. root default -----Original Message----- To: ceph-users(a)ceph.io Subject: [ceph-users] Re: hdd pg's migrating when converting ssd class

> osd's > > Are all the OSDs in the same crush root? I would think that since the

...

crush weight of hosts change as soon as OSDs are out it impacts the whole crush tree. If you separate the SSDs from the HDDs logically

(e.g.

different bucket type in the crush tree) the ramapping wouldn't affect

> the HDDs. > > > > >> I have been converting ssd's osd's to dmcrypt, and I have noticed >> that > >> pg's of pools are migrated that should be (and are?) on hdd class. >> >> On a healthy ok cluster I am getting, when I set the crush reweight >> to > >> 0.0 of a ssd osd this: >> >> 17.35 10415 0 0 9907 0 >> 36001743890 0 0 3045 3045 >> active+remapped+backfilling 2020-09-27 12:55:49.093054 >> active+remapped+83758'20725398 >> 83758:100379720 [8,14,23] 8 [3,14,23] 3 >> 83636'20718129 2020-09-27 00:58:07.098096 83300'20689151 2020-09-24 >> 21:42:07.385360 0 >> >> However osds 3,14,23,8 are all hdd osd's >> >> Since this is a cluster from Kraken/Luminous, I am not sure if the >> device class of the replicated_ruleset[1] was set when the pool 17 >> was > >> created. >> Weird thing is that all pg's of this pool seem to be on hdd osd[2] >> >> Q. How can I display the definition of 'crush_rule 0' at the time of

...

the pool creation? (To be sure it had already this device class hdd configured) [1] [@~]# ceph osd pool ls detail | grep 'pool 17' pool 17 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 83712 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd [@~]# ceph osd crush rule dump replicated_ruleset { "rule_id": 0, "rule_name": "replicated_ruleset", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -10, "item_name": "default~hdd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } [2] [@~]# for osd in `ceph pg dump pgs| grep '^17' | awk '{print $17"

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Frank Schilder

3:54 p.m.

Hi Nico and Mark, your crush trees look indeed like they have been converted properly to using device classes already. Changing something within one device class should not influence placement in another. Maybe I'm overlooking something? The only other place I know of where such a mix-up could occur are the crush rules. Do your rules look like this: { "rule_id": 5, "rule_name": "sr-rbd-data-one", "ruleset": 5, "type": 3, "min_size": 3, "max_size": 8, "steps": [ { "op": "set_chooseleaf_tries", "num": 50 }, { "op": "set_choose_tries", "num": 1000 }, { "op": "take", "item": -185, "item_name": "ServerRoom~rbd_data" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } Notice the "~rbd_data" qualifier. It is important that the device class is specified at the root selection. I'm really surprised that with your crush tree you observe changes in SSD implying changes in HDD placements. I was really rough on our mimic cluster with moving disks in and out and between servers and I have never seen this problem. Could it be a regression in nautilus? Is the auto-balancer interfering?

...

we recently also noticed that rebuilding one pool ("ssd") influenced speed on other pools, which was unexpected.

Could this be something else? Was PG/object placement influenced or performance only? I'm asking, because during one of our service windows we observed something very strange. We have a multi-location cluster with pools with completely isolated storage devices in different locations. On one of these sub-clusters we run a ceph fs. During maintenance we needed to shut down the ceph-fs. When our admin issued the umount command (ca. 1500 clients), we noticed that RBD pools seemed to have problems even though there is absolutely no overlap in disks (disjoint crush trees), they are not even in the same physical location and sit on their own switches. The fs and RBD only share the MONs/MGRs. I'm not entirely sure if we observed something real or only a network blip. However, nagios went crazy on our VM environment for a few minutes. Maybe there is another issue that causes unexpected cross-dependencies that affect performance? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Marc Roos <M.Roos(a)f1-outsourcing.eu> Sent: 30 September 2020 14:59:50 To: eblock; Frank Schilder Cc: ceph-users; nico.schottelius Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Hi Frank, thanks this 'root default' indeed looks different with these 0 there. I have also uploaded mine[1] because it looks very similar to Nico's. I guess his hdd pg's can also start moving in some occassions. Thanks for 'crushtool reclassify' hint, I guess I have missed this in the release notes or so. [1] https://pastebin.com/PFx0V3S7 -----Original Message----- To: Eugen Block Cc: Marc Roos; ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's This is how my crush tree including shadow hierarchies looks like (a mess :): https://pastebin.com/iCLbi4Up Every device class has its own tree. Starting with mimic, this is automatic when creating new device classes. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock(a)nde.ag> Sent: 30 September 2020 08:43:47 To: Frank Schilder Cc: Marc Roos; ceph-users Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Interesting, I also did this test on an upgraded cluster (L to N). I'll repeat the test on a native Nautilus to see it for myself. Zitat von Frank Schilder

...

Somebody on this list posted a script that can convert pre-mimic crush

...

hosts and went as far as introducing rbd_meta, rbd_data and such classes to sub-divide even further (all these devices have different

perf specs).

...

This worked like a charm. When adding devices of one class, only pools

...

host.

...

> osd's > > Are all the OSDs in the same crush root? I would think that since the

...

crush weight of hosts change as soon as OSDs are out it impacts the whole crush tree. If you separate the SSDs from the HDDs logically

(e.g.

different bucket type in the crush tree) the ramapping wouldn't affect

...

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Marc Roos

6:07 p.m.

I did not switch to upmap still, what I have crush-compat[1]. I just set the crush reweight of a ssd osd 33 to 0.0 (changing from ceph-disk to ceph-volume with dmcrypt). First I see only ssd pools remapping, then a bit later 2 hdd pools. I thought at first it could be maybe the time the crush rule was adapted for hdd classes, but now a pool fs_data.ec21 is remapping, and I know for sure the hdd ec21 rule existed when this pool was created. [1] [@ceph]# ceph balancer status { "last_optimize_duration": "0:00:00.647219", "plans": [], "mode": "crush-compat", "active": true, "optimize_result": "Unable to find further optimization, change balancer mode and retry might help", "last_optimize_started": "Wed Sep 30 17:10:27 2020" } [@ceph]# ceph osd crush rule dump replicated_ruleset { "rule_id": 0, "rule_name": "replicated_ruleset", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -10, "item_name": "default~hdd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } [@ceph]# ceph osd crush rule dump replicated_ruleset_ssd { "rule_id": 5, "rule_name": "replicated_ruleset_ssd", "ruleset": 5, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -15, "item_name": "default~ssd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } -----Original Message----- To: Marc Roos; eblock Cc: ceph-users; nico.schottelius Subject: Re: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's Hi Nico and Mark, your crush trees look indeed like they have been converted properly to using device classes already. Changing something within one device class should not influence placement in another. Maybe I'm overlooking something? The only other place I know of where such a mix-up could occur are the crush rules. Do your rules look like this: { "rule_id": 5, "rule_name": "sr-rbd-data-one", "ruleset": 5, "type": 3, "min_size": 3, "max_size": 8, "steps": [ { "op": "set_chooseleaf_tries", "num": 50 }, { "op": "set_choose_tries", "num": 1000 }, { "op": "take", "item": -185, "item_name": "ServerRoom~rbd_data" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } Notice the "~rbd_data" qualifier. It is important that the device class is specified at the root selection. I'm really surprised that with your crush tree you observe changes in SSD implying changes in HDD placements. I was really rough on our mimic cluster with moving disks in and out and between servers and I have never seen this problem. Could it be a regression in nautilus? Is the auto-balancer interfering?

...

we recently also noticed that rebuilding one pool ("ssd") influenced speed on other pools, which was unexpected.

...

Somebody on this list posted a script that can convert pre-mimic crush

...

hosts and went as far as introducing rbd_meta, rbd_data and such classes to sub-divide even further (all these devices have different

perf specs).

...

This worked like a charm. When adding devices of one class, only pools

...

host.

...

> osd's > > Are all the OSDs in the same crush root? I would think that since the

...

crush weight of hosts change as soon as OSDs are out it impacts the whole crush tree. If you separate the SSDs from the HDDs logically

(e.g.

different bucket type in the crush tree) the ramapping wouldn't affect

...

"$19}' > | grep -oE '[0-9]{1,2}'| sort -u -n`; do ceph osd crush > | get-device-class > osd.$osd ; done | sort -u > dumped pgs > hdd

_______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

Marc Roos

10:26 p.m.

I am not sure, but it looks like this remapping at hdd's is not being done when adding back the same ssd osd.

Frank Schilder

1 Oct 1 Oct

9:02 a.m.

Dear Mark and Nico, I think this might be the time to file a tracker report. As far as I can see, your set-up is as it should be, OSD operations on your clusters should behave exactly as on ours. I don't know of any other configuration option that influences placement calculation. The problems you (Nico in particular) describe seem serious enough. I heard also other reports of admin operations killing a cluster starting with Nautilus, most notably this one https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/W4M5XQRDBLX… . Maybe there is/are regressions with crush placement computations (and others)? I will add this to the list of tests before considering to upgrade from mimic. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Marc Roos <M.Roos(a)f1-outsourcing.eu> Sent: 30 September 2020 22:26:11 To: eblock; Frank Schilder Cc: ceph-users; nico.schottelius Subject: RE: [ceph-users] Re: hdd pg's migrating when converting ssd class osd's I am not sure, but it looks like this remapping at hdd's is not being done when adding back the same ssd osd.

Nico Schottelius

30 Sep 30 Sep

9:38 p.m.

Good evening Frank, Frank Schilder <frans(a)dtu.dk> writes:

...

That's the same question we were asking ourselves last week (and still do).

...

The only other place I know of where such a mix-up could occur are the crush rules. Do your rules look like this:

Ours are slightly simpler due to less osds/hierarchy, but besides that I think they should be fine: { "rule_id": 2, "rule_name": "ssd", "ruleset": 2, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -12, "item_name": "default~ssd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, Differences I see is chooseleaf_indep vs. chooseleaf_firstn, from replicated vs. ec pool. But either way we are below the correct root already.

...

I'm really surprised that with your crush tree you observe changes in SSD implying changes in HDD placements. I was really rough on our mimic cluster with moving disks in and out and between servers and I have never seen this problem. Could it be a regression in nautilus? Is the auto-balancer interfering?

The auto balancer was off during our last big rebalance, however I am also wondering if this is a nautilus regression, as we have never seen it in Luminous. We migrated luminous -> nautilus with a baby-step (1 day or so) of mimic in between, so I am not able to say whether this behaviour already changed somehow in mimic or just in Nautilus. Our case last week was: - Move 4 SSDs from one host to another - The whole cluster - all pools - became unresponsive, slow ops everywhere. - Network bandwidth was never exhausted, enough CPU cores idle, RAM available While we do co-host SSD & HDD on the same hosts, we were not able to detect any resource exhaustion that would have prevented the other pools to function properly. Best regards, Nico -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch

1297

days inactive

1301

days old

ceph-users@ceph.io

Manage subscription

19 comments

5 participants

tags (0)

participants (5)

Eugen Block
Frank Schilder
Marc Roos
Nico Schottelius
Stefan Kooman