August 2019 - ceph-users

by Dan van der Ster

Sorry to post this to the list, but does this lists.ceph.io password reset work for anyone? https://lists.ceph.io/accounts/password/reset/ For my accounts which are getting mail I have "The e-mail address is not assigned to any user account". Best Regards, Dan

4 years, 7 months

1
1
0 0

Re: MON DNS Lookup & Version 2 Protocol

by Ricardo Dias

Hi Dominic, I just created a feature ticket in the Ceph tracker to keep track of this issue. Here's the ticket: https://tracker.ceph.com/issues/41537 Cheers, Ricardo Dias On 17/07/19 20:06, DHilsbos(a)performair.com wrote: > All; > > I'm trying to firm up my understanding of how Ceph works, and ease of management tools and capabilities. > > I stumbled upon this: http://docs.ceph.com/docs/nautilus/rados/configuration/mon-lookup-dns/ > > It got me wondering; how do you convey protocol version 2 capabilities in this format? > > The examples all list port 6789, which is the port for protocol version 1. Would I add SRV records for port 3300? How does the client distinguish v1 from v2 in this case? > > Thank you, > > Dominic L. Hilsbos, MBA > Director - Information Technology > Perform Air International, Inc. > DHilsbos(a)PerformAir.com > www.PerformAir.com > > > _______________________________________________ > ceph-users mailing list > ceph-users(a)lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Ricardo Dias Senior Software Engineer - Storage Team SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)

4 years, 7 months

1
0
0 0

ceph's replicas question

by Wesley Peng

Hi, We have all SSD disks as ceph's backend storage. Consider the cost factor, can we setup the cluster to have only two replicas for objects? thanks & regards Wesley

4 years, 7 months

7
8
0 0

krdb upmap compatibility

by Frank R

It seems that with Linux kernel 4.16.10 krdb clients are seen as Jewel rather than Luminous. Can someone tell me which kernel version will be seen as Luminous as I want to enable the Upmap Balancer.

4 years, 7 months

4
5
0 0

active+remapped+backfilling with objects misplaced

by Arash Shams

Hi everybody Im new to ceph and I have a question related to active+remapped+backfilling and misplaced objects Recently I copied more than 10 million objects to a new cluster with 3 nodes and 6 osds during this migration one of my OSDs got full and health check became ERR I dont know why but ceph started to write every object on only one osd(can I change this behaviour) and after it gots full I try to reweight it by utilization and increase the pgs for 1 pool. cluster became accessible again with status warning and recovery started. I checked the cluster status for 2 days and I found that I always have 1 pg with status active+remapped+backfilling and more than 5% misplaced objects. I thought recovery process takes more days so I leave the cluster to do the recovery in background, till now I have more active+remapped+backfill_wait pgs and more misplaced objects ( about 10%) the question is what should I do ? waiting for recovery to finish ? can I speedup this process ? these servers are in production environment am I in trouble or not ? Kind Regards Thanks

4 years, 7 months

1
0
0 0

Luminous and mimic: adding OSD can crash mon(s) and lead to loss of quorum

by Florian Haas

Hi everyone, there are a couple of bug reports about this in Redmine but only one (unanswered) mailing list message[1] that I could find. So I figured I'd raise the issue here again and copy the original reporters of the bugs (they are BCC'd, because in case they are no longer subscribed it wouldn't be appropriate to share their email addresses with the list). This is about https://tracker.ceph.com/issues/40029, and https://tracker.ceph.com/issues/39978 (the latter of which was recently closed as a duplicate of the former). In short, it appears that at least in luminous and mimic (I haven't tried nautilus yet), it's possible to crash a mon when attempting to add a new OSD as it's trying to inject itself into the crush map under its host bucket, when that host bucket does not exist yet. What's worse is that when the OSD's "ceph osd new" process has thus crashed the leader mon, a new leader is elected and in case the "ceph osd new" process is still running on the OSD node, it will promptly connect to that mon, and kill it too. This then continues until sufficiently many mons have died for quorum to be lost. The recovery steps appear to involve - killing the "ceph osd new" process, - restarting mons until you regain quorum, - and then running "ceph osd purge" to drop the problematic OSD entry from the crushmap and osdmap. The issue can apparently be worked around by adding the host buckets to the crushmap manually before adding the new OSDs, but surely this isn't intended to be a prerequisite, at least not to the point of mons crashing otherwise? Also I am guessing that this is some weird corner case rooted in an unusual combination of contributing factors, because otherwise I am guessing more people would be bitten by this problem. Anyone able to share their thoughts on this one? Have more people run into this? Cheers, Florian [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/034880.html — interestingly I could find this message in the pipermail archive but none in the one that my MUA keeps for me. So perhaps that message wasn't delivered to all subscribers, which might be why it has gone unanswered.

4 years, 7 months

2
4
0 0

ceph rbd disk performance question

by linghucongsong

HI all! I use ceph as the openstack VM disk. I have a VM run postgresql. I found the disk on the vm run postgresql is very busy and slow! But the ceph cluster is very healthy and without any slow request. Even the vm disk is very busy, but the ceph cluster is look like very idle. My ceph version is 12.2.8 and the vm disk is ext4 file system. The postgresql vm disk is very busy! look below: avg-cpu: %user %nice %system %iowait %steal %idle 0.53 0.00 1.6 16.55 0.00 81.31 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await r_await w_await svctm %util vdb 0.00 7425.00 2.0 65.5 40.00 63904.00 940.54 134.27 66966.54 12553.40 69042.94 14.71 100.05 The ceph cluster is very idle. osd commit_latency(ms) apply_latency(ms) 39 0 1 38 0 2 37 0 0 36 0 1 35 0 0 34 0 0 33 0 0 32 0 0 31 0 0 30 0 1 29 0 1 28 0 0 27 0 1 26 0 1 25 0 1 24 0 1 23 0 0 22 0 0 9 0 0 8 0 0 7 0 0 6 0 1 5 0 1 4 0 7 0 0 1 1 0 3 2 0 2 3 0 1 10 0 1 11 0 0 13 0 0 14 0 0 15 0 1 16 0 0 17 0 1 18 0 0 19 0 0 20 0 0 21 0 0 Can anybody tell me why? Thanks in advance!

4 years, 7 months

3
3
0 0

Re: Failed to get omap key when mirroring of image is enabled

by Jason Dillaman

On Fri, Aug 23, 2019 at 7:38 AM Ajitha Robert <ajitharobert01(a)gmail.com> wrote: > > Sir, > > I have a running DR setup with ceph.. but i did the same for another two sites.. Its actually direct L2 connectivity link between sites.. I m getting repeated error > > rbd::mirror::InstanceWatcher: C_NotifyInstanceRequestfinish: resending after timeout That is just indicative that you are having issues talking to your local cluster. Assuming you only have a single rbd-mirror daemon running, it seems like it cannot even sent a message through the OSD to itself within 5 seconds. Perhaps your cluster is too slow to respond? > > its coming continously. so not getting replicated to other site. Whether direct L2 connectivity is a concern?? whether rbd-mirror expects a L3 layer link for two sites? > > On Wed, Jul 24, 2019 at 12:42 AM Ajitha Robert <ajitharobert01(a)gmail.com> wrote: >> >> Thanks for your reply. >> >> Regarding rbd mirroring, Can you please check the logs for rbd image creation. Second[2] one started syncing and no progress further. >> >> 1)Log for manual rbd image creation >> >> http://paste.openstack.org/show/754766/ >> >> >> 2)Log for 16gb volume created from cinder, status in cinder volume is available >> >> http://paste.openstack.org/show/754767/ >> >> >> 3)Log for 100gb volume created from cinder, status in cinder volume is error >> >> http://paste.openstack.org/show/754769/ >> >> >> On Tue, Jul 23, 2019 at 1:13 AM Jason Dillaman <jdillama(a)redhat.com> wrote: >>> >>> On Mon, Jul 22, 2019 at 3:26 PM Ajitha Robert <ajitharobert01(a)gmail.com> wrote: >>> > >>> > Thanks for your reply >>> > >>> > 1) In scenario 1, I didnt attempt to delete the cinder volume. Please find the cinder volume log. >>> > http://paste.openstack.org/show/754731/ >>> >>> It might be better to ping Cinder folks about that one. It doesn't >>> really make sense to me from a quick glance. >>> >>> > >>> > 2) In scenario 2. I will try with debug. But i m having a test setup with one OSD in primary and one OSD in secondary. distance between two ceph clusters is 300 km >>> > >>> > >>> > 3)I have disabled ceph authentication totally for all including rbd-mirror daemon. Also i have deployed the ceph cluster using ceph-ansible. Will these both create any issue to the entire setup >>> >>> Not to my knowledge. >>> >>> > 4)The image which was in syncing mode, showed read only status in secondary. >>> >>> Mirrored images are either primary or non-primary. It is the expected >>> (documented) behaviour that non-primary images are read-only. >>> >>> > 5)In a presentation i found as journaling feature is causing poor performance in IO operations and we can skip the journaling process for mirroring... Is it possible.. By enabling mirroring to entire cinder pool as pool mode instead of mirror mode of rbd mirroring.. And we can skip the replication_enabled is true spec in cinder type.. >>> >>> Journaling is required for RBD mirroring. >>> >>> > >>> > >>> > >>> > On Mon, Jul 22, 2019 at 11:13 PM Jason Dillaman <jdillama(a)redhat.com> wrote: >>> >> >>> >> On Mon, Jul 22, 2019 at 10:49 AM Ajitha Robert <ajitharobert01(a)gmail.com> wrote: >>> >> > >>> >> > No error log in rbd-mirroring except some connection timeout came once, >>> >> > Scenario 1: >>> >> > when I create a bootable volume of 100 GB with a glance image.Image get downloaded and from cinder, volume log throws with "volume is busy deleting volume that has snapshot" . Image was enabled with exclusive lock, journaling, layering, object-map, fast-diff and deep-flatten >>> >> > Cinder volume is in error state but the rbd image is created in primary but not in secondary. >>> >> >>> >> Any chance you know where in Cinder that error is being thrown? A >>> >> quick grep of the code doesn't reveal that error message. If the image >>> >> is being synced to the secondary site when you attempt to delete it, >>> >> it's possible you could hit this issue. Providing debug log messages >>> >> from librbd on the Cinder controller might also be helpful for this. >>> >> >>> >> > Scenario 2: >>> >> > but when i create a 50gb volume with another glance image. Volume get created. and in the backend i could see the rbd images both in primary and secondary >>> >> > >>> >> > From rbd mirror image status i found secondary cluster starts copying , and syncing was struck at around 14 %... It will be in 14 % .. no progress at all. should I set any parameters for this like timeout?? >>> >> > >>> >> > I manually checked rbd --cluster primary object-map check <object-name>.. No results came for the objects and the command was in hanging.. Thats why got worried on the failed to map object key log. I couldnt even rebuild the object map. >>> >> >>> >> It sounds like one or more of your primary OSDs are not reachable from >>> >> the secondary site. If you run w/ "debug rbd-mirror = 20" and "debug >>> >> rbd = 20", you should be able to see the last object it attempted to >>> >> copy. From that, you could use "ceph osd map" to figure out the >>> >> primary OSD for that object. >>> >> >>> >> > the image which was in syncing mode, showed read only status in secondary. >>> >> > >>> >> > >>> >> > >>> >> > On Mon, 22 Jul 2019, 17:36 Jason Dillaman, <jdillama(a)redhat.com> wrote: >>> >> >> >>> >> >> On Sun, Jul 21, 2019 at 8:25 PM Ajitha Robert <ajitharobert01(a)gmail.com> wrote: >>> >> >> > >>> >> >> > I have a rbd mirroring setup with primary and secondary clusters as peers and I have a pool enabled image mode.., In this i created a rbd image , enabled with journaling. >>> >> >> > >>> >> >> > But whenever i enable mirroring on the image, I m getting error in osd.log. I couldnt trace it out. please guide me to solve this error. >>> >> >> > >>> >> >> > I think initially it worked fine. but after ceph process restart. these error coming >>> >> >> > >>> >> >> > >>> >> >> > Secondary.osd.0.log >>> >> >> > >>> >> >> > 2019-07-22 05:36:17.371771 7ffbaa0e9700 0 <cls> /build/ceph-12.2.12/src/cls/journal/cls_journal.cc:61: failed to get omap key: client_a5c76849-ba16-480a-a96b-ebfdb7f6ac65 >>> >> >> > 2019-07-22 05:36:17.388552 7ffbaa0e9700 0 <cls> /build/ceph-12.2.12/src/cls/journal/cls_journal.cc:472: active object set earlier than minimum: 0 < 1 >>> >> >> > 2019-07-22 05:36:17.413102 7ffbaa0e9700 0 <cls> /build/ceph-12.2.12/src/cls/journal/cls_journal.cc:61: failed to get omap key: order >>> >> >> > 2019-07-22 05:36:23.341490 7ffbab8ec700 0 <cls> /build/ceph-12.2.12/src/cls/rbd/cls_rbd.cc:4125: error retrieving image id for global id '9e36b9f8-238e-4a54-a055-19b19447855e': (2) No such file or directory >>> >> >> > >>> >> >> > >>> >> >> > primary-osd.0.log >>> >> >> > >>> >> >> > 2019-07-22 05:16:49.287769 7fae12db1700 0 log_channel(cluster) log [DBG] : 1.b deep-scrub ok >>> >> >> > 2019-07-22 05:16:54.078698 7fae125b0700 0 log_channel(cluster) log [DBG] : 1.1b scrub starts >>> >> >> > 2019-07-22 05:16:54.293839 7fae125b0700 0 log_channel(cluster) log [DBG] : 1.1b scrub ok >>> >> >> > 2019-07-22 05:17:04.055277 7fae12db1700 0 <cls> /build/ceph-12.2.12/src/cls/journal/cls_journal.cc:472: active object set earlier than minimum: 0 < 1 >>> >> >> > >>> >> >> > 2019-07-22 05:33:21.540986 7fae135b2700 0 <cls> /build/ceph-12.2.12/src/cls/journal/cls_journal.cc:472: active object set earlier than minimum: 0 < 1 >>> >> >> > 2019-07-22 05:35:27.447820 7fae12db1700 0 <cls> /build/ceph-12.2.12/src/cls/rbd/cls_rbd.cc:4125: error retrieving image id for global id '8a61f694-f650-4ba1-b768-c5e7629ad2e0': (2) No such file or directory >>> >> >> >>> >> >> Those don't look like errors, but the log level should probably be >>> >> >> reduced for those OSD cls methods. If you look at your rbd-mirror >>> >> >> daemon log, do you see any errors? That would be the important place >>> >> >> to look. >>> >> >> >>> >> >> > >>> >> >> > -- >>> >> >> > Regards, >>> >> >> > Ajitha R >>> >> >> > _______________________________________________ >>> >> >> > ceph-users mailing list >>> >> >> > ceph-users(a)lists.ceph.com >>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >>> >> >> >>> >> >> >>> >> >> -- >>> >> >> Jason >>> >> >>> >> >>> >> >>> >> -- >>> >> Jason >>> > >>> > >>> > >>> > -- >>> > Regards, >>> > Ajitha R >>> >>> >>> >>> -- >>> Jason >> >> >> >> -- >> Regards, >> Ajitha R > > > > -- > Regards, > Ajitha R -- Jason

4 years, 7 months

2
1
0 0

RBD, OpenStack Nova, libvirt, qemu-guest-agent, and FIFREEZE: is this working as intended?

by Florian Haas

Hi everyone, apologies in advance; this will be long. It's also been through a bunch of edits and rewrites, so I don't know how well I'm expressing myself at this stage — please holler if anything is unclear and I'll be happy to try to clarify. I am currently in the process of investigating the behavior of OpenStack Nova instances when being snapshotted and suspended, in conjunction with qemu-guest-agent (qemu-ga). I realize that RBD-backed Nova/libvirt instances are expected to behave differently from file-backed ones, but I think I might have reason to believe that the RBD-backed ones are indeed behaving incorrectly, and I'd like to verify that. So first up, for comparison, let's recap how a Nova/libvirt/KVM instance behaves when it is *not* backed by RBD (such as, it's using a qcow2 file that is on a Nova compute node in /var/lib/nova/instances), is booted from an image with the hw_qemu_guest_agent=yes meta property set, and runs qemu-guest-agent within the guest: - User issues "nova suspend" or "openstack server suspend". - If nova-compute on the compute node decides that the instance has qemu-guest-agent running (which is the case if it's qemu or kvm, and its image has hw_qemu_guest_agent=yes), it sends a guest-sync command over the guest agent VirtIO serial port. This command registers in the qemu-ga log file in the guest. - nova-compute on the compute node sends a libvirt managed-save command. - Nova reports the instance as suspended. - User issues "nova resume" or "openstack server resume". - nova-compute on the compute node sends a libvirt start command. - Again, if nova-compute on the compute node knows that the instance has qemu-guest-agent running, it sends another command over the serial port, namely guest-set-time. This, too, registers in the guest's qemu-ga log. - Nova reports the instance as active (running normally) again. Now, when I instead use a Nova environment that is fully RBD-backed, I see exactly the same behavior as described above. So I know that in principle, nova-compute/qemu-ga communication works in both an RBD-backed and a non-RBD-backed environment. However, things appear to get very different when it comes to snapshots. Again, starting with a file-backed environment: - User issues "nova image-create" or "openstack server image create". - If nova-compute on the compute node decides that the instance can be quiesced (which is the case if it's qemu or kvm, and its image has hw_qemu_guest_agent=yes), then it sends a "guest-fsfreeze-freeze" command over the guest agent VirtIO serial port. - The guest agent inside the guest loops over all mounted filesystems, and issues the FIFREEZE ioctl (which maps to the kernel freeze_super() function). This can be seen in the qemu-ga log file in the guest, and it is also verifiable by using ftrace on the qemu-ga PID and checking for the freeze_super() function call. - nova-compute then takes a live snapshot of the instance. - Once complete, the guest gets a "guest-fsfreeze-thaw" command, and again I can see this in the qemu-ga log, and with ftrace. And now with RBD: - User issues "nova image-create" or "openstack server image create". - The guest-fsfreeze-freeze agent command never happens. Now I can see the info message from https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac… in my nova-compute log, which confirms that we're attempting a live snapshot. I also do *not* see the warning from https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac…, so it looks like the direct_snapshot() call from https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac… succeeds. This is defined in https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac… and it uses RBD functionality only. Importantly, it never interacts with qemu-ga, so it appears to not worry at all about freezing the filesystem. (Which does seem to contradict https://docs.ceph.com/docs/master/rbd/rbd-openstack/?highlight=uuid#image-p…, by the way, so that may be a documentation bug.) Now here's another interesting part. Were the direct snapshot to fail, if I read https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac… and https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac… correctly, the fallback behavior would be as follows: The domain would next be "suspended" (note, again this is Nova suspend, which maps to libvirt managed-save per https://opendev.org/openstack/nova/src/commit/7bf75976016aae5d458eca9f6ddac…), then snapshotted using a libvirt call and resumed again post-snapshot. In which case there would be a guest-sync call on suspend. And it's this part that has me a bit worried. If an RBD backed instance, on a successful snapshot, never freezes its filesystem *and* never does any kind of sync, either, doesn't that mean that such an instance can't be made to produce consistent snapshots? (Particularly in the case of write-back caching, which is recommended and normally safe for RBD/virtio devices.) Or is there some magic within the Qemu RBD storage driver that I am unaware of, that makes any such contortions unnecessary? Thanks in advance for your insights! Cheers, Florian

4 years, 7 months

3
7
0 0

Re: Strange Ceph architect with SAN storages

by Brett Chancellor

It's certainly possible. It makes things a little more complex though. Some questions you may want to consider during the design.. - Is the customer aware this won't preserve any data on the luns they are hoping to reuse. - Is the plan to eventually replace the SAN with JBOD, in the same systems? If so you may want to make your luns look like the eventual drive size and count. - Is the plan to use a few systems with SAN and add standalone systems later? Then you need to calculate expected speeds and divide between failure domains. - Is the plan to use a couple of hosts with SAN to save money, and have the rest be traditional Ceph storage? If so consider putting the SAN hosts all in one failure domain. - Depending on the SAN you may consider aligning your failure domains to different arrays, switches, or even array directors. - Remember to take the hosts network speed into consideration when calculating how many luns to put on each host. Hope that helps. -Brett On Thu, Aug 22, 2019, 4:14 AM Mohsen Mottaghi <mohsenmottaghi(a)outlook.com> wrote: > Hi > > > Yesterday one of our customers asked us a strange request. He asked us to > use SAN as the Ceph storage space to add the SAN storages it currently has > to the cluster and reduce other disk purchase costs. > > > Anybody know can we do this or not?! And if this is possible how we should > start to architect this Strange Ceph?! Is it good or not?! > > > > Thanks for your help. > > Mohsen Mottaghi > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

4 years, 7 months

4
4
0 0

2024

2023

2022

2021

2020

2019

ceph-users August 2019