Hi,
We have had a situation three times where rbd images seem to be corrupt after restoring a snapshot, and I'm looking for advice on how to investigate this.
We're running Proxmox 7 with Ceph Octopus (Proxmox build, 15.2.17-pve1). Every time the problem has happened, it has happened after these actions were done with the VM:
(Yesterday)
- VM stopped
- Snapshot created
- VM started
- VM stopped
- Snapshot restored
- VM started (OK)
- Nightly backup with vzdump to Proxmox Backup Server
(Today)
- VM stopped
- Snapshot restored
- VM does not start
On previous occasions we tried to find a solution and when we couldn't, we restored the VM from backup, which solved the problem. Now this happened to a test system, so we've left the situation as is and maybe get to the root cause.
Some observations:
* We're using krbd
* The PBS backups don't allow file restore if the backup was made from a "broken" image
* After mapping the current image, it doesn't seem to contain a partition table
There's a thread on the Proxmox forum about this issues as well[1].
If anyone could give some advice about how to proceed from here, I'd be very grateful.
Best regards,
Roel
PS: An upgrade to Pacific has already been planned.
[1] https://forum.proxmox.com/threads/vm-disks-corrupt-after-reverting-to-snaps…
--
Wij zijn ISO 27001 gecertificeerd
1A First Alternative BV
T: +31 (0)88 0016405
W: https://www.1afa.com
Dear all
I have just changed the crush rule for all the replicated pools in the
following way:
ceph osd crush rule create-replicated replicated_hdd default host hdd
ceph osd pool set <poolname> crush_rule replicated_hdd
See also this [*] thread
Before applying this change, these pools were all using
the replicated_ruleset rule where the class is not specified.
I am noticing now a problem with the autoscaler: "ceph osd pool
autoscale-status" doesn't report any output and the mgr log complains about
overlapping roots:
[pg_autoscaler ERROR root] pool xyz has overlapping roots: {-18, -1}
Indeed:
# ceph osd crush tree --show-shadow
ID CLASS WEIGHT TYPE NAME
-18 hdd 1329.26501 root default~hdd
-17 hdd 329.14154 rack Rack11-PianoAlto~hdd
-15 hdd 54.56085 host ceph-osd-04~hdd
30 hdd 5.45609 osd.30
31 hdd 5.45609 osd.31
...
...
-1 1329.26501 root default
-7 329.14154 rack Rack11-PianoAlto
-8 54.56085 host ceph-osd-04
30 hdd 5.45609 osd.30
31 hdd 5.45609 osd.31
...
I have already read about this behavior but I have no clear ideas how to
fix the problem.
I read somewhere that the problem happens when there are rules that force
some pools to only use one class and there are also pools which does not
make any distinction between device classes
All the replicated pools are using the replicated_hdd pool but I also have
some EC pools which are using a profile where the class is not specified.
As far I understand, I can't force these pools to use only the hdd class:
according to the doc I can't change this profile specifying the hdd class
(or at least the change wouldn't be applied to the existing EC pools)
Any suggestions ?
The crush map is available at https://cernbox.cern.ch/s/gIyjbQbmoTFHCrr, if
you want to have a look
Many thanks, Massimo
[*] https://www.mail-archive.com/ceph-users@ceph.io/msg18534.html
I have my ceph IOPS very low with over 48 SSD backed on NVMs for DB/WAL on four physical servers. The whole cluster has only about 20K IO total. Looks the IOs are suppressed over bottleneck somewhere. Dstat shows a lots csw and interrupts over 150K, while I am using FIO bench 4K 128QD test.
I check SSD throughput only about 40M at 250 ios each. Network are total 20G and not full of traffic. CPU are around 50% idle on 2*E5 2950v2 each node.
Is it normal to get that high and how to reduce it? where else could be the bottleneck?
Happy New Year all!
This release remains to be in "progress"/"on hold" status as we are
sorting all infrastructure-related issues.
Unless I hear objections, I suggest doing a full rebase/retest QE
cycle (adding PRs merged lately) since it's taking much longer than
anticipated when sepia is back online.
Objections?
Thx
YuriW
On Thu, Dec 15, 2022 at 9:14 AM Yuri Weinstein <yweinste(a)redhat.com> wrote:
>
> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/58257#note-1
> Release Notes - TBD
>
> Seeking approvals for:
>
> rados - Neha (https://github.com/ceph/ceph/pull/49431 is still being
> tested and will be merged soon)
> rook - Sébastien Han
> cephadm - Adam
> dashboard - Ernesto
> rgw - Casey (rwg will be rerun on the latest SHA1)
> rbd - Ilya, Deepika
> krbd - Ilya, Deepika
> fs - Venky, Patrick
> upgrade/nautilus-x (pacific) - Neha, Laura
> upgrade/octopus-x (pacific) - Neha, Laura
> upgrade/pacific-p2p - Neha - Neha, Laura
> powercycle - Brad
> ceph-volume - Guillaume, Adam K
>
> Thx
> YuriW
Dear all,
We have started to use more intensively cephfs for some wlcg related workload.
We have 3 active mds instances spread on 3 servers, mds_cache_memory_limit=12G, most of the other configs are default ones.
One of them has crashed this night leaving the log below.
Do you have any hint on what could be the cause and how to avoid it?
Regards,
Giuseppe
[root@naret-monitor03 ~]# journalctl -u ceph-63334166-d991-11eb-99de-40a6b72108d0(a)mds.cephfs.naret-monitor03.lqppte.service
...
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific >
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 1: /lib64/libpthread.so.0(+0x12ce0) [0x7fe291e4fce0]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 2: abort()
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 3: /lib64/libstdc++.so.6(+0x987ba) [0x7fe2912567ba]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 4: /lib64/libstdc++.so.6(+0x9653c) [0x7fe29125453c]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 5: /lib64/libstdc++.so.6(+0x95559) [0x7fe291253559]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 6: __gxx_personality_v0()
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 7: /lib64/libgcc_s.so.1(+0x10b03) [0x7fe290c34b03]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 8: _Unwind_Resume()
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 9: /usr/bin/ceph-mds(+0x18c104) [0x5638351e7104]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 10: /lib64/libpthread.so.0(+0x12ce0) [0x7fe291e4fce0]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 11: gsignal()
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 12: abort()
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 13: /lib64/libstdc++.so.6(+0x9009b) [0x7fe29124e09b]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 14: /lib64/libstdc++.so.6(+0x9653c) [0x7fe29125453c]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 15: /lib64/libstdc++.so.6(+0x96597) [0x7fe291254597]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 16: /lib64/libstdc++.so.6(+0x967f8) [0x7fe2912547f8]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 17: /lib64/libtcmalloc.so.4(+0x19fa4) [0x7fe29bae6fa4]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 18: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, vo>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 19: (std::shared_ptr<inode_t<mempool::mds_co::pool_allocator> > InodeSt>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 20: (CInode::_decode_base(ceph::buffer::v15_2_0::list::iterator_impl<tr>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 21: (CInode::decode_import(ceph::buffer::v15_2_0::list::iterator_impl<t>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 22: (Migrator::decode_import_inode(CDentry*, ceph::buffer::v15_2_0::lis>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 23: (Migrator::decode_import_dir(ceph::buffer::v15_2_0::list::iterator_>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 24: (Migrator::handle_export_dir(boost::intrusive_ptr<MExportDir const>>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 25: (Migrator::dispatch(boost::intrusive_ptr<Message const> const&)+0x1>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 26: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 27: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, boo>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 28: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const>>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 29: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x10>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 30: (DispatchQueue::entry()+0x126a) [0x7fe2930a5aba]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 31: (DispatchQueue::DispatchThread::entry()+0x11) [0x7fe2931575d1]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 32: /lib64/libpthread.so.0(+0x81cf) [0x7fe291e451cf]
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: 33: clone()
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is neede>
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: --- begin dump of recent events ---
Jan 19 04:49:40 naret-monitor03 ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]: terminate called recursively
Jan 19 04:49:43 naret-monitor03 systemd[1]: ceph-63334166-d991-11eb-99de-40a6b72108d0(a)mds.cephfs.naret-monitor03.lqppte.service: Main process exited, code=exited, status=127/n/a
Jan 19 04:49:43 naret-monitor03 systemd[1]: ceph-63334166-d991-11eb-99de-40a6b72108d0(a)mds.cephfs.naret-monitor03.lqppte.service: Failed with result 'exit-code'.
We use the rbd-mirror as a way to migrate volumes between clusters.
The process is enable mirroring on the image to migrate, demote on the
primary cluster, promote on the secondary cluster, and then disable
mirroring on the image.
When we started using `rbd_mirroring_delete_delay` so we could retain a
backup of the source image, we noticed volumes with unprotected snaps do
not get purged from the trash. Previously, the image and all its snaps
would be successfully removed after disabling mirroring.
I would expect a similar function when using `rbd_mirroring_delete_delay`
as well. Is rbd trash just overly cautious here?
--
Tyler Brekke
Senior Engineer I
tbrekke(a)digitalocean.com
------------------------------
We're Hiring! <https://do.co/careers> | @digitalocean
<https://twitter.com/digitalocean> | YouTube
<https://www.youtube.com/digitalocean>
Dear all
I have a ceph cluster where so far all OSDs have been rotational hdd disks
(actually there are some SSDs, used only for block.db and wal.db)
I now want to add some SSD disks to be used as OSD. My use case is:
1) for the existing pools keep using only hdd disks
2) create some new pools using only sdd disks
Let's start with 1 (I didn't have added yet the ssd disks in the cluster)
I have some replicated pools and some ec pools. The replicated pools are
using a replicated_ruleset rule [*].
I created a new "replicated_hdd" rule [**] using the command:
ceph osd crush rule create-replicated replicated_hdd default host hdd
I then changed the crush rule of a existing pool (that was using
'replicated_ruleset') using the command:
ceph osd pool set <poolname> crush_rule replicated_hdd
This triggered the remapping of some pgs and therefore some data movement.
Is this normal/expected, since for the time being I have only hdd osds ?
Thanks, Massimo
[*]
rule replicated_ruleset {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
[**]
rule replicated_hdd {
id 7
type replicated
min_size 1
max_size 10
step take default class hdd
step chooseleaf firstn 0 type host
step emit
}
Hi.
I'm new to ceph, been toying around in a virtual environment (for now) trying to understand how to manage it. I made 3 vms in proxmox and provisioned a bunch of virtual drives to each. Bootstrapped following the quincy-branch official documentation.
These are the drives:
> /dev/sdb 128.00 GB sdb True False QEMU HARDDISK (HDD)
> /dev/sdc 128.00 GB sdc True False QEMU HARDDISK (HDD)
> /dev/sdd 32.00 GB sdd False False QEMU HARDDISK (SSD)
This is the lvdisplay on /dev/sdd after creating two lvs:
> db-0 dev0-db-0 -wi-a----- 16.00g
>
> db-1 dev0-db-0 -wi-a----- <16.00g
My curiosity was to have OSDs with data=raw + block.db=lv created like this:
> ceph-volume raw prepare --bluestore --data /dev/sdd --block.db /dev/mapper/dev0--db--0--db--0
This required tinkering with permissions and temporarily modifying /etc/ceph/ceph.keyring because by default it wasn't allowing access, RADOS complained about unauthorized client.boostrap-osd something but I got it to work eventually.
(By the way, In a real environment, would RAW be of any benefit vs lvm everywhere ?)
So now I have created 2 OSDs, each with the journal on the SSD and the data on the HDD.
I repeated the steps on my other two boxes (btw, can't this be done from the local box via ceph cli ?)
Now I am trying (and failing) to start OSD daemons on this host. I tried apply osd --all-available-devices, it tells me "Scheduled osd.all-available-devices update..." but nothing happens.
I'm also not sure how to apply osds from a yaml file since that would provision them and .. they're already provisioned using the ceph-volume command above... right ?
I'm having trouble getting a lot of things to work, this is just one of them and even if I feel nostalgic using mailing lists, It's inefficient. Is there any interactive community where I can find some people usually online and talk to them realtime like discord/slack etc ? I tried irc but most are afk.
Thanks
Sent with [Proton Mail](https://proton.me/) secure email.