Hi,
I'm working on a curious case that looks like a bug in PG merging
maybe related to FileStore.
Setup is 14.2.1 that is half BlueStore half FileStore (being
migrated), and the number of PGs on an RGW index pool were reduced,
now one of the PGs (3 FileStore OSDs) seems to be corrupted. There are
some (29) objects that are affected (~20% of the PG), the issue looks
like this for one of the affected objects which I'll call dir.A here
# object seems to exist according to rados
rados -p default.rgw.buckets.index ls | grep .dir.A
.dir.A
# or doesn't it?
rados -p default.rgw.buckets.index get .dir.A -
error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory
Running deep-scrub reports that everything is okay with the affected PG
This is what the OSD logs when trying to access it, nothing really
relevant with debug 20:
10 osd.57 pg_epoch: 1149030 pg[18.2( v 1148996'1422066
(1144429'1418988,1148996'1422066] local-lis/les=1149021/1149022 n=135
ec=49611/596 lis/c 1149021/1149021 les/c/f 1149022/1149022/0
1149015/1149021/1149021) [57,0,31] r=0 lpr=1149021 crt=1148996'1422066
lcod 1148996'1422065 mlcod 0'0 active+clean] get_object_context: no
obc for soid 18:764060e4:::.dir.A:head and !can_create
So going one level deeper with ceph-objectstore-tool:
# --op list
(29 messages like this)
error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory
followed by a complete autoput of the json for the objects including
the broken ones
# .dir.A dump
dump
Error stat on : 18.2_head,#18:73996afb:::.dir.A:head#, (2) No such
file or directory
Error getting snapset on : 18.2_head,#18:73996afb:::.dir.A:head#, (2)
No such file or directory
{
"id": {
"oid": ".dir.A",
"key": "",
"snapid": -2,
"hash": 3746994638,
"max": 0,
"pool": 18,
"namespace": "",
"max": 0
}
}
# --op export
stops after encountering a bad object with 'export_files error -2'
This is the same for all 3 OSDs in that PG.
Has anyone encountered something similar? I'll probably just nuke the
affected bucket indices tomorrow and re-create them.
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
Hello,
We are having an issue with bucket index consistency between two zones in a multisite environment. The master zone (originally a single zone implementation) is running 12.2.5, and the secondary zone is running 12.2.11. We implemented a multisite configuration to migrate to new hardware (among other tasks which justified the use of multisite vs. simply introducing the new nodes).
We have one bucket in particular that has ~40k obj. missing (of 330k total) in the bucket index on the secondary cluster after sync completed. The objects are present in the data pool, and can be accessed directly, but cannot be listed. We’ve attempted `bucket check –bucket=bucket1 –fix` and also `--check-objects`. None of the missing objects are listed in output of –fix. We’ve tried using the ceph documentation to reshard the bucket in master after purging the bucket in secondary. Same result. All obj sync, but none of the ~40k are listed in the index. We’ve searched the user list history and see occurrences of this issue with other users but those threads seem to end before a resolution is met. We are simply looking to get our bucket indexes in sync so we can cutover master to the newer cluster and destroy the old cluster.
Any help would be greatly appreciated.
Thanks,
Ben
Getting decent RBD performance is not a trivial exercise. While at a first glance 61 SSDs for 245 clients sounds more or less OK, it does come down to a bit more than that.
The first thing is, how to get SSD performance out of SSDs with ceph. This post will provide very good clues and might already point out the bottleneck: https://yourcmc.ru/wiki/index.php?title=Ceph_performance . Do you have good enterprise SSDs?
Next thing to look at, what kind of data pool, replicated or erasure coded? If erasure coded, has the profile been benchmarked? There are very poor choices. Good ones are 4+m, 8+m. 4+m better IOps, 8+m better throughput. m>=2.
More complications: do you need to deploy more than one OSD per SSD to boost performance? This is indicated by the iodepth required in an fio benchmark to get full IOPs. Good SSDs deliver already spec performance with 1 OSD. More common ones require 2-4 OSDs per disk. Are you using ceph-volume already, its default is 2 OSDs per SSD (batch mode).
To give a base line, after extensive testing and working through all the required tuning steps, I could run about 250 VMs on a 6+2 EC data pool on 33 enterprise SAS SSDs with 1 OSD per disk, each VM getting 50IOPs write performance. This is probably what you would like to see as well.
If you use replicated data pool, this should be relatively easy. With EC data pool, this is a bit of a battle.
Good luck,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of Void Star Nill <void.star.nill(a)gmail.com>
Sent: 22 October 2019 03:00
To: ceph-users
Subject: [ceph-users] Fwd: large concurrent rbd operations block for over 15 mins!
Apparently the graph is too big, so my last post is stuck. Resending without the graph.
Thanks
---------- Forwarded message ---------
From: Void Star Nill <void.star.nill(a)gmail.com<mailto:void.star.nill@gmail.com>>
Date: Mon, Oct 21, 2019 at 4:41 PM
Subject: large concurrent rbd operations block for over 15 mins!
To: ceph-users <ceph-users(a)lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Hello,
I have been running some benchmark tests with a mid-size cluster and I am seeing some issues. Wanted to know if this is a bug or something that can be tuned. Appreciate any help on this.
- I have a 15 node Ceph cluster, with 3 monitors and 12 data nodes with total 61 OSDs on SSDs running 14.2.4 nautilus (stable) version. Each node has 100G link.
- I have 245 client machines from which I am triggering rbd operations. Each client has 25G link
- rbd operations include, creating an RBD image of 50G size and layering feature, mapping the image to the client machine, formatting the device in ext4 format, mounting it, running dd to write to the full disk and cleaning up (unmount, unmap and remove).
If I run these RBD operations concurrently on a small number of machines (say 16-20), they run very well and I see good throughput. All image operations (except for dd) take less than 2 seconds.
However, when I scale it up to 245 clients, each running these operations concurrently, I see lot of operations getting hung for a long time and the overall throughput reduces drastically.
For example, some of the format operations take over 10-15 mins!!!
Note that, all operations do complete - so its most likely not a deadlock kind of situation.
I dont see any errors in ceph.log on the monitor nodes. However, the clients do report "hung_task_timeout" in dmesg logs.
As you can see in the below image, half the format operations are completing in less than a second time, while the other half is over 10mins (y axis is in seconds)
[11117.113618] INFO: task umount:9902 blocked for more than 120 seconds.
[11117.113677] Tainted: G OE 4.15.0-51-generic #55~16.04.1-Ubuntu
[11117.113731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11117.113787] umount D 0 9902 9901 0x00000000
[11117.113793] Call Trace:
[11117.113804] __schedule+0x3d6/0x8b0
[11117.113810] ? _raw_spin_unlock_bh+0x1e/0x20
[11117.113814] schedule+0x36/0x80
[11117.113821] wb_wait_for_completion+0x64/0x90
[11117.113828] ? wait_woken+0x80/0x80
[11117.113831] __writeback_inodes_sb_nr+0x8e/0xb0
[11117.113835] writeback_inodes_sb+0x27/0x30
[11117.113840] __sync_filesystem+0x51/0x60
[11117.113844] sync_filesystem+0x26/0x40
[11117.113850] generic_shutdown_super+0x27/0x120
[11117.113854] kill_block_super+0x2c/0x80
[11117.113858] deactivate_locked_super+0x48/0x80
[11117.113862] deactivate_super+0x5a/0x60
[11117.113866] cleanup_mnt+0x3f/0x80
[11117.113868] __cleanup_mnt+0x12/0x20
[11117.113874] task_work_run+0x8a/0xb0
[11117.113881] exit_to_usermode_loop+0xc4/0xd0
[11117.113885] do_syscall_64+0x100/0x130
[11117.113887] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[11117.113891] RIP: 0033:0x7f0094384487
[11117.113893] RSP: 002b:00007fff4199efc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[11117.113897] RAX: 0000000000000000 RBX: 0000000000944030 RCX: 00007f0094384487
[11117.113899] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000944210
[11117.113900] RBP: 0000000000944210 R08: 0000000000000000 R09: 0000000000000014
[11117.113902] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f009488d83c
[11117.113903] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fff4199f250
On Wed, 23 Oct 2019 at 3:12 am, <ceph-users-request(a)ceph.io> wrote:
> Send ceph-users mailing list submissions to
> ceph-users(a)ceph.io
>
> To subscribe or unsubscribe via email, send a message with subject or
> body 'help' to
> ceph-users-request(a)ceph.io
>
> You can reach the person managing the list at
> ceph-users-owner(a)ceph.io
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of ceph-users digest..."
>
> Today's Topics:
>
> 1. Re: Replace ceph osd in a container (Sasha Litvak)
> 2. Re: Fwd: large concurrent rbd operations block for over 15 mins!
> (Mark Nelson)
> 3. Re: rgw multisite failover (Ed Fisher)
>
>
> ----------------------------------------------------------------------
>
> Date: Tue, 22 Oct 2019 08:52:54 -0500
> From: Sasha Litvak <alexander.v.litvak(a)gmail.com>
> Subject: [ceph-users] Re: Replace ceph osd in a container
> To: Frank Schilder <frans(a)dtu.dk>
> Cc: ceph-users <ceph-users(a)ceph.io>
> Message-ID:
> <
> CALi_L49RxWcBx_ZivRHHWgYg8Ea_UrH-0YKGMY4+b20KhXu6UQ(a)mail.gmail.com>
> Content-Type: multipart/alternative;
> boundary="000000000000a0731a0595801ea4"
>
> --000000000000a0731a0595801ea4
> Content-Type: text/plain; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> Frank,
>
> Thank you for your suggestion. It sounds very promising. I will
> definitely try it.
>
> Best,
>
> On Tue, Oct 22, 2019, 2:44 AM Frank Schilder <frans(a)dtu.dk> wrote:
>
> > > I am suspecting that mon or mgr have no access to /dev or /var/lib
> whil=
> e
> > osd containers do.
> > > Cluster configured originally by ceph-ansible (nautilus 14.2.2)
> >
> > They don't, because they don't need to.
> >
> > > The question is if I want to replace all disks on a single node, and I
> > have 6 nodes with pools
> > > replication 3, is it safe to restart mgr mounting /dev and
> /var/lib/cep=
> h
> > volumes (not configured right now).
> >
> > Restarting mons is safe in the sense that data will not get lost.
> However=
> ,
> > access might get lost temporarily.
> >
> > The question is, how many mons do you have? If you have only 1 or 2, it
> > will mean downtime. If you can bear the downtime, it doesn't matter. If
> y=
> ou
> > have at least 3, you can restart one after the other.
> >
> > However, I would not do that. Having to restart a mon container every
> tim=
> e
> > some minor container config changes for reasons that have nothing to do
> > with a mon sounds like calling for trouble.
> >
> > I also use containers and would recommend a different approach. I created
> > an additional type of container (ceph-adm) that I use for all admin
> tasks=
> .
> > Its the same image and the entry point simply executes a sleep infinity.
> =
> In
> > this container I make all relevant hardware visible. You might also want
> =
> to
> > expose /var/run/ceph to be able to use admin sockets without hassle. This
> > way, I separated admin operations from actual storage daemons and can
> > modify and restart the admin container as I like.
> >
> > Best regards,
> >
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Frank Schilder
> > AIT Ris=C3=B8 Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of Alex
> > Litvak <alexander.v.litvak(a)gmail.com>
> > Sent: 22 October 2019 08:04
> > To: ceph-users(a)lists.ceph.com
> > Subject: [ceph-users] Replace ceph osd in a container
> >
> > Hello cephers,
> >
> > So I am having trouble with a new hardware systems with strange OSD
> > behavior and I want to replace a disk with a brand new one to test the
> > theory.
> >
> > I run all daemons in containers and on one of the nodes I have mon, mgr,
> > and 6 osds. So following
> >
> https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replac=
> ing-an-osd
> <https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replac=i…>
> >
> > I stopped container with osd.23, waited until it is down and out, ran
> > safe-to-destroy loop and then destroyed the osd all using the monitor
> fro=
> m
> > the container on this node. All good.
> >
> > Then I swapped the SSDs and started running additional steps (from step
> 3=
> )
> > using the same mon container. I have no ceph packages installed on the
> > bare metal box. It looks like mon container doesn't
> > see the disk.
> >
> > podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap /dev/sdh
> > stderr: lsblk: /dev/sdh: not a block device
> > stderr: error: /dev/sdh: No such file or directory
> > stderr: Unknown device, --name=3D, --path=3D, or absolute path in
> /dev/=
> or
> > /sys expected.
> > usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
> > [--osd-fsid OSD_FSID]
> > [DEVICES [DEVICES ...]]
> > ceph-volume lvm zap: error: Unable to proceed with non-existing device:
> > /dev/sdh
> > Error: exit status 2
> > root@storage2n2-la:~# ls -l /dev/sd
> > sda sdc sdd sde sdf sdg sdg1 sdg2 sdg5 sdh
> > root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume
> > lvm zap sdh
> > stderr: lsblk: sdh: not a block device
> > stderr: error: sdh: No such file or directory
> > stderr: Unknown device, --name=3D, --path=3D, or absolute path in
> /dev/=
> or
> > /sys expected.
> > usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
> > [--osd-fsid OSD_FSID]
> > [DEVICES [DEVICES ...]]
> > ceph-volume lvm zap: error: Unable to proceed with non-existing device:
> s=
> dh
> > Error: exit status 2
> >
> > I execute lsblk and it sees device sdh
> > root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk
> > lsblk: dm-1: failed to get device path
> > lsblk: dm-2: failed to get device path
> > lsblk: dm-4: failed to get device path
> > lsblk: dm-6: failed to get device path
> > lsblk: dm-4: failed to get device path
> > lsblk: dm-2: failed to get device path
> > lsblk: dm-1: failed to get device path
> > lsblk: dm-0: failed to get device path
> > lsblk: dm-0: failed to get device path
> > lsblk: dm-7: failed to get device path
> > lsblk: dm-5: failed to get device path
> > lsblk: dm-7: failed to get device path
> > lsblk: dm-6: failed to get device path
> > lsblk: dm-5: failed to get device path
> > lsblk: dm-3: failed to get device path
> > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> > sdf 8:80 0 1.8T 0 disk
> > sdd 8:48 0 1.8T 0 disk
> > sdg 8:96 0 223.5G 0 disk
> > |-sdg5 8:101 0 223G 0 part
> > |-sdg1 8:97 487M 0 part
> > `-sdg2 8:98 1K 0 part
> > sde 8:64 0 1.8T 0 disk
> > sdc 8:32 0 3.5T 0 disk
> > sda 8:0 0 3.5T 0 disk
> > sdh 8:112 0 3.5T 0 disk
> >
> > So I use a fellow osd container (osd.5) on the same node and run all of
> > the operations (zap and prepare) successfully.
> >
> > I am suspecting that mon or mgr have no access to /dev or /var/lib while
> > osd containers do. Cluster configured originally by ceph-ansible
> (nautil=
> us
> > 14.2.2)
> >
> > The question is if I want to replace all disks on a single node, and I
> > have 6 nodes with pools replication 3, is it safe to restart mgr mounting
> > /dev and /var/lib/ceph volumes (not configured right now).
> >
> > I cannot use other osd containers on the same box because my controller
> > reverts from raid to non-raid mode with all disks lost and not just a
> > single one. So I need to replace all 6 osds to run back
> > in containers and the only things will remain operational on node are mon
> > and mgr containers.
> >
> > I prefer not to install a full cluster or client on the bare metal node
> i=
> f
> > possible.
> >
> > Thank you for your help,
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users(a)lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --000000000000a0731a0595801ea4
> Content-Type: text/html; charset="UTF-8"
> Content-Transfer-Encoding: quoted-printable
>
> <div dir=3D"auto">Frank,<div dir=3D"auto"><br></div><div
> dir=3D"auto">Thank=
> you=C2=A0for your suggestion.=C2=A0 It sounds very promising.=C2=A0 I
> will=
> definitely try it.</div><div dir=3D"auto"><br></div><div
> dir=3D"auto">Best=
> ,</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr"
> class=3D"gmail=
> _attr">On Tue, Oct 22, 2019, 2:44 AM Frank Schilder <<a href=3D"mailto:
> f=
> rans(a)dtu.dk">frans(a)dtu.dk</a>> wrote:<br></div><blockquote
> class=3D"gmai=
> l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc
> solid;padding-left=
> :1ex">> I am suspecting that mon or mgr have no access to /dev or
> /var/l=
> ib while osd containers do. <br>
> > Cluster configured originally by ceph-ansible (nautilus 14.2.2)<br>
> <br>
> They don't, because they don't need to.<br>
> <br>
> > The question is if I want to replace all disks on a single node, and
> I=
> have 6 nodes with pools<br>
> > replication 3, is it safe to restart mgr mounting /dev and
> /var/lib/ce=
> ph volumes (not configured right now).<br>
> <br>
> Restarting mons is safe in the sense that data will not get lost. However,
> =
> access might get lost temporarily.<br>
> <br>
> The question is, how many mons do you have? If you have only 1 or 2, it
> wil=
> l mean downtime. If you can bear the downtime, it doesn't matter. If
> yo=
> u have at least 3, you can restart one after the other.<br>
> <br>
> However, I would not do that. Having to restart a mon container every time
> =
> some minor container config changes for reasons that have nothing to do
> wit=
> h a mon sounds like calling for trouble.<br>
> <br>
> I also use containers and would recommend a different approach. I created
> a=
> n additional type of container (ceph-adm) that I use for all admin tasks.
> I=
> ts the same image and the entry point simply executes a sleep infinity. In
> =
> this container I make all relevant hardware visible. You might also want
> to=
> expose /var/run/ceph to be able to use admin sockets without hassle. This
> =
> way, I separated admin operations from actual storage daemons and can
> modif=
> y and restart the admin container as I like.<br>
> <br>
> Best regards,<br>
> <br>
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>
> Frank Schilder<br>
> AIT Ris=C3=B8 Campus<br>
> Bygning 109, rum S14<br>
> <br>
> ________________________________________<br>
> From: ceph-users <<a href=3D"mailto:ceph-users-bounces@lists.ceph.com"
> t=
> arget=3D"_blank" rel=3D"noreferrer">ceph-users-bounces(a)lists.ceph.com
> </a>&g=
> t; on behalf of Alex Litvak <<a href=3D"mailto:
> alexander.v.litvak(a)gmail.=
> com" target=3D"_blank" rel=3D"noreferrer">alexander.v.litvak(a)gmail.com
> </a>&=
> gt;<br>
> Sent: 22 October 2019 08:04<br>
> To: <a href=3D"mailto:ceph-users@lists.ceph.com" target=3D"_blank"
> rel=3D"n=
> oreferrer">ceph-users(a)lists.ceph.com</a><br>
> Subject: [ceph-users] Replace ceph osd in a container<br>
> <br>
> Hello cephers,<br>
> <br>
> So I am having trouble with a new hardware systems with strange OSD
> behavio=
> r and I want to replace a disk with a brand new one to test the theory.<br>
> <br>
> I run all daemons in containers and on one of the nodes I have mon, mgr,
> an=
> d 6 osds.=C2=A0 So following <a href=3D"
> https://docs.ceph.com/docs/master/r=
> ados/operations/add-or-rm-osds/#replacing-an-osd
> <https://docs.ceph.com/docs/master/r=ados/operations/add-or-rm-osds/#replaci…>"
> rel=3D"noreferrer norefer=
> rer" target=3D"_blank">
> https://docs.ceph.com/docs/master/rados/operations/a=
> dd-or-rm-osds/#replacing-an-osd
> <https://docs.ceph.com/docs/master/rados/operations/a=dd-or-rm-osds/#replaci…>
> </a><br>
> <br>
> I stopped container with osd.23, waited until it is down and out, ran
> safe-=
> to-destroy loop and then destroyed the osd all using the monitor from the
> c=
> ontainer on this node.=C2=A0 All good.<br>
> <br>
> Then I swapped the SSDs and started running additional steps (from step 3)
> =
> using the same mon container.=C2=A0 I have no ceph packages installed on
> th=
> e bare metal box. It looks like mon container doesn't<br>
> see the disk.<br>
> <br>
> =C2=A0 =C2=A0 =C2=A0podman exec -it ceph-mon-storage2n2-la ceph-volume lvm
> =
> zap /dev/sdh<br>
> =C2=A0 stderr: lsblk: /dev/sdh: not a block device<br>
> =C2=A0 stderr: error: /dev/sdh: No such file or directory<br>
> =C2=A0 stderr: Unknown device, --name=3D, --path=3D, or absolute path in
> /d=
> ev/ or /sys expected.<br>
> usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]<br>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
> =A0 =C2=A0 =C2=A0 =C2=A0 [--osd-fsid OSD_FSID]<br>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
> =A0 =C2=A0 =C2=A0 =C2=A0 [DEVICES [DEVICES ...]]<br>
> ceph-volume lvm zap: error: Unable to proceed with non-existing device:
> /de=
> v/sdh<br>
> Error: exit status 2<br>
> root@storage2n2-la:~# ls -l /dev/sd<br>
> sda=C2=A0 =C2=A0sdc=C2=A0 =C2=A0sdd=C2=A0 =C2=A0sde=C2=A0 =C2=A0sdf=C2=A0 =
> =C2=A0sdg=C2=A0 =C2=A0sdg1=C2=A0 sdg2=C2=A0 sdg5=C2=A0 sdh<br>
> root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume
> lv=
> m zap sdh<br>
> =C2=A0 stderr: lsblk: sdh: not a block device<br>
> =C2=A0 stderr: error: sdh: No such file or directory<br>
> =C2=A0 stderr: Unknown device, --name=3D, --path=3D, or absolute path in
> /d=
> ev/ or /sys expected.<br>
> usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]<br>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
> =A0 =C2=A0 =C2=A0 =C2=A0 [--osd-fsid OSD_FSID]<br>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
> =A0 =C2=A0 =C2=A0 =C2=A0 [DEVICES [DEVICES ...]]<br>
> ceph-volume lvm zap: error: Unable to proceed with non-existing device:
> sdh=
> <br>
> Error: exit status 2<br>
> <br>
> I execute lsblk and it sees device sdh<br>
> root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk<br>
> lsblk: dm-1: failed to get device path<br>
> lsblk: dm-2: failed to get device path<br>
> lsblk: dm-4: failed to get device path<br>
> lsblk: dm-6: failed to get device path<br>
> lsblk: dm-4: failed to get device path<br>
> lsblk: dm-2: failed to get device path<br>
> lsblk: dm-1: failed to get device path<br>
> lsblk: dm-0: failed to get device path<br>
> lsblk: dm-0: failed to get device path<br>
> lsblk: dm-7: failed to get device path<br>
> lsblk: dm-5: failed to get device path<br>
> lsblk: dm-7: failed to get device path<br>
> lsblk: dm-6: failed to get device path<br>
> lsblk: dm-5: failed to get device path<br>
> lsblk: dm-3: failed to get device path<br>
> NAME=C2=A0 =C2=A0MAJ:MIN RM=C2=A0 =C2=A0SIZE RO TYPE MOUNTPOINT<br>
> sdf=C2=A0 =C2=A0 =C2=A0 8:80=C2=A0 =C2=A00=C2=A0 =C2=A01.8T=C2=A0 0
> disk<br=
> >
> sdd=C2=A0 =C2=A0 =C2=A0 8:48=C2=A0 =C2=A00=C2=A0 =C2=A01.8T=C2=A0 0
> disk<br=
> >
> sdg=C2=A0 =C2=A0 =C2=A0 8:96=C2=A0 =C2=A00 223.5G=C2=A0 0 disk<br>
> |-sdg5=C2=A0 =C2=A08:101=C2=A0 0=C2=A0 =C2=A0223G=C2=A0 0 part<br>
> |-sdg1=C2=A0 =C2=A08:97=C2=A0 =C2=A0 =C2=A0 =C2=A0487M=C2=A0 0 part<br>
> `-sdg2=C2=A0 =C2=A08:98=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01K=C2=A0 0
> part<br=
> >
> sde=C2=A0 =C2=A0 =C2=A0 8:64=C2=A0 =C2=A00=C2=A0 =C2=A01.8T=C2=A0 0
> disk<br=
> >
> sdc=C2=A0 =C2=A0 =C2=A0 8:32=C2=A0 =C2=A00=C2=A0 =C2=A03.5T=C2=A0 0
> disk<br=
> >
> sda=C2=A0 =C2=A0 =C2=A0 8:0=C2=A0 =C2=A0 0=C2=A0 =C2=A03.5T=C2=A0 0
> disk<br=
> >
> sdh=C2=A0 =C2=A0 =C2=A0 8:112=C2=A0 0=C2=A0 =C2=A03.5T=C2=A0 0 disk<br>
> <br>
> So I use a fellow osd container (osd.5) on the same node and run all of
> the=
> operations (zap and prepare) successfully.<br>
> <br>
> I am suspecting that mon or mgr have no access to /dev or /var/lib while
> os=
> d containers do.=C2=A0 Cluster configured originally by ceph-ansible
> (nauti=
> lus 14.2.2)<br>
> <br>
> The question is if I want to replace all disks on a single node, and I
> have=
> 6 nodes with pools replication 3, is it safe to restart mgr mounting /dev
> =
> and /var/lib/ceph volumes (not configured right now).<br>
> <br>
> I cannot use other osd containers on the same box because my controller
> rev=
> erts from raid to non-raid mode with all disks lost and not just a single
> o=
> ne.=C2=A0 So I need to replace all 6 osds to run back<br>
> in containers and the only things will remain operational on node are mon
> a=
> nd mgr containers.<br>
> <br>
> I prefer not to install a full cluster or client on the bare metal node if
> =
> possible.<br>
> <br>
> Thank you for your help,<br>
> <br>
> _______________________________________________<br>
> ceph-users mailing list<br>
> <a href=3D"mailto:ceph-users@lists.ceph.com" target=3D"_blank"
> rel=3D"noref=
> errer">ceph-users(a)lists.ceph.com</a><br>
> <a href=3D"http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com"
> rel=3D"n=
> oreferrer noreferrer" target=3D"_blank">
> http://lists.ceph.com/listinfo.cgi/=
> ceph-users-ceph.com
> <http://lists.ceph.com/listinfo.cgi/=ceph-users-ceph.com></a><br>
> </blockquote></div>
>
> --000000000000a0731a0595801ea4--
>
> ------------------------------
>
> Date: Tue, 22 Oct 2019 08:59:21 -0500
> From: Mark Nelson <mnelson(a)redhat.com>
> Subject: [ceph-users] Re: Fwd: large concurrent rbd operations block
> for over 15 mins!
> To: ceph-users(a)ceph.io
> Message-ID: <362e3930-8c30-d3e0-d0b0-30187c8551e4(a)redhat.com>
> Content-Type: text/plain; charset=UTF-8; format=flowed
>
> Out of curiosity, when you chose EC over replication how did you weigh
> IOPS vs space amplification in your decision making process? I'm
> wondering if we should prioritize EC latency vs other tasks in future
> tuning efforts (it's always a tradeoff deciding what to focus on).
>
>
> Thanks,
>
> Mark
>
> On 10/22/19 2:35 AM, Frank Schilder wrote:
> > Getting decent RBD performance is not a trivial exercise. While at a
> first glance 61 SSDs for 245 clients sounds more or less OK, it does come
> down to a bit more than that.
> >
> > The first thing is, how to get SSD performance out of SSDs with ceph.
> This post will provide very good clues and might already point out the
> bottleneck: https://yourcmc.ru/wiki/index.php?title=Ceph_performance . Do
> you have good enterprise SSDs?
> >
> > Next thing to look at, what kind of data pool, replicated or erasure
> coded? If erasure coded, has the profile been benchmarked? There are very
> poor choices. Good ones are 4+m, 8+m. 4+m better IOps, 8+m better
> throughput. m>=2.
> >
> > More complications: do you need to deploy more than one OSD per SSD to
> boost performance? This is indicated by the iodepth required in an fio
> benchmark to get full IOPs. Good SSDs deliver already spec performance with
> 1 OSD. More common ones require 2-4 OSDs per disk. Are you using
> ceph-volume already, its default is 2 OSDs per SSD (batch mode).
> >
> > To give a base line, after extensive testing and working through all the
> required tuning steps, I could run about 250 VMs on a 6+2 EC data pool on
> 33 enterprise SAS SSDs with 1 OSD per disk, each VM getting 50IOPs write
> performance. This is probably what you would like to see as well.
> >
> > If you use replicated data pool, this should be relatively easy. With EC
> data pool, this is a bit of a battle.
> >
> > Good luck,
> >
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of Void
> Star Nill <void.star.nill(a)gmail.com>
> > Sent: 22 October 2019 03:00
> > To: ceph-users
> > Subject: [ceph-users] Fwd: large concurrent rbd operations block for
> over 15 mins!
> >
> > Apparently the graph is too big, so my last post is stuck. Resending
> without the graph.
> >
> > Thanks
> >
> >
> > ---------- Forwarded message ---------
> > From: Void Star Nill <void.star.nill(a)gmail.com<mailto:
> void.star.nill(a)gmail.com>>
> > Date: Mon, Oct 21, 2019 at 4:41 PM
> > Subject: large concurrent rbd operations block for over 15 mins!
> > To: ceph-users <ceph-users(a)lists.ceph.com<mailto:
> ceph-users(a)lists.ceph.com>>
> >
> >
> > Hello,
> >
> > I have been running some benchmark tests with a mid-size cluster and I
> am seeing some issues. Wanted to know if this is a bug or something that
> can be tuned. Appreciate any help on this.
> >
> > - I have a 15 node Ceph cluster, with 3 monitors and 12 data nodes with
> total 61 OSDs on SSDs running 14.2.4 nautilus (stable) version. Each node
> has 100G link.
> > - I have 245 client machines from which I am triggering rbd operations.
> Each client has 25G link
> > - rbd operations include, creating an RBD image of 50G size and layering
> feature, mapping the image to the client machine, formatting the device in
> ext4 format, mounting it, running dd to write to the full disk and cleaning
> up (unmount, unmap and remove).
> >
> > If I run these RBD operations concurrently on a small number of machines
> (say 16-20), they run very well and I see good throughput. All image
> operations (except for dd) take less than 2 seconds.
> >
> > However, when I scale it up to 245 clients, each running these
> operations concurrently, I see lot of operations getting hung for a long
> time and the overall throughput reduces drastically.
> >
> > For example, some of the format operations take over 10-15 mins!!!
> >
> > Note that, all operations do complete - so its most likely not a
> deadlock kind of situation.
> >
> > I dont see any errors in ceph.log on the monitor nodes. However, the
> clients do report "hung_task_timeout" in dmesg logs.
> >
> > As you can see in the below image, half the format operations are
> completing in less than a second time, while the other half is over 10mins
> (y axis is in seconds)
> >
> >
> >
> > [11117.113618] INFO: task umount:9902 blocked for more than 120 seconds.
> > [11117.113677] Tainted: G OE 4.15.0-51-generic
> #55~16.04.1-Ubuntu
> > [11117.113731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> > [11117.113787] umount D 0 9902 9901 0x00000000
> > [11117.113793] Call Trace:
> > [11117.113804] __schedule+0x3d6/0x8b0
> > [11117.113810] ? _raw_spin_unlock_bh+0x1e/0x20
> > [11117.113814] schedule+0x36/0x80
> > [11117.113821] wb_wait_for_completion+0x64/0x90
> > [11117.113828] ? wait_woken+0x80/0x80
> > [11117.113831] __writeback_inodes_sb_nr+0x8e/0xb0
> > [11117.113835] writeback_inodes_sb+0x27/0x30
> > [11117.113840] __sync_filesystem+0x51/0x60
> > [11117.113844] sync_filesystem+0x26/0x40
> > [11117.113850] generic_shutdown_super+0x27/0x120
> > [11117.113854] kill_block_super+0x2c/0x80
> > [11117.113858] deactivate_locked_super+0x48/0x80
> > [11117.113862] deactivate_super+0x5a/0x60
> > [11117.113866] cleanup_mnt+0x3f/0x80
> > [11117.113868] __cleanup_mnt+0x12/0x20
> > [11117.113874] task_work_run+0x8a/0xb0
> > [11117.113881] exit_to_usermode_loop+0xc4/0xd0
> > [11117.113885] do_syscall_64+0x100/0x130
> > [11117.113887] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > [11117.113891] RIP: 0033:0x7f0094384487
> > [11117.113893] RSP: 002b:00007fff4199efc8 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000a6
> > [11117.113897] RAX: 0000000000000000 RBX: 0000000000944030 RCX:
> 00007f0094384487
> > [11117.113899] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
> 0000000000944210
> > [11117.113900] RBP: 0000000000944210 R08: 0000000000000000 R09:
> 0000000000000014
> > [11117.113902] R10: 00000000000006b2 R11: 0000000000000246 R12:
> 00007f009488d83c
> > [11117.113903] R13: 0000000000000000 R14: 0000000000000000 R15:
> 00007fff4199f250
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
> ------------------------------
>
> Date: Tue, 22 Oct 2019 11:10:38 -0500
> From: Ed Fisher <ed(a)debacle.org>
> Subject: [ceph-users] Re: rgw multisite failover
> To: Frank R <frankaritchie(a)gmail.com>
> Cc: ceph-users <ceph-users(a)ceph.com>
> Message-ID: <084F4293-88CC-456A-B8A4-2E36ACA24B65(a)debacle.org>
> Content-Type: multipart/alternative;
> boundary="Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A"
>
>
> --Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A
> Content-Transfer-Encoding: quoted-printable
> Content-Type: text/plain;
> charset=us-ascii
>
>
>
> > On Oct 18, 2019, at 10:40 PM, Frank R <frankaritchie(a)gmail.com> wrote:
> >=20
> > I am looking to change an RGW multisite deployment so that the =
> secondary will become master. This is meant to be a permanent change.
> >=20
> > Per:
> > https://docs.ceph.com/docs/mimic/radosgw/multisite/ =
> <https://docs.ceph.com/docs/mimic/radosgw/multisite/>
> >=20
> > I need to:
> >=20
> > 1. Stop RGW daemons on the current master end.
> >=20
> > On a secondary RGW node:
> > 2. radosgw-admin zone modify --rgw-zone=3D{zone-name} --master =
> --default
> > 3. radosgw-admin period update --commit
> > 4. systemctl restart ceph-radosgw(a)rgw.`hostname -s`
> >=20
> > Since I want the former master to be secondary permanently do I need =
> to do anything after restarting the RGW daemons on the old master end?
>
>
> Before you restart the RGW daemons on the old master you want to make =
> sure you pull the current realm from the new master. Beyond that there =
> should be no changes needed.=20=
>
> --Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A
> Content-Transfer-Encoding: quoted-printable
> Content-Type: text/html;
> charset=us-ascii
>
> <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
> charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; =
> -webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><br =
> class=3D""><div><br class=3D""><blockquote type=3D"cite" class=3D""><div =
> class=3D"">On Oct 18, 2019, at 10:40 PM, Frank R <<a =
> href=3D"mailto:frankaritchie@gmail.com" =
> class=3D"">frankaritchie(a)gmail.com</a>> wrote:</div><br =
> class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
> class=3D"">I am looking to change an RGW multisite deployment so that =
> the secondary will become master. This is meant to be a permanent =
> change.<div class=3D""><br class=3D""></div><div class=3D"">Per:</div><div=
> class=3D""><a =
> href=3D"https://docs.ceph.com/docs/mimic/radosgw/multisite/" =
> class=3D"">https://docs.ceph.com/docs/mimic/radosgw/multisite/</a><br =
> class=3D""></div><div class=3D""><br class=3D""></div><div class=3D"">I =
> need to:</div><div class=3D""><br class=3D""></div><div class=3D"">1. =
> Stop RGW daemons on the current master end.</div><div class=3D""><br =
> class=3D""></div><div class=3D"">On a secondary RGW node:</div><div =
> class=3D"">2. radosgw-admin zone modify --rgw-zone=3D{zone-name} =
> --master --default</div><div class=3D"">3. radosgw-admin period =
> update --commit</div><div class=3D"">4. systemctl restart =
> ceph-radosgw(a)rgw.`hostname -s`</div><div class=3D""><br =
> class=3D""></div><div class=3D"">Since I want the former master to be =
> secondary permanently do I need to do anything after restarting the RGW =
> daemons on the old master end?</div></div></div></blockquote><div><br =
> class=3D""></div><div><br class=3D""></div>Before you restart the RGW =
> daemons on the old master you want to make sure you pull the current =
> realm from the new master. Beyond that there should be no changes =
> needed. </div></body></html>=
>
> --Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A--
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>
>
> ------------------------------
>
> End of ceph-users Digest, Vol 81, Issue 56
> ******************************************
>
I am looking to change an RGW multisite deployment so that the secondary
will become master. This is meant to be a permanent change.
Per:
https://docs.ceph.com/docs/mimic/radosgw/multisite/
I need to:
1. Stop RGW daemons on the current master end.
On a secondary RGW node:
2. radosgw-admin zone modify --rgw-zone={zone-name} --master --default
3. radosgw-admin period update --commit
4. systemctl restart ceph-radosgw(a)rgw.`hostname -s`
Since I want the former master to be secondary permanently do I need to do
anything after restarting the RGW daemons on the old master end?
> I am suspecting that mon or mgr have no access to /dev or /var/lib while osd containers do.
> Cluster configured originally by ceph-ansible (nautilus 14.2.2)
They don't, because they don't need to.
> The question is if I want to replace all disks on a single node, and I have 6 nodes with pools
> replication 3, is it safe to restart mgr mounting /dev and /var/lib/ceph volumes (not configured right now).
Restarting mons is safe in the sense that data will not get lost. However, access might get lost temporarily.
The question is, how many mons do you have? If you have only 1 or 2, it will mean downtime. If you can bear the downtime, it doesn't matter. If you have at least 3, you can restart one after the other.
However, I would not do that. Having to restart a mon container every time some minor container config changes for reasons that have nothing to do with a mon sounds like calling for trouble.
I also use containers and would recommend a different approach. I created an additional type of container (ceph-adm) that I use for all admin tasks. Its the same image and the entry point simply executes a sleep infinity. In this container I make all relevant hardware visible. You might also want to expose /var/run/ceph to be able to use admin sockets without hassle. This way, I separated admin operations from actual storage daemons and can modify and restart the admin container as I like.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of Alex Litvak <alexander.v.litvak(a)gmail.com>
Sent: 22 October 2019 08:04
To: ceph-users(a)lists.ceph.com
Subject: [ceph-users] Replace ceph osd in a container
Hello cephers,
So I am having trouble with a new hardware systems with strange OSD behavior and I want to replace a disk with a brand new one to test the theory.
I run all daemons in containers and on one of the nodes I have mon, mgr, and 6 osds. So following https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacin…
I stopped container with osd.23, waited until it is down and out, ran safe-to-destroy loop and then destroyed the osd all using the monitor from the container on this node. All good.
Then I swapped the SSDs and started running additional steps (from step 3) using the same mon container. I have no ceph packages installed on the bare metal box. It looks like mon container doesn't
see the disk.
podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap /dev/sdh
stderr: lsblk: /dev/sdh: not a block device
stderr: error: /dev/sdh: No such file or directory
stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
[--osd-fsid OSD_FSID]
[DEVICES [DEVICES ...]]
ceph-volume lvm zap: error: Unable to proceed with non-existing device: /dev/sdh
Error: exit status 2
root@storage2n2-la:~# ls -l /dev/sd
sda sdc sdd sde sdf sdg sdg1 sdg2 sdg5 sdh
root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap sdh
stderr: lsblk: sdh: not a block device
stderr: error: sdh: No such file or directory
stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected.
usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]
[--osd-fsid OSD_FSID]
[DEVICES [DEVICES ...]]
ceph-volume lvm zap: error: Unable to proceed with non-existing device: sdh
Error: exit status 2
I execute lsblk and it sees device sdh
root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk
lsblk: dm-1: failed to get device path
lsblk: dm-2: failed to get device path
lsblk: dm-4: failed to get device path
lsblk: dm-6: failed to get device path
lsblk: dm-4: failed to get device path
lsblk: dm-2: failed to get device path
lsblk: dm-1: failed to get device path
lsblk: dm-0: failed to get device path
lsblk: dm-0: failed to get device path
lsblk: dm-7: failed to get device path
lsblk: dm-5: failed to get device path
lsblk: dm-7: failed to get device path
lsblk: dm-6: failed to get device path
lsblk: dm-5: failed to get device path
lsblk: dm-3: failed to get device path
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdf 8:80 0 1.8T 0 disk
sdd 8:48 0 1.8T 0 disk
sdg 8:96 0 223.5G 0 disk
|-sdg5 8:101 0 223G 0 part
|-sdg1 8:97 487M 0 part
`-sdg2 8:98 1K 0 part
sde 8:64 0 1.8T 0 disk
sdc 8:32 0 3.5T 0 disk
sda 8:0 0 3.5T 0 disk
sdh 8:112 0 3.5T 0 disk
So I use a fellow osd container (osd.5) on the same node and run all of the operations (zap and prepare) successfully.
I am suspecting that mon or mgr have no access to /dev or /var/lib while osd containers do. Cluster configured originally by ceph-ansible (nautilus 14.2.2)
The question is if I want to replace all disks on a single node, and I have 6 nodes with pools replication 3, is it safe to restart mgr mounting /dev and /var/lib/ceph volumes (not configured right now).
I cannot use other osd containers on the same box because my controller reverts from raid to non-raid mode with all disks lost and not just a single one. So I need to replace all 6 osds to run back
in containers and the only things will remain operational on node are mon and mgr containers.
I prefer not to install a full cluster or client on the bare metal node if possible.
Thank you for your help,
_______________________________________________
ceph-users mailing list
ceph-users(a)lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Hi,
Our cluster got an unexpected power outage. Ceph mon cannot start after that. The log shows:
Running command: '/usr/bin/ceph-mon -f -i 10.10.198.11 --public-addr 10.10.198.11:6789'
Corruption: 15 missing files; e.g.: /var/lib/ceph/mon/ceph-10.10.198.11/store.db/2676107.sst
Is there any way to fix this problem? Thank you very much!
We are running ceph 10.2.10.
Br,
Xu Yun