October 2019 - ceph-users

PG badly corrupted after merging PGs on mixed FileStore/BlueStore setup

by Paul Emmerich

Hi, I'm working on a curious case that looks like a bug in PG merging maybe related to FileStore. Setup is 14.2.1 that is half BlueStore half FileStore (being migrated), and the number of PGs on an RGW index pool were reduced, now one of the PGs (3 FileStore OSDs) seems to be corrupted. There are some (29) objects that are affected (~20% of the PG), the issue looks like this for one of the affected objects which I'll call dir.A here # object seems to exist according to rados rados -p default.rgw.buckets.index ls | grep .dir.A .dir.A # or doesn't it? rados -p default.rgw.buckets.index get .dir.A - error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory Running deep-scrub reports that everything is okay with the affected PG This is what the OSD logs when trying to access it, nothing really relevant with debug 20: 10 osd.57 pg_epoch: 1149030 pg[18.2( v 1148996'1422066 (1144429'1418988,1148996'1422066] local-lis/les=1149021/1149022 n=135 ec=49611/596 lis/c 1149021/1149021 les/c/f 1149022/1149022/0 1149015/1149021/1149021) [57,0,31] r=0 lpr=1149021 crt=1148996'1422066 lcod 1148996'1422065 mlcod 0'0 active+clean] get_object_context: no obc for soid 18:764060e4:::.dir.A:head and !can_create So going one level deeper with ceph-objectstore-tool: # --op list (29 messages like this) error getting default.rgw.buckets.index/.dir.A: (2) No such file or directory followed by a complete autoput of the json for the objects including the broken ones # .dir.A dump dump Error stat on : 18.2_head,#18:73996afb:::.dir.A:head#, (2) No such file or directory Error getting snapset on : 18.2_head,#18:73996afb:::.dir.A:head#, (2) No such file or directory { "id": { "oid": ".dir.A", "key": "", "snapid": -2, "hash": 3746994638, "max": 0, "pool": 18, "namespace": "", "max": 0 } } # --op export stops after encountering a bad object with 'export_files error -2' This is the same for all 3 OSDs in that PG. Has anyone encountered something similar? I'll probably just nuke the affected bucket indices tomorrow and re-create them. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90

4 years, 6 months

2
2
0 0

Radosgw sync incomplete bucket indexes

by Benjamin.Zieglmeier

Hello, We are having an issue with bucket index consistency between two zones in a multisite environment. The master zone (originally a single zone implementation) is running 12.2.5, and the secondary zone is running 12.2.11. We implemented a multisite configuration to migrate to new hardware (among other tasks which justified the use of multisite vs. simply introducing the new nodes). We have one bucket in particular that has ~40k obj. missing (of 330k total) in the bucket index on the secondary cluster after sync completed. The objects are present in the data pool, and can be accessed directly, but cannot be listed. We’ve attempted `bucket check –bucket=bucket1 –fix` and also `--check-objects`. None of the missing objects are listed in output of –fix. We’ve tried using the ceph documentation to reshard the bucket in master after purging the bucket in secondary. Same result. All obj sync, but none of the ~40k are listed in the index. We’ve searched the user list history and see occurrences of this issue with other users but those threads seem to end before a resolution is met. We are simply looking to get our bucket indexes in sync so we can cutover master to the newer cluster and destroy the old cluster. Any help would be greatly appreciated. Thanks, Ben

4 years, 6 months

1
0
0 0

Re: Fwd: large concurrent rbd operations block for over 15 mins!

by Frank Schilder

Getting decent RBD performance is not a trivial exercise. While at a first glance 61 SSDs for 245 clients sounds more or less OK, it does come down to a bit more than that. The first thing is, how to get SSD performance out of SSDs with ceph. This post will provide very good clues and might already point out the bottleneck: https://yourcmc.ru/wiki/index.php?title=Ceph_performance . Do you have good enterprise SSDs? Next thing to look at, what kind of data pool, replicated or erasure coded? If erasure coded, has the profile been benchmarked? There are very poor choices. Good ones are 4+m, 8+m. 4+m better IOps, 8+m better throughput. m>=2. More complications: do you need to deploy more than one OSD per SSD to boost performance? This is indicated by the iodepth required in an fio benchmark to get full IOPs. Good SSDs deliver already spec performance with 1 OSD. More common ones require 2-4 OSDs per disk. Are you using ceph-volume already, its default is 2 OSDs per SSD (batch mode). To give a base line, after extensive testing and working through all the required tuning steps, I could run about 250 VMs on a 6+2 EC data pool on 33 enterprise SAS SSDs with 1 OSD per disk, each VM getting 50IOPs write performance. This is probably what you would like to see as well. If you use replicated data pool, this should be relatively easy. With EC data pool, this is a bit of a battle. Good luck, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of Void Star Nill <void.star.nill(a)gmail.com> Sent: 22 October 2019 03:00 To: ceph-users Subject: [ceph-users] Fwd: large concurrent rbd operations block for over 15 mins! Apparently the graph is too big, so my last post is stuck. Resending without the graph. Thanks ---------- Forwarded message --------- From: Void Star Nill <void.star.nill(a)gmail.com<mailto:void.star.nill@gmail.com>> Date: Mon, Oct 21, 2019 at 4:41 PM Subject: large concurrent rbd operations block for over 15 mins! To: ceph-users <ceph-users(a)lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Hello, I have been running some benchmark tests with a mid-size cluster and I am seeing some issues. Wanted to know if this is a bug or something that can be tuned. Appreciate any help on this. - I have a 15 node Ceph cluster, with 3 monitors and 12 data nodes with total 61 OSDs on SSDs running 14.2.4 nautilus (stable) version. Each node has 100G link. - I have 245 client machines from which I am triggering rbd operations. Each client has 25G link - rbd operations include, creating an RBD image of 50G size and layering feature, mapping the image to the client machine, formatting the device in ext4 format, mounting it, running dd to write to the full disk and cleaning up (unmount, unmap and remove). If I run these RBD operations concurrently on a small number of machines (say 16-20), they run very well and I see good throughput. All image operations (except for dd) take less than 2 seconds. However, when I scale it up to 245 clients, each running these operations concurrently, I see lot of operations getting hung for a long time and the overall throughput reduces drastically. For example, some of the format operations take over 10-15 mins!!! Note that, all operations do complete - so its most likely not a deadlock kind of situation. I dont see any errors in ceph.log on the monitor nodes. However, the clients do report "hung_task_timeout" in dmesg logs. As you can see in the below image, half the format operations are completing in less than a second time, while the other half is over 10mins (y axis is in seconds) [11117.113618] INFO: task umount:9902 blocked for more than 120 seconds. [11117.113677] Tainted: G OE 4.15.0-51-generic #55~16.04.1-Ubuntu [11117.113731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11117.113787] umount D 0 9902 9901 0x00000000 [11117.113793] Call Trace: [11117.113804] __schedule+0x3d6/0x8b0 [11117.113810] ? _raw_spin_unlock_bh+0x1e/0x20 [11117.113814] schedule+0x36/0x80 [11117.113821] wb_wait_for_completion+0x64/0x90 [11117.113828] ? wait_woken+0x80/0x80 [11117.113831] __writeback_inodes_sb_nr+0x8e/0xb0 [11117.113835] writeback_inodes_sb+0x27/0x30 [11117.113840] __sync_filesystem+0x51/0x60 [11117.113844] sync_filesystem+0x26/0x40 [11117.113850] generic_shutdown_super+0x27/0x120 [11117.113854] kill_block_super+0x2c/0x80 [11117.113858] deactivate_locked_super+0x48/0x80 [11117.113862] deactivate_super+0x5a/0x60 [11117.113866] cleanup_mnt+0x3f/0x80 [11117.113868] __cleanup_mnt+0x12/0x20 [11117.113874] task_work_run+0x8a/0xb0 [11117.113881] exit_to_usermode_loop+0xc4/0xd0 [11117.113885] do_syscall_64+0x100/0x130 [11117.113887] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [11117.113891] RIP: 0033:0x7f0094384487 [11117.113893] RSP: 002b:00007fff4199efc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [11117.113897] RAX: 0000000000000000 RBX: 0000000000944030 RCX: 00007f0094384487 [11117.113899] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000944210 [11117.113900] RBP: 0000000000944210 R08: 0000000000000000 R09: 0000000000000014 [11117.113902] R10: 00000000000006b2 R11: 0000000000000246 R12: 00007f009488d83c [11117.113903] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fff4199f250

4 years, 6 months

2
3
0 0

Since nautilus upgrade(?) getting ceph: build_snap_context fail -12

by Marc Roos

Getting these since the upgrade to nautilus [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d5c ffff911d8b648900 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d18 ffff9115f344ac00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d15 ffff9121cbf0ea00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d14 ffff9119bfa95600 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d0d ffff911b643ba200 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d0c ffff9115f7a8b300 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d0b ffff9121c3290d00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085cf9 ffff911b06b6be00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085cf8 ffff9117c6c32100 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085cf7 ffff911993788a00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085cf6 ffff9118fabfc600 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d61 ffff9121c8978000 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d60 ffff911ce2c08700 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d5f ffff9118d4edfb00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d5e ffff9119382a1c00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d5d ffff91182d107f00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d5c ffff911d8b648900 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d18 ffff9115f344ac00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d15 ffff9121cbf0ea00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d14 ffff9119bfa95600 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d0d ffff911b643ba200 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d0c ffff9115f7a8b300 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085d0b ffff9121c3290d00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085cf9 ffff911b06b6be00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085cf8 ffff9117c6c32100 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085cf7 ffff911993788a00 fail -12 [Wed Oct 23 01:59:12 2019] ceph: build_snap_context 10002085cf6 ffff9118fabfc600 fail -12

4 years, 6 months

1
0
0 0

mix ceph-disk and ceph-volume

by Frank R

Is it ok to create a new OSD using ceph-volume on a server where the other OSDs were created with ceph-disk? thx Frank

4 years, 6 months

2
1
0 0

Help unsubscribe please

by Sumit Gaur

On Wed, 23 Oct 2019 at 3:12 am, <ceph-users-request(a)ceph.io> wrote: > Send ceph-users mailing list submissions to > ceph-users(a)ceph.io > > To subscribe or unsubscribe via email, send a message with subject or > body 'help' to > ceph-users-request(a)ceph.io > > You can reach the person managing the list at > ceph-users-owner(a)ceph.io > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of ceph-users digest..." > > Today's Topics: > > 1. Re: Replace ceph osd in a container (Sasha Litvak) > 2. Re: Fwd: large concurrent rbd operations block for over 15 mins! > (Mark Nelson) > 3. Re: rgw multisite failover (Ed Fisher) > > > ---------------------------------------------------------------------- > > Date: Tue, 22 Oct 2019 08:52:54 -0500 > From: Sasha Litvak <alexander.v.litvak(a)gmail.com> > Subject: [ceph-users] Re: Replace ceph osd in a container > To: Frank Schilder <frans(a)dtu.dk> > Cc: ceph-users <ceph-users(a)ceph.io> > Message-ID: > < > CALi_L49RxWcBx_ZivRHHWgYg8Ea_UrH-0YKGMY4+b20KhXu6UQ(a)mail.gmail.com> > Content-Type: multipart/alternative; > boundary="000000000000a0731a0595801ea4" > > --000000000000a0731a0595801ea4 > Content-Type: text/plain; charset="UTF-8" > Content-Transfer-Encoding: quoted-printable > > Frank, > > Thank you for your suggestion. It sounds very promising. I will > definitely try it. > > Best, > > On Tue, Oct 22, 2019, 2:44 AM Frank Schilder <frans(a)dtu.dk> wrote: > > > > I am suspecting that mon or mgr have no access to /dev or /var/lib > whil= > e > > osd containers do. > > > Cluster configured originally by ceph-ansible (nautilus 14.2.2) > > > > They don't, because they don't need to. > > > > > The question is if I want to replace all disks on a single node, and I > > have 6 nodes with pools > > > replication 3, is it safe to restart mgr mounting /dev and > /var/lib/cep= > h > > volumes (not configured right now). > > > > Restarting mons is safe in the sense that data will not get lost. > However= > , > > access might get lost temporarily. > > > > The question is, how many mons do you have? If you have only 1 or 2, it > > will mean downtime. If you can bear the downtime, it doesn't matter. If > y= > ou > > have at least 3, you can restart one after the other. > > > > However, I would not do that. Having to restart a mon container every > tim= > e > > some minor container config changes for reasons that have nothing to do > > with a mon sounds like calling for trouble. > > > > I also use containers and would recommend a different approach. I created > > an additional type of container (ceph-adm) that I use for all admin > tasks= > . > > Its the same image and the entry point simply executes a sleep infinity. > = > In > > this container I make all relevant hardware visible. You might also want > = > to > > expose /var/run/ceph to be able to use admin sockets without hassle. This > > way, I separated admin operations from actual storage daemons and can > > modify and restart the admin container as I like. > > > > Best regards, > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Frank Schilder > > AIT Ris=C3=B8 Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of Alex > > Litvak <alexander.v.litvak(a)gmail.com> > > Sent: 22 October 2019 08:04 > > To: ceph-users(a)lists.ceph.com > > Subject: [ceph-users] Replace ceph osd in a container > > > > Hello cephers, > > > > So I am having trouble with a new hardware systems with strange OSD > > behavior and I want to replace a disk with a brand new one to test the > > theory. > > > > I run all daemons in containers and on one of the nodes I have mon, mgr, > > and 6 osds. So following > > > https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replac= > ing-an-osd > <https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replac=i…> > > > > I stopped container with osd.23, waited until it is down and out, ran > > safe-to-destroy loop and then destroyed the osd all using the monitor > fro= > m > > the container on this node. All good. > > > > Then I swapped the SSDs and started running additional steps (from step > 3= > ) > > using the same mon container. I have no ceph packages installed on the > > bare metal box. It looks like mon container doesn't > > see the disk. > > > > podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap /dev/sdh > > stderr: lsblk: /dev/sdh: not a block device > > stderr: error: /dev/sdh: No such file or directory > > stderr: Unknown device, --name=3D, --path=3D, or absolute path in > /dev/= > or > > /sys expected. > > usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID] > > [--osd-fsid OSD_FSID] > > [DEVICES [DEVICES ...]] > > ceph-volume lvm zap: error: Unable to proceed with non-existing device: > > /dev/sdh > > Error: exit status 2 > > root@storage2n2-la:~# ls -l /dev/sd > > sda sdc sdd sde sdf sdg sdg1 sdg2 sdg5 sdh > > root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume > > lvm zap sdh > > stderr: lsblk: sdh: not a block device > > stderr: error: sdh: No such file or directory > > stderr: Unknown device, --name=3D, --path=3D, or absolute path in > /dev/= > or > > /sys expected. > > usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID] > > [--osd-fsid OSD_FSID] > > [DEVICES [DEVICES ...]] > > ceph-volume lvm zap: error: Unable to proceed with non-existing device: > s= > dh > > Error: exit status 2 > > > > I execute lsblk and it sees device sdh > > root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk > > lsblk: dm-1: failed to get device path > > lsblk: dm-2: failed to get device path > > lsblk: dm-4: failed to get device path > > lsblk: dm-6: failed to get device path > > lsblk: dm-4: failed to get device path > > lsblk: dm-2: failed to get device path > > lsblk: dm-1: failed to get device path > > lsblk: dm-0: failed to get device path > > lsblk: dm-0: failed to get device path > > lsblk: dm-7: failed to get device path > > lsblk: dm-5: failed to get device path > > lsblk: dm-7: failed to get device path > > lsblk: dm-6: failed to get device path > > lsblk: dm-5: failed to get device path > > lsblk: dm-3: failed to get device path > > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT > > sdf 8:80 0 1.8T 0 disk > > sdd 8:48 0 1.8T 0 disk > > sdg 8:96 0 223.5G 0 disk > > |-sdg5 8:101 0 223G 0 part > > |-sdg1 8:97 487M 0 part > > `-sdg2 8:98 1K 0 part > > sde 8:64 0 1.8T 0 disk > > sdc 8:32 0 3.5T 0 disk > > sda 8:0 0 3.5T 0 disk > > sdh 8:112 0 3.5T 0 disk > > > > So I use a fellow osd container (osd.5) on the same node and run all of > > the operations (zap and prepare) successfully. > > > > I am suspecting that mon or mgr have no access to /dev or /var/lib while > > osd containers do. Cluster configured originally by ceph-ansible > (nautil= > us > > 14.2.2) > > > > The question is if I want to replace all disks on a single node, and I > > have 6 nodes with pools replication 3, is it safe to restart mgr mounting > > /dev and /var/lib/ceph volumes (not configured right now). > > > > I cannot use other osd containers on the same box because my controller > > reverts from raid to non-raid mode with all disks lost and not just a > > single one. So I need to replace all 6 osds to run back > > in containers and the only things will remain operational on node are mon > > and mgr containers. > > > > I prefer not to install a full cluster or client on the bare metal node > i= > f > > possible. > > > > Thank you for your help, > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users(a)lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > --000000000000a0731a0595801ea4 > Content-Type: text/html; charset="UTF-8" > Content-Transfer-Encoding: quoted-printable > > <div dir=3D"auto">Frank,<div dir=3D"auto"><br></div><div > dir=3D"auto">Thank= > you=C2=A0for your suggestion.=C2=A0 It sounds very promising.=C2=A0 I > will= > definitely try it.</div><div dir=3D"auto"><br></div><div > dir=3D"auto">Best= > ,</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" > class=3D"gmail= > _attr">On Tue, Oct 22, 2019, 2:44 AM Frank Schilder <<a href=3D"mailto: > f= > rans(a)dtu.dk">frans(a)dtu.dk</a>> wrote:<br></div><blockquote > class=3D"gmai= > l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc > solid;padding-left= > :1ex">> I am suspecting that mon or mgr have no access to /dev or > /var/l= > ib while osd containers do. <br> > > Cluster configured originally by ceph-ansible (nautilus 14.2.2)<br> > <br> > They don't, because they don't need to.<br> > <br> > > The question is if I want to replace all disks on a single node, and > I= > have 6 nodes with pools<br> > > replication 3, is it safe to restart mgr mounting /dev and > /var/lib/ce= > ph volumes (not configured right now).<br> > <br> > Restarting mons is safe in the sense that data will not get lost. However, > = > access might get lost temporarily.<br> > <br> > The question is, how many mons do you have? If you have only 1 or 2, it > wil= > l mean downtime. If you can bear the downtime, it doesn't matter. If > yo= > u have at least 3, you can restart one after the other.<br> > <br> > However, I would not do that. Having to restart a mon container every time > = > some minor container config changes for reasons that have nothing to do > wit= > h a mon sounds like calling for trouble.<br> > <br> > I also use containers and would recommend a different approach. I created > a= > n additional type of container (ceph-adm) that I use for all admin tasks. > I= > ts the same image and the entry point simply executes a sleep infinity. In > = > this container I make all relevant hardware visible. You might also want > to= > expose /var/run/ceph to be able to use admin sockets without hassle. This > = > way, I separated admin operations from actual storage daemons and can > modif= > y and restart the admin container as I like.<br> > <br> > Best regards,<br> > <br> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br> > Frank Schilder<br> > AIT Ris=C3=B8 Campus<br> > Bygning 109, rum S14<br> > <br> > ________________________________________<br> > From: ceph-users <<a href=3D"mailto:ceph-users-bounces@lists.ceph.com" > t= > arget=3D"_blank" rel=3D"noreferrer">ceph-users-bounces(a)lists.ceph.com > </a>&g= > t; on behalf of Alex Litvak <<a href=3D"mailto: > alexander.v.litvak(a)gmail.= > com" target=3D"_blank" rel=3D"noreferrer">alexander.v.litvak(a)gmail.com > </a>&= > gt;<br> > Sent: 22 October 2019 08:04<br> > To: <a href=3D"mailto:ceph-users@lists.ceph.com" target=3D"_blank" > rel=3D"n= > oreferrer">ceph-users(a)lists.ceph.com</a><br> > Subject: [ceph-users] Replace ceph osd in a container<br> > <br> > Hello cephers,<br> > <br> > So I am having trouble with a new hardware systems with strange OSD > behavio= > r and I want to replace a disk with a brand new one to test the theory.<br> > <br> > I run all daemons in containers and on one of the nodes I have mon, mgr, > an= > d 6 osds.=C2=A0 So following <a href=3D" > https://docs.ceph.com/docs/master/r= > ados/operations/add-or-rm-osds/#replacing-an-osd > <https://docs.ceph.com/docs/master/r=ados/operations/add-or-rm-osds/#replaci…>" > rel=3D"noreferrer norefer= > rer" target=3D"_blank"> > https://docs.ceph.com/docs/master/rados/operations/a= > dd-or-rm-osds/#replacing-an-osd > <https://docs.ceph.com/docs/master/rados/operations/a=dd-or-rm-osds/#replaci…> > </a><br> > <br> > I stopped container with osd.23, waited until it is down and out, ran > safe-= > to-destroy loop and then destroyed the osd all using the monitor from the > c= > ontainer on this node.=C2=A0 All good.<br> > <br> > Then I swapped the SSDs and started running additional steps (from step 3) > = > using the same mon container.=C2=A0 I have no ceph packages installed on > th= > e bare metal box. It looks like mon container doesn't<br> > see the disk.<br> > <br> > =C2=A0 =C2=A0 =C2=A0podman exec -it ceph-mon-storage2n2-la ceph-volume lvm > = > zap /dev/sdh<br> > =C2=A0 stderr: lsblk: /dev/sdh: not a block device<br> > =C2=A0 stderr: error: /dev/sdh: No such file or directory<br> > =C2=A0 stderr: Unknown device, --name=3D, --path=3D, or absolute path in > /d= > ev/ or /sys expected.<br> > usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]<br> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= > =A0 =C2=A0 =C2=A0 =C2=A0 [--osd-fsid OSD_FSID]<br> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= > =A0 =C2=A0 =C2=A0 =C2=A0 [DEVICES [DEVICES ...]]<br> > ceph-volume lvm zap: error: Unable to proceed with non-existing device: > /de= > v/sdh<br> > Error: exit status 2<br> > root@storage2n2-la:~# ls -l /dev/sd<br> > sda=C2=A0 =C2=A0sdc=C2=A0 =C2=A0sdd=C2=A0 =C2=A0sde=C2=A0 =C2=A0sdf=C2=A0 = > =C2=A0sdg=C2=A0 =C2=A0sdg1=C2=A0 sdg2=C2=A0 sdg5=C2=A0 sdh<br> > root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume > lv= > m zap sdh<br> > =C2=A0 stderr: lsblk: sdh: not a block device<br> > =C2=A0 stderr: error: sdh: No such file or directory<br> > =C2=A0 stderr: Unknown device, --name=3D, --path=3D, or absolute path in > /d= > ev/ or /sys expected.<br> > usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID]<br> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= > =A0 =C2=A0 =C2=A0 =C2=A0 [--osd-fsid OSD_FSID]<br> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= > =A0 =C2=A0 =C2=A0 =C2=A0 [DEVICES [DEVICES ...]]<br> > ceph-volume lvm zap: error: Unable to proceed with non-existing device: > sdh= > <br> > Error: exit status 2<br> > <br> > I execute lsblk and it sees device sdh<br> > root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk<br> > lsblk: dm-1: failed to get device path<br> > lsblk: dm-2: failed to get device path<br> > lsblk: dm-4: failed to get device path<br> > lsblk: dm-6: failed to get device path<br> > lsblk: dm-4: failed to get device path<br> > lsblk: dm-2: failed to get device path<br> > lsblk: dm-1: failed to get device path<br> > lsblk: dm-0: failed to get device path<br> > lsblk: dm-0: failed to get device path<br> > lsblk: dm-7: failed to get device path<br> > lsblk: dm-5: failed to get device path<br> > lsblk: dm-7: failed to get device path<br> > lsblk: dm-6: failed to get device path<br> > lsblk: dm-5: failed to get device path<br> > lsblk: dm-3: failed to get device path<br> > NAME=C2=A0 =C2=A0MAJ:MIN RM=C2=A0 =C2=A0SIZE RO TYPE MOUNTPOINT<br> > sdf=C2=A0 =C2=A0 =C2=A0 8:80=C2=A0 =C2=A00=C2=A0 =C2=A01.8T=C2=A0 0 > disk<br= > > > sdd=C2=A0 =C2=A0 =C2=A0 8:48=C2=A0 =C2=A00=C2=A0 =C2=A01.8T=C2=A0 0 > disk<br= > > > sdg=C2=A0 =C2=A0 =C2=A0 8:96=C2=A0 =C2=A00 223.5G=C2=A0 0 disk<br> > |-sdg5=C2=A0 =C2=A08:101=C2=A0 0=C2=A0 =C2=A0223G=C2=A0 0 part<br> > |-sdg1=C2=A0 =C2=A08:97=C2=A0 =C2=A0 =C2=A0 =C2=A0487M=C2=A0 0 part<br> > `-sdg2=C2=A0 =C2=A08:98=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01K=C2=A0 0 > part<br= > > > sde=C2=A0 =C2=A0 =C2=A0 8:64=C2=A0 =C2=A00=C2=A0 =C2=A01.8T=C2=A0 0 > disk<br= > > > sdc=C2=A0 =C2=A0 =C2=A0 8:32=C2=A0 =C2=A00=C2=A0 =C2=A03.5T=C2=A0 0 > disk<br= > > > sda=C2=A0 =C2=A0 =C2=A0 8:0=C2=A0 =C2=A0 0=C2=A0 =C2=A03.5T=C2=A0 0 > disk<br= > > > sdh=C2=A0 =C2=A0 =C2=A0 8:112=C2=A0 0=C2=A0 =C2=A03.5T=C2=A0 0 disk<br> > <br> > So I use a fellow osd container (osd.5) on the same node and run all of > the= > operations (zap and prepare) successfully.<br> > <br> > I am suspecting that mon or mgr have no access to /dev or /var/lib while > os= > d containers do.=C2=A0 Cluster configured originally by ceph-ansible > (nauti= > lus 14.2.2)<br> > <br> > The question is if I want to replace all disks on a single node, and I > have= > 6 nodes with pools replication 3, is it safe to restart mgr mounting /dev > = > and /var/lib/ceph volumes (not configured right now).<br> > <br> > I cannot use other osd containers on the same box because my controller > rev= > erts from raid to non-raid mode with all disks lost and not just a single > o= > ne.=C2=A0 So I need to replace all 6 osds to run back<br> > in containers and the only things will remain operational on node are mon > a= > nd mgr containers.<br> > <br> > I prefer not to install a full cluster or client on the bare metal node if > = > possible.<br> > <br> > Thank you for your help,<br> > <br> > _______________________________________________<br> > ceph-users mailing list<br> > <a href=3D"mailto:ceph-users@lists.ceph.com" target=3D"_blank" > rel=3D"noref= > errer">ceph-users(a)lists.ceph.com</a><br> > <a href=3D"http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com" > rel=3D"n= > oreferrer noreferrer" target=3D"_blank"> > http://lists.ceph.com/listinfo.cgi/= > ceph-users-ceph.com > <http://lists.ceph.com/listinfo.cgi/=ceph-users-ceph.com></a><br> > </blockquote></div> > > --000000000000a0731a0595801ea4-- > > ------------------------------ > > Date: Tue, 22 Oct 2019 08:59:21 -0500 > From: Mark Nelson <mnelson(a)redhat.com> > Subject: [ceph-users] Re: Fwd: large concurrent rbd operations block > for over 15 mins! > To: ceph-users(a)ceph.io > Message-ID: <362e3930-8c30-d3e0-d0b0-30187c8551e4(a)redhat.com> > Content-Type: text/plain; charset=UTF-8; format=flowed > > Out of curiosity, when you chose EC over replication how did you weigh > IOPS vs space amplification in your decision making process? I'm > wondering if we should prioritize EC latency vs other tasks in future > tuning efforts (it's always a tradeoff deciding what to focus on). > > > Thanks, > > Mark > > On 10/22/19 2:35 AM, Frank Schilder wrote: > > Getting decent RBD performance is not a trivial exercise. While at a > first glance 61 SSDs for 245 clients sounds more or less OK, it does come > down to a bit more than that. > > > > The first thing is, how to get SSD performance out of SSDs with ceph. > This post will provide very good clues and might already point out the > bottleneck: https://yourcmc.ru/wiki/index.php?title=Ceph_performance . Do > you have good enterprise SSDs? > > > > Next thing to look at, what kind of data pool, replicated or erasure > coded? If erasure coded, has the profile been benchmarked? There are very > poor choices. Good ones are 4+m, 8+m. 4+m better IOps, 8+m better > throughput. m>=2. > > > > More complications: do you need to deploy more than one OSD per SSD to > boost performance? This is indicated by the iodepth required in an fio > benchmark to get full IOPs. Good SSDs deliver already spec performance with > 1 OSD. More common ones require 2-4 OSDs per disk. Are you using > ceph-volume already, its default is 2 OSDs per SSD (batch mode). > > > > To give a base line, after extensive testing and working through all the > required tuning steps, I could run about 250 VMs on a 6+2 EC data pool on > 33 enterprise SAS SSDs with 1 OSD per disk, each VM getting 50IOPs write > performance. This is probably what you would like to see as well. > > > > If you use replicated data pool, this should be relatively easy. With EC > data pool, this is a bit of a battle. > > > > Good luck, > > > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of Void > Star Nill <void.star.nill(a)gmail.com> > > Sent: 22 October 2019 03:00 > > To: ceph-users > > Subject: [ceph-users] Fwd: large concurrent rbd operations block for > over 15 mins! > > > > Apparently the graph is too big, so my last post is stuck. Resending > without the graph. > > > > Thanks > > > > > > ---------- Forwarded message --------- > > From: Void Star Nill <void.star.nill(a)gmail.com<mailto: > void.star.nill(a)gmail.com>> > > Date: Mon, Oct 21, 2019 at 4:41 PM > > Subject: large concurrent rbd operations block for over 15 mins! > > To: ceph-users <ceph-users(a)lists.ceph.com<mailto: > ceph-users(a)lists.ceph.com>> > > > > > > Hello, > > > > I have been running some benchmark tests with a mid-size cluster and I > am seeing some issues. Wanted to know if this is a bug or something that > can be tuned. Appreciate any help on this. > > > > - I have a 15 node Ceph cluster, with 3 monitors and 12 data nodes with > total 61 OSDs on SSDs running 14.2.4 nautilus (stable) version. Each node > has 100G link. > > - I have 245 client machines from which I am triggering rbd operations. > Each client has 25G link > > - rbd operations include, creating an RBD image of 50G size and layering > feature, mapping the image to the client machine, formatting the device in > ext4 format, mounting it, running dd to write to the full disk and cleaning > up (unmount, unmap and remove). > > > > If I run these RBD operations concurrently on a small number of machines > (say 16-20), they run very well and I see good throughput. All image > operations (except for dd) take less than 2 seconds. > > > > However, when I scale it up to 245 clients, each running these > operations concurrently, I see lot of operations getting hung for a long > time and the overall throughput reduces drastically. > > > > For example, some of the format operations take over 10-15 mins!!! > > > > Note that, all operations do complete - so its most likely not a > deadlock kind of situation. > > > > I dont see any errors in ceph.log on the monitor nodes. However, the > clients do report "hung_task_timeout" in dmesg logs. > > > > As you can see in the below image, half the format operations are > completing in less than a second time, while the other half is over 10mins > (y axis is in seconds) > > > > > > > > [11117.113618] INFO: task umount:9902 blocked for more than 120 seconds. > > [11117.113677] Tainted: G OE 4.15.0-51-generic > #55~16.04.1-Ubuntu > > [11117.113731] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > > [11117.113787] umount D 0 9902 9901 0x00000000 > > [11117.113793] Call Trace: > > [11117.113804] __schedule+0x3d6/0x8b0 > > [11117.113810] ? _raw_spin_unlock_bh+0x1e/0x20 > > [11117.113814] schedule+0x36/0x80 > > [11117.113821] wb_wait_for_completion+0x64/0x90 > > [11117.113828] ? wait_woken+0x80/0x80 > > [11117.113831] __writeback_inodes_sb_nr+0x8e/0xb0 > > [11117.113835] writeback_inodes_sb+0x27/0x30 > > [11117.113840] __sync_filesystem+0x51/0x60 > > [11117.113844] sync_filesystem+0x26/0x40 > > [11117.113850] generic_shutdown_super+0x27/0x120 > > [11117.113854] kill_block_super+0x2c/0x80 > > [11117.113858] deactivate_locked_super+0x48/0x80 > > [11117.113862] deactivate_super+0x5a/0x60 > > [11117.113866] cleanup_mnt+0x3f/0x80 > > [11117.113868] __cleanup_mnt+0x12/0x20 > > [11117.113874] task_work_run+0x8a/0xb0 > > [11117.113881] exit_to_usermode_loop+0xc4/0xd0 > > [11117.113885] do_syscall_64+0x100/0x130 > > [11117.113887] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > [11117.113891] RIP: 0033:0x7f0094384487 > > [11117.113893] RSP: 002b:00007fff4199efc8 EFLAGS: 00000246 ORIG_RAX: > 00000000000000a6 > > [11117.113897] RAX: 0000000000000000 RBX: 0000000000944030 RCX: > 00007f0094384487 > > [11117.113899] RDX: 0000000000000001 RSI: 0000000000000000 RDI: > 0000000000944210 > > [11117.113900] RBP: 0000000000944210 R08: 0000000000000000 R09: > 0000000000000014 > > [11117.113902] R10: 00000000000006b2 R11: 0000000000000246 R12: > 00007f009488d83c > > [11117.113903] R13: 0000000000000000 R14: 0000000000000000 R15: > 00007fff4199f250 > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > ------------------------------ > > Date: Tue, 22 Oct 2019 11:10:38 -0500 > From: Ed Fisher <ed(a)debacle.org> > Subject: [ceph-users] Re: rgw multisite failover > To: Frank R <frankaritchie(a)gmail.com> > Cc: ceph-users <ceph-users(a)ceph.com> > Message-ID: <084F4293-88CC-456A-B8A4-2E36ACA24B65(a)debacle.org> > Content-Type: multipart/alternative; > boundary="Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A" > > > --Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A > Content-Transfer-Encoding: quoted-printable > Content-Type: text/plain; > charset=us-ascii > > > > > On Oct 18, 2019, at 10:40 PM, Frank R <frankaritchie(a)gmail.com> wrote: > >=20 > > I am looking to change an RGW multisite deployment so that the = > secondary will become master. This is meant to be a permanent change. > >=20 > > Per: > > https://docs.ceph.com/docs/mimic/radosgw/multisite/ = > <https://docs.ceph.com/docs/mimic/radosgw/multisite/> > >=20 > > I need to: > >=20 > > 1. Stop RGW daemons on the current master end. > >=20 > > On a secondary RGW node: > > 2. radosgw-admin zone modify --rgw-zone=3D{zone-name} --master = > --default > > 3. radosgw-admin period update --commit > > 4. systemctl restart ceph-radosgw(a)rgw.`hostname -s` > >=20 > > Since I want the former master to be secondary permanently do I need = > to do anything after restarting the RGW daemons on the old master end? > > > Before you restart the RGW daemons on the old master you want to make = > sure you pull the current realm from the new master. Beyond that there = > should be no changes needed.=20= > > --Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A > Content-Transfer-Encoding: quoted-printable > Content-Type: text/html; > charset=us-ascii > > <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; = > charset=3Dus-ascii"></head><body style=3D"word-wrap: break-word; = > -webkit-nbsp-mode: space; line-break: after-white-space;" class=3D""><br = > class=3D""><div><br class=3D""><blockquote type=3D"cite" class=3D""><div = > class=3D"">On Oct 18, 2019, at 10:40 PM, Frank R <<a = > href=3D"mailto:frankaritchie@gmail.com" = > class=3D"">frankaritchie(a)gmail.com</a>> wrote:</div><br = > class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" = > class=3D"">I am looking to change an RGW multisite deployment so that = > the secondary will become master. This is meant to be a permanent = > change.<div class=3D""><br class=3D""></div><div class=3D"">Per:</div><div= > class=3D""><a = > href=3D"https://docs.ceph.com/docs/mimic/radosgw/multisite/" = > class=3D"">https://docs.ceph.com/docs/mimic/radosgw/multisite/</a><br = > class=3D""></div><div class=3D""><br class=3D""></div><div class=3D"">I = > need to:</div><div class=3D""><br class=3D""></div><div class=3D"">1. = > Stop RGW daemons on the current master end.</div><div class=3D""><br = > class=3D""></div><div class=3D"">On a secondary RGW node:</div><div = > class=3D"">2. radosgw-admin zone modify --rgw-zone=3D{zone-name} = > --master --default</div><div class=3D"">3. radosgw-admin period = > update --commit</div><div class=3D"">4. systemctl restart = > ceph-radosgw(a)rgw.`hostname -s`</div><div class=3D""><br = > class=3D""></div><div class=3D"">Since I want the former master to be = > secondary permanently do I need to do anything after restarting the RGW = > daemons on the old master end?</div></div></div></blockquote><div><br = > class=3D""></div><div><br class=3D""></div>Before you restart the RGW = > daemons on the old master you want to make sure you pull the current = > realm from the new master. Beyond that there should be no changes = > needed. </div></body></html>= > > --Apple-Mail=_AE5E92AF-C94B-43D4-8A65-947B2DE6F04A-- > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s > > > ------------------------------ > > End of ceph-users Digest, Vol 81, Issue 56 > ****************************************** >

4 years, 6 months

1
0
0 0

Nautilus power outage - 2/3 mons and mgrs dead and no cephfs

by Alex L

Hi list, Had a power outage killing the whole cluster. Cephfs will not start at all, but RBD works just fine. I did have 4 unfound objects that I eventually had to rollback or delete which I don't really understand as I should've had a copy of the those pbjects on the other drives? 2/3 mons and mgrs are damaged but without any errors. I have loads stored on cephfs so would very much like to get that running as a first priority. Thanks! Alex Info about the home cluster: I run 23 osds on 3 hosts. 6 of these are a SSD cache layer for the spinning rust, as well as the metadata portion for cephfs which in retrospect might have to be put back on the spinning rust. # ceph -v ceph version 14.2.4 (65249672c6e6d843510e7e01f8a4b976dcac3db1) nautilus (stable) # head ceph-mgr.pve21.log.7 2019-10-04 00:00:00.397 7fee56df3700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 193052) UID: 0 2019-10-04 00:00:00.573 7fee44af1700 0 ms_deliver_dispatch: unhandled message 0x55855f6b7500 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0 v2:192.168.1.21:6800/3783320901 2019-10-04 00:00:00.573 7fee545ee700 1 mgr finish mon failed to return metadata for mds.pve21: (2) No such file or directory 2019-10-04 00:00:01.553 7fee43aef700 0 log_channel(cluster) log [DBG] : pgmap v2680: 1088 pgs: 1 active+clean+inconsistent, 4 active+recovery_unfound+undersized+degraded+remapped, 1083 active+clean; 4.2 TiB data, 13 TiB used, 15 TiB / 28 TiB avail; 5.7 KiB/s rd, 38 KiB/s wr, 4 op/s; 12/3843345 objects degraded (0.000%); 4/1281115 objects unfound (0.000%) 2019-10-04 00:00:01.573 7fee44af1700 0 ms_deliver_dispatch: unhandled message 0x55855e486380 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0 v2:192.168.1.21:6800/3783320901 2019-10-04 00:00:01.573 7fee545ee700 1 mgr finish mon failed to return metadata for mds.pve21: (2) No such file or directory 2019-10-04 00:00:02.573 7fee44af1700 0 ms_deliver_dispatch: unhandled message 0x55855e4b5500 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0 v2:192.168.1.21:6800/3783320901 2019-10-04 00:00:02.573 7fee545ee700 1 mgr finish mon failed to return metadata for mds.pve21: (2) No such file or directory 2019-10-04 00:00:03.553 7fee43aef700 0 log_channel(cluster) log [DBG] : pgmap v2681: 1088 pgs: 1 active+clean+inconsistent, 4 active+recovery_unfound+undersized+degraded+remapped, 1083 active+clean; 4.2 TiB data, 13 TiB used, 15 TiB / 28 TiB avail; 4.7 KiB/s rd, 33 KiB/s wr, 2 op/s; 12/3843345 objects degraded (0.000%); 4/1281115 objects unfound (0.000%) 2019-10-04 00:00:03.573 7fee44af1700 0 ms_deliver_dispatch: unhandled message 0x55855e3b0380 mgrreport(mds.pve21 +110-0 packed 1366) v7 from mds.0 v2:192.168.1.21:6800/3783320901 # head ceph-mon.pve21.log.7 2019-10-04 00:00:00.389 7f7c25b52700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 193051) UID: 0 2019-10-04 00:00:00.397 7f7c25b52700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 193052) UID: 0 2019-10-04 00:00:00.573 7f7c1f345700 0 mon.pve21@0(leader) e20 handle_command mon_command({"prefix": "mds metadata", "who": "pve21"} v 0) v1 2019-10-04 00:00:00.573 7f7c1f345700 0 log_channel(audit) log [DBG] : from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21' cmd=[{"prefix": "mds metadata", "who": "pve21"}]: dispatch 2019-10-04 00:00:01.573 7f7c1f345700 0 mon.pve21@0(leader) e20 handle_command mon_command({"prefix": "mds metadata", "who": "pve21"} v 0) v1 2019-10-04 00:00:01.573 7f7c1f345700 0 log_channel(audit) log [DBG] : from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21' cmd=[{"prefix": "mds metadata", "who": "pve21"}]: dispatch 2019-10-04 00:00:02.573 7f7c1f345700 0 mon.pve21@0(leader) e20 handle_command mon_command({"prefix": "mds metadata", "who": "pve21"} v 0) v1 2019-10-04 00:00:02.573 7f7c1f345700 0 log_channel(audit) log [DBG] : from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21' cmd=[{"prefix": "mds metadata", "who": "pve21"}]: dispatch 2019-10-04 00:00:03.573 7f7c1f345700 0 mon.pve21@0(leader) e20 handle_command mon_command({"prefix": "mds metadata", "who": "pve21"} v 0) v1 2019-10-04 00:00:03.573 7f7c1f345700 0 log_channel(audit) log [DBG] : from='mgr.137464844 192.168.1.21:0/2201' entity='mgr.pve21' cmd=[{"prefix": "mds metadata", "who": "pve21"}]: dispatch # head ceph-mds.pve21.log.7 2019-10-04 00:00:00.389 7f1b2f1b5700 -1 received signal: Hangup from killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw (PID: 193051) UID: 0 2019-10-04 00:00:00.397 7f1b2f1b5700 -1 received signal: Hangup from (PID: 193052) UID: 0 2019-10-04 00:00:04.881 7f1b319ba700 0 --1- [v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >> v1:192.168.1.23:0/2770609702 conn(0x556f839bb200 0x556f838d4000 :6801 s=OPENED pgs=5 cs=3 l=0).fault server, going to standby 2019-10-04 00:00:06.157 7f1b321bb700 0 --1- [v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >> v1:192.168.1.23:0/2770609702 conn(0x556f839e0000 0x556f83807800 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept replacing existing (lossy) channel (new one lossy=0) 2019-10-04 00:00:06.157 7f1b321bb700 0 --1- [v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >> v1:192.168.1.23:0/2770609702 conn(0x556f839e0000 0x556f83807800 :6801 s=READ_FOOTER_AND_DISPATCH pgs=6 cs=4 l=0).handle_message_footer missed message? skipped from seq 0 to 2 2019-10-04 00:01:19.167 7f1b311b9700 0 --1- [v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >> v1:192.168.1.21:0/3200878088 conn(0x556f839c2900 0x556f837f8000 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby 2019-10-04 00:01:23.555 7f1b311b9700 0 --1- [v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >> v1:192.168.1.22:0/2875552603 conn(0x556f839bda80 0x556f837f9000 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby 2019-10-04 00:02:08.768 7f1b311b9700 0 --1- [v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >> v1:192.168.1.23:0/2427365808 conn(0x556f839bd180 0x556f83e9d800 :6801 s=OPENED pgs=2 cs=1 l=0).fault server, going to standby 2019-10-04 00:02:20.140 7f1b311b9700 0 --1- [v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >> v1:192.168.1.21:0/3200878088 conn(0x556f839c2900 0x556f837f8000 :6801 s=OPENED pgs=5 cs=3 l=0).fault server, going to standby 2019-10-04 00:02:21.420 7f1b319ba700 0 --1- [v2:192.168.1.21:6800/3783320901,v1:192.168.1.21:6801/3783320901] >> v1:192.168.1.21:0/3200878088 conn(0x556f839e0480 0x556f83d7f000 :6801 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_message_2 accept replacing existing (lossy) channel (new one lossy=0)

4 years, 6 months

2
4
0 0

rgw multisite failover

by Frank R

I am looking to change an RGW multisite deployment so that the secondary will become master. This is meant to be a permanent change. Per: https://docs.ceph.com/docs/mimic/radosgw/multisite/ I need to: 1. Stop RGW daemons on the current master end. On a secondary RGW node: 2. radosgw-admin zone modify --rgw-zone={zone-name} --master --default 3. radosgw-admin period update --commit 4. systemctl restart ceph-radosgw(a)rgw.`hostname -s` Since I want the former master to be secondary permanently do I need to do anything after restarting the RGW daemons on the old master end?

4 years, 6 months

2
1
0 0

Re: Replace ceph osd in a container

by Frank Schilder

> I am suspecting that mon or mgr have no access to /dev or /var/lib while osd containers do. > Cluster configured originally by ceph-ansible (nautilus 14.2.2) They don't, because they don't need to. > The question is if I want to replace all disks on a single node, and I have 6 nodes with pools > replication 3, is it safe to restart mgr mounting /dev and /var/lib/ceph volumes (not configured right now). Restarting mons is safe in the sense that data will not get lost. However, access might get lost temporarily. The question is, how many mons do you have? If you have only 1 or 2, it will mean downtime. If you can bear the downtime, it doesn't matter. If you have at least 3, you can restart one after the other. However, I would not do that. Having to restart a mon container every time some minor container config changes for reasons that have nothing to do with a mon sounds like calling for trouble. I also use containers and would recommend a different approach. I created an additional type of container (ceph-adm) that I use for all admin tasks. Its the same image and the entry point simply executes a sleep infinity. In this container I make all relevant hardware visible. You might also want to expose /var/run/ceph to be able to use admin sockets without hassle. This way, I separated admin operations from actual storage daemons and can modify and restart the admin container as I like. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of Alex Litvak <alexander.v.litvak(a)gmail.com> Sent: 22 October 2019 08:04 To: ceph-users(a)lists.ceph.com Subject: [ceph-users] Replace ceph osd in a container Hello cephers, So I am having trouble with a new hardware systems with strange OSD behavior and I want to replace a disk with a brand new one to test the theory. I run all daemons in containers and on one of the nodes I have mon, mgr, and 6 osds. So following https://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacin… I stopped container with osd.23, waited until it is down and out, ran safe-to-destroy loop and then destroyed the osd all using the monitor from the container on this node. All good. Then I swapped the SSDs and started running additional steps (from step 3) using the same mon container. I have no ceph packages installed on the bare metal box. It looks like mon container doesn't see the disk. podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap /dev/sdh stderr: lsblk: /dev/sdh: not a block device stderr: error: /dev/sdh: No such file or directory stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID] [--osd-fsid OSD_FSID] [DEVICES [DEVICES ...]] ceph-volume lvm zap: error: Unable to proceed with non-existing device: /dev/sdh Error: exit status 2 root@storage2n2-la:~# ls -l /dev/sd sda sdc sdd sde sdf sdg sdg1 sdg2 sdg5 sdh root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la ceph-volume lvm zap sdh stderr: lsblk: sdh: not a block device stderr: error: sdh: No such file or directory stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. usage: ceph-volume lvm zap [-h] [--destroy] [--osd-id OSD_ID] [--osd-fsid OSD_FSID] [DEVICES [DEVICES ...]] ceph-volume lvm zap: error: Unable to proceed with non-existing device: sdh Error: exit status 2 I execute lsblk and it sees device sdh root@storage2n2-la:~# podman exec -it ceph-mon-storage2n2-la lsblk lsblk: dm-1: failed to get device path lsblk: dm-2: failed to get device path lsblk: dm-4: failed to get device path lsblk: dm-6: failed to get device path lsblk: dm-4: failed to get device path lsblk: dm-2: failed to get device path lsblk: dm-1: failed to get device path lsblk: dm-0: failed to get device path lsblk: dm-0: failed to get device path lsblk: dm-7: failed to get device path lsblk: dm-5: failed to get device path lsblk: dm-7: failed to get device path lsblk: dm-6: failed to get device path lsblk: dm-5: failed to get device path lsblk: dm-3: failed to get device path NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdf 8:80 0 1.8T 0 disk sdd 8:48 0 1.8T 0 disk sdg 8:96 0 223.5G 0 disk |-sdg5 8:101 0 223G 0 part |-sdg1 8:97 487M 0 part `-sdg2 8:98 1K 0 part sde 8:64 0 1.8T 0 disk sdc 8:32 0 3.5T 0 disk sda 8:0 0 3.5T 0 disk sdh 8:112 0 3.5T 0 disk So I use a fellow osd container (osd.5) on the same node and run all of the operations (zap and prepare) successfully. I am suspecting that mon or mgr have no access to /dev or /var/lib while osd containers do. Cluster configured originally by ceph-ansible (nautilus 14.2.2) The question is if I want to replace all disks on a single node, and I have 6 nodes with pools replication 3, is it safe to restart mgr mounting /dev and /var/lib/ceph volumes (not configured right now). I cannot use other osd containers on the same box because my controller reverts from raid to non-raid mode with all disks lost and not just a single one. So I need to replace all 6 osds to run back in containers and the only things will remain operational on node are mon and mgr containers. I prefer not to install a full cluster or client on the bare metal node if possible. Thank you for your help, _______________________________________________ ceph-users mailing list ceph-users(a)lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

4 years, 6 months

2
1
0 0

ceph mon failed to start

by 徐蕴

Hi, Our cluster got an unexpected power outage. Ceph mon cannot start after that. The log shows: Running command: '/usr/bin/ceph-mon -f -i 10.10.198.11 --public-addr 10.10.198.11:6789' Corruption: 15 missing files; e.g.: /var/lib/ceph/mon/ceph-10.10.198.11/store.db/2676107.sst Is there any way to fix this problem? Thank you very much! We are running ceph 10.2.10. Br, Xu Yun

4 years, 6 months

2
3
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2019