June 2020 - ceph-users - lists.ceph.io

Many osds down , ceph mon has a lot of scrub logs

by hoannv46＠gmail.com

Hi all. My cluster has many osds down, 1 mon log has many line : 2020-06-15 18:00:22.575 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=2176495218}) 2020-06-15 18:00:22.661 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=472876490}) 2020-06-15 18:00:22.747 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=2793143323}) 2020-06-15 18:00:22.830 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=3517702147}) 2020-06-15 18:00:22.916 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=2566175247}) 2020-06-15 18:00:22.999 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=1643204334}) 2020-06-15 18:00:23.087 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=220430164}) 2020-06-15 18:00:23.170 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=1336353918}) 2020-06-15 18:00:23.296 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=2573498114}) 2020-06-15 18:00:23.421 7fa2deffe700 0 log_channel(cluster) log [DBG] : scrub ok on 0,1,2,3,4: ScrubResult(keys {osd_snap=100} crc {osd_snap=1132070786}) After 10 minutes, 1 mon and 1 mgr restart. My cluster health ok. what happened with my cluster? How i can debug this. Thanks.

3 years, 11 months

2
1
0 0

Re: help with failed osds after reboot

by Eugen Block

Hi, which ceph release are you using? You mention ceph-disk so your OSDs are not LVM based, I assume? I've seen these messages a lot when testing in my virtual lab environment although I don't believe it's the cluster's fsid but the OSD's fsid that's in the error message (the OSDs have their own ID, too, take a look in /var/lib/ceph/osd/ceph-<ID>/fsid). When I did several re-installs of the whole cluster I had to make sure to properly wipe the disks but sometimes only a reboot did the trick. Of course, this is not an option in your situation. If your OSDs are systemd units check for orphaned units that need to be to disabled before restarting the correct ones. Did you re-deploy some of those disks? Regards, Eugen Zitat von Seth Duncan <Seth.Duncan2(a)bd.com>: > I had 5 of 10 osds fail on one of my nodes, after reboot the other 5 > osds failed to start. > > I have tried running ceph-disk activate-all and get back and error > message about the cluster fsid not matching in /etc/ceph/ceph.conf > > Has anyone experienced an issue such as this? > > > > ******************************************************************* > IMPORTANT MESSAGE FOR RECIPIENTS IN THE U.S.A.: > This message may constitute an advertisement of a BD group's > products or services or a solicitation of interest in them. If this > is such a message and you would like to opt out of receiving future > advertisements or solicitations from this BD group, please forward > this e-mail to optoutbygroup(a)bd.com. [BD.v1.0] > ******************************************************************* > This message (which includes any attachments) is intended only for > the designated recipient(s). It may contain confidential or > proprietary information and may be subject to the attorney-client > privilege or other confidentiality protections. If you are not a > designated recipient, you may not review, use, copy or distribute > this message. If you received this in error, please notify the > sender by reply e-mail and delete this message. Thank you. > ******************************************************************* > Corporate Headquarters Mailing Address: BD (Becton, Dickinson and > Company) 1 Becton Drive Franklin Lakes, NJ 07417 U.S.A. > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 11 months

4
3
0 0

Ganesha rados recovery on NFS 3

by Maged Mokhtar

Hello all, can the NFS ganesha rados recovery for multi headed active/active setup work with NFS 3 or it requires NFS 4/4.1 specifics ? Thanks for any help /Maged

3 years, 11 months

1
0
0 0

Re: mount cephfs with autofs

by Zhenshi Zhou

The systemd autofs will mount cephfs successfully, with both kernel and fuse clients. Marc Roos <M.Roos(a)f1-outsourcing.eu> 于2020年6月15日周一下午6:44写道： > > > Thanks for these I was missing the x-systemd. entries. I assume these > are necessary so booting does not 'hang' on trying to mount these? I > thought the _netdev was for this and sufficient? > > > > > > -----Original Message----- > To: Derrick Lin > Cc: ceph-users > Subject: [ceph-users] Re: mount cephfs with autofs > > Hi, > > With CentOS 7.8 you can use the systemd autofs options in /etc/fstab. > Here are two examples from our clusters, first with fuse and second > with kernel: > > none /cephfs fuse.ceph > ceph.id=admin,ceph.conf=/etc/ceph/dwight.conf,ceph.client_mountpoint=/,x > -systemd.device-timeout=30,x-systemd.mount-timeout=30,noatime,_netdev,no > auto,x-systemd.automount,x-systemd.idle-timeout=30,ro > 0 2 > > cephflax.cern.ch:6789:/ /cephfs2 ceph > name=admin,secretfile=/etc/ceph/flax.admin.secret,x-systemd.device-timeo > ut=30,x-systemd.mount-timeout=30,noatime,_netdev,noauto,x-systemd.automo > unt,x-systemd.idle-timeout=30,ro > 0 2 > > Cheers, Dan > > On Mon, Jun 15, 2020 at 9:27 AM Derrick Lin <klin938(a)gmail.com> wrote: > > > > Hi guys, > > > > I can mount my cephfs via mount command and access it without any > problem. > > > > Now I want to integrate it in autofs which is used on our cluster. > > > > It seems this is not a popular approach and I found only this link: > > > > https://drupal.star.bnl.gov/STAR/blog/mpoat/how-mount-cephfs > > > > I followed the link but could not get it to work. I am wondering if > this is > > possible at all? > > > > We are using CentOS 7.8 and the ceph cluster is running nautilus > 14.2.9 > > > > Regards, > > Derrick > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

3 years, 11 months

1
0
0 0

Re: Fwd: Re-run ansible to add monitor and RGWs

by Khodayar Doustar

Yes, it's faster but I'd like to continue managing the cluster with Ansible, is that possible? On Mon, Jun 15, 2020 at 12:02 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > Just do manual install that is faster. > > > > -----Original Message----- > To: ceph-users > Subject: [ceph-users] Fwd: Re-run ansible to add monitor and RGWs > > Any ideas on this? > > ---------- Forwarded message --------- > From: Khodayar Doustar <doustar(a)rayanexon.ir> > Date: Sun, Jun 14, 2020 at 6:07 PM > Subject: Re-run ansible to add monitor and RGWs > To: ceph-users <ceph-users(a)ceph.io> > > > Hi, > > > I've installed my ceph cluster with ceph-ansible a few months ago. I've > just added one monitor and one rgw at that time. > > So I have 3 nodes, from which one is monitor and rgw and two others only > OSD. > > Now I want to add the other two nodes as monitor and rgw. > > Can I just modify the ansible host file and re-run the site.yml? > > I've done some modification in Storage classes, I've added some OSD and > uploaded a lot of data up to now. Is it safe to re-run ansible site.yml > playbook? > > I don't want to end with a fresh new cluster! :D > > > Thanks a lot, > > Khodayar > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > email to ceph-users-leave(a)ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io >

3 years, 11 months

1
0
0 0

Re: poor cephFS performance on Nautilus 14.2.9 deployed by ceph_ansible

by Derrick Lin

Hi guys, I tried to mount via kernel driver, it works beautifully. I was surprised, below is one of the FIO test, which wasn't able to run at all in FUSE mount: # /usr/bin/fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=FIO --filename=fio.test --bs=4M --iodepth=16 --size=50G --readwrite=randrw --rwmixread=75 FIO: (g=0): rw=randrw, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=libaio, iodepth=16 fio-3.7 Starting 1 process FIO: Laying out IO file (1 file / 51200MiB) Jobs: 1 (f=1): [m(1)][100.0%][r=1021MiB/s,w=340MiB/s][r=255,w=85 IOPS][eta 00m:00s] FIO: (groupid=0, jobs=1): err= 0: pid=131431: Thu Jun 11 17:13:22 2020 read: IOPS=249, BW=999MiB/s (1047MB/s)(37.5GiB/38408msec) bw ( KiB/s): min=819200, max=1171456, per=100.00%, avg=1023387.46, stdev=69360.06, samples=76 iops : min= 200, max= 286, avg=249.83, stdev=16.96, samples=76 write: IOPS=83, BW=334MiB/s (351MB/s)(12.5GiB/38408msec) bw ( KiB/s): min=229376, max=475136, per=99.96%, avg=342204.45, stdev=40407.55, samples=76 iops : min= 56, max= 116, avg=83.51, stdev= 9.87, samples=76 cpu : usr=1.56%, sys=4.44%, ctx=12050, majf=0, minf=24 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=9590,3210,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=999MiB/s (1047MB/s), 999MiB/s-999MiB/s (1047MB/s-1047MB/s), io=37.5GiB (40.2GB), run=38408-38408msec WRITE: bw=334MiB/s (351MB/s), 334MiB/s-334MiB/s (351MB/s-351MB/s), io=12.5GiB (13.5GB), run=38408-38408msec On Tue, Jun 9, 2020 at 6:16 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote: > > Hi Derrick, > > I am not sure what this 200-300MB/s on hdd is. But it is probably not > really relevant. I am testing native disk performance before I use them > with ceph with this fio script. It is a bit lengthy, that is because I > want to be able to have data for possible future use cases. > > Furthermore since I upgraded to Nautilus I have been having issues with > the kernel mount cephfs on osd nodes and had to revert back to fuse. > Even when having 88GB free memory. > > https://tracker.ceph.com/issues/45663 > https://tracker.ceph.com/issues/44100 > > > [global] > ioengine=libaio > #ioengine=posixaio > invalidate=1 > ramp_time=30 > iodepth=1 > runtime=180 > time_based > direct=1 > filename=/dev/sdX > #filename=/mnt/cephfs/ssd/fio-bench.img > > [write-4k-seq] > stonewall > bs=4k > rw=write > #write_bw_log=sdx-4k-write-seq.results > #write_iops_log=sdx-4k-write-seq.results > > [randwrite-4k-seq] > stonewall > bs=4k > rw=randwrite > #write_bw_log=sdx-4k-randwrite-seq.results > #write_iops_log=sdx-4k-randwrite-seq.results > > [read-4k-seq] > stonewall > bs=4k > rw=read > #write_bw_log=sdx-4k-read-seq.results > #write_iops_log=sdx-4k-read-seq.results > > [randread-4k-seq] > stonewall > bs=4k > rw=randread > #write_bw_log=sdx-4k-randread-seq.results > #write_iops_log=sdx-4k-randread-seq.results > > [rw-4k-seq] > stonewall > bs=4k > rw=rw > #write_bw_log=sdx-4k-rw-seq.results > #write_iops_log=sdx-4k-rw-seq.results > > [randrw-4k-seq] > stonewall > bs=4k > rw=randrw > #write_bw_log=sdx-4k-randrw-seq.results > #write_iops_log=sdx-4k-randrw-seq.results > > [write-128k-seq] > stonewall > bs=128k > rw=write > #write_bw_log=sdx-128k-write-seq.results > #write_iops_log=sdx-128k-write-seq.results > > [randwrite-128k-seq] > stonewall > bs=128k > rw=randwrite > #write_bw_log=sdx-128k-randwrite-seq.results > #write_iops_log=sdx-128k-randwrite-seq.results > > [read-128k-seq] > stonewall > bs=128k > rw=read > #write_bw_log=sdx-128k-read-seq.results > #write_iops_log=sdx-128k-read-seq.results > > [randread-128k-seq] > stonewall > bs=128k > rw=randread > #write_bw_log=sdx-128k-randread-seq.results > #write_iops_log=sdx-128k-randread-seq.results > > [rw-128k-seq] > stonewall > bs=128k > rw=rw > #write_bw_log=sdx-128k-rw-seq.results > #write_iops_log=sdx-128k-rw-seq.results > > [randrw-128k-seq] > stonewall > bs=128k > rw=randrw > #write_bw_log=sdx-128k-randrw-seq.results > #write_iops_log=sdx-128k-randrw-seq.results > > [write-1024k-seq] > stonewall > bs=1024k > rw=write > #write_bw_log=sdx-1024k-write-seq.results > #write_iops_log=sdx-1024k-write-seq.results > > [randwrite-1024k-seq] > stonewall > bs=1024k > rw=randwrite > #write_bw_log=sdx-1024k-randwrite-seq.results > #write_iops_log=sdx-1024k-randwrite-seq.results > > [read-1024k-seq] > stonewall > bs=1024k > rw=read > #write_bw_log=sdx-1024k-read-seq.results > #write_iops_log=sdx-1024k-read-seq.results > > [randread-1024k-seq] > stonewall > bs=1024k > rw=randread > #write_bw_log=sdx-1024k-randread-seq.results > #write_iops_log=sdx-1024k-randread-seq.results > > [rw-1024k-seq] > stonewall > bs=1024k > rw=rw > #write_bw_log=sdx-1024k-rw-seq.results > #write_iops_log=sdx-1024k-rw-seq.results > > [randrw-1024k-seq] > stonewall > bs=1024k > rw=randrw > #write_bw_log=sdx-1024k-randrw-seq.results > #write_iops_log=sdx-1024k-randrw-seq.results > > [write-4096k-seq] > stonewall > bs=4096k > rw=write > #write_bw_log=sdx-4096k-write-seq.results > #write_iops_log=sdx-4096k-write-seq.results > > [randwrite-4096k-seq] > stonewall > bs=4096k > rw=randwrite > #write_bw_log=sdx-4096k-randwrite-seq.results > #write_iops_log=sdx-4096k-randwrite-seq.results > > [read-4096k-seq] > stonewall > bs=4096k > rw=read > #write_bw_log=sdx-4096k-read-seq.results > #write_iops_log=sdx-4096k-read-seq.results > > [randread-4096k-seq] > stonewall > bs=4096k > rw=randread > #write_bw_log=sdx-4096k-randread-seq.results > #write_iops_log=sdx-4096k-randread-seq.results > > [rw-4096k-seq] > stonewall > bs=4096k > rw=rw > #write_bw_log=sdx-4096k-rw-seq.results > #write_iops_log=sdx-4096k-rw-seq.results > > [randrw-4096k-seq] > stonewall > bs=4096k > rw=randrw > #write_bw_log=sdx-4096k-randrw-seq.results > #write_iops_log=sdx-4096k-randrw-seq.results > > > > > > > -----Original Message----- > From: Derrick Lin [mailto:klin938@gmail.com] > Sent: dinsdag 9 juni 2020 4:12 > To: Mark Nelson > Cc: ceph-users(a)ceph.io > Subject: [ceph-users] Re: poor cephFS performance on Nautilus 14.2.9 > deployed by ceph_ansible > > Thanks Mark & Marc > > We will do more testing inc kernel client as well as testing the block > storage performance first. > > We just did some direct raw performance test on a single spinning disk > (format as ext4) and it could delivery 200-300MB/s throughput in various > writing and mix testings. But FUSE client could only give ~50MB/s. > > Cheers, > D > > On Thu, Jun 4, 2020 at 1:27 PM Mark Nelson <mnelson(a)redhat.com> wrote: > > > Try using the kernel client instead of the FUSE client. The FUSE > > client is known to be slow for a variety of reasons and I suspect you > > may see faster performance with the kernel client. > > > > > > Thanks, > > > > Mark > > > > > > On 6/2/20 8:00 PM, Derrick Lin wrote: > > > Hi guys, > > > > > > We just deployed a CEPH 14.2.9 cluster with the following hardware: > > > > > > MDSS x 1 > > > Xeon Gold 5122 3.6Ghz > > > 192GB > > > Mellanox ConnectX-4 Lx 25GbE > > > > > > > > > MON x 3 > > > Xeon Bronze 3103 1.7Ghz > > > 48GB > > > Mellanox ConnectX-4 Lx 25GbE > > > 6 x 600GB 10K SAS > > > > > > OSD x 5 > > > Xeon Silver 4110 2.1Ghz x 2 > > > 192GB > > > Mellanox ConnectX-4 Lx 25GbE > > > 16 x 10TB 7.2K NLSAS (block) > > > 2 x 2TB Intel P4600 NVMe (block.db) > > > > > > Network is all Mellanox SN2410/SN2700 configured at 25GbE for both > > > front and back network. > > > > > > Just for POC at this stage, the cluster was deployed by ceph_ansible > > > > without much customization and the initial test on its cephFS FUSE > > > mount performance seems to be very low. We did some test with iozone > > > > the result as follow: > > > > > > ]# /opt/iozone/bin/iozone -i 0 -i 1-r 128k -s 5G -t 20 > > > Iozone: Performance Test of File I/O > > > Version $Revision: 3.465 $ > > > Compiled for 64 bit mode. > > > Build: linux-AMD64 > > > > > > Contributors:William Norcott, Don Capps, Isom Crawford, > > > Kirby Collins > > > Al Slater, Scott Rhine, Mike Wisner, Ken Goss > > > Steve Landherr, Brad Smith, Mark Kelly, Dr. > > > Alain > > CYR, > > > Randy Dunlap, Mark Montague, Dan Million, > > > Gavin Brebner, > > > Jean-Marc Zucconi, Jeff Blomberg, Benny > > > Halevy, > > Dave > > > Boone, > > > Erik Habbinga, Kris Strecker, Walter Wong, > > > Joshua > > Root, > > > Fabrice Bacchella, Zhenghua Xue, Qin Li, > > > Darren > > Sawyer, > > > Vangel Bojaxhi, Ben England, Vikentsi Lapa, > > > Alexey Skidanov. > > > > > > Run began: Tue Jun 2 16:40:53 2020 > > > > > > File size set to 5242880 kB > > > Command line used: /opt/iozone/bin/iozone -i 0 -i 1-r -s 5G > > > > -t > > 20 > > > 128k > > > Output is in kBytes/sec > > > Time Resolution = 0.000001 seconds. > > > Processor cache size set to 1024 kBytes. > > > Processor cache line size set to 32 bytes. > > > File stride size set to 17 * record size. > > > Throughput test with 20 processes > > > Each process writes a 5242880 kByte file in 4 kByte records > > > > > > Children see throughput for 20 initial writers = > 35001.12 > > kB/sec > > > Parent sees throughput for 20 initial writers = > 34967.65 > > kB/sec > > > Min throughput per process = > 1748.22 > > kB/sec > > > Max throughput per process = > 1751.62 > > kB/sec > > > Avg throughput per process = > 1750.06 > > kB/sec > > > Min xfer = > 5232724.00 kB > > > > > > Children see throughput for 20 rewriters = > 35704.79 > > kB/sec > > > Parent sees throughput for 20 rewriters = > 35704.30 > > kB/sec > > > Min throughput per process = > 1783.44 > > kB/sec > > > Max throughput per process = > 1786.29 > > kB/sec > > > Avg throughput per process = > 1785.24 > > kB/sec > > > Min xfer = > 5234532.00 kB > > > > > > Children see throughput for 20 readers = > 49368539.50 > > kB/sec > > > Parent sees throughput for 20 readers = > 49317231.38 > > kB/sec > > > Min throughput per process = > 2414424.00 > > kB/sec > > > Max throughput per process = > 2599996.25 > > kB/sec > > > Avg throughput per process = > 2468426.98 > > kB/sec > > > Min xfer = > 4868708.00 kB > > > > > > Children see throughput for 20 re-readers = > 48675891.50 > > kB/sec > > > Parent sees throughput for 20 re-readers = > 48617335.67 > > kB/sec > > > Min throughput per process = > 2316395.25 > > kB/sec > > > Max throughput per process = > 2703868.75 > > kB/sec > > > Avg throughput per process = > 2433794.58 > > kB/sec > > > Min xfer = > 4491704.00 kB > > > > > > We also did some dd tests, the write speed on a single test on our > > standard > > > server is ~50MB/s but on a very big memory server, the speed is > > > double ~ 80-90MB/s. > > > > > > We have zero experience on ceph and as said we haven't done more > > > tuning > > at > > > this stage. But if this sort of performance is way too low from > > > those hardware spec? > > > > > > Any hints will be appreciated. > > > > > > Cheers > > > D > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > > > > email to ceph-users-leave(a)ceph.io > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > > email to ceph-users-leave(a)ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > email to ceph-users-leave(a)ceph.io > > >

3 years, 11 months

2
1
0 0

MAX AVAIL goes up when I reboot an OSD node

by Boris Behrens

Dear people on this mailing list, I've got the "problem" that our MAX AVAIL value increases by about 5-10 TB when I reboot a whole OSD node. After the reboot the value goes back to normal. I would love to know WHY. Under normal circumstances I would ignore this behavior, but because I am very new to the whole ceph software I would like to know why stuff like this happens. What I read is, that this value will be calculated by the most filled OSD. I've set noout and norebalance while the node is offline and I unset both values after the reboot. We are currently on nautilus. Cheers and thanks in advance Boris

3 years, 11 months

6
9
0 0

OSD SCRUB Error recovery

by Chris Shultz

Hi, I am seeing an issue on one of our older ceph clusters (mimic 13.2.1) in an erasure coded pool on bluestore OSDs in which we are seeing 1 inconsistent pg and 1 scrub error. It should be noted that we have an ongoing rebalance of misplaced data that predates this issue which came from flapping OSDs due to OSD_NEARFULL OSD_TOOFULL warnings/errors we corrected by removing some user data from ceph's rgw/s3 api interface (users "s3 objects" where deleted via the s3 api). If anyone has any suggestions or guidance for dealing with this it would be very much appreciated. I've included all the relevant / helpful information I can think of below, if there is any additional information that you think would be helpful to me or you in providing suggestions please let me know. $ sudo ceph -s cluster: id: 6fa7ec72-79fb-4f45-8b9f-ea5cdc7ab18d health: HEALTH_ERR 248317/437145405 objects misplaced (0.057%) 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 3 daemons, quorum HW-CEPHM-AT01,HW-CEPHM-AT02,HW-CEPHM-AT03 mgr: HW-CEPHM-AT02(active) osd: 109 osds: 107 up, 106 in; 2 remapped pgs rgw: 3 daemons active data: pools: 10 pools, 1380 pgs objects: 54.70 M objects, 68 TiB usage: 116 TiB used, 169 TiB / 285 TiB avail pgs: 248317/437145405 objects misplaced (0.057%) 1374 active+clean 3 active+clean+scrubbing+deep 2 active+remapped+backfilling 1 active+clean+inconsistent io: client: 28 KiB/s rd, 306 KiB/s wr, 26 op/s rd, 30 op/s wr recovery: 6.2 MiB/s, 4 objects/s $ sudo ceph health detail HEALTH_ERR 247241/437143405 objects misplaced (0.057%); 1 scrub errors; Possible data damage: 1 pg inconsistent OBJECT_MISPLACED 247241/437143405 objects misplaced (0.057%) OSD_SCRUB_ERRORS 1 scrub errors PG_DAMAGED Possible data damage: 1 pg inconsistent pg 7.1 is active+clean+inconsistent, acting [2,57,51,15,20,28,9,39] Examination of osd logs shows the error is in osd.2 zgrep -Hn 'ERR' ceph-osd.2.log-20200614.gz ceph-osd.2.log-20200614.gz:1292:2020-06-14 03:31:06.572 7f94591a9700 -1 log_channel(cluster) log [ERR] : 7.1s0 deep-scrub stat mismatch, got 213029/213030 objects, 0/0 clones, 213029/213030 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 292308615921/292308670959 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. ceph-osd.2.log-20200614.gz:1293:2020-06-14 03:31:06.572 7f94591a9700 -1 log_channel(cluster) log [ERR] : 7.1 deep-scrub 1 errors All other OSDs appear to be clean of errors The pg in question (7.1) has been instructed to repair/scrub/deep-scrub but I do not see any indication in it's logs that it has done a scrub or repair (it does log a deep-scrub which comes back OK) and listing inconsistent objects seems to indicate no issues $ sudo rados list-inconsistent-pg default.rgw.buckets.data ["7.1"] $ sudo ceph pg repair 7.1 instructing pg 7.1s0 on osd.2 to repair $ sudo ceph pg scrub 7.1 instructing pg 7.1s0 on osd.2 to scrub $ sudo ceph pg deep-scrub 7.1 instructing pg 7.1s0 on osd.2 to deep-scrub grep -HnEi 'scrub|repair|deep-scrub' ceph-osd.2.log ceph-osd.2.log:118:2020-06-14 07:28:10.139 7f94599aa700 0 log_channel(cluster) log [DBG] : 7.91 deep-scrub starts ceph-osd.2.log:177:2020-06-14 08:39:11.404 7f94599aa700 0 log_channel(cluster) log [DBG] : 7.91 deep-scrub ok ceph-osd.2.log:322:2020-06-14 12:17:31.405 7f94579a6700 0 log_channel(cluster) log [DBG] : 13.135 deep-scrub starts ceph-osd.2.log:323:2020-06-14 12:17:32.744 7f94579a6700 0 log_channel(cluster) log [DBG] : 13.135 deep-scrub ok ceph-osd.2.log:387:2020-06-14 13:40:35.941 7f94591a9700 0 log_channel(cluster) log [DBG] : 7.d8 deep-scrub starts ceph-osd.2.log:441:2020-06-14 14:49:06.111 7f94591a9700 0 log_channel(cluster) log [DBG] : 7.d8 deep-scrub ok Only the last deep-scrub was manually triggered $ sudo rados list-inconsistent-obj 7.1 --format=json-pretty { "epoch": 30869, "inconsistents": [] } $ sudo rados list-inconsistent-obj 7.1s0 --format=json-pretty { "epoch": 30869, "inconsistents": [] } I'm not sure why no inconsistents (empty set) are reported in the above Chris Shultz Global Systems Architect 1 Stiles Road Suite 202 SalemNH03079 United States cshultz(a)korewireless.com (m) 774.270.2679 korewireless.com Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.

3 years, 11 months

1
0
0 0

Re: Sizing your MON storage with a large cluster

by Milan Kupcevic

Hi, Please see below. On Sat, 3 Feb 2018, Sage Weil wrote: > On Sat, 3 Feb 2018, Wido den Hollander wrote: >> Hi, >> >> I just wanted to inform people about the fact that Monitor databases can grow >> quite big when you have a large cluster which is performing a very long >> rebalance. >> >> I'm posting this on ceph-users and ceph-large as it applies to both, but >> you'll see this sooner on a cluster with a lof of OSDs. >> >> Some information: >> >> - Version: Luminous 12.2.2 >> - Number of OSDs: 2175 >> - Data used: ~2PB >> >> We are in the middle of migrating from FileStore to BlueStore and this is >> causing a lot of PGs to backfill at the moment: >> >> 33488 active+clean >> 4802 active+undersized+degraded+remapped+backfill_wait >> 1670 active+remapped+backfill_wait >> 263 active+undersized+degraded+remapped+backfilling >> 250 active+recovery_wait+degraded >> 54 active+recovery_wait+degraded+remapped >> 27 active+remapped+backfilling >> 13 active+recovery_wait+undersized+degraded+remapped >> 2 active+recovering+degraded >> >> This has been running for a few days now and it has caused this warning: >> >> MON_DISK_BIG mons >> srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a >> lot of disk space >> mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB) >> mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB) >> mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB) >> mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB) >> mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB) >> >> This is to be expected as MONs do not trim their store if one or more PGs is >> not active+clean. >> >> In this case we expected this and the MONs are each running on a 1TB Intel >> DC-series SSD to make sure we do not run out of space before the backfill >> finishes. >> >> The cluster is spread out over racks and in CRUSH we replicate over racks. >> Rack by rack we are wiping/destroying the OSDs and bringing them back as >> BlueStore OSDs and letting the backfill handle everything. >> >> In between we wait for the cluster to become HEALTH_OK (all PGs active+clean) >> so that the Monitors can trim their database before we start with the next >> rack. >> >> I just want to warn and inform people about this. Under normal circumstances a >> MON database isn't that big, but if you have a very long period of >> backfills/recoveries and also have a large number of OSDs you'll see the DB >> grow quite big. >> >> This has improved significantly going to Jewel and Luminous, but it is still >> something to watch out for. >> >> Make sure your MONs have enough free space to handle this! > > Yes! > > Just a side note that Joao has an elegant fix for this that allows the mon > to trim most of the space-consuming full osdmaps. It's still work in > progress but is likely to get backported to luminous. > > sage Hi Sage, Has this issue ever been sorted out. I've added a batch of new nodes a couple of days ago to our Nautilus (14.2.9) cluster and the mon db is growing at about 50GB per day. Cluster state: osd: 1515 osds: 1494 up (since 2d), 1492 in (since 2d); 8740 remapped pgs data: pools: 15 pools, 17048 pgs objects: 483.21M objects, 1.3 PiB usage: 1.9 PiB used, 12 PiB / 14 PiB avail pgs: 0.012% pgs not active 1612355425/4675115461 objects misplaced (34.488%) 8305 active+clean 4372 active+remapped+backfill_wait+backfill_toofull 4348 active+remapped+backfill_wait 19 active+remapped+backfilling 2 active+clean+remapped 2 peering Health state: SLOW_OPS 63640 slow ops, oldest one blocked for 1402 sec, daemons [osd.477,osd.571,osd.589,osd.707,osd.786,mon.mon01,mon.mon02,mon.mon03,mon.mon04,mon.mon05] have slow ops. MON_DISK_BIG mons mon01,mon02,mon03,mon04,mon05 are using a lot of disk space mon.mon02 is 126 GiB >= mon_data_size_warn (15 GiB) mon.mon03 is 126 GiB >= mon_data_size_warn (15 GiB) mon.mon04 is 126 GiB >= mon_data_size_warn (15 GiB) mon.mon05 is 127 GiB >= mon_data_size_warn (15 GiB) mon.mon01 is 127 GiB >= mon_data_size_warn (15 GiB) How large can this grow? If it continues to grow at this rate our SSDs will not be able to ride it out. Is the only way to deal with this to stop the whole cluster, put larger SSD drives in the monitors and then let it continue? Milan -- Milan Kupcevic Senior Cyberinfrastructure Engineer at Project NESE Harvard University FAS Research Computing

3 years, 11 months

2
1
0 0

Ceph deployment and Managing suite

by Amudhan P

Hi, I am looking for a Software suite to deploy Ceph Storage Node and Gateway server (SMB & NFS) and also dashboard Showing entire Cluster status, Individual node health, disk identification or maintenance activity, network utilization. Simple user manageable dashboard. Please suggest any Paid or Community based you have been using or you recommend to others. regards Amudhan

3 years, 11 months

3
2
0 0

2024

2023

2022

2021

2020

2019

ceph-users June 2020