Hi,
As we all know, the default replica setting of 'size' is 3 which means
there
are 3 copies of an object. What is the disadvantages if I set it to 2,
except
I get fewer copies?
Thanks
Hi all,
rocksdb failed to open when the ceph-osd process was restarted after unplugging the OSD data disk with Ceph 14.2.5 on Centos 7.6.
1) After unplugging the OSD data disk, the ceph-osd process exist.
-3> 2020-07-13 15:25:35.912 7f1ad7254700 -1 bdev(0x559d1134f880 /var/lib/ceph/osd/ceph-10/block) _sync_write sync_file_range error: (5) Input/output error
-2> 2020-07-13 15:25:35.912 7f1ad9c5f700 -1 bdev(0x559d1134f880 /var/lib/ceph/osd/ceph-10/block) _aio_thread got r=-5 ((5) Input/output error)
-1> 2020-07-13 15:25:35.917 7f1ad9c5f700 -1 /root/rpmbuild/BUILD/ceph-14.2.5-1.0.9/src/os/bluestore/KernelDevice.cc: In function 'void KernelDevice::_aio_thread()' thread 7f1ad9c5f700 time 2020-07-13 15:25:35.913821
/root/rpmbuild/BUILD/ceph-14.2.5-1.0.9/src/os/bluestore/KernelDevice.cc: 534: ceph_abort_msg("Unexpected IO error. This may suggest a hardware issue. Please check your kernel log!")
ceph version 14.2.5-93-g9a4f93e (9a4f93e7143bcdd5fadc88eb58bb730ae97b89c5) nautilus (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x559d05b6069a]
2: (KernelDevice::_aio_thread()+0xebe) [0x559d061a54ee]
3: (KernelDevice::AioCompletionThread::entry()+0xd) [0x559d061a7add]
4: (()+0x7dd5) [0x7f1ae66aedd5]
5: (clone()+0x6d) [0x7f1ae5572ead]
2) Plug the disk back in and restart the ceph-osd process, rocksdb found that incomplete records existed and stop to work.
2020-07-13 15:51:38.305 7f9801ef5a80 4 rocksdb: [db/db_impl_open.cc:583] Recovering log #9 mode 0
2020-07-13 15:51:38.748 7f9801ef5a80 3 rocksdb: [db/db_impl_open.cc:518] db.wal/000009.log: dropping 2922 bytes; Corruption: missing start of fragmented record(2)
2020-07-13 15:51:38.748 7f9801ef5a80 4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work
2020-07-13 15:51:38.748 7f9801ef5a80 4 rocksdb: [db/db_impl.cc:563] Shutdown complete
2020-07-13 15:51:38.748 7f9801ef5a80 -1 rocksdb: Corruption: missing start of fragmented record(2)
2020-07-13 15:51:38.748 7f9801ef5a80 -1 bluestore(/var/lib/ceph/osd/ceph-10) _open_db erroring opening db:
2020-07-13 15:51:38.748 7f9801ef5a80 1 bluefs umount
2020-07-13 15:51:38.776 7f9801ef5a80 1 fbmap_alloc 0x55c897e0a900 shutdown
2020-07-13 15:51:38.776 7f9801ef5a80 1 bdev(0x55c898a6ce00 /var/lib/ceph/osd/ceph-10/block) close
Why does rocksdb not automatically delete these incomplete records and continue work?
In addition, after the occurrence of this situation, what method should be used to recover.
I tested osd bench with different block size: 1MB, 512KB, 256KB, 128KB, 64KB, and 32KB. osd.2 is one from the cluster where osds have better 4KB osd bench, and osd.30 is from the cluster where osds have lower 4KB osd bench. Before 32KB, osd.30 was better than osd.2, however, there was a big drop on osd.30 with 32KB block size.
root@cmn01:~# ceph tell osd.2 bench 1073741824 1048576
{
"bytes_written": 1073741824,
"blocksize": 1048576,
"bytes_per_sec": 188747963
}
root@cmn01:~# ceph tell osd.2 bench 1073741824 524288
{
"bytes_written": 1073741824,
"blocksize": 524288,
"bytes_per_sec": 181071543
}
root@cmn01:~# ceph tell osd.2 bench 786432000 262144
{
"bytes_written": 786432000,
"blocksize": 262144,
"bytes_per_sec": 159007035
}
root@cmn01:~# ceph tell osd.2 bench 393216000 131072
{
"bytes_written": 393216000,
"blocksize": 131072,
"bytes_per_sec": 127179122
}
root@cmn01:~# ceph tell osd.2 bench 196608000 65536
{
"bytes_written": 196608000,
"blocksize": 65536,
"bytes_per_sec": 83365482
}
root@cmn01:~# ceph tell osd.2 bench 98304000 32768
{
"bytes_written": 98304000,
"blocksize": 32768,
"bytes_per_sec": 48351258
}
root@cmn01:~# ceph tell osd.2 bench 49152000 16384
{
"bytes_written": 49152000,
"blocksize": 16384,
"bytes_per_sec": 31725841
}
------------------------------------------------------------------------------------------------------------------
root@stor-mgt01:~# ceph tell osd.30 bench 1073741824 1048576
{
"bytes_written": 1073741824,
"blocksize": 1048576,
"elapsed_sec": 5.344805,
"bytes_per_sec": 200894474.890259,
"iops": 191.587901
}
root@stor-mgt01:~# ceph tell osd.30 bench 1073741824 524288
{
"bytes_written": 1073741824,
"blocksize": 524288,
"elapsed_sec": 5.303052,
"bytes_per_sec": 202476205.680661,
"iops": 386.192714
}
root@stor-mgt01:~# ceph tell osd.30 bench 786432000 262144
{
"bytes_written": 786432000,
"blocksize": 262144,
"elapsed_sec": 3.878248,
"bytes_per_sec": 202780204.655892,
"iops": 773.545092
}
root@stor-mgt01:~# ceph tell osd.30 bench 393216000 131072
{
"bytes_written": 393216000,
"blocksize": 131072,
"elapsed_sec": 1.939532,
"bytes_per_sec": 202737591.242988,
"iops": 1546.765070
}
root@stor-mgt01:~# ceph tell osd.30 bench 196608000 65536
{
"bytes_written": 196608000,
"blocksize": 65536,
"elapsed_sec": 1.081617,
"bytes_per_sec": 181772360.338257,
"iops": 2773.626104
}
root@stor-mgt01:~# ceph tell osd.30 bench 98304000 32768
{
"bytes_written": 98304000,
"blocksize": 32768,
"elapsed_sec": 2.908703,
"bytes_per_sec": 33796507.598640,
"iops": 1031.387561
}
root@stor-mgt01:~# ceph tell osd.30 bench 49152000 16384
{
"bytes_written": 49152000,
"blocksize": 16384,
"elapsed_sec": 3.907744,
"bytes_per_sec": 12578102.861185,
"iops": 767.706473
}
------------------ 原始邮件 ------------------
发件人: "rainning" <tweetypie(a)qq.com>gt;;
发送时间: 2020年7月16日(星期四) 上午9:42
收件人: "Zhenshi Zhou"<deaderzzs(a)gmail.com>gt;;
抄送: "ceph-users"<ceph-users(a)ceph.io>gt;;
主题: 回复:[ceph-users] Re: osd bench with or without a separate WAL device deployed
Hi Zhenshi,
I did try with bigger block size. Interestingly, the one whose 4KB osd bench was lower performed slightly better in 4MB osd bench.
Let me try some other bigger block sizes, e.g. 16K, 64K, 128K, 1M etc, to see if there is any pattern.
Moreover, I did compare two SSDs, they respectively are INTEL SSDSC2KB480G8 and INTEL SSDSC2KB960G8. Performance wise, there is no much difference.
Thanks,
Ning
------------------ 原始邮件 ------------------
发件人: "Zhenshi Zhou" <deaderzzs(a)gmail.com>gt;;
发送时间: 2020年7月16日(星期四) 上午9:24
收件人: "rainning"<tweetypie(a)qq.com>gt;;
抄送: "ceph-users"<ceph-users(a)ceph.io>gt;;
主题: [ceph-users] Re: osd bench with or without a separate WAL device deployed
Maybe you can try writing with bigger block size and compare the results.
For bluestore, the write operations contain two modes. One is COW, the
other is RMW. AFAIK only RMW uses wal in order to prevent data from
being interrupted.
rainning <tweetypie(a)qq.com> 于2020年7月15日周三 下午11:04写道:
> Hi Zhenshi, thanks very much for the reply.
>
> Yes I know it is ood that the bluestore is deployed only with a separate
> db device but no a WAL device. The cluster was deployed in k8s using rook.
> I was told it was because the rook we used didn't support that.
>
> Moreover, the comparison was made on osd bench, so the network should not
> be the case. As far as the storage node hardware, although two clusters are
> indeed different, their CPUs and HDDs do have almost same performance
> numbers. I haven't compared SSDs that are used as db/WAL devices, it might
> cause difference, but I am not sure if it can make two times difference.
>
> ---Original---
> *From:* "Zhenshi Zhou"<deaderzzs(a)gmail.com>
> *Date:* Wed, Jul 15, 2020 18:39 PM
> *To:* "rainning"<tweetypie(a)qq.com>gt;;
> *Cc:* "ceph-users"<ceph-users(a)ceph.io>gt;;
> *Subject:* [ceph-users] Re: osd bench with or without a separate WAL
> device deployed
>
> I deployed the cluster either with separate db/wal or put db/wal/data
> together. Never tried to have only a seperate db.
> AFAIK wal does have an effect on writing but I'm not sure if it could be
> two times of the bench value. Hardware and
> network environment are also important factors.
>
> rainning <tweetypie(a)qq.com> 于2020年7月15日周三 下午4:35写道:
>
> > Hi all,
> >
> >
> > I am wondering if there is any performance comparison done on osd bench
> > with and without a separate WAL device deployed given that there is
> always
> > a separate db device deployed on SSD in both cases.
> >
> >
> > The reason I am asking this question is that we have two clusters and
> osds
> > in one have separate db and WAL device deployed on SSD but osds in
> another
> > only have a separate db device deployed. And we found 4KB osd bench (i.e.
> > ceph tell osd.X bench 12288000 4096) for the ones having a separate WAL
> > device was two times of the ones without a separate WAL device. Is the
> > performance difference caused by the separate WAL device?
> >
> >
> > Thanks,
> > Ning
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi all,
I am wondering if there is any performance comparison done on osd bench with and without a separate WAL device deployed given that there is always a separate db device deployed on SSD in both cases.
The reason I am asking this question is that we have two clusters and osds in one have separate db and WAL device deployed on SSD but osds in another only have a separate db device deployed. And we found 4KB osd bench (i.e. ceph tell osd.X bench 12288000 4096) for the ones having a separate WAL device was two times of the ones without a separate WAL device. Is the performance difference caused by the separate WAL device?
Thanks,
Ning
Hi Liam, All,
We have also run into this bug:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/PCYY2MKRPCP…
Like you, we are also running Octopus 15.2.3
Downgrading the RGWs at this point is not ideal, but if a fix isn't found
soon we might have to.
Has a bug report been filed for this yet?
- Dave
Hi all,
Does anyone know when we can expect Crimson/Seastor to be "Production
Ready" and/or what level of performance increase can be expected?
thx
Frank
Hello everyone,
Below is our current setup, the master zone is on the master datacenter and the backup zone is on the standby datacenter which has very slow bandwidth internet connection. The client application connects to the Master zone load balancer for uploading objects to cluster. Later on, the
Gate way nodes from the Backup zone will synchronize new objects into the Backup cluster by requesting object data to the load balancer of the master zone.
[A screenshot of a cell phone Description automatically generated]
The current connection inside Gateway nodes of the backup zone
[A screenshot of a computer Description automatically generated]
From the above image, you can see that the first Gateway node create a lot of connection to the master zone balancer (cephmm-03) for checking objects status. That means this gateway mostly will take the role for downloading new objects from the master zone to the backup zone (very high load on this node). On the other hand, the second Gateway node just has only 2 connections to the master node balancer (mostly free all the time). This second node only increase connections if the first node (or may be the third node) is down…
Can you explain this behavior of Ceph in this case and how to let all the gateway node in active mode?
I appreciate any comments from you!
--
Nghia Viet Tran (Mr)
Yes. After the time-out of 600 secs the OSDs got marked down, all PGs got remapped and recovery/rebalancing started as usual. In the past, I did service on servers with the flag noout set and would expect that mon_osd_down_out_subtree_limit=host has the same effect when shutting down an entire host. Unfortunately, in my case these two settings behave differently.
If I understand the documentation correctly, the OSDs should not get marked out automatically.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Anthony D'Atri <anthony.datri(a)gmail.com>
Sent: 14 July 2020 04:32:05
To: Frank Schilder
Subject: Re: [ceph-users] mon_osd_down_out_subtree_limit not working?
Did it start rebalancing?
> On Jul 13, 2020, at 4:29 AM, Frank Schilder <frans(a)dtu.dk> wrote:
>
> if I shut down all OSDs on this host, these OSDs should not be marked out automatically after mon_osd_down_out_interval(=600) seconds. I did a test today and, unfortunately, the OSDs do get marked as out. Ceph status was showing 1 host down as expected.