Our test cluster is seeing a problem where peering is going incredibly slow shortly after upgrading it to Nautilus (14.2.2) from Luminous (12.2.12).
From what I can tell it seems to be caused by "wait for new map" taking a long time. When looking at dump_historic_slow_ops on pretty much any OSD I see stuff like this:
# ceph daemon osd.112 dump_historic_slow_ops
[...snip...]
{
"description": "osd_pg_create(e180614 287.4b:177739 287.75:177739 287.1c3:177739 287.1cf:177739 287.1e1:177739 287.2dd:177739 287.2fc:177739 287.342:177739 287.382:177739)",
"initiated_at": "2019-09-03 15:12:41.366514",
"age": 4800.8847047119998,
"duration": 4780.0579745630002,
"type_data": {
"flag_point": "started",
"events": [
{
"time": "2019-09-03 15:12:41.366514",
"event": "initiated"
},
{
"time": "2019-09-03 15:12:41.366514",
"event": "header_read"
},
{
"time": "2019-09-03 15:12:41.366501",
"event": "throttled"
},
{
"time": "2019-09-03 15:12:41.366547",
"event": "all_read"
},
{
"time": "2019-09-03 15:39:03.379456",
"event": "dispatched"
},
{
"time": "2019-09-03 15:39:03.379477",
"event": "wait for new map"
},
{
"time": "2019-09-03 15:39:03.522376",
"event": "wait for new map"
},
{
"time": "2019-09-03 15:53:55.912499",
"event": "wait for new map"
},
{
"time": "2019-09-03 15:59:37.909063",
"event": "wait for new map"
},
{
"time": "2019-09-03 16:00:43.356023",
"event": "wait for new map"
},
{
"time": "2019-09-03 16:20:50.575498",
"event": "wait for new map"
},
{
"time": "2019-09-03 16:31:48.689415",
"event": "started"
},
{
"time": "2019-09-03 16:32:21.424489",
"event": "done"
}
]
}
It always seems to be in osd_pg_create() with multiple "wait for new map" messages before it finally does something. What could be causing it so long to get the OSD map? The mons don't appear to be overloaded in any way.
Thanks,
Bryan
This is the third bug fix release of Ceph Nautilus release series. This
release fixes a security issue. We recommend all Nautilus users upgrade
to this release. For upgrading from older releases of ceph, general
guidelines for upgrade to nautilus must be followed
Notable Changes
---------------
* CVE-2019-10222 - Fixed a denial of service vulnerability where an
unauthenticated client of Ceph Object Gateway could trigger a crash from an
uncaught exception
* Nautilus-based librbd clients can now open images on Jewel clusters.
* The RGW `num_rados_handles` has been removed. If you were using a value of
`num_rados_handles` greater than 1, multiply your current
`objecter_inflight_ops` and `objecter_inflight_op_bytes` parameters by the
old `num_rados_handles` to get the same throttle behavior.
* The secure mode of Messenger v2 protocol is no longer experimental with this
release. This mode is now the preferred mode of connection for monitors.
* "osd_deep_scrub_large_omap_object_key_threshold" has been lowered to detect an
object with large number of omap keys more easily.
For a detailed changelog please refer to the official release notes
entry at the ceph blog: https://ceph.io/releases/v14-2-3-nautilus-released/
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.3.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 0f776cf838a1ae3130b2b73dc26be9c95c6ccc39
--
Abhishek Lekshmanan
SUSE Software Solutions Germany GmbH
Good day,
We have a Ceph cluster and make use of object-storage and integrate
with OpenStack. Each OpenStack project/tenant is given a radosgw user
which allows all keystone users of that project to access the
object-storage as that single radosgw user. The radosgw user is the
project id of the OpenStack project/tenant.
Sometimes we have use cases where we want to access the object-storage
outside of the swift-api and use tools like the aws-cli or homebrew
java applications to access the object storage. For this use case what
we do is generate S3 access/secret key for the specific radosgw user
and they have full access to the object storage for that OpenStack
project/tenant.
What we want to know is if it is possible to provide granular access
to containers within a single OpenStack project using S3 access keys
or S3 sub-users? I know that the Swift API has ACL's that can limit by
keystone user but we are exploring the possibility of doing this using
S3 and S3 bucket policies so that the tools our team are developing
(open source) are more transferrable to AWS S3 and Rados GW.
Thanks all,
Jared Baker
Cloud Architect, OICR
Hi!
I understand that this question is not quite for this mailing list, but nonetheless, experts who may be encountered this have gathered here.
I have 24 servers, and on each, after six months of work, the following began to happen:
[root@S-26-5-1-2 cph]# uname -a
Linux S-26-5-1-2 5.2.11-1.el7.elrepo.x86_64 #1 SMP Thu Aug 29 08:10:52 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@S-26-5-1-2 cph]# dd if=/dev/zero of=/dev/sdc bs=1M count=1000 oflag=sync
1048576000 bytes (1.0 GB) copied, 3.76334 s, 279 MB/s
[root@S-26-5-1-2 cph]# dd if=/dev/zero of=/dev/sdd bs=1M count=1000 oflag=sync
1048576000 bytes (1.0 GB) copied, 4.54834 s, 231 MB/s
sdc - SSD disk. sdd - HDD.
It can be seen that ssd works somehow slowly, and hdd - too quickly.
Reboot - nothing changes.
And only poweroff/poweron cycle change behavior to normal:
[root@S-26-5-1-2 cph]# dd if=/dev/zero of=/dev/sdc bs=1M count=1000 oflag=sync
1048576000 bytes (1.0 GB) copied, 3.24042 s, 324 MB/s
[root@S-26-5-1-2 cph]# dd if=/dev/zero of=/dev/sdd bs=1M count=1000 oflag=sync
1048576000 bytes (1.0 GB) copied, 13.7709 s, 76.1 MB/s
Absoluteli nothing in system and ceph log (this servers used for OSD) about that.
Perhaps someone has encountered similar behavior?
WBR,
Fyodor.
Hello,
I'm trying to install nautilus on stretch following the directions here https://docs.ceph.com/docs/master/install/get-packages/ . However, it seems the stretch repo only includes ceph-deploy. Are the rest of the packages missing on purpose or have I missed something obvious?
Thanks
hi,all:
I use the aws s3 java sdk , when make a new bucket , with the hostname " s3.my-self.mydomain.com" ; will get a auth error.
but , when I use the hostname " s3.us-east-1.mydomian.com" ,will be ok, why ?
黄明友
IT基础架构部经理
V.Photos 云摄影
移动电话: +86 13540630430
客服电话:400 - 806 - 5775
电子邮件: hmy(a)v.photos
官方网址: www.v.photos
上海 黄浦区中山东二路88号外滩SOHO3Q F栋 2层
北京 朝阳区光华路9号光华路SOHO二期南二门SOHO3Q 1层
广州 天河区林和中路136号天誉花园二期3Wcoffice 天誉青创社区
深圳 南山区蛇口网谷科技大厦二期A座102网谷双创街 1层
成都 成华区建设路世贸广场 7层
Hello,
I have an old ceph 0.94.10 cluster that had 10 storage nodes with one extra
management node used for running commands on the cluster. Over time we'd
had some hardware failures on some of the storage nodes, so we're down to
6, with ceph-mon running on the management server and 4 of the storage
nodes. We attempted deploying a ceph.conf change and restarted ceph-mon and
ceph-osd services, but the cluster went down on us. We found all the
ceph-mons are stuck in the electing state, I can't get any response from
any ceph commands but I found I can contact the daemon directly and get
this information (hostnames removed for privacy reasons):
root@<mgmt1>:~# ceph daemon mon.<mgmt1> mon_status
{
"name": "<mgmt1>",
"rank": 0,
"state": "electing",
"election_epoch": 4327,
"quorum": [],
"outside_quorum": [],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 10,
"fsid": "69611c75-200f-4861-8709-8a0adc64a1c9",
"modified": "2019-08-23 08:20:57.620147",
"created": "0.000000",
"mons": [
{
"rank": 0,
"name": "<mgmt1>",
"addr": "[fdc4:8570:e14c:132d::15]:6789\/0"
},
{
"rank": 1,
"name": "<mon1>",
"addr": "[fdc4:8570:e14c:132d::16]:6789\/0"
},
{
"rank": 2,
"name": "<mon2>",
"addr": "[fdc4:8570:e14c:132d::28]:6789\/0"
},
{
"rank": 3,
"name": "<mon3>",
"addr": "[fdc4:8570:e14c:132d::29]:6789\/0"
},
{
"rank": 4,
"name": "<mon4>",
"addr": "[fdc4:8570:e14c:132d::151]:6789\/0"
}
]
}
}
Is there any way to force the cluster back into a quorum even if it's just
one mon running to start it up? I've tried exporting the mgmt's monmap and
injecting it into the other nodes, but it didn't make any difference.
Thanks!
Hi,
Em qui, 29 de ago de 2019 às 22:32, fengyd <fengyd81(a)gmail.com> escreveu:
> Hi,
>
> The issue is still there?
>
Yes, yet.
> I have met an IO peformance issue recently and found that the count of the
> max fd for the Qemu/KVM was not bigger enough, the fd for Qemu/KVM was
> exhausted, the issue was solved after increasing the count of the max fd.
>
> How check and increase max fd for qemu? Can you give-me way?
Regards,
Gesiel
>
> On Wed, 21 Aug 2019 at 20:53, Gesiel Galvão Bernardes <
> gesiel.bernardes(a)gmail.com> wrote:
>
>> Hi Eliza,
>>
>> Em qua, 21 de ago de 2019 às 09:30, Eliza <eli(a)chinabuckets.com>
>> escreveu:
>>
>>> Hi
>>>
>>> on 2019/8/21 20:25, Gesiel Galvão Bernardes wrote:
>>> > I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I
>>> > having problems with slowness in aplications that many times not
>>> > consuming very CPU or RAM. This problem affect mostly Windows.
>>> Appearly
>>> > the problem is that normally the application load many short files
>>> (ex:
>>> > DLLs) and these files take a long time to load, generating a slowness.
>>>
>>> Did you check/test your network connection?
>>> Do you have a fast network setup?
>>
>>
>> I have a bond of two 10GB interfaces, with little use.
>>
>>>
>>>
>> regards.
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users(a)lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users(a)lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>