Hi,
I can see the following error message regularely in MGR log:
2019-11-18 14:25:48.847 7fd9e6a3a700 0 mgr[dashboard]
[18/Nov/2019:14:25:48] ENGINE Error in HTTPServer.tick
Traceback (most recent call last):
File
"/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
2021, in start
self.tick()
File
"/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
2090, in tick
s, ssl_env = self.ssl_adapter.wrap(s)
File
"/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/ssl_builtin.py",
line 67, in wrap
server_side=True)
File "/usr/lib/python2.7/ssl.py", line 369, in wrap_socket
_context=self)
File "/usr/lib/python2.7/ssl.py", line 599, in __init__
self.do_handshake()
File "/usr/lib/python2.7/ssl.py", line 828, in do_handshake
self._sslobj.do_handshake()
error: [Errno 0] Error
2019-11-18 14:25:49.027 7fd9e6a3a700 0 mgr[dashboard]
[18/Nov/2019:14:25:49] ENGINE Error in HTTPServer.tick
Traceback (most recent call last):
File
"/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
2021, in start
self.tick()
File
"/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
2090, in tick
s, ssl_env = self.ssl_adapter.wrap(s)
File
"/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/ssl_builtin.py",
line 67, in wrap
server_side=True)
File "/usr/lib/python2.7/ssl.py", line 369, in wrap_socket
_context=self)
File "/usr/lib/python2.7/ssl.py", line 599, in __init__
self.do_handshake()
File "/usr/lib/python2.7/ssl.py", line 828, in do_handshake
self._sslobj.do_handshake()
SSLError: [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate
unknown (_ssl.c:727)
In many cases this error causes a switch of the active MGR node.
But there's another impact on the Ceph Dashboard directly that hangs
completely when this error is logged.
Any advise how to fix this issue is appreciated.
THX
Hi Daniel,
I am able to mount the buckets with your config, however when I try to
write something, my logs get a lot of these errors:
svc_732] nfs4_Errno_verbose :NFS4 :CRIT :Error I/O error in
nfs4_write_cb converted to NFS4ERR_IO but was set non-retryable
Any chance you know how to resolve this?
Hello,
I'm going to deploy a new cluster soon based on 6.4TB NVME PCI-E Cards, I will have only 1 NVME card per node and 38 nodes.
The use case is to offer cephfs volumes for a k8s platform, I plan to use an EC-POOL 8+3 for the cephfs_data pool.
Do you have recommendations for the setup or mistakes to avoid? I use ceph-ansible to deploy all myclusters.
Best regards,
--
Yoann Moulin
EPFL IC-IT
Hello Nathan,
>>>> I'm going to deploy a new cluster soon based on 6.4TB NVME PCI-E Cards, I will have only 1 NVME card per node and 38 nodes.
>>>>
>>>> The use case is to offer cephfs volumes for a k8s platform, I plan to use an EC-POOL 8+3 for the cephfs_data pool.
>>>>
>>>> Do you have recommendations for the setup or mistakes to avoid? I use ceph-ansible to deploy all myclusters.
>>>
>>> In order to get optimal performance out of NVMe, you will want very
>>> fast cores, and you will probably have to split each NVMe card into
>>> 2-4 OSD partitions in order to throw enough cores at it.
That's a good idea ! If I have enough time, I'll try to do some benchmark with 2 and 4 OSD partitions.
>> I’ve been trying unsuccessfully to convince some folks of the need for fast cores, there’s the idea that the effect would be slight. Do
>> you have any numbers? I’ve also read a claim that each BlueStore will use 3-4 cores, They’re listening to me though about splitting the
>> card into multiple OSDs.
>
> Bluestore will use about 4 cores, but in my experience, the maximum
> utilization I've seen has been something like: 100%, 100%, 50%, 50%
>
> So those first 2 cores are the bottleneck for pure OSD IOPS. This sort
> of pattern isn't uncommon in multithreaded programs. This was on HDD
> OSDs with DB/WAL on NVMe, as well as some small metadata OSDs on pure
> NVMe. SSD OSDs default to 2 threads per shard, and HDD to 1, but we
> had to set HDD to 2 as well when we enabled NVMe WAL/DB. Otherwise the
> OSDs ran out of CPU and failed to heartbeat when under load. I believe
> that if we had 50% faster cores, we might not have needed to do this.
>
> On SSDs/NVMe you can compensate for slower cores with more OSDs, but
> of course only for parallel operations. Anything that is
> serial+synchronous, not so much. I would expect something like 4 OSDs
> per NVMe, 4 cores per OSD. That's already 16 cores per node just for
> OSDs.
>
> Our bottleneck in practice is the Ceph MDS, which seems to use exactly
> 2 cores and has no setting to change this. As far as I can tell, if we
> had 50% faster cores just for the MDS, I would expect roughly +50%
> performance in terms of metadata ops/second. Each filesystem has it's
> own rank-0 MDS, so this load will be split across daemons. The MDS can
> also use a ton of RAM (32GB) if the clients have a working set of 1
> million+ files. Multi-mds exists to further split the load, but is
> quite new and I would not trust it. CephFS in general is likely where
> you will have the most issues, as it both new and complex compared to
> a simple object store. Having an MDS in standby-replay mode keeps it's
> RAM cache synced with the active, so you get far faster failover (
> O(seconds) rather than O(minutes) with a few million file caps) but
> you use the same RAM again.
>
> So, IMHO, you will want at least:
> CPU:
> 16 cores per 1-card NVMe OSD node. 2 cores per filesystem (maybe 1 if
> you don't expect a lot of simultaneous load?)
>
> RAM:
> The Bluestore default is 4GB per OSD, so 16GB per node.
> ~32GB of RAM per active and standby-replay MDS if you expect file
> counts in the millions, so 64GB per filesystem.
The context is
3 Intel Server 1U for MONs/MDSs/MGRs services + K8s daemons
CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24c/48t)
Memory : 64GB
Disk OS : 2x Intel SSD DC S3520 240GB
38 Dell C4140 1U for OSD nodes :
CPU : 2 x Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz (28c/56t)
Memory : 384GB
GPU : 4 Nvidia V100 32GB NVLink
Disk OS : M.2 240G
NVME : Dell 6.4TB NVME PCI-E Drive (Samsung PM1725b), only 1 slot available
Each server is used in a k8s cluster to give access to GPUs and CPUs for X-learning labs.
Ceph have to share the CPU and memory with the compute K8s cluster.
> 128GB of RAM per node ought to do, if you have less than 14 filesystems?
I plan to have only 1 filesystem.
Thanks to all those useful information.
Best regards,
--
Yoann Moulin
EPFL IC-IT
Hi,
ceph health is reporting: pg 59.1c is creating+down, acting [426,438]
root@ld3955:~# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; noscrub,nodeep-scrub
flag(s) set; Reduced data availability: 1 pg inactive, 1 pg down; 1
subtrees have overcommitted pool target_size_bytes; 1 subtrees have
overcommitted pool target_size_ratio; mons ld5505,ld5506 are low on
available space
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdsld4465(mds.0): 8 slow metadata IOs are blocked > 30 secs, oldest
blocked for 120721 secs
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg down
pg 59.1c is creating+down, acting [426,438]
MON_DISK_LOW mons ld5505,ld5506 are low on available space
mon.ld5505 has 22% avail
mon.ld5506 has 29% avail
root@ld3955:~# ceph pg dump_stuck inactive
ok
PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY
59.1c creating+down [426,438] 426 [426,438] 426
How can I fix this?
THX
hi,cool guys,
Recently,we had encountered a problem,the journal of MDS daemon could't be trimmed,resulting in a large amount of space occupied by the metadata pool.so what we could think out was using the admin socket command to flush journal,you know,It got worse,the admin thread of MDS was also stuck and after that we couldn't configure the log level.After analyzing the code, we found some segments couldn't get out out from expiring queue.but we don't know why and where is stuck in the function void(LogSegment::try_to_expire(MDSRank *mds, MDSGatherBuilder &gather_bld, int op_prio). any ideas or advice?Thanks a lot.Here are some cluster information:
Version:
luminous(v12.2.12)
MDS debug log:
5 mds.0.log trim already expiring segment 3658103659/11516554553473, 980 events
5 mds.0.log trim already expiring segment 3658104639/11516556356904, 1024 events
5 mds.0.log trim already expiring segment 3658105663/11516558241475, 1024 events
cephfs-journal-tool:
{
"magic": "ceph fs volume v011", "write_pos": 11836049063598, "expire_pos": 11516554553473, "trimmed_pos": 11516552151040, "stream_format": 1, "layout": { "stripe_unit": 4194304, "stripe_count": 1, "object_size": 4194304, "pool_id": 2, "pool_ns": "" } }
| |
locallocal
|
|
locallocal(a)163.com
|
签名由网易邮箱大师定制
radosgw-admin4j is an admin client in Java that allows provisioning and control of Ceph object store. In version 2.0.2, Java 11 and Ceph Nautilus are supported. See https://github.com/twonote/radosgw-admin4j for more details.