Hello,
I noticed a couple unanswered questions on this topic from a while back.
It seems, however, worth asking whether adjusting either or both of the
subject attributes could improve performance with large HDD OSDs (mine are
12TB SAS).
In the previous posts on this topic the writers indicated that they had
experimented with increasing either or both of osd_op_num_shards and
osd_op_num_threads_per_shard and had seen performance improvement. Like
myself, the writers wondering about any limitations or pitfalls relating to
such adjustments.
Since I would rather not take chances with a 500TB production cluster I am
asking for guidance from this list.
BTW, my cluster is currently running Nautilus 14.2.6 (stock Debian
packages).
Thank you.
-Dave
--
Dave Hall
Binghamton University
kdhall(a)binghamton.edu
Have you checked for disk failure? dmesg, smartctl etc. ?
Zitat von "Robert W. Eckert" <rob(a)rob.eckert.name>:
> I worked through that workflow- but it seems like the one monitor
> will run for a while - anywhere from an hour to a day, then just stop.
>
> This machine is running on AMD hardware (3600X CPU on X570 chipset)
> while my other two are running on old intel.
>
> I did find this in the service logs
>
> 2021-04-30T16:02:40.135+0000 7f5d0a94f700 -1 rocksdb: submit_common
> error: Corruption: block checksum mismatch: expected 395334538, got
> 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst
> offset 36769734 size 84730 code = 2 Rocksdb transaction:
>
> I am attaching the output of
> journalctl -u ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867(a)mon.cube.service
>
> The error appears to be here:
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -61>
> 2021-04-30T16:02:38.700+0000 7f5d21332700 4 mon.cube(a)-1(???).mgr
> e702 active server:
> [v2:192.168.2.199:6834/1641928541,v1:192.168.2.199:6835/1641928541](2184157)
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -60>
> 2021-04-30T16:02:38.700+0000 7f5d21332700 4 mon.cube(a)-1(???).mgr
> e702 mkfs or daemon transitioned to available, loading commands
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -59>
> 2021-04-30T16:02:38.701+0000 7f5d21332700 4 set_mon_vals no
> callback set
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -58>
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals
> client_cache_size = 32768
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -57>
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals
> container_image =
> docker.io/ceph/ceph@sha256:15b15fb7a708970f1b734285ac08aef45dcd76e86866af37412d041e00853743
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -56>
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals
> log_to_syslog = true
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -55>
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals
> mon_data_avail_warn = 10
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -54>
> 2021-04-30T16:02:38.701+0000 7f5d21332700 10 set_mon_vals
> mon_warn_on_insecure_global_id_reclaim_allowed = true
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -53>
> 2021-04-30T16:02:38.701+0000 7f5d21332700 4 set_mon_vals no
> callback set
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -52>
> 2021-04-30T16:02:38.702+0000 7f5d21332700 2 auth: KeyRing::load:
> loaded key file /var/lib/ceph/mon/ceph-cube/keyring
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -51>
> 2021-04-30T16:02:38.702+0000 7f5d1095b700 3 rocksdb:
> [db_impl/db_impl_compaction_flush.cc:2808] Compaction error:
> Corruption: block checksum mismatch: expected 395334538, got
> 4289108204 in /var/lib/ceph/mon/ceph- cube/store.db/073501.sst
> offset 36769734 size 84730
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -50>
> 2021-04-30T16:02:38.702+0000 7f5d21332700 5 asok(0x56327d226000)
> register_command compact hook 0x56327e028700
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -49>
> 2021-04-30T16:02:38.702+0000 7f5d1095b700 4 rocksdb: (Original Log
> Time 2021/04/30-16:02:38.703267) [compaction/compaction_job.cc:760]
> [default] compacted to: base level 6 level multiplier 10.00 max
> bytes base 268435456 files[5 0 0 0 0 0 2] max score 0.00, MB/sec:
> 11035.6 rd, 0.0 wr, level 6, files in(5, 2) out(1) MB in(32.1,
> 126.7) out(0.0), read-write-amplify(5.0) write-amplify(0.0)
> Corruption: block checksum mismatch: expected 395334538, got
> 4289108204 in /var/lib/ceph/mon/ceph-cube/store.db/073501.sst
> offset 36769734 size 84730, records in: 7670, records dropped: 6759
> output_compres
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -48>
> 2021-04-30T16:02:38.702+0000 7f5d1095b700 4 rocksdb: (Original Log
> Time 2021/04/30-16:02:38.703283) EVENT_LOG_v1 {"time_micros":
> 1619798558703277, "job": 3, "event": "compaction_finished",
> "compaction_time_micros": 15085, "compaction_time_cpu_micros":
> 11937, "output_level": 6, "num_output_files": 1,
> "total_output_size": 12627499, "num_input_records": 7670,
> "num_output_records": 911, "num_subcompactions": 1,
> "output_compression": "NoCompression",
> "num_single_delete_mismatches": 0, "num_single_delete_fallthrough":
> 0, "lsm_state": [5, 0, 0, 0, 0, 0, 2]}
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -47>
> 2021-04-30T16:02:38.702+0000 7f5d1095b700 2 rocksdb:
> [db_impl/db_impl_compaction_flush.cc:2344] Waiting after background
> compaction error: Corruption: block checksum mismatch: expected
> 395334538, got 4289108204 in
> /var/lib/ceph/mon/ceph-cube/store.db/073501.sst offset 36769734
> size 84730, Accumulated background error counts: 1
> Apr 30 12:02:40 cube.robeckert.us conmon[41474]: debug -46>
> 2021-04-30T16:02:38.702+0000 7f5d21332700 5 asok(0x56327d226000)
> register_command smart hook 0x56327e028700
>
>
> This is running the latest pacific container, but I was seeing the
> same issue in octopus.
>
> The container runs under podman on rhel 8, and the
> /var/lib/ceph/mon/ceph-cube is mapped to
> /var/lib/ceph/fe3a7cb0-69ca-11eb-8d45-c86000d08867/mon.cube.service
> on the nvme boot drive, which has plenty of space.
>
> To recover I run a script that will stop the monitor on another
> host, copy the store.db directory then start up, and it syncs right
> up.
>
>
>
> Thanks,
> Rob
>
>
>
>
>
> -----Original Message-----
> From: Sebastian Wagner <sewagner(a)redhat.com>
> Sent: Thursday, April 29, 2021 7:44 AM
> To: Eugen Block <eblock(a)nde.ag>; ceph-users(a)ceph.io
> Subject: [ceph-users] Re: one of 3 monitors keeps going down
>
> Right, here are the docs for that workflow:
>
> https://docs.ceph.com/en/latest/cephadm/mon/#mon-service
>
> Am 29.04.21 um 13:13 schrieb Eugen Block:
>> Hi,
>>
>> instead of copying MON data to this one did you also try to redeploy
>> the MON container entirely so it gets a fresh start?
>>
>>
>> Zitat von "Robert W. Eckert" <rob(a)rob.eckert.name>:
>>
>>> Hi,
>>> On a daily basis, one of my monitors goes down
>>>
>>> [root@cube ~]# ceph health detail
>>> HEALTH_WARN 1 failed cephadm daemon(s); 1/3 mons down, quorum
>>> rhel1.robeckert.us,story [WRN] CEPHADM_FAILED_DAEMON: 1 failed
>>> cephadm daemon(s)
>>> daemon mon.cube on cube.robeckert.us is in error state [WRN]
>>> MON_DOWN: 1/3 mons down, quorum rhel1.robeckert.us,story
>>> mon.cube (rank 2) addr
>>> [v2:192.168.2.142:3300/0,v1:192.168.2.142:6789/0] is down (out of
>>> quorum) [root@cube ~]# ceph --version ceph version 15.2.11
>>> (e3523634d9c2227df9af89a4eac33d16738c49cb)
>>> octopus (stable)
>>>
>>> I have a script that will copy the mon data from another server and
>>> it restarts and runs well for a while.
>>>
>>> It is always the same monitor, and when I look at the logs the only
>>> thing I really see is the cephadm log showing it down
>>>
>>> 2021-04-28 10:07:26,173 DEBUG Running command: /usr/bin/podman
>>> --version
>>> 2021-04-28 10:07:26,217 DEBUG /usr/bin/podman: stdout podman version
>>> 2.2.1
>>> 2021-04-28 10:07:26,222 DEBUG Running command: /usr/bin/podman
>>> inspect --format
>>> {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
>>> .Config.Labels "io.ceph.version"}}
>>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-osd.2
>>> 2021-04-28 10:07:26,326 DEBUG /usr/bin/podman: stdout
>>> fab17e5242eb4875e266df19ca89b596a2f2b1d470273a99ff71da2ae81eeb3c,dock
>>> er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da
>>> 452188daf2af72e,2021-04-26
>>> 17:13:15.54183375 -0400 EDT,
>>> 2021-04-28 10:07:26,328 DEBUG Running command: systemctl is-enabled
>>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867(a)mon.cube<mailto:ceph-fe3a7c
>>> b0-69ca-11eb-8d45-c86000d08867(a)mon.cube>
>>>
>>> 2021-04-28 10:07:26,334 DEBUG systemctl: stdout enabled
>>> 2021-04-28 10:07:26,335 DEBUG Running command: systemctl is-active
>>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867(a)mon.cube<mailto:ceph-fe3a7c
>>> b0-69ca-11eb-8d45-c86000d08867(a)mon.cube>
>>>
>>> 2021-04-28 10:07:26,340 DEBUG systemctl: stdout failed
>>> 2021-04-28 10:07:26,340 DEBUG Running command: /usr/bin/podman
>>> --version
>>> 2021-04-28 10:07:26,395 DEBUG /usr/bin/podman: stdout podman version
>>> 2.2.1
>>> 2021-04-28 10:07:26,402 DEBUG Running command: /usr/bin/podman
>>> inspect --format
>>> {{.Id}},{{.Config.Image}},{{.Image}},{{.Created}},{{index
>>> .Config.Labels "io.ceph.version"}}
>>> ceph-fe3a7cb0-69ca-11eb-8d45-c86000d08867-mon.cube
>>> 2021-04-28 10:07:26,526 DEBUG /usr/bin/podman: stdout
>>> 04e7c673cbacf5160427b0c3eb2f0948b2f15d02c58bd1d9dd14f975a84cfc6f,dock
>>> er.io/ceph/ceph:v15,5b724076c58f97872fc2f7701e8405ec809047d71528f79da
>>> 452188daf2af72e,2021-04-28
>>> 08:54:57.614847512 -0400 EDT,
>>>
>>> I don't know if it matters, but this server is an AMD 3600XT while
>>> my other two servers which have had no issues are intel based.
>>>
>>> The root file system was originally on a SSD, and I switched to NVME,
>>> so I eliminated controller or drive issues. (I didn't see anything
>>> in dmesg anyway)
>>>
>>> If someone could point me in the right direction on where to
>>> troubleshoot next, I would appreciate it.
>>>
>>> Thanks,
>>> Rob Eckert
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>>> email to ceph-users-leave(a)ceph.io
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
>> email to ceph-users-leave(a)ceph.io
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
Hello folks,
I am new to ceph and at the moment I am doing some performance tests with a 4 node ceph-cluster (pacific, 16.2.1).
Node hardware (4 identical nodes):
* DELL 3620 workstation
* Intel Quad-Core i7-6700(a)3.4 GHz
* 8 GB RAM
* Debian Buster (base system, installed a dedicated on Patriot Burst 120 GB SATA-SSD)
* HP 530SPF+ 10 GBit dual-port NIC (tested with iperf to 9.4 GBit/s from node to node)
* 1 x Kingston KC2500 M2 NVMe PCIe SSD (500 GB, NO power loss protection !)
* 3 x Seagate Barracuda SATA disk drives (7200 rpm, 500 GB)
After bootstrapping a containerized (docker) ceph-cluster, I did some performance tests on the NVMe storage by creating a storage pool called „ssdpool“, consisting of 4 OSDs per (one) NVMe device (per node). A first write-performance test yields
=============
root@ceph1:~# rados bench -p ssdpool 10 write -b 4M -t 16 --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_ceph1_78
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 30 14 55.997 56 0.0209977 0.493427
2 16 53 37 73.9903 92 0.0264305 0.692179
3 16 76 60 79.9871 92 0.559505 0.664204
4 16 99 83 82.9879 92 0.609332 0.721016
5 16 116 100 79.9889 68 0.686093 0.698084
6 16 132 116 77.3224 64 1.19715 0.731808
7 16 153 137 78.2741 84 0.622646 0.755812
8 16 171 155 77.486 72 0.25409 0.764022
9 16 192 176 78.2076 84 0.968321 0.775292
10 16 214 198 79.1856 88 0.401339 0.766764
11 1 214 213 77.4408 60 0.969693 0.784002
Total time run: 11.0698
Total writes made: 214
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 77.3272
Stddev Bandwidth: 13.7722
Max bandwidth (MB/sec): 92
Min bandwidth (MB/sec): 56
Average IOPS: 19
Stddev IOPS: 3.44304
Max IOPS: 23
Min IOPS: 14
Average Latency(s): 0.785372
Stddev Latency(s): 0.49011
Max latency(s): 2.16532
Min latency(s): 0.0144995
=============
... and I think that 80 MB/s throughput is a very poor result in conjunction with NVMe devices and 10 GBit nics.
A bare write-test (with fsync=0 option) of the NVMe drives yields a write throughput of round about 800 MB/s per device ... the second test (with fsync=1) drops performance to 200 MB/s.
=============
root@ceph1:/home/mschmid# fio --rw=randwrite --name=IOPS-write --bs=1024k --direct=1 --filename=/dev/nvme0n1 --numjobs=4 --ioengine=libaio --iodepth=32 --refill_buffers --group_reporting --runtime=30 --time_based --fsync=0
IOPS-write: (g=0): rw=randwrite, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=32...
fio-3.12
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=723MiB/s][w=722 IOPS][eta 00m:00s]
IOPS-write: (groupid=0, jobs=4): err= 0: pid=31585: Thu Apr 29 15:15:03 2021
write: IOPS=740, BW=740MiB/s (776MB/s)(21.8GiB/30206msec); 0 zone resets
slat (usec): min=16, max=810, avg=106.48, stdev=30.48
clat (msec): min=7, max=1110, avg=172.09, stdev=120.18
lat (msec): min=7, max=1110, avg=172.19, stdev=120.18
clat percentiles (msec):
| 1.00th=[ 32], 5.00th=[ 48], 10.00th=[ 53], 20.00th=[ 63],
| 30.00th=[ 115], 40.00th=[ 161], 50.00th=[ 169], 60.00th=[ 178],
| 70.00th=[ 190], 80.00th=[ 220], 90.00th=[ 264], 95.00th=[ 368],
| 99.00th=[ 667], 99.50th=[ 751], 99.90th=[ 894], 99.95th=[ 986],
| 99.99th=[ 1036]
bw ( KiB/s): min=22528, max=639744, per=25.02%, avg=189649.94, stdev=113845.69, samples=240
iops : min= 22, max= 624, avg=185.11, stdev=111.18, samples=240
lat (msec) : 10=0.01%, 20=0.19%, 50=6.43%, 100=20.29%, 250=61.52%
lat (msec) : 500=8.21%, 750=2.85%, 1000=0.47%
cpu : usr=11.87%, sys=2.05%, ctx=13141, majf=0, minf=45
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.3%, 32=99.4%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued rwts: total=0,22359,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
WRITE: bw=740MiB/s (776MB/s), 740MiB/s-740MiB/s (776MB/s-776MB/s), io=21.8GiB (23.4GB), run=30206-30206msec
Disk stats (read/write):
nvme0n1: ios=0/89150, merge=0/0, ticks=0/15065724, in_queue=15118720, util=99.75%
=============
Furthermore an IOPS-test on the NVMe device with block-size 4k shows round about 1000 IOPS with fsnyc=1 and 35000 IOPS with fsync=0.
To my question: As CPU- and network-load seem to be low during my tests, I would like to know, which bottleneck can cause such a huge performance drop between the bare hardware-performance of the nvme-drives and the write-speeds in the rados benchmark. Could the missing power loss protection (fsync=1) be the problem, or what throughput should one expect to be normal in such a setup?
Thanks for every advice!
Best regards,
Michael
Dear gents,
to get handy with cephadm upgrade path and in general (we heavily use old
style "ceph-deploy" Octopus based production clusters), we decided to do
some tests with a vanilla cluster running 15.2.11 based on Centos8 on top
of vSphere. Deployment of Octopus cluster runs very well and we are
excited about this new technique and all the possibilities. No errors no
clues... :-)
Unfortunately upgrade fails to Pacific (16.2.0 or 16.2.1) either original
docker or quay.ceph.io/ceph-ci/ceph:pacific images all the time. We use a
small setup (3 mons, 2 mgrs, some osds) This is the upgrade behaviour:
Upgrade of both MGR's seems to be ok but we get this:
2021-04-29T15:35:19.903111+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n00.vnxaqu container digest correct
2021-04-29T15:35:19.903206+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n00.vnxaqu deployed by correct version
2021-04-29T15:35:19.903298+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n01.gstlmw container digest correct
2021-04-29T15:35:19.903378+0200 mgr.c0n00.vnxaqu [DBG] daemon
mgr.c0n01.gstlmw *not deployed by correct version*
After this the upgrade process stucks completely. Although you have a
running cluster (minus one monitor daemon):
[root@c0n00 ~]# ceph -s
cluster:
id: 5541c866-a8fe-11eb-b604-005056b8f1bf
health: HEALTH_WARN
* 3 hosts fail cephadm check*
services:
mon: 2 daemons, quorum c0n00,c0n02 (age 68m)
mgr: c0n00.bmtvpr(active, since 68m), standbys: c0n01.jwfuca
osd: 4 osds: 4 up (since 63m), 4 in (since 62m)
[..]
progress:
Upgrade to 16.2.1-257-g717ce59b (0s)
[=...........................]
{
"target_image": "
quay.ceph.io/ceph-ci/ceph@sha256:d0f624287378fe63fc4c30bccc9f82bfe0e42e62381c0a3d0d3d86d985f5d788",
"in_progress": true,
"services_complete": [
"mgr"
],
"progress": "2/19 ceph daemons upgraded",
"message": "Error: UPGRADE_EXCEPTION: Upgrade: failed due to an
unexpected exception"
[root@c0n00 ~]# ceph orch ps
NAME HOST PORTS STATUS REFRESHED AGE
VERSION IMAGE ID CONTAINER ID
alertmanager.c0n00 c0n00 running (56m) 4m ago 16h
0.20.0 0881eb8f169f 30d9eff06ce2
crash.c0n00 c0n00 running (56m) 4m ago 16h
15.2.11 9d01da634b8f 91d3e4d0e14d
crash.c0n01 c0n01 host is offline 16h ago 16h
15.2.11 9d01da634b8f 0ff4a20021df
crash.c0n02 c0n02 host is offline 16h ago 16h
15.2.11 9d01da634b8f 0253e6bb29a0
crash.c0n03 c0n03 host is offline 16h ago 16h
15.2.11 9d01da634b8f 291ce4f8b854
grafana.c0n00 c0n00 running (56m) 4m ago 16h
6.7.4 80728b29ad3f 46d77b695da5
mgr.c0n00.bmtvpr c0n00 *:8443,9283 running (56m) 4m ago 16h
16.2.1-257-g717ce59b 3be927f015dd 94a7008ccb4f
mgr.c0n01.jwfuca c0n01 host is offline 16h ago 16h
16.2.1-257-g717ce59b 3be927f015dd 766ada65efa9
mon.c0n00 c0n00 running (56m) 4m ago 16h
15.2.11 9d01da634b8f b9f270cd99e2
mon.c0n02 c0n02 host is offline 16h ago 16h
15.2.11 9d01da634b8f a90c21bfd49e
node-exporter.c0n00 c0n00 running (56m) 4m ago 16h
0.18.1 e5a616e4b9cf eb1306811c6c
node-exporter.c0n01 c0n01 host is offline 16h ago 16h
0.18.1 e5a616e4b9cf 093a72542d3e
node-exporter.c0n02 c0n02 host is offline 16h ago 16h
0.18.1 e5a616e4b9cf 785531f5d6cf
node-exporter.c0n03 c0n03 host is offline 16h ago 16h
0.18.1 e5a616e4b9cf 074fac77e17c
osd.0 c0n02 host is offline 16h ago 16h
15.2.11 9d01da634b8f c075bd047c0a
osd.1 c0n01 host is offline 16h ago 16h
15.2.11 9d01da634b8f 616aeda28504
osd.2 c0n03 host is offline 16h ago 16h
15.2.11 9d01da634b8f b36453730c83
osd.3 c0n00 running (56m) 4m ago 16h
15.2.11 9d01da634b8f e043abf53206
prometheus.c0n00 c0n00 running (56m) 4m ago 16h
2.18.1 de242295e225 7cb50c04e26a
After some digging into daemon logs we found Tracebacks (please see below).
We also noticed that we successfully reach each host per ssh -F .... !!!
We've done tcpdumps while upgrading and every SYN gets its SYNACK... ;-)
Because we get no errors while deploying fresh Octopus cluster by
cephadm (from
https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm and cephadm
prepare host is always OK) it might be a missing Python Lib or something
that's not checked cephadm itself?
Thank you for any hint.
Christoph Ackermann
Traceback:
Traceback (most recent call last):
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line
48, in bootstrap_exec
s = io.read(1)
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 402,
in read
raise EOFError("expected %d bytes, got %d" % (numbytes, len(buf)))
EOFError: expected 1 bytes, got 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1166, in
_remote_connection
conn, connr = self.mgr._get_connection(addr)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1202, in
_get_connection
sudo=True if self.ssh_user != 'root' else False)
File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line
34, in __init__
self.gateway = self._make_gateway(hostname)
File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line
44, in _make_gateway
self._make_connection_string(hostname)
File "/lib/python3.6/site-packages/execnet/multi.py", line 134, in
makegateway
gw = gateway_bootstrap.bootstrap(io, spec)
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line
102, in bootstrap
bootstrap_exec(io, spec)
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line
53, in bootstrap_exec
raise HostNotFound(io.remoteaddress)
execnet.gateway_bootstrap.HostNotFound: -F /tmp/cephadm-conf-61otabz_ -i
/tmp/cephadm-identity-rt2nm0t4 root@c0n02
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/utils.py", line 73, in do_work
return f(*arg)
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 60, in
create_from_spec_one
replace_osd_ids=osd_id_claims.get(host, []), env_vars=env_vars
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 75, in
create_single_host
out, err, code = self._run_ceph_volume_command(host, cmd,
env_vars=env_vars)
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 295, in
_run_ceph_volume_command
error_ok=True)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1003, in _run_cephadm
with self._remote_connection(host, addr) as tpl:
File "/lib64/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1197, in
_remote_connection
raise OrchestratorError(msg) from e
orchestrator._interface.OrchestratorError: Failed to connect to c0n02
(c0n02).
Please make sure that the host is reachable and accepts connections using
the cephadm SSH key
To add the cephadm SSH key to the host:
> ceph cephadm get-pub-key > ~/ceph.pub
> ssh-copy-id -f -i ~/ceph.pub root@c0n02
To check that the host is reachable:
> ceph cephadm get-ssh-config > ssh_config
> ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> chmod 0600 ~/cephadm_private_key
> ssh -F ssh_config -i ~/cephadm_private_key root@c0n02
Hello,
I upgraded my Octopus test cluster which has 5 hosts because one of the node (a mon/mgr node) was still on version 15.2.10 but all the others on 15.2.11.
For the upgrade I used the following command:
ceph orch upgrade start --ceph-version 15.2.11
The upgrade worked correctly and I did not see any errors in the logs but the host version in the ceph dashboard (under the navigation Cluster -> Hosts) still snows 15.2.10 for that specific node.
The output of "ceph versions", shows that every component is on 15.2.11 as you can see below:
{
"mon": {
"ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 3
},
"mgr": {
"ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 2
},
"osd": {
"ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 2
},
"mds": {},
"overall": {
"ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)": 7
}
}
So why is it still stuck on 15.2.10 in the dashboard?
Best regards,
Mabi
Good thought. The storage for the monitor data is a RAID-0 over three
NVMe devices. Watching iostat, they are completely idle, maybe 0.8% to
1.4% for a second every minute or so.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Thu, Apr 8, 2021 at 7:48 PM Zizon Qiu <zzdtsv(a)gmail.com> wrote:
>
> Will it be related to some kind of disk issue of that mon located in,which may casually
> slow down IO and further the rocksdb?
>
>
> On Fri, Apr 9, 2021 at 4:29 AM Robert LeBlanc <robert(a)leblancnet.us> wrote:
>>
>> I found this thread that matches a lot of what I'm seeing. I see the
>> ms_dispatch thread going to 100%, but I'm at a single MON, the
>> recovery is done and the rocksdb MON database is ~300MB. I've tried
>> all the settings mentioned in that thread with no noticeable
>> improvement. I was hoping that once the recovery was done (backfills
>> to reformatted OSDs) that it would clear up, but not yet. So any other
>> ideas would be really helpful. Our MDS is functioning, but stalls a
>> lot because the mons miss heartbeats.
>>
>> mon_compact_on_start = true
>> rocksdb_cache_size = 1342177280
>> mon_lease = 30
>> mon_osd_cache_size = 200000
>> mon_sync_max_payload_size = 4096
>>
>> ----------------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>>
>> On Thu, Apr 8, 2021 at 1:11 PM Stefan Kooman <stefan(a)bit.nl> wrote:
>> >
>> > On 4/8/21 6:22 PM, Robert LeBlanc wrote:
>> > > I upgraded our Luminous cluster to Nautilus a couple of weeks ago and
>> > > converted the last batch of FileStore OSDs to BlueStore about 36 hours
>> > > ago. Yesterday our monitor cluster went nuts and started constantly
>> > > calling elections because monitor nodes were at 100% and wouldn't
>> > > respond to heartbeats. I reduced the monitor cluster to one to prevent
>> > > the constant elections and that let the system limp along until the
>> > > backfills finished. There are large amounts of time where ceph commands
>> > > hang with the CPU is at 100%, when the CPU drops I see a lot of work
>> > > getting done in the monitor logs which stops as soon as the CPU is at
>> > > 100% again.
>> >
>> >
>> > Try reducing mon_sync_max_payload_size=4096. I have seen Frank Schilder
>> > advise this several times because of monitor issues. Also recently for a
>> > cluster that got upgraded from Luminous -> Mimic -> Nautilus.
>> >
>> > Worth a shot.
>> >
>> > Otherwise I'll try to look in depth and see if I can come up with
>> > something smart (for now I need to go catch some sleep).
>> >
>> > Gr. Stefan
ceph pool size 1 for (temporary and expendable data) still using 2X storage?
Hey Ceph Users!
With all the buzz around chia coin, I want to dedicate a few TB to
storage mining, really just to play with the chia CLI tool, and learn
how it all works.
At the whole concept is about dedicating disk space for large
calculation outputs, the data is meaningless.
For this reason, I am hoping to use a pool with size 1, min_size 1,
and did set up the same.
However, as a proxmox user, I noticed that this pool appears to still
use 2X storage space, or at a minimum, the pool's maximum size is
limited to 50% of total storage space (not that I plan on maxing out
my storage for this.)
I suspect there is a novice-user failsafe which ensures foolishly
configured size=1 is automatically treated as size=2...
Can anyone point me towards how best to leverage my ceph cluster to
store expendable data at size=1 without wasting x2 actual disk space?
My cluster is perfectly balanced, so I am reluctant to pull an osd
out, generally don't have any other disks on hand, and don't plan to
spend money on additional storage for this endeavour. I do want to
ensure I am not wasting more space than I am expecting though.
(Small hobby cluster, if it matters)
Josh
Hi Reed,
Thankyou so much for the input and support. We have tried using the
variable suggested by you, but could not see any impact on the current
system.
*"ceph fs set cephfs allow_standby_replay true " *it did not create any *impact
in the failover time*
Furthermore we have tried more scenarios that we tested using our test :
*scenario 1:*
[image: image.png]
- In this we have tried to see the logs at the new node on which the mds
will failover to, i.e in this case if we reboot cephnode2 so new active MDS
will be Cephnode1. Checking logs for cephnode1 in two scenarios:
- 1. *normal reboot of Cephnode2 by keeping the I/O operation in
progress,*
- we see that log at cephnode1 instantiates immediately and then wait
for sometime (around 15 seconds for some beacon time) + some
additional 6-7
seconds during which it activated MDS on cephnode1 and resumes I/O. Refer
logs as :
- 2021-04-29T15:49:42.480+0530 7fa747690700 1 mds.cephnode1 Updating
MDS map to version 505 from mon.2
2021-04-29T15:49:42.482+0530 7fa747690700 1 mds.0.505 handle_mds_map
i am now mds.0.505
2021-04-29T15:49:42.482+0530 7fa747690700 1 mds.0.505 handle_mds_map
state change up:boot --> up:replay
2021-04-29T15:49:42.482+0530 7fa747690700 1 mds.0.505 replay_start
2021-04-29T15:49:42.482+0530 7fa747690700 1 mds.0.505 recovery set
is
2021-04-29T15:49:42.482+0530 7fa747690700 1 mds.0.505 waiting for
osdmap 486 (which blacklists prior instance)
2021-04-29T15:49:55.686+0530 7fa74568c700 1 mds.beacon.cephnode1 MDS
connection to Monitors appears to be laggy; 15.9769s since last
acked beacon
2021-04-29T15:49:55.686+0530 7fa74568c700 1 mds.0.505 skipping
upkeep work because connection to Monitors appears laggy
2021-04-29T15:49:57.533+0530 7fa749e95700 0 mds.beacon.cephnode1
MDS is no longer laggy
2021-04-29T15:49:59.599+0530 7fa740e83700 0 mds.0.cache creating
system inode with ino:0x100
2021-04-29T15:49:59.599+0530 7fa740e83700 0 mds.0.cache creating
system inode with ino:0x1
2021-04-29T15:50:00.456+0530 7fa73f680700 1 mds.0.505 Finished
replaying journal
2021-04-29T15:50:00.456+0530 7fa73f680700 1 mds.0.505 making mds
journal writeable
2021-04-29T15:50:00.959+0530 7fa747690700 1 mds.cephnode1 Updating
MDS map to version 506 from mon.2
2021-04-29T15:50:00.959+0530 7fa747690700 1 mds.0.505 handle_mds_map
i am now mds.0.505
2021-04-29T15:50:00.959+0530 7fa747690700 1 mds.0.505 handle_mds_map
state change up:replay --> up:reconnect
2021-04-29T15:50:00.959+0530 7fa747690700 1 mds.0.505 reconnect_start
2021-04-29T15:50:00.959+0530 7fa747690700 1 mds.0.505 reopen_log
2021-04-29T15:50:00.959+0530 7fa747690700 1 mds.0.server
reconnect_clients -- 2 sessions
2021-04-29T15:50:00.964+0530 7fa747690700 0 log_channel(cluster) log
[DBG] : reconnect by client.6892 v1:10.0.4.96:0/1646469259 after
0.00499997
2021-04-29T15:50:00.972+0530 7fa747690700 0 log_channel(cluster) log
[DBG] : reconnect by client.6990 v1:10.0.4.115:0/2776266880 after
0.0129999
2021-04-29T15:50:00.972+0530 7fa747690700 1 mds.0.505 reconnect_done
2021-04-29T15:50:02.005+0530 7fa747690700 1 mds.cephnode1 Updating
MDS map to version 507 from mon.2
2021-04-29T15:50:02.005+0530 7fa747690700 1 mds.0.505 handle_mds_map
i am now mds.0.505
2021-04-29T15:50:02.005+0530 7fa747690700 1 mds.0.505 handle_mds_map
state change up:reconnect --> up:rejoin
2021-04-29T15:50:02.005+0530 7fa747690700 1 mds.0.505 rejoin_start
2021-04-29T15:50:02.008+0530 7fa747690700 1 mds.0.505
rejoin_joint_start
2021-04-29T15:50:02.040+0530 7fa740e83700 1 mds.0.505 rejoin_done
2021-04-29T15:50:03.050+0530 7fa747690700 1 mds.cephnode1 Updating
MDS map to version 508 from mon.2
2021-04-29T15:50:03.050+0530 7fa747690700 1 mds.0.505 handle_mds_map
i am now mds.0.505
2021-04-29T15:50:03.050+0530 7fa747690700 1 mds.0.505 handle_mds_map
state change up:rejoin --> up:clientreplay
2021-04-29T15:50:03.050+0530 7fa747690700 1 mds.0.505 recovery_done
-- successful recovery!
2021-04-29T15:50:03.050+0530 7fa747690700 1 mds.0.505
clientreplay_start
2021-04-29T15:50:03.094+0530 7fa740e83700 1 mds.0.505
clientreplay_done
2021-04-29T15:50:04.081+0530 7fa747690700 1 mds.cephnode1 Updating
MDS map to version 509 from mon.2
2021-04-29T15:50:04.081+0530 7fa747690700 1 mds.0.505 handle_mds_map
i am now mds.0.505
2021-04-29T15:50:04.081+0530 7fa747690700 1 mds.0.505 handle_mds_map
state change up:clientreplay --> up:active
2021-04-29T15:50:04.081+0530 7fa747690700 1 mds.0.505 active_start
2021-04-29T15:50:04.085+0530 7fa747690700 1 mds.0.505 cluster
recovered.
- *hard reset/power-off of Cephnode2 by keeping the I/O operation in
progress:*
- In this case we see that the system logs at cephnode 1(on which new
MDS will be activated) gets activated after 15+ seconds of power-off.
- Time at which power-off was it : 2021-04-29-16-17-37
- *Time at which the logs started to show in cephnode 1* (refer
logs) i.e log started nearly after 15 seconds of hardware reset:
- 2021-04-29T16:17:51.983+0530 7f5ba3a38700 1 mds.cephnode1 Updating
MDS map to version 518 from mon.0
2021-04-29T16:17:51.984+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map i am now mds.0.518
2021-04-29T16:17:51.984+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map state change up:boot --> up:replay
2021-04-29T16:17:51.984+0530 7f5ba3a38700 1 mds.0.518
replay_start
2021-04-29T16:17:51.984+0530 7f5ba3a38700 1 mds.0.518
recovery set is
2021-04-29T16:17:51.984+0530 7f5ba3a38700 1 mds.0.518 waiting
for osdmap 504 (which blacklists prior instance)
2021-04-29T16:17:54.044+0530 7f5b9ca2a700 0 mds.0.cache
creating system inode with ino:0x100
2021-04-29T16:17:54.045+0530 7f5b9ca2a700 0 mds.0.cache
creating system inode with ino:0x1
2021-04-29T16:17:55.025+0530 7f5b9ba28700 1 mds.0.518 Finished
replaying journal
2021-04-29T16:17:55.025+0530 7f5b9ba28700 1 mds.0.518 making
mds journal writeable
2021-04-29T16:17:56.060+0530 7f5ba3a38700 1 mds.cephnode1
Updating MDS map to version 519 from mon.0
2021-04-29T16:17:56.060+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map i am now mds.0.518
2021-04-29T16:17:56.060+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map state change up:replay --> up:reconnect
2021-04-29T16:17:56.060+0530 7f5ba3a38700 1 mds.0.518
reconnect_start
2021-04-29T16:17:56.060+0530 7f5ba3a38700 1 mds.0.518
reopen_log
2021-04-29T16:17:56.060+0530 7f5ba3a38700 1 mds.0.server
reconnect_clients -- 2 sessions
2021-04-29T16:17:56.068+0530 7f5ba3a38700 0
log_channel(cluster) log [DBG] : reconnect by client.6990 v1:
10.0.4.115:0/2776266880 after 0.00799994
2021-04-29T16:17:56.069+0530 7f5ba3a38700 0
log_channel(cluster) log [DBG] : reconnect by client.6892 v1:
10.0.4.96:0/1646469259 after 0.00899994
2021-04-29T16:17:56.069+0530 7f5ba3a38700 1 mds.0.518
reconnect_done
2021-04-29T16:17:57.099+0530 7f5ba3a38700 1 mds.cephnode1
Updating MDS map to version 520 from mon.0
2021-04-29T16:17:57.099+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map i am now mds.0.518
2021-04-29T16:17:57.099+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map state change up:reconnect --> up:rejoin
2021-04-29T16:17:57.099+0530 7f5ba3a38700 1 mds.0.518
rejoin_start
2021-04-29T16:17:57.103+0530 7f5ba3a38700 1 mds.0.518
rejoin_joint_start
2021-04-29T16:17:57.472+0530 7f5b9d22b700 1 mds.0.518
rejoin_done
2021-04-29T16:17:58.138+0530 7f5ba3a38700 1 mds.cephnode1
Updating MDS map to version 521 from mon.0
2021-04-29T16:17:58.138+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map i am now mds.0.518
2021-04-29T16:17:58.138+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map state change up:rejoin --> up:clientreplay
2021-04-29T16:17:58.138+0530 7f5ba3a38700 1 mds.0.518
recovery_done -- successful recovery!
2021-04-29T16:17:58.138+0530 7f5ba3a38700 1 mds.0.518
clientreplay_start
2021-04-29T16:17:58.157+0530 7f5b9d22b700 1 mds.0.518
clientreplay_done
2021-04-29T16:17:59.178+0530 7f5ba3a38700 1 mds.cephnode1
Updating MDS map to version 522 from mon.0
2021-04-29T16:17:59.178+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map i am now mds.0.518
2021-04-29T16:17:59.178+0530 7f5ba3a38700 1 mds.0.518
handle_mds_map state change up:clientreplay --> up:active
2021-04-29T16:17:59.178+0530 7f5ba3a38700 1 mds.0.518
active_start
2021-04-29T16:17:59.181+0530 7f5ba3a38700 1 mds.0.518 cluster
recovered.
*In both the test cases above* we saw some extra delay of* around 15
seconds* + 8-10 seconds. (total 21-25 seconds for failover in case of
power-off/reboot),
*Query:* Any specific config that may need to be tweaked/tried to reduce
this time for MDS to know that it has to activate and start the standby MDS
Node?)
*Scenario 2:*
- *Only stop MDS Daemon Service on Active Node*
- In this scenario when we only tried stopping systemctl service for
the MDS Node on Active Node, we have very good *reading of around 5-7
Seconds* for failover.
-
Deployment Mode CEPH MDS Setup Test Case I/O Resume Duration
(Seconds) Node affected
-
2 Node MDS Setupwith max_mds=1 Active-Standby MDS with Active Node
MDS Demon stop 5-7 cephnode 1
Please suggest/advice if we can try to configure to achieve minimal
failover duration in the first two scenarios.
Best Regards,
Lokendra
On Thu, Apr 29, 2021 at 1:47 AM Reed Dier <reed.dier(a)focusvq.com> wrote:
> I don't have anything of merit to add to this, but it would be an
> interesting addition to your testing to see if active+standby-replay makes
> any difference with test-case1.
>
> I don't think it would be applicable to any of the other use-cases, as a
> standby-replay MDS is bound to a single rank, meaning its bound to a single
> active MDS, and can't function as a standby for active:active.
>
> https://docs.ceph.com/en/latest/cephfs/standby/#configuring-standby-replay
>
>
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/c…
>
> Good luck and look forward to hearing feedback/more results.
>
> Reed
>
> On Apr 27, 2021, at 8:40 AM, Lokendra Rathour <lokendrarathour(a)gmail.com>
> wrote:
>
> Hi Team,
> We have setup two Node Ceph Cluster using *Native Cephfs Driver* with *Details
> as:*
>
> - 3 Node / 2 Node MDS Cluster
> - 3 Node Monitor Quorum
> - 2 Node OSD
> - 2 Nodes for Manager
>
>
> Cephnode3 have only Mon and MDS (only for test case 4-7) rest two nodes
> i.e. cephnode1 and cephnode2 have (mgr,mds,mon,rgw)
>
>
> We have tested following failover scenarios for Native Cephfs Driver by
> mounting for any one sub-volume on a VM or client with continuous I/O
> operations(Directory creation after every 1 Second)*:*
>
> <image.png>
>
>
> In the table above we have few queries as:
>
> - Refer test case 2 and test case 7, both are similar test case with
> only difference in number of Ceph MDS with time for both the test cases is
> different. It should be zero. But time is coming as 17 seconds for testcase
> 7.
> - Is there any configurable parameter/any configuration which we need
> to make in the Ceph cluster to get the failover time reduced to few
> seconds?
>
> In current default deployment we are getting something around 35-40
> seconds.
>
>
>
>
>
>
> Best Regards,
>
> --
> ~ Lokendra
> www.inertiaspeaks.com
> www.inertiagroups.com
> skype: lokendrarathour
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
>
--
~ Lokendra
www.inertiaspeaks.comwww.inertiagroups.com
skype: lokendrarathour