January 2021 - ceph-users

by Adam Boyhan

I have noticed that RBD-Mirror snapshot mode can only manage to take 1 snapshot per second. For example I have 21 images in a single pool. When the schedule is triggered it takes the mirror snapshot of each image 1 at a time. It doesn't feel or look like a performance issue as the OSD's are Micron 9300 PRO NVMe's and each server has 2x Intel Platinum 8268 CPU's. I was hoping that adding more RDB-Mirror instance would help, but that only seems to help with overall throughput. As it sits I have 3 RBD-Mirror instances running on each cluster. We run a 30 minute snapshot schedule to our remote site as it is, based on that I can only squeeze 1800 mirror snaps every 30 minutes. I was hoping there might be something I am missing with RBD-Mirror as far as scaling goes. Maybe multiple pools would be a solution and have other benefits?

3 years, 3 months

2
3
0 0

RBD-Mirror Mirror Snapshot stuck

by Adam Boyhan

I have a rbd-mirror snapshot on 1 image that failed to replicate and now its not getting cleaned up. The cause of this was my fault based on my steps. Just trying to understand how to clean up/handle the situation. Here is how I got into this situation. - Created manual rbd snapshot on the image - On the remote cluster I cloned the snapshot - While cloned on the secondary cluster I made the mistake of deleting the snapshot on the primary - The subsequent mirror snapshot failed - I then removed the clone - The next mirror snapshot was successful but I was left with this mirror snapshot on the primary that I can't seem to get rid of root@Ccscephtest1:/var/log/ceph# rbd snap ls --all CephTestPool1/vm-100-disk-0 SNAPID NAME SIZE PROTECTED TIMESTAMP NAMESPACE 10082 .mirror.primary.90c53c21-6951-4218-9f07-9e983d490993.e0c63479-b09e-4c66-a65b-085b67a19907 2 TiB Thu Jan 21 07:10:09 2021 mirror (primary peer_uuids:[]) 10243 .mirror.primary.90c53c21-6951-4218-9f07-9e983d490993.483e55aa-2f64-4bb0-ac0f-7b5aac59830e 2 TiB Thu Jan 21 07:30:08 2021 mirror (primary peer_uuids:[debf975b-ebb8-432c-a94a-d3b101e0f770]) I have tried deleting the snap with "rbd snap rm" like normal user created snaps, but no luck. Anyway to force the deletion?

3 years, 3 months

2
2
0 0

How to make HEALTH_ERR quickly and pain-free

by George Shuklin

I have hell of the question: how to make HEALTH_ERR status for a cluster without consequences? I'm working on CI tests and I need to check if our reaction to HEALTH_ERR is good. For this I need to take an empty cluster with an empty pool and do something. Preferably quick and reversible. For HEALTH_WARN the best thing I found is to change pool size to 1, it raises "1 pool(s) have no replicas configured" warning almost instantly and it can be reverted very quickly for empty pool. But HEALTH_ERR is a bit more tricky. Any ideas?

3 years, 3 months

2
5
0 0

cephfs: massive drop in MDS requests per second with increasing number of caps

by Dietmar Rieder

Hi all, we noticed a massive drop in requests per second a cephfs client is able to perform when we do a recursive chown over a directory with millions of files. As soon as we see about 170k caps on the MDS, the client performance drops from about 660 reqs/sec to 70 reqs/sec. When we then clear dentries and inodes using "sync; echo 2 > /proc/sys/vm/drop_caches" on the client, the request go up to ~660 again just to drop again when reaching about 170k caps. See the attached screenshots. When we stop the chown process for a while and restart it ~25min later again it still performs very slowly and the MDS reqs/sec remain low (~60/sec.). Clearing the cache (dentries and inodes) on the client restores the performance again. When we run the same chown on another client in parallel, it starts again with reasonable good performance (while the first client is poorly performing) but eventually it gets slow again just like the first client. Can someone comment on this and explain it? How can this be solved, so that the performance remains stable? We are running ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable) on all ceph cluster nodes and on all clients. The OS on all ceph cluster nodes and client nodes is CentOS 7.9. The filesystem is mounted via CentOS kernel client (latest official version). Thanks in advance. ~Best Dietmar -- _________________________________________ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Institute of Bioinformatics Innrain 80, 6020 Innsbruck Phone: +43 512 9003 71402 Fax: +43 512 9003 73100 Email: dietmar.rieder(a)i-med.ac.at Web: http://www.icbi.at

3 years, 3 months

3
8
0 0

RBD on windows

by Szabo, Istvan (Agoda)

Hi, I'm looking the suse documentation regarding their option to have rbd on win. I want to try on windows server 2019 vm, but I got this error: PS C:\Users\$admin$> rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring C:/ProgramData/ceph/keyring 2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: error parsing file C:/ProgramData/ceph/keyring: cannot parse buffer: Malformed input 2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: failed to load C:/ProgramData/ceph/keyring: (5) Input/output error 2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: error parsing file C:/ProgramData/ceph/keyring: cannot parse buffer: Malformed input 2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: failed to load C:/ProgramData/ceph/keyring: (5) Input/output error 2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: error parsing file C:/ProgramData/ceph/keyring: cannot parse buffer: Malformed input rbd: couldn't connect to the cluster! 2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 auth: failed to load C:/ProgramData/ceph/keyring: (5) Input/output error 2021-01-20T11:15:29.066SE Asia Standard Time 1 -1 monclient: keyring not found This is the keyring file: [client.windowstest] key = AQBJ7wdgdWLIMhAAle+/pg+26XvWsDv8PyPcvw== caps mon = "allow rw" caps osd = "allow rwx pool=windowstest" And this is the ceph.conf file on the windows client: [global] log to stderr = true run dir = C:/ProgramData/ceph crash dir = C:/ProgramData/ceph [client] keyring = C:/ProgramData/ceph/keyring log file = C:/ProgramData/ceph/$name.$pid.log admin socket = C:/ProgramData/ceph/$name.$pid.asok [global] mon host = [v2:10.118.199.231:3300,v1:10.118.199.231:6789] [v2:10.118.199.232:3300,v1:10.118.199.232:6789] [v2:10.118.199.233:3300,v1:10.118.199.233:6789] Commands I've tried: rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring C:/ProgramData/ceph/keyring rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring C:\ProgramData\ceph\keyring rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring "C:/ProgramData/ceph/keyring" rbd create image01 --size 4096 --pool windowstest -m 10.118.199.248,10.118.199.249,10.118.199.250 --id windowstest --keyring "C:\ProgramData\ceph\keyring" rbd create blank_image --size=1G The ceph version is luminous 12.2.8. Don't know why they don't find mon keyring. Thank you. ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 3 months

1
0
0 0

[Query] Safe to discard bucket lock objects in reshard pool?

by Prasad Krishnan

Dear Ceph users, We have a slightly dated version of Luminous cluster in which dynamic bucket resharding was accidentally enabled due to a misconfig (we don't use this feature since the number of objects per bucket is capped). This resulted in creation of the RGW reshard pool with lots of bucket reshard lock objects (we have thousands of buckets) which is leading to clutter. Also, we've run into a malloc failure issue (similar to https://tracker.ceph.com/issues/21826 but not the same since we already use tcmalloc) on the OSDs in which these reshard lock objects are located and we'd like to reduce the objects that have to be copied out. My question to the community is: "Is it safe to discard the bucket reshard lock objects if we know that we'll never use the reshard feature on the cluster again?". The RGWs performed resharding several months ago due to a misconfiguration and we already have stale bucket instances which are due for cleanup on this cluster. Thanks, Prasad Krishnan -- *-----------------------------------------------------------------------------------------* *This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee, you should not disseminate, distribute or copy this email. Please notify the sender immediately by email if you have received this email by mistake and delete this email from your system. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.***** **** *Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the organization. Any information on shares, debentures or similar instruments, recommended product pricing, valuations and the like are for information purposes only. It is not meant to be an instruction or recommendation, as the case may be, to buy or to sell securities, products, services nor an offer to buy or sell securities, products or services unless specifically stated to be so on behalf of the Flipkart group. Employees of the Flipkart group of companies are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to organizational policy and outside the scope of the employment of the individual concerned. The organization will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising.***** **** *Our organization accepts no liability for the content of this email, or for the consequences of any actions taken on the basis of the information *provided,* unless that information is subsequently confirmed in writing. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.* _-----------------------------------------------------------------------------------------_

3 years, 3 months

3
2
1 0

Re: Samsung PM883 3.84TB SSD performance

by mj

Hi both, Thanks for both quick replies, and both (of course) 100% spot-on! With 4k, IOPS is around 18530 :-) Thank you both, and apologies for the noise! Best, MJ On 19/01/2021 14:57, Marc Roos wrote: > You should test with 4k not 4M. > > >> -----Original Message----- >> From: mj <lists(a)merit.unu.edu> >> Sent: 19 January 2021 14:33 >> To: ceph-users <ceph-users(a)ceph.io> >> Subject: [ceph-users] Samsung PM883 3.84TB SSD performance >> >> Hi, >> >> We purchased Samsung PM883 3.84TB SSDs to be used as BlueStore SSDs in >> our cluster. >> >> I ran some benchmarks (write cache disabled, and with -sync=1 and >> -fsync=1) but with my little understandig, the results look terribly >> bad. (iops around 82!) >> >> Reading pages like yourcmc.ru, we should at least aim for > 1000s IOPS. >> (like 10000 up to 20000) >> >> Here is our complete output: >> >>> root@pve:~# hdparm -W 0 /dev/sdd >>> >>> /dev/sdd: >>> setting drive write-caching to 0 (off) >>> write-caching = 0 (off) >>> root@pve:~# fio -ioengine=libaio -fsync=1 -invalidate=1 -name=test - >> bs=4M -iodepth=32 -rw=randwrite -runtime=60 -filename=/dev/sdd >>> test: (g=0): rw=randwrite, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, >> iodepth=32 >>> fio-2.16 >>> Starting 1 process >>> Jobs: 1 (f=1): [f(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta >> 00m:00s] >>> test: (groupid=0, jobs=1): err= 0: pid=3268778: Tue Jan 19 14:13:44 >> 2021 >>> write: io=19728MB, bw=336674KB/s, iops=82, runt= 60003msec >>> slat (usec): min=1853, max=4652, avg=2428.94, stdev=361.92 >>> clat (msec): min=9, max=652, avg=376.87, stdev=22.88 >>> lat (msec): min=11, max=655, avg=379.30, stdev=22.94 >>> clat percentiles (msec): >>> | 1.00th=[ 363], 5.00th=[ 367], 10.00th=[ 371], 20.00th=[ >> 371], >>> | 30.00th=[ 371], 40.00th=[ 375], 50.00th=[ 375], 60.00th=[ >> 379], >>> | 70.00th=[ 379], 80.00th=[ 383], 90.00th=[ 388], 95.00th=[ >> 392], >>> | 99.00th=[ 404], 99.50th=[ 420], 99.90th=[ 611], 99.95th=[ >> 635], >>> | 99.99th=[ 652] >>> lat (msec) : 10=0.02%, 50=0.06%, 100=0.10%, 250=0.24%, 500=99.25% >>> lat (msec) : 750=0.32% >>> cpu : usr=2.95%, sys=28.30%, ctx=47835, majf=0, minf=26 >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.3%, 32=99.4%, >>> =64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >>> =64=0.0% >>> issued : total=r=0/w=4932/d=0, short=r=0/w=0/d=0, >> drop=r=0/w=0/d=0 >>> latency : target=0, window=0, percentile=100.00%, depth=32 >>> >>> Run status group 0 (all jobs): >>> WRITE: io=19728MB, aggrb=336674KB/s, minb=336674KB/s, >> maxb=336674KB/s, mint=60003msec, maxt=60003msec >>> >>> Disk stats (read/write): >>> sdd: ios=41/39330, merge=0/4995672, ticks=8/231816, in_queue=231440, >> util=79.43% >>> root@pve:~# fio -ioengine=libaio -sync=1 -invalidate=1 -name=test - >> bs=4M -iodepth=32 -rw=randwrite -runtime=60 -filename=/dev/sdd >>> test: (g=0): rw=randwrite, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, >> iodepth=32 >>> fio-2.16 >>> Starting 1 process >>> Jobs: 1 (f=1): [f(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta >> 00m:00s] >>> test: (groupid=0, jobs=1): err= 0: pid=3270282: Tue Jan 19 14:14:58 >> 2021 >>> write: io=19468MB, bw=332248KB/s, iops=81, runt= 60001msec >>> slat (usec): min=11658, max=14287, avg=12318.59, stdev=371.39 >>> clat (usec): min=8, max=400078, avg=380931.63, stdev=19107.88 >>> lat (msec): min=11, max=412, avg=393.25, stdev=19.16 >>> clat percentiles (msec): >>> | 1.00th=[ 371], 5.00th=[ 375], 10.00th=[ 375], 20.00th=[ >> 375], >>> | 30.00th=[ 375], 40.00th=[ 379], 50.00th=[ 379], 60.00th=[ >> 383], >>> | 70.00th=[ 388], 80.00th=[ 392], 90.00th=[ 392], 95.00th=[ >> 396], >>> | 99.00th=[ 396], 99.50th=[ 400], 99.90th=[ 400], 99.95th=[ >> 400], >>> | 99.99th=[ 400] >>> lat (usec) : 10=0.02% >>> lat (msec) : 20=0.02%, 50=0.04%, 100=0.08%, 250=0.25%, 500=99.59% >>> cpu : usr=2.22%, sys=30.01%, ctx=53455, majf=0, minf=25 >>> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.3%, 32=99.4%, >>> =64=0.0% >>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >>> =64=0.0% >>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >>> =64=0.0% >>> issued : total=r=0/w=4867/d=0, short=r=0/w=0/d=0, >> drop=r=0/w=0/d=0 >>> latency : target=0, window=0, percentile=100.00%, depth=32 >>> >>> Run status group 0 (all jobs): >>> WRITE: io=19468MB, aggrb=332248KB/s, minb=332248KB/s, >> maxb=332248KB/s, mint=60001msec, maxt=60001msec >>> >>> Disk stats (read/write): >>> sdd: ios=41/38830, merge=0/4931664, ticks=4/203944, in_queue=203700, >> util=79.55% >>> >>> >> >> The disk seems to be properly detected properly by the OS: >> >>> [4145961.899584] scsi 0:0:16:0: Direct-Access ATA SAMSUNG >> MZ7LH3T8 904Q PQ: 0 ANSI: 6 >>> [4145961.899596] scsi 0:0:16:0: SATA: handle(0x001a), >> sas_addr(0x500304801ef80883), phy(3), device_name(0x0000000000000000) >>> [4145961.899598] scsi 0:0:16:0: enclosure logical id >> (0x500304801ef808bf), slot(3) >>> [4145961.899599] scsi 0:0:16:0: enclosure level(0x0000), connector >> name( ) >>> [4145961.899649] scsi 0:0:16:0: atapi(n), ncq(y), asyn_notify(n), >> smart(y), fua(y), sw_preserve(y) >>> [4145961.900471] sd 0:0:16:0: Power-on or device reset occurred >>> [4145961.900544] sd 0:0:16:0: Attached scsi generic sg3 type 0 >>> [4145961.902367] sd 0:0:16:0: [sdd] 7501476528 512-byte logical >> blocks: (3.84 TB/3.49 TiB) >>> [4145961.902369] sd 0:0:16:0: [sdd] 4096-byte physical blocks >>> [4145961.904264] sd 0:0:16:0: [sdd] Write Protect is off >>> [4145961.904266] sd 0:0:16:0: [sdd] Mode Sense: 9b 00 10 08 >>> [4145961.904756] sd 0:0:16:0: [sdd] Write cache: enabled, read cache: >> enabled, supports DPO and FUA >>> [4145961.915996] sd 0:0:16:0: [sdd] Attached SCSI disk >> >> Anyone with a idea what we could be doing wrong? Or are these disks >> really unsuitable for OSD use? >> >> MJ >> _______________________________________________ >> ceph-users mailing list -- ceph-users(a)ceph.io >> To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 3 months

1
0
0 0

Samsung PM883 3.84TB SSD performance

by mj

Hi, We purchased Samsung PM883 3.84TB SSDs to be used as BlueStore SSDs in our cluster. I ran some benchmarks (write cache disabled, and with -sync=1 and -fsync=1) but with my little understandig, the results look terribly bad. (iops around 82!) Reading pages like yourcmc.ru, we should at least aim for > 1000s IOPS. (like 10000 up to 20000) Here is our complete output: > root@pve:~# hdparm -W 0 /dev/sdd > > /dev/sdd: > setting drive write-caching to 0 (off) > write-caching = 0 (off) > root@pve:~# fio -ioengine=libaio -fsync=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=randwrite -runtime=60 -filename=/dev/sdd > test: (g=0): rw=randwrite, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=32 > fio-2.16 > Starting 1 process > Jobs: 1 (f=1): [f(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s] > test: (groupid=0, jobs=1): err= 0: pid=3268778: Tue Jan 19 14:13:44 2021 > write: io=19728MB, bw=336674KB/s, iops=82, runt= 60003msec > slat (usec): min=1853, max=4652, avg=2428.94, stdev=361.92 > clat (msec): min=9, max=652, avg=376.87, stdev=22.88 > lat (msec): min=11, max=655, avg=379.30, stdev=22.94 > clat percentiles (msec): > | 1.00th=[ 363], 5.00th=[ 367], 10.00th=[ 371], 20.00th=[ 371], > | 30.00th=[ 371], 40.00th=[ 375], 50.00th=[ 375], 60.00th=[ 379], > | 70.00th=[ 379], 80.00th=[ 383], 90.00th=[ 388], 95.00th=[ 392], > | 99.00th=[ 404], 99.50th=[ 420], 99.90th=[ 611], 99.95th=[ 635], > | 99.99th=[ 652] > lat (msec) : 10=0.02%, 50=0.06%, 100=0.10%, 250=0.24%, 500=99.25% > lat (msec) : 750=0.32% > cpu : usr=2.95%, sys=28.30%, ctx=47835, majf=0, minf=26 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.3%, 32=99.4%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% > issued : total=r=0/w=4932/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=32 > > Run status group 0 (all jobs): > WRITE: io=19728MB, aggrb=336674KB/s, minb=336674KB/s, maxb=336674KB/s, mint=60003msec, maxt=60003msec > > Disk stats (read/write): > sdd: ios=41/39330, merge=0/4995672, ticks=8/231816, in_queue=231440, util=79.43% > root@pve:~# fio -ioengine=libaio -sync=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=randwrite -runtime=60 -filename=/dev/sdd > test: (g=0): rw=randwrite, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=32 > fio-2.16 > Starting 1 process > Jobs: 1 (f=1): [f(1)] [100.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s] > test: (groupid=0, jobs=1): err= 0: pid=3270282: Tue Jan 19 14:14:58 2021 > write: io=19468MB, bw=332248KB/s, iops=81, runt= 60001msec > slat (usec): min=11658, max=14287, avg=12318.59, stdev=371.39 > clat (usec): min=8, max=400078, avg=380931.63, stdev=19107.88 > lat (msec): min=11, max=412, avg=393.25, stdev=19.16 > clat percentiles (msec): > | 1.00th=[ 371], 5.00th=[ 375], 10.00th=[ 375], 20.00th=[ 375], > | 30.00th=[ 375], 40.00th=[ 379], 50.00th=[ 379], 60.00th=[ 383], > | 70.00th=[ 388], 80.00th=[ 392], 90.00th=[ 392], 95.00th=[ 396], > | 99.00th=[ 396], 99.50th=[ 400], 99.90th=[ 400], 99.95th=[ 400], > | 99.99th=[ 400] > lat (usec) : 10=0.02% > lat (msec) : 20=0.02%, 50=0.04%, 100=0.08%, 250=0.25%, 500=99.59% > cpu : usr=2.22%, sys=30.01%, ctx=53455, majf=0, minf=25 > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.2%, 16=0.3%, 32=99.4%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% > issued : total=r=0/w=4867/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=32 > > Run status group 0 (all jobs): > WRITE: io=19468MB, aggrb=332248KB/s, minb=332248KB/s, maxb=332248KB/s, mint=60001msec, maxt=60001msec > > Disk stats (read/write): > sdd: ios=41/38830, merge=0/4931664, ticks=4/203944, in_queue=203700, util=79.55% > > The disk seems to be properly detected properly by the OS: > [4145961.899584] scsi 0:0:16:0: Direct-Access ATA SAMSUNG MZ7LH3T8 904Q PQ: 0 ANSI: 6 > [4145961.899596] scsi 0:0:16:0: SATA: handle(0x001a), sas_addr(0x500304801ef80883), phy(3), device_name(0x0000000000000000) > [4145961.899598] scsi 0:0:16:0: enclosure logical id (0x500304801ef808bf), slot(3) > [4145961.899599] scsi 0:0:16:0: enclosure level(0x0000), connector name( ) > [4145961.899649] scsi 0:0:16:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y) > [4145961.900471] sd 0:0:16:0: Power-on or device reset occurred > [4145961.900544] sd 0:0:16:0: Attached scsi generic sg3 type 0 > [4145961.902367] sd 0:0:16:0: [sdd] 7501476528 512-byte logical blocks: (3.84 TB/3.49 TiB) > [4145961.902369] sd 0:0:16:0: [sdd] 4096-byte physical blocks > [4145961.904264] sd 0:0:16:0: [sdd] Write Protect is off > [4145961.904266] sd 0:0:16:0: [sdd] Mode Sense: 9b 00 10 08 > [4145961.904756] sd 0:0:16:0: [sdd] Write cache: enabled, read cache: enabled, supports DPO and FUA > [4145961.915996] sd 0:0:16:0: [sdd] Attached SCSI disk Anyone with a idea what we could be doing wrong? Or are these disks really unsuitable for OSD use? MJ

3 years, 3 months

2
1
0 0

Dashboard : Block image listing and infos

by Gilles Mocellin

Hello Cephers, On a new cluster, I only have 2 RBD block images, and the Dashboard doesn't manage to list them correctly. I have this message : Warning Displaying previously cached data for pool veeam-repos. Sometime it disappears, but as soon as I reload or return to the listing page, it's there. What I've seen, is a high CPU load due to ceph-mgr on the active manager. And also stack-traces like this : 2021-01-15T14:41:12.061+0100 7f7f3fec4700 0 [dashboard ERROR exception] Dashboard Exception Traceback (most recent call last): File "/usr/share/ceph/mgr/dashboard/services/exception.py", line 94, in dashboard_exception_handler return handler(*args, **kwargs) File "/usr/lib/python3/dist-packages/cherrypy/_cpdispatch.py", line 60, in __call__ return self.callable(*self.args, **self.kwargs) File "/usr/share/ceph/mgr/dashboard/controllers/__init__.py", line 666, in inner ret = func(*args, **kwargs) File "/usr/share/ceph/mgr/dashboard/controllers/__init__.py", line 861, in wrapper return func(*vpath, **params) File "/usr/lib/python3.6/contextlib.py", line 52, in inner return func(*args, **kwds) File "/usr/lib/python3.6/contextlib.py", line 52, in inner return func(*args, **kwds) File "/usr/share/ceph/mgr/dashboard/controllers/rbd.py", line 86, in list return self._rbd_list(pool_name) File "/usr/share/ceph/mgr/dashboard/controllers/rbd.py", line 76, in _rbd_list status, value = RbdService.rbd_pool_list(pool) File "/usr/share/ceph/mgr/dashboard/tools.py", line 254, in wrapper return rvc.run(fn, args, kwargs) File "/usr/share/ceph/mgr/dashboard/tools.py", line 242, in run raise ViewCacheNoDataException() dashboard.exceptions.ViewCacheNoDataException: ViewCache: unable to retrieve data Also that, since I changed some features back and forth on one image : 2021-01-18T11:13:26.383+0100 7f00199ca700 0 [dashboard ERROR frontend.error] (https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/#/block/rbd/edit/veeam- repos%252Fveeam-repo2-vol1): Cannot read property 'features_name' of undefined TypeError: Cannot read property 'features_name' of undefined at https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/1.9e79c41bbaed982a50af.js:1:121… at Array.forEach (<anonymous>) at R.deepBoxCheck (https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/1.9e79c41bbaed982a50af.js:1:120…) at R.featureFormUpdate (https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/1.9e79c41bbaed982a50af.js:1:121…) at https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/1.9e79c41bbaed982a50af.js:1:119… at d.a [as _next] (https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…) at d.__tryOrUnsub (https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…) at d.next (https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…) at l._next (https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…) at l.next (https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/main.c43d13b597196a5f022f.js:2:…) But that's perhaps just becaus I open an Edit window on the image and it does not have the datas. The Edit window is empty, and I can't edit things, especially, I wan't to resize the image. Finally, I find a similar bug already registered here, but that seems resolved for the owner : https://tracker.ceph.com/issues/45308 Mmmm, as I read carrefully, it'about cephfs not rbd in that stacktrace... Am I the only one ? Is there a work around ? I really need the dashboard to be usable, because I want to delegate a maximum of operations to people that don't need to have the rights to connect to machines to use CLI. And also don't have the skills to do so. Have a good day, -- Gilles

3 years, 3 months

1
0
0 0

RBD image size in prometheus

by Seena Fallah

Hi, Is there any reason why RBD image size isn't exported in the Prometheus module? Thanks.

3 years, 3 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users January 2021