- ceph-users - lists.ceph.io

by mjclark.00＠gmail.com

Hello, I'm trying to install nautilus on stretch following the directions here https://docs.ceph.com/docs/master/install/get-packages/ . However, it seems the stretch repo only includes ceph-deploy. Are the rest of the packages missing on purpose or have I missed something obvious? Thanks

4 years, 8 months

3
2
0 0

slow requests with the ceph osd dead lock?

by linghucongsong

Hi all! My ceph version is 10.2.11 and I use rgw and EC(7+3). When I use muliti clients to read and write on one rgw bucket .The bucket have 200 shards. There look like dead look on osd? Thanks for all! The osd block ops all have the similiar below logs and the block osds not always on the same osd. "description": "osd_op(client.3147989.0:2782587465 13.36f21cde .dir.517af746-28f1-454c-ba41-0c4fd51af270.896917.11.99 [call rgw.bucket_complete_op] snapc 0=[] ack+ondisk+write+known_if_redirected e26026)", "initiated_at": "2019-09-04 13:38:48.089766", "age": 3520.459786, "duration": 3521.432087, "type_data": [ "delayed", { "client": "client.3147989", "tid": 2782587465 }, [ { "time": "2019-09-04 13:38:48.089766", "event": "initiated" }, { "time": "2019-09-04 13:38:48.089794", "event": "queued_for_pg" }, { "time": "2019-09-04 13:38:48.180941", "event": "reached_pg" }, { "time": "2019-09-04 13:38:48.180962", "event": "waiting for rw locks" }, { "time": "2019-09-04 13:38:48.430598", "event": "reached_pg" }, { "time": "2019-09-04 13:38:48.430616", "event": "waiting for rw locks" }, { "time": "2019-09-04 13:38:49.150673", "event": "reached_pg" }, { "time": "2019-09-04 13:38:49.150691", "event": "waiting for rw locks" }, { "time": "2019-09-04 13:38:51.421887", "event": "reached_pg" }, { "time": "2019-09-04 13:38:51.421904", "event": "waiting for rw locks" }, { "time": "2019-09-04 13:38:51.990943", "event": "reached_pg" }, { "time": "2019-09-04 13:38:51.990961", "event": "waiting for rw locks" }, { "time": "2019-09-04 13:38:54.343921", "event": "reached_pg" }, { "time": "2019-09-04 13:38:54.343938", "event": "waiting for rw locks" }

4 years, 8 months

2
1
0 0

rgw auth error with self region name

by 黄明友

hi,all: I use the aws s3 java sdk , when make a new bucket , with the hostname " s3.my-self.mydomain.com" ; will get a auth error. but , when I use the hostname " s3.us-east-1.mydomian.com" ,will be ok, why ? 黄明友 IT基础架构部经理 V.Photos 云摄影移动电话: +86 13540630430 客服电话：400 - 806 - 5775 电子邮件: hmy(a)v.photos 官方网址: www.v.photos 上海黄浦区中山东二路88号外滩SOHO3Q F栋 2层北京朝阳区光华路9号光华路SOHO二期南二门SOHO3Q 1层广州天河区林和中路136号天誉花园二期3Wcoffice 天誉青创社区深圳南山区蛇口网谷科技大厦二期A座102网谷双创街 1层成都成华区建设路世贸广场 7层

4 years, 8 months

2
1
0 0

ceph mons stuck in electing state

by Nick

Hello, I have an old ceph 0.94.10 cluster that had 10 storage nodes with one extra management node used for running commands on the cluster. Over time we'd had some hardware failures on some of the storage nodes, so we're down to 6, with ceph-mon running on the management server and 4 of the storage nodes. We attempted deploying a ceph.conf change and restarted ceph-mon and ceph-osd services, but the cluster went down on us. We found all the ceph-mons are stuck in the electing state, I can't get any response from any ceph commands but I found I can contact the daemon directly and get this information (hostnames removed for privacy reasons): root@<mgmt1>:~# ceph daemon mon.<mgmt1> mon_status { "name": "<mgmt1>", "rank": 0, "state": "electing", "election_epoch": 4327, "quorum": [], "outside_quorum": [], "extra_probe_peers": [], "sync_provider": [], "monmap": { "epoch": 10, "fsid": "69611c75-200f-4861-8709-8a0adc64a1c9", "modified": "2019-08-23 08:20:57.620147", "created": "0.000000", "mons": [ { "rank": 0, "name": "<mgmt1>", "addr": "[fdc4:8570:e14c:132d::15]:6789\/0" }, { "rank": 1, "name": "<mon1>", "addr": "[fdc4:8570:e14c:132d::16]:6789\/0" }, { "rank": 2, "name": "<mon2>", "addr": "[fdc4:8570:e14c:132d::28]:6789\/0" }, { "rank": 3, "name": "<mon3>", "addr": "[fdc4:8570:e14c:132d::29]:6789\/0" }, { "rank": 4, "name": "<mon4>", "addr": "[fdc4:8570:e14c:132d::151]:6789\/0" } ] } } Is there any way to force the cluster back into a quorum even if it's just one mon running to start it up? I've tried exporting the mgmt's monmap and injecting it into the other nodes, but it didn't make any difference. Thanks!

4 years, 8 months

4
5
0 0

Fwd: Applications slow in VMs running RBD disks

by Gesiel Galvão Bernardes

Hi, Em qui, 29 de ago de 2019 às 22:32, fengyd <fengyd81(a)gmail.com> escreveu: > Hi, > > The issue is still there? > Yes, yet. > I have met an IO peformance issue recently and found that the count of the > max fd for the Qemu/KVM was not bigger enough, the fd for Qemu/KVM was > exhausted, the issue was solved after increasing the count of the max fd. > > How check and increase max fd for qemu? Can you give-me way? Regards, Gesiel > > On Wed, 21 Aug 2019 at 20:53, Gesiel Galvão Bernardes < > gesiel.bernardes(a)gmail.com> wrote: > >> Hi Eliza, >> >> Em qua, 21 de ago de 2019 às 09:30, Eliza <eli(a)chinabuckets.com> >> escreveu: >> >>> Hi >>> >>> on 2019/8/21 20:25, Gesiel Galvão Bernardes wrote: >>> > I`m use a Qemu/kvm(Opennebula) with Ceph/RBD for running VMs, and I >>> > having problems with slowness in aplications that many times not >>> > consuming very CPU or RAM. This problem affect mostly Windows. >>> Appearly >>> > the problem is that normally the application load many short files >>> (ex: >>> > DLLs) and these files take a long time to load, generating a slowness. >>> >>> Did you check/test your network connection? >>> Do you have a fast network setup? >> >> >> I have a bond of two 10GB interfaces, with little use. >> >>> >>> >> regards. >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users(a)lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users(a)lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >

4 years, 8 months

2
1
0 0

Nautilus 14.2.3 packages appearing on the mirrors

by Sasha Litvak

Is there an actual release or an accident?

4 years, 8 months

1
0
0 0

Re: forcing an osd down

by Frank Schilder

"ceph osd down" will mark an OSD down once, but not shut it down. Hence, it will continue to send heartbeats and request to be marked up again after a couple of seconds. To keep it down, there are 2 ways: - either set "ceph osd set noup", - or actually shut the OSD down. The first version will allow the OSD to keep running so you can talk to the daemon while it is marked "down" . Be aware that the OSD will be marked "out" after a while. You might need to mark it "in" manually when you are done with maintenance. I believe with nautilus it is possible to set the noup flag on a specific OSD, which is much safer. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of solarflow99 <solarflow99(a)gmail.com> Sent: 03 September 2019 19:40:59 To: Ceph Users Subject: [ceph-users] forcing an osd down I noticed this has happened before, this time I can't get it to stay down at all, it just keeps coming back up: # ceph osd down osd.48 marked down osd.48. # ceph osd tree |grep osd.48 48 3.64000 osd.48 down 0 1.00000 # ceph osd tree |grep osd.48 48 3.64000 osd.48 up 0 1.00000 health HEALTH_WARN 2 pgs backfilling 1 pgs degraded 2 pgs stuck unclean recovery 18/164089686 objects degraded (0.000%) recovery 1467405/164089686 objects misplaced (0.894%) monmap e1: 3 mons at {0=192.168.4.10:6789/0,1=192.168.4.11:6789/0,2=192.168.4.12:6789/0<http://192.168.4.10:6789/0,1=192.168.4.11:6789/0,2=192.168.4.12:6789/0>} election epoch 210, quorum 0,1,2 0,1,2 mdsmap e166: 1/1/1 up {0=0=up:active}, 2 up:standby osdmap e25733: 45 osds: 45 up, 44 in; 2 remapped pgs

4 years, 8 months

2
1
0 0

MDS blocked ops; kernel: Workqueue: ceph-pg-invalid ceph_invalidate_work [ceph]

by Frank Schilder

Hi, I encountered a problem with blocked MDS operations and a client becoming unresponsive. I dumped the MDS cache, ops, blocked ops and some further log information here: https://files.dtu.dk/u/peQSOY1kEja35BI5/2010-09-03-mds-blocked-ops?l A user of our HPC system was running a job that creates a somewhat stressful MDS load. This workload tends to lead to MDS warnings like "slow metadata ops" and "client does not respond to caps release", which usually disappear without intervantion after a while. He cancelled the job and one operation from one of the clients remained stuck in the MDS. We had a health warning about 1 blocked meta data operation and one client failing to respond to caps release. I should mention that we execute "echo 3 > /proc/sys/vm/drop_caches" in the epilogue script executed after every job, which usually cleans up all unused caps without problems. So, at the time I was looking at the number of client caps, these were down to below 100 for the client in question due to epilogue script execution. Looks like there might be a race condition with the drop caches and MDS requests. In addition, while this happened, there was backfill going on. All PGs were active+other stuff. All storage was r/w-accessible. On the client side, this was in the logs: Sep 3 09:15:57 sn110 kernel: INFO: task kworker/0:1:79782 blocked for more than 120 seconds. Sep 3 09:15:57 sn110 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 3 09:15:57 sn110 kernel: kworker/0:1 D ffff995cf4614100 0 79782 2 0x00000000 Sep 3 09:15:57 sn110 kernel: Workqueue: ceph-pg-invalid ceph_invalidate_work [ceph] Sep 3 09:15:57 sn110 kernel: Call Trace: [... see link above ...] I did not see slow ops on any of the OSDs. All other information in the link above. We had to reboot the client to resolve this problem. It seems like the MDS does not clean up blocked requests in certain situations when it ought to be possible. I hope the cache and ops dumps help pinpoint the reason. Best regards, Frank

4 years, 8 months

2
1
0 0

TASK_UNINTERRUPTIBLE kernel client threads

by Toby Darling

Hi We have a couple of RHEL 7.6 (3.10.0-957.21.3.el7.x86_64) clients that have a number of uninterruptible threads and I'm wondering if we're looking at the issue fixed by https://www.spinics.net/lists/ceph-devel/msg45467.html (the fix hasn't made it into RHEL 7.7 3.10.0-1062). Stack traces of the hung threads are at http://p.ip.fi/9pQA There are a number of entries listed in /sys/kernel/debug/ceph/*/{osdc,mdsc} at http://p.ip.fi/VVzx Unfortunately, the issue isn't consistently reproducible Cheers Toby -- Toby Darling, Scientific Computing (2N249) MRC Laboratory of Molecular Biology Francis Crick Avenue Cambridge Biomedical Campus Cambridge CB2 0QH Phone 01223 267070

4 years, 8 months

2
1
0 0

Manual pg repair help

by Marc Roos

Is there no ceph wiki page with examples of manual repairs with the ceph-objectstore-tool (eg. where pg repair and pg scrub don’t work) I am having this issue for quite some time. 2019-09-02 14:17:34.175139 7f9b3f061700 -1 log_channel(cluster) log [ERR] : deep-scrub 17.36 17:6ca1f70a:::rbd_data.1f114174b0dc51.0000000000000974:head : expected clone 17:6ca1f70a:::rbd_data.1f114174b0dc51.0000000000000974:4 1 missing And tried to resolve it according to this procedure[0] but now I am getting the message ceph-objectstore-tool --dry-run --type bluestore --data-path /var/lib/ceph/osd/ceph-29 --pgid 17.36 '{"oid":"rbd_data.1f114174b0dc51.0000000000000974","key":"","snapid":-2, "hash":1357874486,"pool":17,"namespace":"","max":0}' remove Snapshots are present, use removeall to delete everything I am not sure about this removeall, but I do not want to start deleting snapshots hoping it will amount to something. Besides if only maybe 4mb block is damaged, do you really need to purge snapshots of 40GB. I rather have a snapshot of 40GB missing 4MB than having no snapshot at all. [0] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg47218.html PS. Is there a record of who is having the longest unhealthy cluster state? Because I would not like it to be me ;)

4 years, 8 months

1
0
0 0

2024

2023

2022

2021

2020

2019

ceph-users