Is there a way to cleanup the sync shards and start from scratch?
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
We have a Red Hat installation of Luminuous (full packages version: 12.2.8-128.1). We're experiencing an issue where the ceph-radosgw service will timeout during initialization and cycle through attempts every five minutes until it seems to just give up. Every other ceph service starts successfully.
I tried looking at the health of the cluster, but anytime I run a command, whether ceph or radosgw-admin just to see a list of users, it seems to time out as well.
I've used strace when attempting to start radosgw directly and was presented with a missing keyring error. I would be inclined to think that might be the problem, but wouldn't that also impact all of the other services?
I haven't been able to find anything in the logs that would lead me down any paths. Everything I've looked at (journalctl, /var/log/messages, /var/log/ceph/ceph-rgw-server.log) all just say the same thing: the service attempted to start, it failed to initialize, entered a failed state, service stopped. This repeats.
I only get pulled in to look at Ceph every now and then so I don't have enough knowledge to know how the various components interact or if anything external is having an impact. Is there anywhere that might be holding a bit of information?
Sent with [ProtonMail](https://protonmail.com) Secure Email.
Is it possible to disable checking on 'x pool(s) have no replicas
configured', so I don't have this HEALTH_WARN constantly.
Or is there some other disadvantage of keeping some empty 1x replication
Recently I deployed a small ceph cluster using cephadm.
In this cluster, I have 3 OSD nodes with 8 HDDs Hitachi (9.1 TiB), 4 NVMes
Micron_9300 (2.9 TiB), and 2 NVMes Intel Optane P4800X (375 GiB). I want to
use spinning disks for the data block, 2.9 NVMes for the block.DB and the
intel Optane for the block.wal.
I tried with a spec file and also via the ceph dashboard but I encountered
I would expect 1 lv on every data disk, 4 lv on wal disks, and 2 lv on DB
disks. The problem arises on DB disks where only 1 lv gets created.
After some debugging, I think that the problem is generated when the VG
gets divided into 2. I have 763089 Total PE and the first LV was created
using 381545 PE (round-up for 763089/2). Thanks to that, the creation of
the second LV fails: Volume group
"ceph-c7078851-d3c1-4745-96b6-f98a45d3da93" has insufficient free space
(381544 extents): 381545 required.
Is this an expected behavior or not? Should I create the LVs by myself?
Gheorghe Asachi Technical University of Iasi
OK, I added the ceph-users again :)
Thanks for your reply, this is a lot of useful pointers. Yes, its Dell EMC switches running OS9 and I believe they support per-VLAN bandwidth reservations. It would be the easiest to configure and test. At the moment, I always see the slow ping times on both, the front- and back interface at the same time on exactly the same OSD pairs. If I reserve bandwidth to the replication VLAN and the slow ping times on the back interface disappear, this would be a really strong clue.
I will go through everything after the weekend.
AIT Risø Campus
Bygning 109, rum S14
From: Stefan Kooman <stefan(a)bit.nl>
Sent: 12 February 2021 18:18
To: Frank Schilder
Subject: Re: [ceph-users] Network design issues
On 2/12/21 5:27 PM, Frank Schilder wrote:
> Hi Stefan,
> do you want to keep this out of the ceph-users list or was it a click-and-miss?
^^ This, I recently switched to Thunderbird because of mail migration
(from Mutt) ... and I'm not used to it yet. I *tried* to reply to all
(incl. list) but might have screwed up.
I would consider this as of general interest.
> Thanks for your detailed reply. I take it that I need to provide more info and will try to make a few sketches of the architecture. I think it will help explaining the problem. Some quick replies:
>> I'm curious what you changed. Want to share it?
> # ceph config set mds mds_max_caps_per_client 65536
> Thread "cephfs: massive drop in MDS requests per second with increasing number of caps"
Ah yes, I've read that thread. Interesting. I haven't tested it out
yet, but will do so.
> There are a number of config values with significantly too large defaults, this is one of them. Another one is mon_sync_max_payload_size.
quite a few people have run into issues with that settig. We haven't had
any issues with it yet. But perhaps I should downscale it as well.
>> Do you know what causes the slow ops?
> I don't care about slow ops under high load, these are to be expected. I worry about "slow ping times". These are not expected and are almost certainly caused by congestion of a link.
Yeah sure, I would suspect that as well. Or "discards" from a switch
because of errors, but those are less likely.
>> I don't quite get the 10G bottleneck. Sure, a client can saturate a 10
>> Gb/s link, but how does this affect storage <-> storage (replication)
>> traffic and / or other clients?
> Because it all happens on the same physical link. We don't have a dedicated replication network. Its all mixed on the same hardware. If a 10G link is saturated, nothing moves any more through this particular link and the clients are so superior in capacity that they can easily starve parts of the internal ceph traffic in this way.
> Basically, we started out with a dedicated replication VLAN and decided to merge this with the access VLAN for simplicity of the set-up. Our networking is currently equivalent to having a single network only. Here the interfaces:
> ceph0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
> ceph0.81@ceph0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
> ceph0.82@ceph0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
> $ ethtool ceph0
> Settings for ceph0:
> Supported ports: [ ]
> Supported link modes: Not reported
> Supported pause frame use: No
> Supports auto-negotiation: No
> Supported FEC modes: Not reported
> Advertised link modes: Not reported
> Advertised pause frame use: No
> Advertised auto-negotiation: No
> Advertised FEC modes: Not reported
> Speed: 60000Mb/s
> Duplex: Full
> Port: Other
> PHYAD: 0
> Transceiver: internal
> Auto-negotiation: off
> Cannot get wake-on-lan settings: Operation not permitted
> Link detected: yes
> the bond is 6x10G active-active, VLAN 81 is the access VLAN and 82 is the replication network. Goes all over the same lines. This config is very convenient for maintenance, but seems to suffer from not physically reserving bandwidth to VLAN 82. Maybe such a bandwidth reservation QOS definition could already help?
Are these Dell / EMC switches? You might be able to give priority on a
VLAN level, or "shape" bandwith based on VLANs. I know that Aristas
(that we use) have support for that in newish firmware. You might also
want to support "pause" frames (ethernet flow control) as that might
help during congestion (a back off protocol), see:
Just to note: we don't have a separate replication network / interfaces.
We only have one network. Wido (den Hollander) and myself as well, don't
see any added benefit of a separate network. You only waste bandwith if
you split them up. And make debugging more complex in certain failure
scenarios. Do you know what hashing is in use for the LACP port-channel?
You want to use mac, ip and port (5-tuple). We use OpenvSwitch (OVS) a
lot, and with OVS you can balance the load between the LACP links (by
default it evaluates every 10 seconds if it should move flows around).
I doubt there is a silver bullet, but hey, you never know. Do change one
thing at a time, otherwise it will be hard to know what the effect is of
each of the changes (they might even cancel each other out).
> I will provide a sketch of the set-up, I think this will make things more clear. I don't think we have an aggregated bandwidth problem, I believe what we have is a load distribution/priority problem over physical link members in the aggregation group "ceph0" on the storage servers.
Yes, your issue makes more sense to me now. Do you have any metrics from
the load on the individual links? Even bmon might be a useful tool. You
might want to capture metrics (like every second or so) to detect
"bursts" of traffic that might cause issues. Just to make sure you are
on the right track. We use telegraf as metric collecting agent sending
them to influxdb, but there are many more options.
And then there are also other things to tune: tcp checksum offload et
al. You might also hit IRQ balance issue, and there are also ways to
overcome those. Are those single CPU systems? And / or AMD? NUMA might
be a thing as well, and ideally you have the Ceph OSD daemons pinned to
the CPU with the network / storage adapters connected.
Finally, this might be of use:
I have one ceph cluster with nautilus 14.2.10 and one node has 3 SSD and 4 HDD each.
Also has two nvmes as cache. (Means nvme0n1 cache for 0-2 SSD and Nvme1n1 cache for 3-7 HDD)
but there is one nodes’ nvme0n1 always hit below issues(see name..I/O…timeout, aborting), and sudden this nvme0n1 disappear .
After that i need reboot this node to recover.
Any one hit same issue ? and how to slow it? Any suggestion are welcome. Thanks in advance!
I am once googled the issue, and see below link, but not see any help
Feb 19 01:31:52 ip kernel: [1275313.393211] nvme 0000:03:00.0: I/O 949 QID 12 timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389232] nvme 0000:03:00.0: I/O 728 QID 5 timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389247] nvme 0000:03:00.0: I/O 515 QID 7 timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389252] nvme 0000:03:00.0: I/O 516 QID 7 timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389257] nvme 0000:03:00.0: I/O 517 QID 7 timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389263] nvme 0000:03:00.0: I/O 82 QID 9 timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389271] nvme 0000:03:00.0: I/O 853 QID 13 timeout, aborting
Feb 19 01:31:53 ip kernel: [1275314.389275] nvme 0000:03:00.0: I/O 854 QID 13 timeout, aborting
Feb 19 01:32:23 ip kernel: [1275344.401708] nvme 0000:03:00.0: I/O 728 QID 5 timeout, reset controller
Feb 19 01:32:52 ip kernel: [1275373.394112] nvme 0000:03:00.0: I/O 0 QID 0 timeout, reset controller
Feb 19 01:33:53 ip ceph-osd: /build/ceph-14.2.10/src/common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, ceph::time_detail::coarse_mono_clock::rep)' thread 7f36c03fb700 time 2021-02-19 01:33:53.436018
Feb 19 01:33:53 ip ceph-osd: /build/ceph-14.2.10/src/common/HeartbeatMap.cc: 82: ceph_abort_msg("hit suicide timeout")
Feb 19 01:33:53 ip ceph-osd: ceph version 14.2.10 (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)
Feb 19 01:33:53 ip ceph-osd: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xdf) [0x83eb8c]
Feb 19 01:33:53 ip ceph-osd: 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x4a5) [0xec56f5]
Feb 19 01:33:53 ip ceph-osd: 3: (ceph::HeartbeatMap::is_healthy()+0x106) [0xec6846]
Feb 19 01:33:53 ip ceph-osd: 4: (OSD::handle_osd_ping(MOSDPing*)+0x67c) [0x8aaf0c]
Feb 19 01:33:53 ip ceph-osd: 5: (OSD::heartbeat_dispatch(Message*)+0x1eb) [0x8b3f4b]
Feb 19 01:33:53 ip ceph-osd: 6: (DispatchQueue::fast_dispatch(boost::intrusive_ptr<Message> const&)+0x27d) [0x12456bd]
Feb 19 01:33:53 ip ceph-osd: 7: (ProtocolV2::handle_message()+0x9d6) [0x129b4e6]
Feb 19 01:33:53 ip ceph-osd: 8: (ProtocolV2::handle_read_frame_dispatch()+0x160) [0x12ad330]
Feb 19 01:33:53 ip ceph-osd: 9: (ProtocolV2::handle_read_frame_epilogue_main(std::unique_ptr<ceph::buffer::v14_2_0::ptr_node, ceph::buffer::v14_2_0::ptr_node::disposer>&&, int)+0x178) [0x12ad598]
Feb 19 01:33:53 ip ceph-osd: 10: (ProtocolV2::run_continuation(Ct<ProtocolV2>&)+0x34) [0x12956b4]
Feb 19 01:33:53 ip ceph-osd: 11: (AsyncConnection::process()+0x186) [0x126f446]
Feb 19 01:33:53 ip ceph-osd: 12: (EventCenter::process_events(unsigned int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*)+0x7cd) [0x10b14cd]
Feb 19 01:33:53 ip ceph-osd: 13: /usr/bin/ceph-osd() [0x10b3fd8]
Feb 19 01:33:53 ip ceph-osd: 14: /usr/bin/ceph-osd() [0x162b59f]
Feb 19 01:33:53 ip ceph-osd: 15: (()+0x76ba) [0x7f36c2ed46ba]
Feb 19 01:33:53 ip ceph-osd: 16: (clone()+0x6d) [0x7f36c24db4dd]
Feb 19 01:33:53 ip ceph-osd: *** Caught signal (Aborted) **
Hapy to report that we recently upgraded our three-host 24 OSD cluster
from HDD filestore to SSD BlueStore. After a few months of use, their
WEAR is still at 1%, and the cluster performance ("rados bench" etc) has
dramatically improved. So all in all: yes, we're happy Samsung PM883
ceph users. :-)
We currently have a "meshed" ceph setup, with the three hosts connected
directly to each other over 10G ethernet, as described here:
As we would like to be able to add more storage hosts, we need to loose
the meshed network setup.
My idea is to add two stacked 10G ethernet switches to the setup, so we
can start using lacp bonded networking over two physical switches.
Looking around, we can get refurb Cisco Small Business 550X for around
1300 euro. We also noticed that mikrotik and TP-Link have some even
nicer-priced 10G switches, but those all lack bonding. :-(
Therfore I'm asking here: anyone here with suggestions on what to look
at, for nice-priced 10G stackable switches..?
We would like to continue using ethernet, as we use that everywhere, and
also performance-wise we're happy with what we currently have.
Last december I wrote to mikrotik support, asking if they will support
stacking / LACP any time soon, and their answer was: probably 2nd half
So, anyone here with interesting insights to share for ceph 10G ethernet
We have a very weird issue with rbd-mirror replication. As per the command
output, we are in sync but the OSD usage on DR side doesn't match the Prod
On Prod, we are using close to 52TB but on DR side we are only 22TB.
We took a snap on Prod and mounted the snap on DR side and compared the data
and we found lot of missing data. Please see the output below.
Please help us resolve this issue or point us in right direction.
DR# rbd --cluster cephdr mirror pool status cifs --verbose
images: 1 total
description: replaying, master_position=[object_number=390133, tag_tid=4,
entry_tid=447832541], mirror_position=[object_number=390133, tag_tid=4,
last_update: 2021-01-29 15:10:13
DR# ceph osd pool ls detail
pool 5 'cifs' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins
pg_num 128 pgp_num 128 last_change 1294 flags hashpspool stripe_width 0
PROD# ceph df detail
NAME ID QUOTA OBJECTS QUOTA BYTES USED %USED
MAX AVAIL OBJECTS DIRTY READ WRITE RAW USED
cifs 17 N/A N/A 26.0TiB 30.10
60.4TiB 6860550 6.86M 873MiB 509MiB 52.1TiB
DR# ceph df detail
NAME ID QUOTA OBJECTS QUOTA BYTES USED %USED
MAX AVAIL OBJECTS DIRTY READ WRITE RAW USED
cifs 5 N/A N/A 11.4TiB 15.78
60.9TiB 3043260 3.04M 2.65MiB 431MiB 22.8TiB
PROD#:/vol/research_data# du -sh *
DR#:/vol/research_data# du -sh *
I want to invite you to apply to an internship program called Outreachy!
Outreachy provides three-month internships to work in Free and Open
Source Software (FOSS). Outreachy internship projects may include
programming, user experience, documentation, illustration, graphical
design, or data science. Interns often find employment after their
internship with Outreachy sponsors or jobs that use the skills they
learned during their internship.
Ceph has had ten projects submitted in Outreachy since 2018. Now we can
submit more projects for our May-August 2021 round!
The project can be coordinated on the etherpad:
Projects need to be submitted by the mentor here for approval:
Outreachy internships run twice a year. The internships run from May to
August and December to March. Interns are paid a stipend of $6,000 USD
for the three months of work.
Outreachy internships are entirely remote and are open to applicants
around the world. Interns work remotely with experienced mentors. We
expressly invite women (both cis and trans), trans men, and genderqueer
people to apply. We also expressly invite applications from residents
and nationals of the United States of any gender. They are Black/African
American, Hispanic/Latin@, Native American/American Indian, Alaska
Native, Native Hawaiian, or Pacific Islander. Anyone who faces
under-representation, systematic bias, or discrimination in their
country's technology industry is invited to apply. More details and
eligibility criteria can be found here:
The next Outreachy internship round is from May 24, 2021, until Aug. 24,
Initial applications are currently open. Initial applications are due on
Feb. 22, 2021, at 4 pm UTC. Apply today:
Applying to Outreachy is a little different than other internship
programs. You'll fill out an initial application. If your initial
application is approved, you'll move onto the five-week contribution
phase. During the contribution phase, you'll make contact with project
mentors and contribute to the project. Outreachy organizers have found
that the most vital applicants contact mentors early, ask many
questions, and continually submit contributions throughout the
Please let Ali or I know if you have any questions about the program.
The Outreachy organizers (Karen Sandler, Sage Sharp, Marina
Zhurakhinskaya, Cindy Pallares, and Tony Sebro) can all be reached
through our contact form:
We hope you'll help us spread the word about Outreachy internships!
We've recently run into an issue where our single ceph rbd pool is throwing errors for nearfull osds. The OSDs themselves vary in PGs/%full with a low of 64/78% and a high of 73/86%. Is there any suggestions on how to get this to balance a little more cleanly? Currently we have 360 drives in a single pool with 8192 PGs. I think we may be able to double the PG num and that will balance things a bit cleaner but I just wanted to see if there may be anything the community suggests other than that. Let me know if there's any further info I forgot to provide if that'll help sort this out.
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 741 TiB 135 TiB 606 TiB 607 TiB 81.85
TOTAL 741 TiB 135 TiB 606 TiB 607 TiB 81.85
POOL ID STORED OBJECTS USED %USED MAX AVAIL
pool 1 162 TiB 46.81M 494 TiB 89.02 20 TiB
85 nearfull osd(s)
1 pool(s) nearfull
osd: 360 osds: 360 up (since 7d), 360 in (since 7d)
pools: 1 pools, 8192 pgs
objects: 46.81M objects, 169 TiB
usage: 607 TiB used, 135 TiB / 741 TiB avail
pgs: 8192 active+clean