Really not sure where to go with this one. Firstly, a description of my cluster. Yes, I know there are a lot of "not ideals" here but this is what I inherited.
The cluster is running Jewel and has two storage/mon nodes and an additional mon only node, with a pool size of 2. Today, we had a some power issues in the data centre and we very ungracefully lost both storage servers at the same time. Node 1 came back online before node 2 but I could see there were a few OSDs that were down. When node 2 came back, I started trying to get OSDs up. Each node has 14 OSDs and I managed to get all OSDs up and in on node 2, but one of the OSDs on node 1 keeps starting and crashing and just won't stay up. I'm not finding the OSD log output to be much use. Current health status looks like this:
# ceph health
HEALTH_ERR 26 pgs are stuck inactive for more than 300 seconds; 26 pgs down; 26 pgs peering; 26 pgs stuck inactive; 26 pgs stuck unclean; 5 requests are blocked > 32 sec
# ceph status
cluster e2391bbf-15e0-405f-af12-943610cb4909
health HEALTH_ERR
26 pgs are stuck inactive for more than 300 seconds
26 pgs down
26 pgs peering
26 pgs stuck inactive
26 pgs stuck unclean
5 requests are blocked > 32 sec
Any clues as to what I should be looking for or what sort of action I should be taking to troubleshoot this? Unfortunately, I'm a complete novice with Ceph.
Here's a snippet from the OSD log that means little to me...
--- begin dump of recent events ---
0> 2021-04-16 12:25:10.169340 7f2e23921ac0 -1 *** Caught signal (Aborted) **
in thread 7f2e23921ac0 thread_name:ceph-osd
ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e)
1: (()+0x9f1c2a) [0x7f2e24330c2a]
2: (()+0xf5d0) [0x7f2e21ee95d0]
3: (gsignal()+0x37) [0x7f2e2049f207]
4: (abort()+0x148) [0x7f2e204a08f8]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x267) [0x7f2e2442fd47]
6: (FileJournal::read_entry(ceph::buffer::list&, unsigned long&, bool*)+0x90c) [0x7f2e2417bc7c]
7: (JournalingObjectStore::journal_replay(unsigned long)+0x1ee) [0x7f2e240c8dce]
8: (FileStore::mount()+0x3cd6) [0x7f2e240a0546]
9: (OSD::init()+0x27d) [0x7f2e23d5828d]
10: (main()+0x2c18) [0x7f2e23c71088]
11: (__libc_start_main()+0xf5) [0x7f2e2048b3d5]
12: (()+0x3c8847) [0x7f2e23d07847]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Thanks in advance,
Mark
Hi,
I am trying to follow this url https://docs.ceph.com/en/latest/radosgw/s3/bucketops/#create-notification
to create a publisher for my bucket into a topic.
My curl:
curl -v -H 'Date: Fri, 16 Apr 2021 05:21:14 +0000' -H 'Authorization: AWS accessid:secretkey' -L -H 'content-type: text/xml' -H 'Content-MD5: pBRX39Oo7aAUYbilIYMoAw==' -T notif.xml http://ceph:8080/vig-test?notification
and it returns me this error
<?xml version="1.0" encoding="UTF-8"?>
<Error>
<Code>NoSuchKey</Code>
<BucketName>vig-test</BucketName>
<RequestId>tx0000000000000016ac570-0060791ecb-1c7e96b-hkg</RequestId>
<HostId>1c7e96b-hkg-data</HostId>
</Error>
Does anybody know what does this error mean in Ceph? How can I proceed?
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hello,
I want to deploy a new ceph Octopus cluster using cephadm on arm64 architecture but unfortunately the ceph/ceph-grafana docker image for arm64 is missing.
Is this mailing list the right place to report this? or where should I report that?
Best regards,
Mabi
Hi,
I had several clusters running as nautilus and pending upgrading to
octopus.
I am now testing the upgrade steps for ceph cluster from nautilus
to octopus using cephadm adopt in lab referred to below link:
- https://docs.ceph.com/en/octopus/cephadm/adoption/
Lab environment:
3 all-in-one nodes.
OS: CentOS 7.9.2009 with podman 1.6.4.
After the adoption, ceph health keep warns about tcme-runner not managed by
cephadm.
# ceph health detail
HEALTH_WARN 12 stray daemon(s) not managed by cephadm; 1 pool(s) have no
replicas configured
[WRN] CEPHADM_STRAY_DAEMON: 12 stray daemon(s) not managed by cephadm
stray daemon tcmu-runner.ceph-aio1:iSCSI/iscsi_image_01 on host
ceph-aio1 not managed by cephadm
stray daemon tcmu-runner.ceph-aio1:iSCSI/iscsi_image_02 on host
ceph-aio1 not managed by cephadm
stray daemon tcmu-runner.ceph-aio1:iSCSI/iscsi_image_03 on host
ceph-aio1 not managed by cephadm
stray daemon tcmu-runner.ceph-aio1:iSCSI/iscsi_image_test on host
ceph-aio1 not managed by cephadm
stray daemon tcmu-runner.ceph-aio2:iSCSI/iscsi_image_01 on host
ceph-aio2 not managed by cephadm
stray daemon tcmu-runner.ceph-aio2:iSCSI/iscsi_image_02 on host
ceph-aio2 not managed by cephadm
stray daemon tcmu-runner.ceph-aio2:iSCSI/iscsi_image_03 on host
ceph-aio2 not managed by cephadm
stray daemon tcmu-runner.ceph-aio2:iSCSI/iscsi_image_test on host
ceph-aio2 not managed by cephadm
stray daemon tcmu-runner.ceph-aio3:iSCSI/iscsi_image_01 on host
ceph-aio3 not managed by cephadm
stray daemon tcmu-runner.ceph-aio3:iSCSI/iscsi_image_02 on host
ceph-aio3 not managed by cephadm
stray daemon tcmu-runner.ceph-aio3:iSCSI/iscsi_image_03 on host
ceph-aio3 not managed by cephadm
stray daemon tcmu-runner.ceph-aio3:iSCSI/iscsi_image_test on host
ceph-aio3 not managed by cephadm
And tcmu-runner is still running with the old version.
# ceph versions
{
"mon": {
"ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02)
octopus (stable)": 3
},
"mgr": {
"ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02)
octopus (stable)": 1
},
"osd": {
"ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02)
octopus (stable)": 9
},
"mds": {},
"tcmu-runner": {
"ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)
nautilus (stable)": 12
},
"overall": {
"ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)
nautilus (stable)": 12,
"ceph version 15.2.10 (27917a557cca91e4da407489bbaa64ad4352cc02)
octopus (stable)": 13
}
}
I didn't find any ceph-iscsi related upgrade steps from the above reference
link.
Can anyone here point me to the right direction of ceph-iscsi version
upgrade?
Thanks.
Regs,
Icy
Reading the thread "s3 requires twice the space it should use", Boris pointed
out that the fragmentation for the osds is around 0.8-0.9:
> On Thu, Apr 15, 2021 at 8:06 PM Boris Behrens <bb(a)kervyn.de> wrote:
>> I also checked the fragmentation on the bluestore OSDs and it is around
>> 0.80 - 0.89 on most OSDs. yikes.
>> [root@s3db1 ~]# ceph daemon osd.23 bluestore allocator score block
>> {
>> "fragmentation_rating": 0.85906054329923576
>> }
And that made me wonder what is the current recommended (and not recommended)
way to handle and reduce the fragmentation of the existing OSDs.
Reading around I would think of tweaking the min_alloc_size_{ssd,hdd} and
redeploying those OSDs, but I was unable to find much else, I wonder what do
people do?
ps. There was another thread that got no replies asking something similar (and
a bunch of other things):
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/3PITWZRNX7…