Hello,
I did a mistake, while deploying a new node on octopus.
The node is a fresh installed Centos8 machine.
Bevore I did a "ceph orch host add node08" I pasted the wrong command:
ceph orch daemon add osd node08:cl_node08/ceph
That did not return anything, so I tried to add the node first with the host add command, but now I get an error:
Error ENOENT: New host node08 (node08) failed check: ['Traceback (most recent call last):', ' File "<stdin>", line 4580, in <module>', ' File "<stdin>", line 3592, in command_check_host', "UnboundLocalError: local variable 'container_path' referenced before assignment"]
I'm not a developer, so I don't know where to look, and how to fix this.
I tried to reboot every node, to see if it's just a cached problem, but no luck there.
Do any of you know how to fix this?
Thanks in advance,
Simon
Hello All,
We saw all nodes mon services are restarted at the same time after enabling
msgr2, So this make an impact on production running cluster? We are
upgrading from Luminous to Nautilus.
Thanks,
AmitG
Hi,
I have a lab single node cluster with octopus installed via ceph-ansible.
Both v1 and v2 were enabled in ceph-ansible vars with the correct suffixes.
The configuration was generated correctly and both ports were included in
the mon array.
[global]
cluster network = 172.16.6.0/24
fsid = bb204a5c-957d-4a06-a372-redacted
mon_host = [v2:172.16.6.210:3300/0,v1:172.16.6.210:6789/0]
mon initial members = aio1
mon_pg_warn_max_per_osd = 0
osd pool default crush rule = -1
osd_pool_default_min_size = 1
osd_pool_default_size = 1
public network = 172.16.6.0/24
I can also see that `ms_bind_msgr1` is enabled in the live config.
root@aio1 ~ # ceph daemon mon.aio1 config show | grep msgr
"mon_warn_on_msgr2_not_enabled": "true",
"ms_bind_msgr1": "true",
"ms_bind_msgr2": "true",
However only v2 is binding
netstat -tlnp |grep mon
tcp 0 0 172.16.6.210:3300 0.0.0.0:* LISTEN
2039098/ceph-mon
I have a client that only speaks v1 (ceph-csi) that can't talk to the v2
port
2020-06-15T09:49:51.330+0100 7f8776038700 -1 --2- v2:172.16.6.210:3300/0 >>
conn(0x563bfd6b2000 0x563bde5ff600 unknown :-1 s=BANNER_ACCEPTING pgs=0
cs=0 l=0 rx=0 tx=0)._handle_peer_banner peer is using msgr V1 protocol
2020-06-15T09:49:52.258+0100 7f8776038700 -1 --2- v2:172.16.6.210:3300/0 >>
conn(0x563bfd6b2000 0x563bde5ff600 unknown :-1 s=BANNER_ACCEPTING pgs=0
cs=0 l=0 rx=0 tx=0)._handle_peer_banner peer is using msgr V1 protocol
What could be the reason for mon not binding to port 6789?
Thanks
Miguel
Yep, you should also tell mgr that rbd of which pool you wanna export
statistics.
Follow this, https://ceph.io/rbd/new-in-nautilus-rbd-performance-monitoring/
Marc Roos <M.Roos(a)f1-outsourcing.eu> 于2020年6月12日周五 下午10:33写道:
>
> The grafana dashboard 'rbd overview' is empty. Queries have measurements
> 'ceph_rbd_write_ops' that do not exist in prometheus (I think). Should I
> enable something more than just 'ceph mgr module enable prometheus'
>
> I am on Nautilus
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
hi,guys.
we have a ceph cluster which version is luminous 12.2.13. and Recently we encountered a problem.here are some log infomations:
2020-06-08 12:33:52.706070 7f4097e2d700 0 log_channel(cluster) log [WRN] : slow request 30.518930 seconds old, received at 2020-06-08 12:33:22.186924: client_request(client.48978906:941633993 create #0x100028cab8a/.filename 2020-06-08 12:33:22.197434 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply
...
2020-06-08 13:12:17.826727 7f4097e2d700 0 log_channel(cluster) log [WRN] : slow request 2220.991833 seconds old, received at 2020-06-08 12:35:16.764233: client_request(client.42390705:788369155 create #0x1000224f999/.filename 2020-06-08 12:35:16.774553 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply
it looks like mds can't flush journal to osd of meta pool.but the osd type is ssd and the load is very low.this problem leads the client can't mount and the mds can't trim log.
Is there anyone have encountered this problem.Please help!
Hi all,
we have a cluster starting from jewel to octopus nowadays. We would like
to enable Upmap but unfortunately there are some old Jewel clients
active. We cannot force Upmap by: ceph osd
set-require-min-compat-client luminous Because of production state,
we must not lose any client. ;-)
"client": [
{
"features": "0x27018fb86aa42ada",
"release": "jewel",
"num": 7
},
{
"features": "0x3f01cfb8ffadffff",
"release": "luminous",
"num": 6
}
The cluster and all clients are v15.2.3 and my assumtion was, that
Centos7 with kernel 3.10 have backported kernel modules. Am i wrong?
I also checked Centos7 client with 4.20-ml kernel without success.
Clients always appear as Jewel clients... Fresh Centos8 run as Luminous
client as expected.
BTW: Is there a trick to identify Jewel clients by IP address / Hostname?
Thank you much,
Christoph
--
--
Christoph Ackermann | System Engineer
INFOSERVE GmbH | Am Felsbrunnen 15 | D-66119 Saarbrücken
Fon +49 (0)681 88008-59 | Fax +49 (0)681 88008-33 | mailto:C.Ackermann@infoserve.de | https://www.infoserve.de
INFOSERVE Datenschutzhinweise: https://infoserve.de/datenschutz
Handelsregister: Amtsgericht Saarbrücken, HRB 11001 | Erfüllungsort: Saarbrücken
Geschäftsführer: Dr. Stefan Leinenbach | Ust-IdNr.: DE168970599
Hi, I have a Luminous (12.2.25) cluster with several OSDs down. The daemons start but they're reporting as down. I did see in some osd logs that heartbeats were failing but when I checked the ports for the heartbeats were incorrect for that osd, although another osd was listening on that. How does the osd know what ports to ping other osds on? Is there any way to force an update.
The reason this happened is because someone took a VM snapshot of this cluster and restored the snapshot so the osds aren't up. I know this isn't a good implementation or a good idea and this will change going forward.
Anyway, I was just wondering about the heartbeat issue and whether attempting to ping on the right ports might bring them up.
Thanks,
Neil.
Hello
I'm running ceph 14.2.9.
During heavy backfilling due to rebalancing one OSD crashed.
I want to recover the data from the lost OSD before continuing the
backfilling so i out'ed the lost osd and ran "ceph osd set norebalance".
But i'm noticing with the norebalance flag set the system does not backfill
the undersized PG's. Only the degraded ones. So now i have plenty of
undersized PG's and the system is idle.
How can i recover the undersized PG's before resuming normal
backfilling/rebalancing ?
Regards,
Kári
you can calculate the difference of count of pg on osd before and after to estimate the amount of data migrated.
Using the crush algorithm to calculate the difference of count of pg on osd without having to actually add or remove osd.
> Date: Thu, 18 Jun 2020 01:18:30 +0430
> From: Seena Fallah <seenafallah(a)gmail.com>
> Subject: [ceph-users] Re: Calculate recovery time
> To: Janne Johansson <icepic.dz(a)gmail.com>
> Cc: ceph-users <ceph-users(a)ceph.io>
> Message-ID:
> <CAK3+OmWxDZf_g0Ok5AEgtLWP+EujrwAQjauxx6J=xANmM7xchA(a)mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> Yes I know but any point of view for backfill or priority used in Ceph when
> recovering?
>
> On Wed, Jun 17, 2020 at 11:00 AM Janne Johansson <icepic.dz(a)gmail.com>
> wrote:
>
> > Den ons 17 juni 2020 kl 02:14 skrev Seena Fallah <seenafallah(a)gmail.com>:
> >
> >> Hi all.
> >> Is there any way that I could calculate how much time it takes to add
> >> OSD to my cluster and get rebalanced or how much it takes to out OSD
> >> from my cluster?
> >>
> >
> > This is very dependent on all the variables of a cluster, from controller
> > & disk speeds, network speeds, cpu/bus speeds, ram availability and/or ram
> > allocation, the amount of copies the PGs and the pools are using, how many
> > other OSDs there are in the same crush rules as the missing/new one, how
> > full the OSDs are in general and the out:ed on specifically and of course
> > on if you have few huge objects in your datasets or if you have millions of
> > small ones. On top of that, it would be affected by the amount of client IO
> > being done at the same time, and in some small sense, might even depend
> > ever so slightly on the ability of the mons to react to changes for its own
> > database in case the mons are super slow.
> >
> > This would probably be why you will not just find a fixed number saying
> > "it will always take 5h45m for a 4TB drive". It is a problem that has 10 or
> > more dimensions.
> > But, you could always just out one. The cluster must be able to handle a
> > broken drive, so you might aswell test it now, instead of some weekend
> > night before that important database run someone at work needs done.
> >
> > You will see drives that break at some point, and if your dataset is
> > anything like everyone elses the last 50 or so years, your data will grow
> > so you just might want to get used to the "replace disk" and "add disk"
> > procedures right now.
> >