Hi,
just wanted to ask if it is intentional that
http://ceph.com/pgcalc/
results in a 404 error?
is there any alternative url?
it is still linked from the offical docs.
with kind regards
Dominik
Is this happening to anyone else? After this command:
ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 2w
The 'dashboard' shows 'Health OK', then after a few hours (perhaps a
mon leadership change), it's back to 'degraded' and
'AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure
global_id reclaim'
Pacific, 16.2.4 all in docker containers.
Hi,
Want to have logs from cluster on Graylog but seems like CEPH send empty
"host" field. Any one can help ?
CEPH 16.2.3
# ceph config dump | grep graylog
global advanced clog_to_graylog true
global advanced clog_to_graylog_host xx.xx.xx.xx
global basic err_to_graylog true
global basic log_graylog_host xx.xx.xx.xx *
global basic log_to_graylog true
I see that my Graylog is hit by traffic from ceph on port 12201 udp and
parsed by GELF udp
Grylog logs:
2021-07-01 12:16:57,355 ERROR:
org.graylog2.shared.buffers.processors.DecodingProcessor - Error
processing message RawMessage{id=3ad5c6a1-da66-11eb-a55c-0242ac120005,
messageQueueId=2810784, codec=gelf, payloadSize=340,
timestamp=2021-07-01T12:16:57.354Z, remoteAddress=/xx.xx.xx.xx:34049}
java.lang.IllegalArgumentException: GELF message
<3ad5c6a1-da66-11eb-a55c-0242ac120005> (received from
<xx.xx.xx.xx:34049>) has empty mandatory "host" field.
at
org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:247)
~[graylog.jar:?]
at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:140)
~[graylog.jar:?]
at
org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:153)
~[graylog.jar:?]
at
org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:94)
[graylog.jar:?]
at
org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:90)
[graylog.jar:?]
at
org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:47)
[graylog.jar:?]
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143)
[graylog.jar:?]
at
com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
[graylog.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Best regards Milosz
I'm still attempting to build a ceph cluster and I'm currently getting
nowhere very very quickly. From what I can tell I have a slightly unstable
setup and I'm yet to work out why.
I currently have 24 servers and I'm planning to increase this to around 48
These servers are in three groups with different types of disks (and
number) in each type.
Currently I'm having an issue where every time I add a new server it adds
the osd on the node and then a few random ods on the current hosts will all
fall over and I'll only be able to get them up again by restart the daemons.
I'm using cephadm. and the network is a QDR based IB network running IP
over IB so its meant to be 40G but currently is behaving more like 10G
(when I've tested it) Its still faster than the 1G management network I've
also got.
The machines are mostly running debian. There are a few machine running
CentOS7 I'm meaning to redeploy when I get the time (so I can upgrade to
Pacific)
I'm running Octopus 15.2.13, I'm more than happy to change stuff I'm still
trying to learn stuff so there is no data that I care about quite yet, I
was looking for more stability before I go there.
I really just want to know where to look for the problems rather than any
exact answers, I'm yet to see any clues that might help
Thanks in advance
Peter Childs
Hello,
After having done a rolling reboot of my Octopus 15.2.13 cluster of 8 nodes cephadm does not find python3 on the node and hence I get quite a few of the following warnings:
[WRN] CEPHADM_HOST_CHECK_FAILED: 7 hosts fail cephadm check
host ceph1f failed check: Can't communicate with remote host `ceph1f`, possibly because python3 is not installed there: [Errno 32] Broken pipe
Here is the full stack trace from cephadm:
2021-07-06T06:03:20.798410+0000 mgr.ceph1a.xxqpph [ERR] Failed to apply osd.all-available-devices spec DriveGroupSpec(name=all-available-devices->placement=PlacementSpec(host_pattern='*'), service_id='all-available-devices', service_type='osd', data_devices=DeviceSelection(all=True), osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False): Can't communicate with remote host `ceph1d`, possibly because python3 is not installed there: [Errno 32] Broken pipe
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/module.py", line 1015, in _remote_connection
conn, connr = self._get_connection(addr)
File "/usr/share/ceph/mgr/cephadm/module.py", line 978, in _get_connection
sudo=True if self.ssh_user != 'root' else False)
File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 34, in __init__
self.gateway = self._make_gateway(hostname)
File "/lib/python3.6/site-packages/remoto/backends/__init__.py", line 44, in _make_gateway
self._make_connection_string(hostname)
File "/lib/python3.6/site-packages/execnet/multi.py", line 134, in makegateway
gw = gateway_bootstrap.bootstrap(io, spec)
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 102, in bootstrap
bootstrap_exec(io, spec)
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 46, in bootstrap_exec
"serve(io, id='%s-slave')" % spec.id,
File "/lib/python3.6/site-packages/execnet/gateway_bootstrap.py", line 78, in sendexec
io.write((repr(source) + "\n").encode("ascii"))
File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 409, in write
self._write(data)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/module.py", line 1019, in _remote_connection
raise execnet.gateway_bootstrap.HostNotFound(msg)
execnet.gateway_bootstrap.HostNotFound: Can't communicate with remote host `ceph1d`, possibly because python3 is not installed there: [Errno 32] Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 412, in _apply_all_services
if self._apply_service(spec):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 450, in _apply_service
self.mgr.osd_service.create_from_spec(cast(DriveGroupSpec, spec))
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 51, in create_from_spec
ret = create_from_spec_one(self.prepare_drivegroup(drive_group))
File "/usr/share/ceph/mgr/cephadm/utils.py", line 65, in forall_hosts_wrapper
return CephadmOrchestrator.instance._worker_pool.map(do_work, vals)
File "/lib64/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/lib64/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/lib64/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/usr/share/ceph/mgr/cephadm/utils.py", line 59, in do_work
return f(*arg)
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 47, in create_from_spec_one
host, cmd, replace_osd_ids=osd_id_claims.get(host, []), env_vars=env_vars
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 56, in create_single_host
out, err, code = self._run_ceph_volume_command(host, cmd, env_vars=env_vars)
File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 271, in _run_ceph_volume_command
error_ok=True)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1100, in _run_cephadm
with self._remote_connection(host, addr) as tpl:
File "/lib64/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/usr/share/ceph/mgr/cephadm/module.py", line 1046, in _remote_connection
raise OrchestratorError(msg) from e
orchestrator._interface.OrchestratorError: Can't communicate with remote host `ceph1d`, possibly because python3 is not installed there: [Errno 32] Broken pipe
I checked directly on the nodes and I can execute "python3" command and I can also SSH into all nodes with the following test command:
ssh -F ssh_config -i ~/cephadm_private_key root@nodeX
So I don't really understand what could have broken the cephadm orchestrator... Any ideas? The cephfs itself is still working.
Best regards,
Mabi
Hi All,
I have an already created and functional ceph cluster (latest luminous
release) with two networks one for the public (layer 2+3) and the other for
the cluster, the public one uses VLAN and its 10GbE and the other one uses
Infiniband with 56Gb/s, the cluster works ok. The public network uses
Juniper QFX5100 switches with VLAN in layer2+3 configuration but the
network team needs to move to a full layer3 and they want to use BGP, so
the question is, how can we move to that schema? What are the
considerations? Is it possible? Is there any step-by-step way to move to
that schema? Also is anything better than BGP or other alternatives?
In information will be really helpful
Thanks in advance,
Cheers,
German
In the week since upgrading one of our clusters from Nautilus 14.2.21 to Pacific 16.2.4 I've seen four spurious read errors that always have the same bad checksum of 0x6706be76. I've never seen this in any of our clusters before. Here's an example of what I'm seeing in the logs:
ceph-osd.132.log:2021-06-20T22:53:20.584-0400 7fde2e4fc700 -1 bluestore(/var/lib/ceph/osd/ceph-132) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x6706be76, expected 0xee74a56a, device location [0x18c81b40000~1000], logical extent 0x200000~1000, object #29:2d8210bf:::rbd_data.94f4232ae8944a.0000000000026c57:head#
I'm not seeing any indication of inconsistent PGs, only the spurious read error. I don't see an explicit indication of a retry in the logs following the above message. Bluestore code to retry three times was introduced in 2018 following a similar issue with the same checksum: https://tracker.ceph.com/issues/22464
Here's an example of what my health detail looks like:
HEALTH_WARN 1 OSD(s) have spurious read errors [WRN] BLUESTORE_SPURIOUS_READ_ERRORS: 1 OSD(s) have spurious read errors
osd.117 reads with retries: 1
I followed this (unresolved) thread, too: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/DRBVFQLZ5ZY…
I do have swap enabled, but I don't think memory pressure is an issue with 30GB available out of 96GB (and no sign I've been close to summoning the OOMkiller). The OSDs that have thrown the cluster into HEALTH_WARN with the spurious read errors are busy 12TB rotational HDDs and I _think_ it's only happening during a deep scrub. We're on Ubuntu 18.04; uname: 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux.
Does Pacific retry three times on a spurious read error? Would I see an indication of a retry in the logs?
Thanks!
~Jay