Hi,
I am trying to set up a new cluster with cephadm using a docker backend.
The initial boot strap did not finish cleanly and it errored out waiting for the mon-ip, I used the command:
cephadm bootstrap --mon-ip 192.168.0.1
With 192.168.0.1 being the ip address of this first host.
I tried the command again but it failed as the new ceph node was actually running so it could not bind to the ports.
After a bit of searching I was able to use "sudo cephadm shell —“ commands to change the username and password for the dashboard and login to it.
I then used cephadm to add a new host with "sudo cephadm shell — ceph orch host add host2”
Now in the inventory of the dashboard, and "ceph orch device ls” only devices on host2 are listed not host1.
In the Cluster/Hosts section of the dashboard host1 has its root volume drive listed in devices, and host2 has the root volume drive and drive for the OSD listed.
I successfully added an OSD with a drive on host2, trying the same command adjusted for host1 I get the following in the log:
Dec 23 08:55:47 localhost systemd[1]: var-lib-docker-overlay2-91e9dffa86c333353dd6b445021c852d7ce8da6237d0d4d95909d68ef3d4fe23\x2dinit-merged.mount: Succeeded.
Dec 23 08:55:47 localhost systemd[24638]: var-lib-docker-overlay2-91e9dffa86c333353dd6b445021c852d7ce8da6237d0d4d95909d68ef3d4fe23\x2dinit-merged.mount: Succeeded.
Dec 23 08:55:47 localhost containerd[1470]: time="2020-12-23T08:55:47.369773808Z" level=info msg="shim containerd-shim started" address=/containerd-shim/80f876072532ebebdfef341a5c793654e27766f2d1708991a6f25599b24b6557.sock debug=false pid=28597
Dec 23 08:55:47 localhost bash[8745]: debug 2020-12-23T08:55:47.517+0000 ffff73d7a200 1 mon.host1(a)0(leader).osd e12 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 71303168 full_alloc: 71303168 kv_alloc: 876609536
Dec 23 08:55:47 localhost containerd[1470]: time="2020-12-23T08:55:47.621748606Z" level=info msg="shim reaped" id=69a786e4a61605c1e6eca5a6e0e5ed0900635a214b0f1c96a4f26ea7911a12ff
Dec 23 08:55:47 localhost dockerd[2930]: time="2020-12-23T08:55:47.631479207Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Dec 23 08:55:47 localhost systemd[24638]: var-lib-docker-overlay2-91e9dffa86c333353dd6b445021c852d7ce8da6237d0d4d95909d68ef3d4fe23-merged.mount: Succeeded.
Dec 23 08:55:47 localhost systemd[1]: var-lib-docker-overlay2-91e9dffa86c333353dd6b445021c852d7ce8da6237d0d4d95909d68ef3d4fe23-merged.mount: Succeeded.
Dec 23 08:55:47 localhost systemd[24638]: var-lib-docker-overlay2-64bb135bc0cdab187566992dc9870068dee1430062e1a2b484381c19e03da895\x2dinit-merged.mount: Succeeded.
Dec 23 08:55:47 localhost systemd[1]: var-lib-docker-overlay2-64bb135bc0cdab187566992dc9870068dee1430062e1a2b484381c19e03da895\x2dinit-merged.mount: Succeeded.
Dec 23 08:55:47 localhost containerd[1470]: time="2020-12-23T08:55:47.972437378Z" level=info msg="shim containerd-shim started" address=/containerd-shim/4a61d63e1f46722ffa7a950c31145d167c5c69087d003e5928a6aa3a4831f031.sock debug=false pid=28659
Dec 23 08:55:48 localhost bash[8745]: cluster 2020-12-23T08:55:46.892633+0000 mgr.host1.kkssvi (mgr.24098) 24278 : cluster [DBG] pgmap v24212: 1 pgs: 1 undersized+peered; 0 B data, 112 KiB used, 931 GiB / 932 GiB avail
Dec 23 08:55:48 localhost bash[8756]: debug 2020-12-23T08:55:48.889+0000 ffff93573700 0 log_channel(cluster) log [DBG] : pgmap v24213: 1 pgs: 1 undersized+peered; 0 B data, 112 KiB used, 931 GiB / 932 GiB avail
Dec 23 08:55:49 localhost bash[8756]: debug 2020-12-23T08:55:49.085+0000 ffff9056f700 0 log_channel(audit) log [DBG] : from='client.24206 -' entity='client.admin' cmd=[{"prefix": "orch daemon add osd", "svc_arg": "host1:/dev/nvme0n1", "target": ["mon-mgr", ""]}]: dispatch
Dec 23 08:55:49 localhost bash[8745]: debug 2020-12-23T08:55:49.085+0000 ffff71575200 0 mon.host1@0(leader) e2 handle_command mon_command({"prefix": "osd tree", "states": ["destroyed"], "format": "json"} v 0) v1
Dec 23 08:55:49 localhost bash[8745]: debug 2020-12-23T08:55:49.085+0000 ffff71575200 0 log_channel(audit) log [DBG] : from='mgr.24098 192.168.0.1:0/2486989775' entity='mgr.host1.kkssvi' cmd=[{"prefix": "osd tree", "states": ["destroyed"], "format": "json"}]: dispatch
Dec 23 08:55:49 localhost bash[8756]: debug 2020-12-23T08:55:49.089+0000 ffff8ed6d700 0 log_channel(cephadm) log [INF] : Found osd claims -> {}
Dec 23 08:55:49 localhost bash[8756]: debug 2020-12-23T08:55:49.089+0000 ffff8ed6d700 0 log_channel(cephadm) log [INF] : Found osd claims for drivegroup None -> {}
Dec 23 08:55:49 localhost containerd[1470]: time="2020-12-23T08:55:49.331868093Z" level=info msg="shim reaped" id=780a38dd49fce4a823c4c3d834abdd1cc17bbe0c0aa4f2dd7caeddf8dce1708e
Dec 23 08:55:49 localhost dockerd[2930]: time="2020-12-23T08:55:49.341765820Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Dec 23 08:55:49 localhost systemd[24638]: var-lib-docker-overlay2-64bb135bc0cdab187566992dc9870068dee1430062e1a2b484381c19e03da895-merged.mount: Succeeded.
Dec 23 08:55:49 localhost systemd[1]: var-lib-docker-overlay2-64bb135bc0cdab187566992dc9870068dee1430062e1a2b484381c19e03da895-merged.mount: Succeeded.
Dec 23 08:55:49 localhost bash[8745]: audit 2020-12-23T08:55:49.091014+0000 mon.host1 (mon.0) 1093 : audit [DBG] from='mgr.24098 192.168.0.1:0/2486989775' entity='mgr.host1.kkssvi' cmd=[{"prefix": "osd tree", "states": ["destroyed"], "format": "json"}]: dispatch
Dec 23 08:55:50 localhost bash[8745]: cluster 2020-12-23T08:55:48.893433+0000 mgr.host1.kkssvi (mgr.24098) 24279 : cluster [DBG] pgmap v24213: 1 pgs: 1 undersized+peered; 0 B data, 112 KiB used, 931 GiB / 932 GiB avail
Dec 23 08:55:50 localhost bash[8745]: audit 2020-12-23T08:55:49.087597+0000 mgr.host1.kkssvi (mgr.24098) 24280 : audit [DBG] from='client.24206 -' entity='client.admin' cmd=[{"prefix": "orch daemon add osd", "svc_arg": "host1:/dev/nvme0n1", "target": ["mon-mgr", ""]}]: dispatch
Dec 23 08:55:50 localhost bash[8745]: cephadm 2020-12-23T08:55:49.093552+0000 mgr.host1.kkssvi (mgr.24098) 24281 : cephadm [INF] Found osd claims -> {}
Dec 23 08:55:50 localhost bash[8745]: cephadm 2020-12-23T08:55:49.093933+0000 mgr.host1.kkssvi (mgr.24098) 24282 : cephadm [INF] Found osd claims for drivegroup None -> {}
The other problem is logging is set to debug for both hosts, I tried "sudo cephadm shell -- ceph daemon mon.host1 config set mon_cluster_log_file_level info” which reports success but logging remains at debug level.
If I try the same command with mon.host2 I get
INFO:cephadm:Inferring fsid ae111111-1111-1111-1111-f1111a11111a
INFO:cephadm:Inferring config /var/lib/ceph/ae147088-4486-11eb-9044-f1337a55707a/mon.host1/config
INFO:cephadm:Using recent ceph image ceph/ceph:v15
admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
Which looks like it is trying to use the config for host1 on host2?
Thanks,
Duncan
Dear ceph folks,
rbd_cache can be set up as a read /write cache for librbd, widely used with openstack cinder. Does krbd has a silmilar cache controll mechanism or not? I am using krbd for iSCSI and NFS backend storage, and wonder whether a cache setting exists for krbd.
thanks in advance,
Samuel
huxiaoyu(a)horebdata.cn
Hi,
Using Ceph Octopus installed with cephadm here. Version running currently
is 15.2.6. There are 3 machines running the cluster. Machine names are
introduced in /etc/hosts in long(FQDN) & short forms but ceph hostnames of
the servers are in short form (not sure if this affects anyway). rdb side
is working nicely, tested with a linux client.
Trying to get object gateway to be visible in dashboard but getting error
when selecting "Object Gateway -> Daemons"
Error:
RGW REST API failed request with status code 403
(b'{"Code":"AccessDenied","RequestId":"tx000000000000000000040-005fe20384-8ecbc'
b'-ou","HostId":"8ecbc-ou-default"}')
What am I doing wrong here?
Thanks a lot,
-Mika
---- Procedure what I have done ----
1) ceph orch apply rgw default ou --placement="1 ceph1"
2) radosgw-admin user create --uid=test --display-name=test
--access-key=test --secret-key=test
3) radosgw-admin period update --rgw-realm=default --commit
4) aws configure --profile=default
aws configure --profile=default
AWS Access Key ID [None]: test
AWS Secret Access Key [None]: test
Default region name [None]: default
Default output format [None]: json
5) aws s3 mb s3://test1 --endpoint-url http://ceph1
make_bucket: test1
5.1) radosgw-admin bucket list
[
"test1"
]
6) ceph dashboard --help | grep reset-rgw | awk '{print $2}' | xargs -n
1 ceph dashboard
Option RGW_API_ACCESS_KEY reset to default value ""
Option RGW_API_ADMIN_RESOURCE reset to default value "admin"
Option RGW_API_HOST reset to default value ""
Option RGW_API_PORT reset to default value "80"
Option RGW_API_SCHEME reset to default value "http"
Option RGW_API_SECRET_KEY reset to default value ""
Option RGW_API_SSL_VERIFY reset to default value "True"
Option RGW_API_USER_ID reset to default value ""
7) ceph dashboard set-rgw-api-user-id "test"
Option RGW_API_USER_ID updated
8) ceph dashboard set-rgw-api-access-key test
Option RGW_API_ACCESS_KEY updated
9) ceph dashboard set-rgw-api-secret-key test
Option RGW_API_SECRET_KEY updated
10) ceph mgr module disable dashboard
11) ceph mgr module enable dashboard
Hello
the mgr module diskprediction_local fails under ubuntu 20.04 focal with
python3-sklearn version 0.22.2
Ceph version is 15.2.3
when the module is enabled i get the following error:
File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 112, in
serve
self.predict_all_devices()
File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 279, in
predict_all_devices
result = self._predict_life_expentancy(devInfo['devid'])
File "/usr/share/ceph/mgr/diskprediction_local/module.py", line 222, in
_predict_life_expentancy
predicted_result = obj_predictor.predict(predict_datas)
File "/usr/share/ceph/mgr/diskprediction_local/predictor.py", line 457,
in predict
pred = clf.predict(ordered_data)
File "/usr/lib/python3/dist-packages/sklearn/svm/_base.py", line 585, in
predict
if self.break_ties and self.decision_function_shape == 'ovo':
AttributeError: 'SVC' object has no attribute 'break_ties'
Best Regards
Eric
Hello,I had some faulty power cables on some OSD's in one server which caused lots of IO issues/disks appearing/disappearing.This has been corrected now, 2 of the 10 OSD's are working, however 8 are failing to start due to what looks to be a corrupt DB.When running a ceph-bluestore-tool fsck I get the following output:rocksdb: [db/db_impl_open.cc:516] db.wal/002221.log: dropping 1302 bytes; Corruption: missing start of fragmented record(2)2020-12-22T16:21:52.715+0100 7f7b6a1500c0 4 rocksdb: [db/db_impl.cc:389] Shutdown: canceling all background work2020-12-22T16:21:52.715+0100 7f7b6a1500c0 4 rocksdb: [db/db_impl.cc:563] Shutdown complete2020-12-22T16:21:52.715+0100 7f7b6a1500c0 -1 rocksdb: Corruption: missing start of fragmented record(2)2020-12-22T16:21:52.715+0100 7f7b6a1500c0 -1 bluestore(/var/lib/ceph/b1db6b36-0c4c-4bce-9cda-18834be0632d/osd.28) opendb erroring opening db:Trying to start the OSD leads to:ceph_abort_msg("Bad table magic number: expected 9863518390377041911, found 9372993859750765257 in db/002442.sst")It looks like the last write to these OSD's never fully completed, sadly as I was adding this new node to move from OSD to Host redundancy (EC Pool) I have 20% down PG's currently, is there anything I can do to remove the last entry in the DB or somehow clean up the rocksDB to get these OSD's atleast started? Understand may end up with some corrupted files.Thanks
Sent via MXlogin
I could use some input from more experienced folks…
First time seeing this behavior. I've been running ceph in production
(replicated) since 2016 or earlier.
This, however, is a small 3-node cluster for testing EC. Crush map rules
should sustain the loss of an entire node.
Here's the EC rule:
rule cephfs425 { id 6 type erasure min_size 3 max_size 6 step
set_chooseleaf_tries 40 step set_choose_tries 400 step take default step
choose indep 3 type host step choose indep 2 type osd step emit }
I had actual hardware failure on one node. Interestingly, this appears to
have resulted in data loss. OSDs began to crash in a cascade on other nodes
(i.e., nodes with no known hardware failure). Not a low RAM problem.
I could use some pointers about how to get the down PGs back up — I *think*
there are enough EC shards, even disregarding the OSDs that crash on start.
nautilus 14.2.15
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 54.75960 root default
-10 16.81067 host sumia
1 hdd 5.57719 osd.1 up 1.00000 1.00000
5 hdd 5.58469 osd.5 up 1.00000 1.00000
6 hdd 5.64879 osd.6 up 1.00000 1.00000
-7 16.73048 host sumib
0 hdd 5.57899 osd.0 up 1.00000 1.00000
2 hdd 5.56549 osd.2 up 1.00000 1.00000
3 hdd 5.58600 osd.3 up 1.00000 1.00000
-3 21.21844 host tower1
4 hdd 3.71680 osd.4 up 0 1.00000
7 hdd 1.84799 osd.7 up 1.00000 1.00000
8 hdd 3.71680 osd.8 up 1.00000 1.00000
9 hdd 1.84929 osd.9 up 1.00000 1.00000
10 hdd 2.72899 osd.10 up 1.00000 1.00000
11 hdd 3.71989 osd.11 down 0 1.00000
12 hdd 3.63869 osd.12 down 0 1.00000
cluster:
id: d0b4c175-02ba-4a64-8040-eb163002cba6
health: HEALTH_ERR
1 MDSs report slow requests
4/4239345 objects unfound (0.000%)
Too many repaired reads on 3 OSDs
Reduced data availability: 7 pgs inactive, 7 pgs down
Possible data damage: 4 pgs recovery_unfound
Degraded data redundancy: 95807/24738783 objects degraded
(0.387%), 4 pgs degraded, 3 pgs undersized
7 pgs not deep-scrubbed in time
7 pgs not scrubbed in time
services:
mon: 3 daemons, quorum sumib,tower1,sumia (age 4d)
mgr: sumib(active, since 7d), standbys: sumia, tower1
mds: cephfs:1 {0=sumib=up:active} 2 up:standby
osd: 13 osds: 11 up (since 3d), 10 in (since 4d); 3 remapped pgs
data:
pools: 5 pools, 256 pgs
objects: 4.24M objects, 15 TiB
usage: 24 TiB used, 24 TiB / 47 TiB avail
pgs: 2.734% pgs not active
95807/24738783 objects degraded (0.387%)
47910/24738783 objects misplaced (0.194%)
4/4239345 objects unfound (0.000%)
245 active+clean
7 down
3 active+recovery_unfound+undersized+degraded+remapped
1 active+recovery_unfound+degraded+repair
progress:
Rebalancing after osd.12 marked out
[============================..]
Rebalancing after osd.4 marked out
[=============================.]
An snipped from an example down pg:
"up": [
3,
2,
5,
1,
8,
9
],
"acting": [
3,
2,
5,
1,
8,
9
],
<snip>
],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
11,
12
],
"peering_blocked_by": [
{
"osd": 11,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let
us proceed"
},
{
"osd": 12,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let
us proceed"
}
]
},
{
Oddly, these OSDs possibly did NOT experience hardware failure. However,
they won't start -- see pastebin for ceph-osd.11.log
https://pastebin.com/6U6sQJuJ
HEALTH_ERR 1 MDSs report slow requests; 4/4239345 objects unfound (0.000%);
Too many repaired reads on 3 OSDs; Reduced data availability
: 7 pgs inactive, 7 pgs down; Possible data damage: 4 pgs recovery_unfound;
Degraded data redundancy: 95807/24738783 objects degraded (0
.387%), 4 pgs degraded, 3 pgs undersized; 7 pgs not deep-scrubbed in time;
7 pgs not scrubbed in time
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdssumib(mds.0): 42 slow requests are blocked > 30 secs
OBJECT_UNFOUND 4/4239345 objects unfound (0.000%)
pg 19.5 has 1 unfound objects
pg 15.2f has 1 unfound objects
pg 15.41 has 1 unfound objects
pg 15.58 has 1 unfound objects
OSD_TOO_MANY_REPAIRS Too many repaired reads on 3 OSDs
osd.9 had 9664 reads repaired
osd.7 had 9665 reads repaired
osd.4 had 12 reads repaired
PG_AVAILABILITY Reduced data availability: 7 pgs inactive, 7 pgs down
pg 15.10 is down, acting [3,2,5,1,8,9]
pg 15.1e is down, acting [5,1,9,8,2,3]
pg 15.40 is down, acting [7,10,1,5,3,2]
pg 15.4a is down, acting [0,3,5,6,9,10]
pg 15.6a is down, acting [3,2,6,1,10,8]
pg 15.71 is down, acting [3,2,1,6,8,10]
pg 15.76 is down, acting [2,0,6,5,10,9]
PG_DAMAGED Possible data damage: 4 pgs recovery_unfound
pg 15.2f is active+recovery_unfound+undersized+degraded+remapped,
acting [5,1,0,3,2147483647,7], 1 unfound
pg 15.41 is active+recovery_unfound+undersized+degraded+remapped,
acting [5,1,0,3,2147483647,2147483647], 1 unfound
pg 15.58 is active+recovery_unfound+undersized+degraded+remapped,
acting [10,2147483647,2,3,1,5], 1 unfound
pg 19.5 is active+recovery_unfound+degraded+repair, acting
[3,2,5,1,8,10], 1 unfound
PG_DEGRADED Degraded data redundancy: 95807/24738783 objects degraded
(0.387%), 4 pgs degraded, 3 pgs undersized
pg 15.2f is stuck undersized for 635305.932075, current state
active+recovery_unfound+undersized+degraded+remapped, last acting
[5,1,0,3,2147483647,7]
pg 15.41 is stuck undersized for 364298.836902, current state
active+recovery_unfound+undersized+degraded+remapped, last acting
[5,1,0,3,2147483647,2147483647]
pg 15.58 is stuck undersized for 384461.110229, current state
active+recovery_unfound+undersized+degraded+remapped, last acting
[10,2147483647,2,3,1,5]
pg 19.5 is active+recovery_unfound+degraded+repair, acting
[3,2,5,1,8,10], 1 unfound
PG_NOT_DEEP_SCRUBBED 7 pgs not deep-scrubbed in time
pg 15.76 not deep-scrubbed since 2020-10-21 14:30:03.935228
pg 15.71 not deep-scrubbed since 2020-10-21 12:20:46.235792
pg 15.6a not deep-scrubbed since 2020-10-21 07:52:33.914083
pg 15.10 not deep-scrubbed since 2020-10-22 03:24:40.465367
pg 15.1e not deep-scrubbed since 2020-10-22 10:37:36.169959
pg 15.40 not deep-scrubbed since 2020-10-23 05:33:35.208748
pg 15.4a not deep-scrubbed since 2020-10-22 05:14:06.981035
PG_NOT_SCRUBBED 7 pgs not scrubbed in time
pg 15.76 not scrubbed since 2020-10-24 08:12:40.090831
pg 15.71 not scrubbed since 2020-10-25 05:22:40.573572
pg 15.6a not scrubbed since 2020-10-24 15:03:09.189964
pg 15.10 not scrubbed since 2020-10-24 16:25:08.826981
pg 15.1e not scrubbed since 2020-10-24 16:05:03.080127
pg 15.40 not scrubbed since 2020-10-24 11:58:04.290488
pg 15.4a not scrubbed since 2020-10-24 11:32:44.573551
--
Jeremy Austin
jhaustin(a)gmail.com
Hello all,
wrt: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7IMIWCKIHXN…
Yesterday we hit a problem with osd_pglog memory, similar to the thread above.
We have a 56 node object storage (S3+SWIFT) cluster with 25 OSD disk per node. We run 8+3 EC for the data pool (metadata is on replicated nvme pool).
The cluster has been running fine, and (as relevant to the post) the memory usage has been stable at 100 GB / node. We've had the default pg_log of 3000. The user traffic doesn't seem to have been exceptional lately.
Last Thursday we updated the OSDs from 14.2.8 -> 14.2.13. On Friday the memory usage on OSD nodes started to grow. On each node it grew steadily about 30 GB/day, until the servers started OOM killing OSD processes.
After a lot of debugging we found that the pg_logs were huge. Each OSD process pg_log had grown to ~22GB, which we naturally didn't have memory for, and then the cluster was in an unstable situation. This is significantly more than the 1,5 GB in the post above. We do have ~20k pgs, which may directly affect the size.
We've reduced the pg_log to 500, and started offline trimming it where we can, and also just waited. The pg_log size dropped to ~1,2 GB on at least some nodes, but we're still recovering, and have a lot of ODSs down and out still.
We're unsure if version 14.2.13 triggered this, or if the osd restarts triggered this (or something unrelated we don't see).
This mail is mostly to figure out if there are good guesses why the pg_log size per OSD process exploded? Any technical (and moral) support is appreciated. Also, currently we're not sure if 14.2.13 triggered this, so this is also to put a data point out there for other debuggers.
Cheers,
Kalle Happonen
I have a vm on a osd node (which can reach host and other nodes via the
macvtap interface (used by the host and guest)). I just did a simple
bonnie++ test and everything seems to be fine. Yesterday however the
dovecot procces apparently caused problems (only using cephfs for an
archive namespace, inbox is on rbd ssd, fs meta also on ssd)
How can I recover from such lock-up. If I have a similar situation with
an nfs-ganesha mount, I have the option to do a umount -l, and clients
recover quickly without any issues.
Having to reset the vm, is not really an option. What is best way to
resolve this?
Ceph cluster: 14.2.11 (the vm has 14.2.16)
I have in my ceph.conf nothing special, these 2x in the mds section:
mds bal fragment size max = 120000
# maybe for nfs-ganesha problems?
# http://docs.ceph.com/docs/master/cephfs/eviction/
#mds_session_blacklist_on_timeout = false
#mds_session_blacklist_on_evict = false
mds_cache_memory_limit = 17179860387
All running:
CentOS Linux release 7.9.2009 (Core)
Linux mail04 3.10.0-1160.6.1.el7.x86_64 #1 SMP Tue Nov 17 13:59:11 UTC
2020 x86_64 x86_64 x86_64 GNU/Linux
I was having horrible problems getting my test ceph clusterj reinitialized.
All kinds of annoying things were happening.
including things like getting differing output from
ceph orch device ls
vs
ceph device ls
Being new-ish to ceph, i was going nuts, wondering what kind of init options I was missing.
Turns out, nothing i was doing was wrong, per se.
I had ended up with differing container versions.
Even after doing "cephadm rm-cluster", the old versions were sticking.
The really annoying thing is, the difference was tiny.
15.2.8 on the master.
15.2.5 on other nodes.
but device related things were failing, with
2020-12-21 09:37:09,134 INFO /bin/podman:stderr ceph-volume inventory: error: unrecognized arguments: --filter-for-batch
in /var/log/ceph/cephadm.log
Sighhh...
To save anyone else some research, the end-user fix is:
run
ceph orch upgrade start --ceph-version 15.2.8
--
Philip Brown| Sr. Linux System Administrator | Medata, Inc.
5 Peters Canyon Rd Suite 250
Irvine CA 92606
Office 714.918.1310| Fax 714.918.1325
pbrown(a)medata.com| www.medata.com