Hi,
I have a cluster with rook operator running the ceph version 1.6 and
upgraded first rook operator and then the ceph cluster definition.
Everything was fine, every component except from osds are upgraded. Below
is the reason of OSD not being upgraded:
not updating OSD 1 on node "some-node-name". node no longer exists in the
storage spec. if the user wishes to remove OSDs from the node, they must do
so manually. Rook will not remove OSDs from nodes that are removed from the
storage spec in order to prevent accidental data loss
Any idea or anyone had seen it before?
Regards.
--
Oğuz Yarımtepe
http://about.me/oguzy
Hello. We trying to resolve some issue with ceph. Our openshift cluster is blocked and we tried do almost all.
Actual state is:
MDS_ALL_DOWN: 1 filesystem is offline
MDS_DAMAGE: 1 mds daemon damaged
FS_DEGRADED: 1 filesystem is degraded
MON_DISK_LOW: mon be is low on available space
RECENT_CRASH: 1 daemons have recently crashed
We try to perform
cephfs-journal-tool --rank=gml-okd-cephfs:all event recover_dentries summary
cephfs-journal-tool --rank=gml-okd-cephfs:all journal reset
cephfs-table-tool gml-okd-cephfs:all reset session
ceph mds repaired 0
ceph config rm mds mds_verify_scatter
ceph config rm mds mds_debug_scatterstat
ceph tell gml-okd-cephfs scrub start / recursive repair force
After these commands, mds rises but an error appears:
MDS_READ_ONLY: 1 MDSs are read only
We also tried to create new fs with new metadata pool, delete and recreate old fs with same name with old\new metadatapool.
We got rid of the errors, but the Openshift cluster did not want to work with the old persistence volumes. The pods wrote an error that they could not find it, while it was present and moreover, this volume was associated with pvc.
Now we have rolled back the cluster and are trying to remove the mds error. Any ideas what to try?
Thanks
Hi,
I have some orchestrator issues on our cluster running 16.2.9 with rgw
only services.
We first noticed these issues a few weeks ago when adding new hosts to
the cluster - the orch was not detecting the new drives to build the
osd containers for them. Debugging the mgr logs, I noticed that the mgr
was crashing due to the dashboard module. I disabled the dashboard
module and the new drives were detected and added to the cluster.
Now we have other similar issues : we have a failed drive. The failure
was detected , the osd was marked as down and the rebalancing is
finished. I want to remove the failed osd from the cluster but it looks
like the orch is not working :
- I launched the osd removal with 'ceph orch osd rm 92 --force' where
92 is the osd id in question
- I checked the progress but nothing happens even after a few days :
ceph orch osd rm status
OSD HOST STATE PGS REPLACE FORCE ZAP DRAIN STARTED AT
*92 node10 started 0 False True False*
- the osd process is stopped on that host and from the orch side I can
see this :
ceph orch ps --daemon_type osd --daemon_id 92
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM
VERSION IMAGE ID
osd.92 node10 error 11h ago 4w - 4096M
<unknown> <unknown>
- I have the same long refresh interval on other osds as well. I know it
should be 10 minutes or so :
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM
VERSION IMAGE ID CONTAINER ID
osd.93 node09 running (4w) 11h ago 4w 5573M 4096M
16.2.9 3520ead5eb19 d2f658e9e37b
osd: *116 osds: 115 up* (since 11d), 115 in (since 11d)
90 hdd 16.37109 1.00000 16 TiB 4.0 TiB 4.0 TiB 0 B 17
GiB 12 TiB 24.66 0.97 146 up
* 92 hdd 0 0 0 B 0 B 0 B 0 B 0
B 0 B 0 0 0 down*
94 hdd 16.37109 1.00000 16 TiB 4.0 TiB 4.0 TiB 0 B 17
GiB 12 TiB 24.66 0.97 146 up
- I activated debug 20 on the mgr but I can't see any errors or other
clues regarding the osd removal. I also switched to the standby manager
with 'ceph mgr fail'. The mgr switch works but still nothing happens
- It's not only the osd removal thing. I also tried to deploy new rgw
services by applying rgw labels on 2 new hosts , we have specs for
building rgw containers when detecting the label. Again, nothing happens.
I'm planning to upgrade to 16.2.11 to see if this solves the issues but
I'm not very confident, I didn't see anything regarding this in the
changelogs. Is there anything else I can try to debug this issue ?
Thanks.
ceph 16.2.11,
is safe to enable scrub and deep scrub during backfilling ?
I have log recovery-backfilling due to a new crushmap , backfilling is going slow and deep scrub interval as expired so I have many pgs not deep-scrubbed in time.
Best regards
Alessandro
Hi, we have a cluster with this ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 240 GiB 205 GiB 29 GiB 35 GiB 14.43
hddvm 1.6 TiB 1.2 TiB 277 GiB 332 GiB 20.73
TOTAL 1.8 TiB 1.4 TiB 305 GiB 366 GiB 19.91
--- POOLS ---
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
device_health_metrics 1 1 0 B 0 B 0 B 0 0 B 0 B 0 B 0 308 GiB N/A N/A 0 0 B 0 B
rbd-pool 2 32 539 B 19 B 520 B 9 539 B 19 B 520 B 0 462 GiB N/A N/A 9 0 B 0 B
cephfs.sharedfs.meta 3 32 299 MiB 190 MiB 109 MiB 87.10k 299 MiB 190 MiB 109 MiB 0.03 308 GiB N/A N/A 87.10k 0 B 0 B
cephfs.sharedfs.data 4 32 2.2 GiB 2.2 GiB 0 B 121.56k 2.2 GiB 2.2 GiB 0 B 0.23 308 GiB N/A N/A 121.56k 0 B 0 B
rbd-pool-proddeb02 5 32 712 MiB 712 MiB 568 B 201 712 MiB 712 MiB 568 B 0.08 308 GiB N/A N/A 201 0 B 0 B
So as you can see we have 332GB RAW but data really are 299+2.2G+712M
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 0 B 0 0 B 0 308 GiB
rbd-pool 2 32 539 B 9 539 B 0 462 GiB
cephfs.sharedfs.meta 3 32 299 MiB 87.10k 299 MiB 0.03 308 GiB
cephfs.sharedfs.data 4 32 2.2 GiB 121.56k 2.2 GiB 0.23 308 GiB
rbd-pool-proddeb02 5 32 712 MiB 201 712 MiB 0.08 308 GiB
How to clean Dirty ? How is that possible ? any cache issue or not committed flush from client ?
Best regards
Alessandro
The conditional policy for the List operations does not work as expected
for the bucket with tenant. With buckets and users without a tenant,
everything is fine.
The owner of t1\bucket1 is t1$user1.
Bucket Policy:
{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"PolicyForUser2Prefix",
"Effect":"Allow",
"Principal":{"AWS": ["arn:aws:iam::t1:user/user2"] },
"Action": ["s3:List*","s3:Get*"],
"Condition" : {
"StringLike" : {
"s3:prefix": [ "obj*"], "s3:delimiter":["/"]
}
},
"Resource": [
"arn:aws:s3::t1:bucket1",
"arn:aws:s3::t1:bucket1/obj*"
]
}
]
}
t1$user2 can list bucket1\obj1 but can't get bucket1\obj1
Get returns Error403
If the "Condition" section is removed from the policy, t1$user2 can get
obj1 but can't list.
If I change the condition to "StringLikeIfExists", everything is fine, but
t1$user2 can List the root of the bucket1, which is undesirable.
Are there any errors in the policy or is it a bug?
Hello,
Setting up first ceph cluster in lab.
Rocky 8.6
Ceph quincy
Using curl install method
Following cephadm deployment steps
Everything works as expected except
ceph orch device ls --refresh
Only displays nvme devices and not the sata ssds on the ceph host.
Tried
sgdisk --zap-all /dev/sda
wipefs -a /dev/sda
Adding sata osd manually I get:
ceph orch daemon add osd ceph-a:data_devices=/dev/sda
Created no osd(s) on host ceph-a; already created?
nvme osd gets added without issue.
I have looked in the volume log on the node and monitor log on the admin
server and have not seen anything that seems like an obvious clue.
I can see commands running successfully against /dev/sda in the logs.
Ideas?
Thanks,
cb