Hello everyone!
We had a switch outage and the ceph kernel mount did not work anymore.
This is the fstab entry:
10.99.10.1:/somefolder /cephfs ceph _netdev,nofail,name=cephcluster,secret=IsSecret 0 0
I reproduced it with disabling the vlan on the switch on which the ceph is reachable, which gives a icmp-not-reachable.
I did this for five minutes, after that, "ls /cephfs" just gives a "permission denied"
in dmesg i can see this:
[ 1412.994921] libceph: mon1 10.99.10.4:6789 session lost, hunting for new mon
[ 1413.009325] libceph: mon0 10.99.10.1:6789 session established
[ 1452.998646] libceph: mon2 10.99.15.3:6789 session lost, hunting for new mon
[ 1452.998679] libceph: mon0 10.99.10.1:6789 session lost, hunting for new mon
[ 1461.989549] libceph: mon4 10.99.15.5:6789 socket closed (con state CONNECTING)
---
[ 1787.045148] libceph: mon3 10.99.15.4:6789 socket closed (con state CONNECTING)
[ 1787.062587] libceph: mon0 10.99.10.1:6789 session established
[ 1787.086103] libceph: mon4 10.99.15.5:6789 session established
[ 1814.028761] libceph: mds0 10.99.10.4:6801 socket closed (con state OPEN)
[ 1815.029811] libceph: mds0 10.99.10.4:6801 connection reset
[ 1815.029829] libceph: reset on mds0
[ 1815.029831] ceph: mds0 closed our session
[ 1815.029833] ceph: mds0 reconnect start
[ 1815.052219] ceph: mds0 reconnect denied
[ 1815.052229] ceph: dropping dirty Fw state for ffff9d9085da1340 1099512175611
[ 1815.052231] ceph: dropping dirty+flushing Fw state for ffff9d9085da1340 1099512175611
[ 1815.273008] libceph: mds0 10.99.10.4:6801 socket closed (con state NEGOTIATING)
[ 1816.033241] ceph: mds0 rejected session
[ 1829.018643] ceph: mds0 hung
[ 1880.088504] ceph: mds0 came back
[ 1880.088662] ceph: mds0 caps renewed
[ 1880.094018] ceph: get_quota_realm: ino (10000000afe.fffffffffffffffe) null i_snap_realm
[ 1881.100367] ceph: get_quota_realm: ino (10000000afe.fffffffffffffffe) null i_snap_realm
[ 2046.768969] conntrack: generic helper won't handle protocol 47. Please consider loading the specific helper module.
[ 2061.731126] ceph: get_quota_realm: ino (10000000afe.fffffffffffffffe) null i_snap_realm
Is this a bug to report or wrong configuration?
Did someone else had this before?
To solve the problem, a simple remount does the trick.
Thanks in advance
Simon
Hi,
Can you suggest me what is a good cephfs design? I've never used it, only rgw and rbd we have, but want to give a try. Howvere in the mail list I saw a huge amount of issues with cephfs so would like to go with some let's say bulletproof best practices.
Like separate the mds from mon and mgr?
Need a lot of memory?
Should be on ssd or nvme?
How many cpu/disk ...
Very appreciate it.
Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo(a)agoda.com<mailto:istvan.szabo@agoda.com>
---------------------------------------------------
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
hi,
if you follow the latest install guide here
<https://docs.ceph.com/en/latest/cephadm/install/> to install pacific
release, the bootstrap command will print the following error message:
....
......
mgr epoch 13 is available
Generating a dashboard self-signed certificate...
Creating initial admin user...
Fetching dashboard port number...
Ceph Dashboard is now available at:
URL: https://ceph-1:8443/
User: admin
Password: xxx
Enabling client.admin keyring and conf on hosts with "admin" label
Non-zero exit code 22 from /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph --init -e
CONTAINER_IMAGE=docker.io/ceph/ceph:v16 -e NODE_NAME=tikal-ceph-1 -e
CEPH_USE_RANDOM_NONCE=1 -v
/var/log/ceph/d6c1ba28-cd0d-11eb-8b39-960000bd038e:/var/log/ceph:z -v
/tmp/ceph-tmp9t3y8y39:/etc/ceph/ceph.client.admin.keyring:z -v
/tmp/ceph-tmpaepq1rte:/etc/ceph/ceph.conf:z docker.io/ceph/ceph:v16 orch
client-keyring set client.admin label:_admin
/usr/bin/ceph: stderr Invalid command: client-keyring not in
start|stop|restart|redeploy|reconfig
/usr/bin/ceph: stderr orch start|stop|restart|redeploy|reconfig
<service_name> : Start, stop, restart, redeploy, or reconfig an entire
service (i.e. all daemons)
/usr/bin/ceph: stderr Error EINVAL: invalid command
Unable to set up "admin" label; assuming older version of Ceph
......
....
As a result the first node is not working correctly and you can not add
mom hosts. I tried to set the label manually afterwards with :
# ceph orch host label add ceph-1 _admin
But this seems to have no effect.
Is this problem known? So only solution seems to stay with the octopus
release :-/
===
Ralph
Hi
It seems that with command like this
aws --profile=my-user-tenant1 --endpoint=$HOST_S3_API --region="" iam
create-role --role-name="tenant2\$TemporaryRole"
--assume-role-policy-document file://json/trust-policy-assume-role.json
I can create a role in another tenant.
Executing user have roles:* capability which I think is necessary to be
able to create roles, but at the same time it seems to be a global ability,
for all tenants.
Similarly, a federated user who assumes a role with iam:CreateRole
permission
can create an arbitrary role like below.
aws --endpoint=$HOST_S3_API --region="" iam create-role
--role-name="tenant2\$TemporaryRole" --assume-role-policy-document
file://json/trust-policy-assume-role.json
Example permission policy
{
"Statement":[
{"Effect":"Allow","Action":["iam:GetRole"]},
{"Effect":"Allow","Action":["iam:CreateRole"]}
]
}
Capability roles:* is not needed in this case, which I think is correct,
because only permission policy of the assumed role is checked.
Getting information about a role from other tenants is possible with
iam:GetRole.
This is less controversial but I would still expect it to be scoped to the
user's tenant unless explicit tenant name is stated in the policy like this
{"Effect":"Allow","Action":["iam:GetRole"],"Resource":"arn:aws:iam::tenant2:*"}
Possibly I'm missing something.
Why is crossing tenants possible?
Regards
Daniel
Hi,
I'm currently reading the documentation about stretched cluster,
I would like to known if it's needed or not with this kind of 3 dc
setup:
3km (0.2ms)
DC1--------------DC2
30km(3ms) | | 30km (2-3ms)
|--------DC3------
DC1 && DC2 are near each other, small latency. (0.2ms)
DC3 is at 30km with bigger latency. (2ms)
separated links between dc with different physical path
1 monitor on each dc
osd on DC1/DC2, with size=4
Cluster is full nvme or ssd, lowest latency is required for osd
replication.
Now, I really don't known if latency monitor at DC3 could have an
impact on osd read/write latency is this monitor elected ?
vs stretched cluster with osd only use local dc monitors ?
What is the advantage of stretch cluster here ? (with good redudant
links between sites)
Hi all,
The cluster here is running v14.2.20 and is used for RBD images.
We have a PG in recovery_unfound state and since this is the first
time we've had this occur, we wanted to get your advice on the best
course of action.
PG 4.1904 went into state active+recovery_unfound+degraded+repair [1]
during normal scrubbing (but note that we have `osd scrub auto repair
= true`).
2021-06-13 03:15:11.559680 osd.951 (osd.951) 138 : cluster [DBG]
4.1904 repair starts
2021-06-13 04:00:49.369256 osd.951 (osd.951) 139 : cluster [ERR]
4.1904 shard 951 soid
4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head : candidate
had a read error
The scrub detected a read error on the primary of this PG, and tried
to repair it by reading from the other 2 osds:
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 FAILED Result:
hostbyte=DID_OK driverbyte=DR
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Sense Key :
Medium Error [current] [descript
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 Add. Sense:
Unrecovered read error
Jun 13 04:00:46 xxx kernel: sd 0:0:25:0: [sdp] tag#6 CDB: Read(16) 88
00 00 00 00 02 ba 8c 0b 00
Jun 13 04:00:46 xxx kernel: blk_update_request: critical medium error,
dev sdp, sector 1171967531
But it seems that the other 2 osds could not repair this failed read
on the primary because they don't have the correct version of the
object:
2021-06-13 04:28:29.412765 osd.951 (osd.951) 140 : cluster [ERR]
4.1904 repair 0 missing, 1 inconsistent objects
2021-06-13 04:28:29.413320 osd.951 (osd.951) 141 : cluster [ERR]
4.1904 repair 1 errors, 1 fixed
2021-06-13 04:28:29.445659 osd.14 (osd.14) 414 : cluster [ERR] 4.1904
push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head v
3592634'367863320 failed because local copy is 3593555'368312656
2021-06-13 04:28:29.472554 osd.344 (osd.344) 124 : cluster [ERR]
4.1904 push 4:209cfddb:::rbd_data.3a4ff12d847b61.000000000001c39e:head
v 3592634'367863320 failed because local copy is 3593555'368312656
2021-06-13 04:28:30.863807 mgr.yyy (mgr.692832499) 648287 : cluster
[DBG] pgmap v557097: 19456 pgs: 1
active+recovery_unfound+degraded+repair, 2 active+clean+scrubbing,
19423 active+clean, 30 active+clean+scrubbing+deep+repair; 1.3 PiB
data, 4.0 PiB used, 2.1 PiB / 6.1 PiB avail; 350 MiB/s rd, 766 MiB/s
wr, 16.93k op/s; 3/1063641423 objects degraded (0.000%); 1/354547141
objects unfound (0.000%)
I don't understand how the versions of the objects would get out of
sync -- there have been no other recent failures on these disks,
AFAICT.
So my best guess is that the IO error on 951 confused the repair
process -- the osd.951 tried to recover the non-latest version of the
object.
(This would imply that the object versions on osds 14 and 344 are in
fact the correct newest versions).
We have a few ideas how to fix this:
* osd 951 is sick, so drain it by setting `ceph osd primary-affinity
951 0` and `ceph osd out 951`
* osd 951 is really sick, so just stop it now and backfill its PGs to
other OSDs.
* Don't stop osd 951 yet: Restart all three relevant OSDs and see if
that fixes the object versions.
* Don't drain osd 951 yet: Make OSD 14 or 344 the primary for this PG,
(e.g. ceph osd primary-affinity 951 0) then run `ceph pg repair
4.1904` so that the version from osds 14/344 can be pushed.
* Use mark_unfound_lost revert, or delete. (and inform the user their
image to fsck their image).
Does anyone have some recent experience or advice on this issue?
Best Regards,
Dan
[1]
# ceph pg 4.1904 query
{
"state": "active+recovery_unfound+degraded+repair",
"snap_trimq": "[1c7fd~1,1c7ff~1,1c801~1,1c803~1,1c805~1]",
"snap_trimq_len": 5,
"epoch": 3593586,
"up": [
951,
344,
14
],
"acting": [
951,
344,
14
],
"acting_recovery_backfill": [
"14",
"344",
"951"
],
...
Hi List
I have a osd (83) that fails to start. It is made up of one 4TB drive
and an 80GB DB on nvme. There was a cluster-full situation that is now
solved, however I am quite sure the issue with this particular osd is
unrelated.
When I try to start the osd it failes to read the label of the block.db
device failing with the followint lines:
1 bluefs add_block_device bdev 1 path
/var/lib/ceph/osd/ceph-83/block.db size 80 GiB
-1 bluestore(/var/lib/ceph/osd/ceph-83) _minimal_open_bluefs check
block device(/var/lib/ceph/osd/ceph-83/block.db) label returned: (2) No
such file or directory
1 bdev(0x563d8822a700 /var/lib/ceph/osd/ceph-83/block.db) close
1 bdev(0x563d8822a000 /var/lib/ceph/osd/ceph-83/block) close
-1 osd.83 0 OSD:init: unable to mount object store
-1 ** ERROR: osd init failed: (2) No such file or directory
(Full log:
https://gist.github.com/NightDog/7b50349da1410bb05bd7f4d54a02f055)
The last thing that happened to the OSD before it started to fail
booting was it being terminated during what I believe was a reboot:
received signal: Terminated from Kernel ( Could be generated by
pthread_kill(), raise(), abort(), alarm() ) UID: 0
Last lines++ of the last successful run:
https://gist.github.com/NightDog/fd9b4b7b3e0c0c2ba29ce5d325bb97c6
When I try to run ceph-bluestore-tool --log-level 30 show-label on the
block.db it returns:
"unable to read label for
/dev/ceph-00ed472c-f900-4dc3-9ddc-0e2f3b6547e3/osd-db-bb0eaa16-a1e0-4985-b4bd-74799e5226be:
(2) No such file or directory"
The block returns the label fine (see master gist):
https://gist.github.com/NightDog/4518bf11b364170911e5743b5ed0f614
The strange thing is however that lvs -o lv_tags returns just fine for
the block.db:
root@ceph-node201:~# lvs -o lv_tags
/dev/ceph-00ed472c-f900-4dc3-9ddc-0e2f3b6547e3/osd-db-bb0eaa16-a1e0-4985-b4bd-74799e5226be
LV Tags
ceph.block_device=/dev/ceph-ff60b68a-26fe-4294-8bec-4a9c329e858d/osd-block-73ab12e6-7758-4ebe-9319-5935309fcacd,ceph.block_uuid=nbRXYl-fRrQ-qyYP-D93c-IGct-yKg4-rujDOX,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=f4495398-a8c4-4ad9-8219-80c48625abdf,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.db_device=/dev/ceph-00ed472c-f900-4dc3-9ddc-0e2f3b6547e3/osd-db-bb0eaa16-a1e0-4985-b4bd-74799e5226be,ceph.db_uuid=K09p3L-QV06-LOLO-uVeT-2ulz-GD3O-CEyRcs,ceph.encrypted=0,ceph.osd_fsid=73ab12e6-7758-4ebe-9319-5935309fcacd,ceph.osd_id=83,ceph.osdspec_affinity=osd-spec-2xx,ceph.type=db,ceph.vdo=0
So it seems to me that for some reason, ceph-bluestore-tool fails to
read the label of the block.db device, even tho it is there, and then
fails the startup of the OSD.
Trying to write keys with ceph-bluestore-tool set-label-key fails with
the same error message.
I see no reason why there should be any damage to either the .db or
block device, and since the labels are there in LVM, guess
ceph-bluestore-tool errors out on something else?
Would it be possible to get some help with regards to getting this .db
and OSD back up again?
Thanks!
PS: Running version 15.2.8, also tried with 16.2.3-> show-label, with
same result.
--
Regards
Karl M. Kittilsen
Hi,
I have setup a ceph cluster (octopus) and installed he rbd
plugins/provisioner in my Kubernetes cluster.
I can create dynamically FS and Block Volumes which is fine. For that I
have created the following the following storageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph
provisioner: rbd.csi.ceph.com
parameters:
clusterID: <clusterID>
pool: kubernetes
imageFeatures: layering
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
csi.storage.k8s.io/provisioner-secret-namespace: ceph-system
csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret
csi.storage.k8s.io/controller-expand-secret-namespace: ceph-system
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
csi.storage.k8s.io/node-stage-secret-namespace: ceph-system
reclaimPolicy: Delete
allowVolumeExpansion: true
mountOptions:
- discard
This works fine for ephemeral dynamically crated volumes. But now I want
to use durable volume with the reclaimPolicy:Retain. I expect that I
need to create the image in my kubernetes pool on the ceph cluster first
- which I have done.
I defined the following new storage class with the reclaimPolicy 'Retain':
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ceph-durable
provisioner: rbd.csi.ceph.com
parameters:
clusterID: <clusterID>
pool: kubernetes
imageFeatures: layering
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
csi.storage.k8s.io/provisioner-secret-namespace: ceph-system
csi.storage.k8s.io/controller-expand-secret-name: csi-rbd-secret
csi.storage.k8s.io/controller-expand-secret-namespace: ceph-system
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
csi.storage.k8s.io/node-stage-secret-namespace: ceph-system
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
- discard
And finally I created the following PersistentVolume and
PersistentVolumeClaim:
---
kind: PersistentVolume
apiVersion: v1
metadata:
name: demo-internal-index
spec:
capacity:
storage: 1Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
claimRef:
namespace: office-demo-internal
name: index
csi:
driver: driver.ceph.io
fsType: ext4
volumeHandle: demo-internal-index
storageClassName: ceph-durable
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: index
namespace: office-demo-internal
spec:
accessModes:
- ReadWriteOnce
storageClassName: ceph-durable
resources:
requests:
storage: 1Gi
volumeName: "demo-internal-index"
But this seems not to work and I can see the following deployment warning:
attachdetach-controller AttachVolume.Attach failed for volume
"demo-internal-index" : attachdetachment timeout for volume
demo-internal-index
But the PV exists:
$ kubectl get pv
NAME CAPACITY ACCESS MODES
RECLAIM POLICY STATUS CLAIM
STORAGECLASS REASON AGE
demo-internal-index 1Gi RWO
Retain Bound office-demo-internal/index
ceph-durable 2m35s
and also the PVC exists:
$ kubectl get pvc -n office-demo-internal
NAME STATUS VOLUME CAPACITY ACCESS MODES
STORAGECLASS AGE
index Bound demo-internal-index 1Gi RWO ceph-durable 53m
I guess my PV object is nonsense? Can someone provide me an example how
to setup the PV object in Kubernetes. I only found examples where the
Ceph Monitoring IPs and the user/password is configured within the PV
object. But I would expect that this is covered by the storage class
already?
Thanks for your help
===
Ralph