Hi,
I upgraded from 13.2.5 to 14.2.6 last week and am now seeing
significantly higher latency on various MDS operations. For example,
the 2min rate of ceph_mds_server_req_create_latency_sum /
ceph_mds_server_req_create_latency_count for an 8hr window last Monday
prior to the upgrade was an average of 2ms. Today, however the same
stat shows 869ms. Other operations including open, readdir, rmdir,
etc. are also taking significantly longer.
Here's a partial example of an op from dump_ops_in_flight:
{
"description": "client_request(client.342513090:334359409
create #...)",
"initiated_at": "2020-04-13 15:30:15.707637",
"age": 0.19583208099999999,
"duration": 0.19767626299999999,
"type_data": {
"flag_point": "submit entry: journal_and_reply",
"reqid": "client.342513090:334359409",
"op_type": "client_request",
"client_info": {
"client": "client.342513090",
"tid": 334359409
},
"events": [
{
"time": "2020-04-13 15:30:15.707637",
"event": "initiated"
},
{
"time": "2020-04-13 15:30:15.707637",
"event": "header_read"
},
{
"time": "2020-04-13 15:30:15.707638",
"event": "throttled"
},
{
"time": "2020-04-13 15:30:15.707640",
"event": "all_read"
},
{
"time": "2020-04-13 15:30:15.781935",
"event": "dispatched"
},
{
"time": "2020-04-13 15:30:15.785086",
"event": "acquired locks"
},
{
"time": "2020-04-13 15:30:15.785507",
"event": "early_replied"
},
{
"time": "2020-04-13 15:30:15.785508",
"event": "submit entry: journal_and_reply"
}
]
}
}
This along with every other 'create' op I've seen has a 50ms+ delay
between all_read and dispatched events - what is happening during this
time? I'm not sure what I'm looking for the in the MDS debug logs.
We have a mix of clients from 12.2.x through 14.2.8; my plan was to
upgrade those pre-Nautilus clients this week. There is only a single
MDS rank with 1 backup. Other functions of this cluster - RBDs and RGW
- do not appear impacted so this looks limited to the MDS. I did not
observe this behavior after upgrading a dev cluster last month.
Has anyone seen anything similar? Thanks for any assistance!
Josh
Hi Ceph folks,
I am relatively new to Ceph cluster and I hope I can quickly receive some help here.
I would like to recover files from cephfs data pool. Someone wrote that inode linkage and file names are stored in omap data of objects in metadata pool.
I cant find any information about the structure of omap data of the objects in metadata pool to help me write for example a script to retrieve filenames and the related objects
So I can use for example “rados get” to retrieve those files.
Are there any working script that would traverse through all the metadata pool to find out file names corresponded to objects in data pool?
/Ed
Hi,
is there a way to synchronize a specific bucket by Ceph across the available datacenters?
I've just found multi site setup but that one sync the complete cluster, which is equal to failover solution.
For me just 1 bucket.
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hello,
running Ceph Nautilus 14.2.4, we encountered this documented dynamic resharding issue:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-November/037531.ht…
We disabled dynamic resharding in the configuration, and attempted to reshard to 1 shard:
# radosgw-admin reshard add --bucket files --num-shards 1 --yes-i-really-mean-it
However, it achieved nothing, and the bucket is now stuck in resharding status. It is impossible to clear the resharding flag (I have tried the bucket check --fix operation with no avail)
# radosgw-admin reshard cancel --bucket=files
2020-04-28 11:47:18.721 7fd213b969c0 -1 ERROR: failed to remove entry from reshard log, oid=reshard.0000000000 tenant= bucket=files
# radosgw-admin bucket reshard --bucket files --num-shards 1
ERROR: the bucket is currently undergoing resharding and cannot be added to the reshard list at this time
Hello everyone,
A few weeks ago I have enabled the ceph balancer on my cluster as per the instructions here: [ https://docs.ceph.com/docs/mimic/mgr/balancer/ | https://docs.ceph.com/docs/mimic/mgr/balancer/ ]
I am running ceph version:
ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
The cluster has 48 osds (40 osds in hdd pools and 8 osds in ssd pool)
Currently, the balancer status is showing as Active.
# ceph balancer status
{
"active": true,
"plans": [],
"mode": "upmap"
}
The health status of the cluster is:
health: HEALTH_OK
Previously, I've used the old REWEIGHT to change the placement of data as I've seen very uneven usage (ranging from about 60% usage on some OSDs to over 90% on others). So, I have a number of osds with reweight of 1 an some going down to 0.75.
At the moment the osd usage ranges between about 65% to to just under 90%, so still a huge variation. After switching on the balancer, I've not actually seen any activity or data migration, so I am not sure if the balancer is working at all. Could someone tell me how do I check if balancing is doing its job?
The second question is as the balancer is now switched on, do I suppose to set the reweight values back to their default value of 1?
Many thanks
Hi all!
I export the rbd image use the export v2 format and without it.
The difference is the export v2 format image is more smaller than the image which do not use export v2 format.
Can any body tell me why they are size is difference so huge?
Thanks very much!
root@controller:/mnt# rbd du images/35d69ca5-b4f7-499e-9719-331eee498bc4
NAME PROVISIONED USED
35d69ca5-b4f7-499e-9719-331eee498bc4@snap 40GiB 40GiB
35d69ca5-b4f7-499e-9719-331eee498bc4 40GiB 0B
<TOTAL> 40GiB 40GiB
root@controller:/mnt# rbd export --export-format 2 images/35d69ca5-b4f7-499e-9719-331eee498bc4 ./v2image
Exporting image: 100% complete...done.
root@controller:/mnt# du -sh ./v2image
9.8G ./v2image
root@controller:/mnt# rbd export images/35d69ca5-b4f7-499e-9719-331eee498bc4 ./image
Exporting image: 100% complete...done.
root@controller:/mnt# du -sh ./image
40G ./image
Hi
We have an 3 year old Hadoop cluster - up for refresh - so it is time
to evaluate options. The "only" usecase is running an HBase installation
which is important for us and migrating out of HBase would be a hazzle.
Our Ceph usage has expanded and in general - we really like what we see.
Thus - Can this be "sanely" consolidated somehow? I have seen this:
https://docs.ceph.com/docs/jewel/cephfs/hadoop/
But it seem really-really bogus to me.
It recommends that you set:
pool 3 'hadoop1' rep size 1 min_size 1
Which would - if I understand correct - be disastrous. The Hadoop end would
replicated in 3 across - but within Ceph the replication would be 1.
The 1 replication in ceph means pulling the OSD node would "gaurantee" the
pg's to go inactive - which could be ok - but there is nothing
gauranteeing that the other Hadoop replicas are not served out of the same
OSD-node/pg? In which case - rebooting an OSD node would bring the hadoop
cluster unavailable.
Is anyone serving HBase out of Ceph - how does the stadck and
configuration look? If I went for 3 x replication in both Ceph and HDFS
then it would definately work, but 9x copies of the dataset is a bit more
than what looks feasible at the moment.
Thanks for your reflections/input.
Jesper
Hi, thought i might ask here since i was unable to find anything similar
to my issue.Perhaps someone might have an idea.
In our org we are running currently few nautilus clusters (14.2.2 ,
14.2.4, 14.2.8) But strangely enough the clusters with 14.2.8 are
reporting weird metrics of pool R/W. To be more specific, from time to
time some random huge spikes in orders of TB/s appear in our graphs ,
whereas the real usage is GB/s (Clusters are monitored by Prometheus).
First to blame was our exporters, so i tried the one built in
ceph-dashboard , but the values of those metrics were same (+- few
bytes). I went as far as writing a small program to talk directly to the
ceph API, but after i processed the output, data was also the same. I
can provide more details if someone has a clue what might be the
cause(Graphs, logs etc...).
As far as the rest of the clusters goes, their graphs does not contain
any spikes as the ones of the 14.2.8 clusters.
Thanks for any responses/help in advance.
Completed the migration of an existing Ceph cluster on Octopus to cephadm.
All OSD/MON/MGR moved fine, however upon running the command to setup some new MDS for cephfs they both failed to start.
After looking into the cephadm log's I found the following error:
Apr 13 06:26:15 sn-s01 systemd[1]: Started Ceph mds.cephfs.sn-s01.snkhfd for b1db6b36-0c4c-4bce-9cda-18834be0632d.
Apr 13 06:26:16 sn-s01 bash[3520809]: debug 2020-04-13T04:26:16.445+0000 7f380b908700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [1] but i only support [2]
Apr 13 06:26:16 sn-s01 bash[3520809]: debug 2020-04-13T04:26:16.445+0000 7f380b107700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [1] but i only support [2]
Apr 13 06:26:16 sn-s01 bash[3520809]: debug 2020-04-13T04:26:16.445+0000 7f380a906700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [1] but i only support [2]
Apr 13 06:26:16 sn-s01 bash[3520809]: failed to fetch mon config (--no-mon-config to skip
This cluster is running with cephx disabled, I imported the ceph.conf into cephadm fine and this has worked fine when the other services have started, but from the above error looks like maybe the mds is not checking if cephx is enabled or disabled before trying to communicate to the mon's?
Hi all,
*** Short version ***
Is there a way to repair a rocksdb from errors "Encountered error while
reading data from compression dictionary block Corruption: block
checksum mismatch" and "_open_db erroring opening db" ?
*** Long version ***
We operate a nautilus ceph cluster (with 100 disks of 8TB in 6 servers +
4 mons/mgr + 3 mds).
We recently (Monday 20) upgraded from 14.2.7 to 14.2.8. This triggered a
rebalancing of some data.
Two days later (Wednesday 22) we had a very short power outage. Only one
of the osd servers went down (and unfortunately died).
This triggered a reconstruction of the losts osds. Operations went fine
until Saturday 25 where some osds in the 5 remaining servers started to
crash apparently with no reasons.
We tryed to restart them, but they crashed again. We ended with 18 osd
down (+ 16 in the dead server so 34 osd downs out of 100).
Looking at the logs we found for all the crashed osd :
-237> 2020-04-25 16:32:51.835 7f1f45527a80 3 rocksdb:
[table/block_based_table_reader.cc:1117] Encountered error while reading
data from compression dictionary block Corruption: block checksum
mismatch: expected 0, got 2729370997 in db/181355.sst offset
18446744073709551615 size 18446744073709551615
and
2020-04-25 16:05:47.251 7fcbd1e46a80 -1
bluestore(/var/lib/ceph/osd/ceph-3) _open_db erroring opening db:
We also noticed that the "Encountered error while reading data from
compression dictionary block Corruption: block checksum mismatch" was
present few days before the crash.
We also have some osd with this error but still up.
We tryed to repair with :
ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-3 repair
But no success (it ends with _open_db erroring opening db).
Thus does somebody have an idea to fix this or at least know if it's
possible to repair and correct the "Encountered error while reading data
from compression dictionary block Corruption: block checksum mismatch"
and "_open_db erroring opening db" !
Thanks for your help (we are desperate because we will loose datas and
are fighting to save something) !!!
F.