Hi
On one of our Ceph clusters, some OSDs have been marked as full. Since this is a staging cluster that does not have much data on it, this is strange.
Looking at the full OSDs through “ceph osd df” I figured out that the space is mostly used by metadata:
SIZE: 122 GiB
USE: 118 GiB
DATA: 2.4 GiB
META: 116 GiB
We run mimic, and for the affected OSDs we use a db device (nvme) in addition to the primary device (hdd).
In the logs we see the following errors:
2020-05-12 17:10:26.089 7f183f604700 1 bluefs _allocate failed to allocate 0x400000 on bdev 1, free 0x0; fallback to bdev 2
2020-05-12 17:10:27.113 7f183f604700 1 bluestore(/var/lib/ceph/osd/ceph-8) _balance_bluefs_freespace gifting 0x180a000000~400000 to bluefs
2020-05-12 17:10:27.153 7f183f604700 1 bluefs add_block_extent bdev 2 0x180a000000~400000
We assume it is an issue with Rocksdb, as the following call will quickly fix the problem:
ceph daemon osd.8 compact
The question is, why is this happening? I would think that “compact" is something that runs automatically from time to time, but I’m not sure.
Is it on us to run this regularly?
Any pointers are welcome. I’m quite new to Ceph :)
Cheers,
Denis
Hi,
I am trying to create a topic so that I can use it to listen for object creation notifications on a bucket.
If I make my API call without supplying AWS authorization headers, the topic creation succeeds, and it can be seen by using a ListTopics call.
However, in order to attach a topic to a bucket, the topic and bucket must have the same owner. So I tried creating a topic using AWS auth.
The credential header I tried was the same as what I use for get/put items to a bucket:
Credential=<access key id>/20200512/us-east-1/s3/aws4_request
However in this case rather than succeeding I get a NotImplemented error.
If I tried changing the service in the credential to something other than s3, like "Credential=<access key id>/20200512/us-east-1/s3/aws4_request" I instead get a SignatureDoesNotMatch error. What is the right way to authenticate a CreateTopic request?
Thanks,
Alexis
Hi,
I noticed strange situation in one of our clusters. The OSD deamons are taking too much RAM.
We are running 12.2.12 and have default configuration of osd_memory_target (4GiB).
Heap dump shows:
osd.2969 dumping heap profile now.
------------------------------------------------
MALLOC: 6381526944 ( 6085.9 MiB) Bytes in use by application
MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist
MALLOC: + 173373288 ( 165.3 MiB) Bytes in central cache freelist
MALLOC: + 17163520 ( 16.4 MiB) Bytes in transfer cache freelist
MALLOC: + 95339512 ( 90.9 MiB) Bytes in thread cache freelists
MALLOC: + 28995744 ( 27.7 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 6696399008 ( 6386.2 MiB) Actual memory used (physical + swap)
MALLOC: + 218267648 ( 208.2 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 6914666656 ( 6594.3 MiB) Virtual address space used
MALLOC:
MALLOC: 408276 Spans in use
MALLOC: 75 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
IMO "Bytes in use by application" should be less than osd_memory_target. Am I correct?
I checked heap dump with google-pprof and got following results.
Total: 149.4 MB
60.5 40.5% 40.5% 60.5 40.5% rocksdb::UncompressBlockContentsForCompressionType
34.2 22.9% 63.4% 34.2 22.9% ceph::buffer::create_aligned_in_mempool
11.9 7.9% 71.3% 12.1 8.1% std::_Rb_tree::_M_emplace_hint_unique
10.7 7.1% 78.5% 71.2 47.7% rocksdb::ReadBlockContents
Does it mean that most of RAM is used by rocksdb?
How can I take a deeper look into memory usage ?
Regards,
Rafał Wądołowski
Hi,
I am running a small 3 node Ceph Nautilus 14.2.8 cluster on Ubuntu 18.04.
I am testing cluster to expose cephfs volume in samba v4 share for the user
to access from windows for latter use.
Samba Version 4.7.6-Ubuntu and mount.cifs version: 6.8.
When I did a test with DD Write (600 MB/s) and md5sum file Read speed is
(300 - 400 MB/s) from ceph kernel mount.
The same volume I have exposed in samba using "vfs_ceph" and mounted it
through CIFS in another ubuntu18.04 as client.
Now, when I perform DD write I get the speed of 600 MB/s and md5sum of file
Read speed is only 65 MB/s.
There is a different result when I try to read the same file using
smbclinet getting the speed of 101 MB/s.
Why is this difference what could be the issue?
app_id must match with the 'aud' field in the token introspection result
(In the example the value of 'aud' is 'customer-portal')
Thanks,
Pritha
On Tue, May 12, 2020 at 8:16 PM Wyllys Ingersoll <
wyllys.ingersoll(a)keepertech.com> wrote:
>
> Running Nautilus 14.2.9 and trying to follow the STS example given here:
> https://docs.ceph.com/docs/master/radosgw/STS/ to setup a policy
> for AssumeRoleWithWebIdentity using KeyCloak (8.0.1) as the OIDC provider.
> I am able to see in the rgw debug logs that the token being passed from the
> client is passing the introspection check, but it always ends up failing
> the final authorization to access the requested bucket resource and is
> rejected with a 403 status "AccessDenied".
>
> I configured my policy as described in the 2nd example on the STS page
> above. I suspect the problem is with the "StringEquals" condition statement
> in the AssumeRolePolicy document (I could be wrong though).
>
> The example shows using the keycloak URI followed by ":app_id" matching
> with the name of the keycloak client application ("customer-portal" in the
> example). My keycloak setup does not have any such field in the
> introspection result and I can't seem to figure out how to make this all
> work.
>
> I cranked up the logging to 20/20 and still did not see any hints as to
> what part of the policy is causing the access to be denied.
>
> Any suggestions?
>
> -Wyllys Ingersoll
>
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io
>
Hi,
I deployed a multisite in order to sync data from a mimic cluster zone
to a nautilus cluster zone. The data sync well at present. However,
I check the cluster status and I find something strange. The data in my
new cluster seems larger than that in old ones. The data is far from full
synced while the space used is nearly the same. Does that normal?
'ceph df ' on old cluster:
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
82 TiB 41 TiB 41 TiB 50.37
POOLS:
NAME ID USED %USED MAX
AVAIL OBJECTS
.rgw.root 1 6.0 KiB 0
10 TiB 19
default.rgw.control 2 0 B 0
10 TiB 8
default.rgw.meta 3 3.5 KiB 0
10 TiB 19
default.rgw.log 4 8.4 KiB 0
10 TiB 1500
default.rgw.buckets.index 5 0 B 0
10 TiB 889
default.rgw.buckets.non-ec 6 0 B 0
10 TiB 497
default.rgw.buckets.data 7 14 TiB 56.96
10 TiB 3968545
testpool 8 0 B 0
10 TiB 0
'ceph df ' on new cluster:
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 137 TiB 98 TiB 38 TiB 38 TiB 28.02
TOTAL 137 TiB 98 TiB 38 TiB 38 TiB 28.02
POOLS:
POOL ID STORED OBJECTS
USED %USED MAX AVAIL
.rgw.root 1 6.4 KiB
21 3.8 MiB 0 26 TiB
shubei.rgw.control 13 0 B 8
0 B 0 26 TiB
shubei.rgw.meta 14 4.1 KiB 20
3.2 MiB 0 26 TiB
shubei.rgw.log 15 9.9 MiB 1.64k
47 MiB 0 26 TiB
default.rgw.meta 16 0 B 0
0 B 0 26 TiB
shubei.rgw.buckets.index 17 2.7 MiB 889
2.7 MiB 0 26 TiB
shubei.rgw.buckets.data 18 11 TiB 2.90M
33 TiB 29.37 26 TiB
'radosgw-admin sync status' on new cluster:
realm bde4bb56-fbca-4ef8-a979-935dbf109b78 (new-oriental)
zonegroup d25ae683-cdb8-4227-be45-ebaf0aed6050 (beijing)
zone 313c8244-fe4d-4d46-bf9b-0e33e46be041 (shubei)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: f70a5eb9-d88d-42fd-ab4e-d300e97094de (oldzone)
syncing
full sync: 106/128 shards
full sync: 350 buckets to sync
incremental sync: 22/128 shards
data is behind on 115 shards
behind shards:
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,23,24,25,26,27,28,29,30,32,35,37,38,39,40,41,42,43,44,45,46,47,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,96,97,98,99,100,101,102,103,104,105,107,108,109,110,111,112,113,114,116,118,119,120,121,122,123,124,125,126,127]
oldest incremental change not applied: 2020-05-11
10:46:41.0.60179s [80]
5 shards are recovering
recovering shards: [21,31,95,104,106]
Hello
I had an incidence where 3 OSD's crashed at once completely and won't power
up. And during recovery 3 OSD's in another host have somehow become
corrupted. I am running erasure coding with 8+2 setup using crush map which
takes 2 OSDs per host, and after losing the other 2 OSD i have few PG's
down. Unfortunately these PG's seem to overlap almost all data on the pool,
so i believe the entire pool is mostly lost after only these 2% of PG's
down.
I am running ceph 14.2.9.
OSD 92 log https://pastebin.com/5aq8SyCW
OSD 97 log https://pastebin.com/uJELZxwr
ceph-bluestore-tool repair without --deep showed "success" but OSD's still
fail with the log above.
Log from trying ceph-bluestore-tool repair --deep which is still running,
not sure if it will actually fix anything and log looks pretty bad.
https://pastebin.com/gkqTZpY3
Trying "ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-97 --op
list" gave me input/output error. But everything in SMART looks OK, and i
see no indication of hardware read error in any logs. Same for both OSD.
The OSD's with corruption have absolutely no bad sectors and likely have
only a minor corruption but at important locations.
Any ideas on how to recover this kind of scenario ? Any tips would be
highly appreciated.
Best regards,
Kári Bertilsson
There is a general documentation meeting called the "DocuBetter Meeting",
and it is held every two weeks. The next DocuBetter Meeting will be on 13
May 2020 at 0830 PST, and will run for thirty minutes. Everyone with a
documentation-related request or complaint is invited. The meeting will be
held here: https://bluejeans.com/908675367
Send documentation-related requests and complaints to me by replying to
this email and CCing me at zac.dover(a)gmail.com.
The next DocuBetter meeting is scheduled for:
13 May 2020 0830 PST
13 May 2020 1630 UTC
14 May 2020 0230 AEST
Etherpad: https://pad.ceph.com/p/Ceph_Documentation
Meeting: https://bluejeans.com/908675367
Thanks, everyone.
Zac Dover
Hello all,
I'm having an issue with a bucket that refuses to be resharded..for the record, the cluster was recently upgraded from 13.2.4 to 13.2.10.
# radosgw-admin reshard add --bucket foo --num-shards 3300
ERROR: the bucket is currently undergoing resharding and cannot be added to the reshard list at this time
# radosgw-admin reshard list
[]
# radosgw-admin reshard status --bucket=foo
[
{
"reshard_status": "not-resharding",
"new_bucket_instance_id": "",
"num_shards": -1
},
<snip>
# radosgw-admin reshard cancel --bucket foo
ERROR: failed to remove entry from reshard log, oid=reshard.0000000009 tenant= bucket=foo
# radosgw-admin reshard stale-instances list
[]
Is there anything else I should check to troubleshoot this? I was able to reshard another bucket since the upgrade, so I suspect there's something lingering that's blocking this.
Hello,
I was hoping someone could clear up the difference between these metrics.
In filestore the difference between Apply and Commit Latency was pretty
clear and these metrics gave a good representation of how the cluster was
performing. High commit usually meant our journals were performing poorly
while high apply pointed to an OSD issue.
With bluestore Apply & Commit are now tied to the same metric and it's not
as clear to me what that metric is.
In addition new metrics such as Read and Write Op Latency have been added.
I'm led to believe that these are similar to what Apply Latency used to
represent but is that actually the case?
If anyone who has a better understanding of this than I do can enlighten me
I'd appreciate it!
Thanks,
John