Hello Community,
I have problems with ceph-mons in docker. Docker pods are starting but I got a lot of messages "e6 handle_auth_request failed to assign global_id” in log. 2 mons are up but I can’t send any ceph commands.
Regards
Mateusz
Hi,
I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as
down when the network is cut to that rack.
Situation:
- Nautilus cluster
- 3 racks
- 120 OSDs, 40 per rack
We performed a test where we turned off the network Top-of-Rack for each
rack. This worked as expected with two racks, but with the third
something weird happened.
From the 40 OSDs which were supposed to be marked as down only 36 were
marked as down.
In the end it took 15 minutes for all 40 OSDs to be marked as down.
$ ceph config set mon mon_osd_reporter_subtree_level rack
That setting is set to make sure that we only accept reports from other
racks.
What we saw in the logs for example:
2020-10-29T03:49:44.409-0400 7fbda185e700 10
mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 osd.51 has 54 reporters,
239.856038 grace (20.000000 + 219.856 + 7.43801e-23), max_failed_since
2020-10-29T03:47:22.374857-0400
But osd.51 was still not marked as down after 54 reporters have reported
that it is actually down.
I checked, no ping or other traffic possible to osd.51. Host is unreachable.
Another osd was marked as down, but it took a couple of minutes as well:
2020-10-29T03:50:54.455-0400 7fbda185e700 10
mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 osd.37 has 48 reporters,
221.378970 grace (20.000000 + 201.379 + 6.34437e-23), max_failed_since
2020-10-29T03:47:12.761584-0400
2020-10-29T03:50:54.455-0400 7fbda185e700 1
mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 we have enough reporters
to mark osd.37 down
In the end osd.51 was marked as down, but only after the MON decided to
do so:
2020-10-29T03:53:44.631-0400 7fbda185e700 0 log_channel(cluster) log
[INF] : osd.51 marked down after no beacon for 903.943390 seconds
2020-10-29T03:53:44.631-0400 7fbda185e700 -1
mon.CEPH2-MON1-206-U39(a)0(leader).osd e107104 no beacon from osd.51 since
2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago. marking down
I haven't seen this happen before in any cluster. It's also strange that
this only happens in this rack, the other two racks work fine.
ID CLASS WEIGHT TYPE NAME
-1 1545.35999 root default
-206 515.12000 rack 206
-7 27.94499 host CEPH2-206-U16
...
-207 515.12000 rack 207
-17 27.94499 host CEPH2-207-U16
...
-208 515.12000 rack 208
-31 27.94499 host CEPH2-208-U16
...
That's how the CRUSHMap looks like. Straight forward and 3x replication
over 3 racks.
This issue only occurs in rack *207*.
Has anybody seen this before or knows where to start?
Wido
Hi everyone, I asked the same question in stackoverflow, but will repeat here.
I configured bucket notification using a bucket owner creds and when the owner does actions I can see new events in a configured endpoint(kafka actually). However, when I try to do actions in the bucket, but with another user creds I do not see events in the configured notification topic. Is it expected behavior and each user has to configure own topic(is it possible if a user is not system at all)? Or I have missed something? Thank you.
https://stackoverflow.com/questions/64384060/enable-bucket-notifications-fo…
Hi all,
on a mimic 13.2.8 cluster I observe a gradual increase of memory usage by OSD daemons, in particular, under heavy load. For our spinners I use osd_memory_target=2G. The daemons overrun the 2G in virt size rather quickly and grow to something like 4G virtual. The real memory consumption stays more or less around the 2G of the target. There are some overshoots, but these go down again during periods with less load.
What I observe now is that the actual memory consumption slowly grows and OSDs start using more than 2G virtual memory. I see this as slowly growing swap usage despite having more RAM available (swappiness=10). This indicates allocated but unused memory or memory not accessed for a long time, usually a leak. Here some heap stats:
Before restart:
osd.101 tcmalloc heap stats:------------------------------------------------
MALLOC: 3438940768 ( 3279.6 MiB) Bytes in use by application
MALLOC: + 5611520 ( 5.4 MiB) Bytes in page heap freelist
MALLOC: + 257307352 ( 245.4 MiB) Bytes in central cache freelist
MALLOC: + 357376 ( 0.3 MiB) Bytes in transfer cache freelist
MALLOC: + 6727368 ( 6.4 MiB) Bytes in thread cache freelists
MALLOC: + 25559040 ( 24.4 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 3734503424 ( 3561.5 MiB) Actual memory used (physical + swap)
MALLOC: + 575946752 ( 549.3 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 4310450176 ( 4110.8 MiB) Virtual address space used
MALLOC:
MALLOC: 382884 Spans in use
MALLOC: 35 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
# ceph daemon osd.101 dump_mempools
{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 4691828,
"bytes": 37534624
},
"bluestore_cache_data": {
"items": 0,
"bytes": 0
},
"bluestore_cache_onode": {
"items": 51,
"bytes": 28968
},
"bluestore_cache_other": {
"items": 5761276,
"bytes": 46292425
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 67,
"bytes": 46096
},
"bluestore_writing_deferred": {
"items": 208,
"bytes": 26037057
},
"bluestore_writing": {
"items": 52,
"bytes": 6789398
},
"bluefs": {
"items": 9478,
"bytes": 183720
},
"buffer_anon": {
"items": 291450,
"bytes": 28093473
},
"buffer_meta": {
"items": 546,
"bytes": 34944
},
"osd": {
"items": 98,
"bytes": 1139152
},
"osd_mapbl": {
"items": 78,
"bytes": 8204276
},
"osd_pglog": {
"items": 341944,
"bytes": 120607952
},
"osdmap": {
"items": 10687217,
"bytes": 186830528
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
}
},
"total": {
"items": 21784293,
"bytes": 461822613
}
}
}
Right after restart + health_ok:
osd.101 tcmalloc heap stats:------------------------------------------------
MALLOC: 1173996280 ( 1119.6 MiB) Bytes in use by application
MALLOC: + 3727360 ( 3.6 MiB) Bytes in page heap freelist
MALLOC: + 25493688 ( 24.3 MiB) Bytes in central cache freelist
MALLOC: + 17101824 ( 16.3 MiB) Bytes in transfer cache freelist
MALLOC: + 20301904 ( 19.4 MiB) Bytes in thread cache freelists
MALLOC: + 5242880 ( 5.0 MiB) Bytes in malloc metadata
MALLOC: ------------
MALLOC: = 1245863936 ( 1188.1 MiB) Actual memory used (physical + swap)
MALLOC: + 20488192 ( 19.5 MiB) Bytes released to OS (aka unmapped)
MALLOC: ------------
MALLOC: = 1266352128 ( 1207.7 MiB) Virtual address space used
MALLOC:
MALLOC: 54160 Spans in use
MALLOC: 33 Thread heaps in use
MALLOC: 8192 Tcmalloc page size
------------------------------------------------
Am I looking at a memory leak here or are these heap stats expected?
I don't mind the swap usage, it doesn't have impact. I'm just wondering if I need to restart OSDs regularly. The "leakage" above occurred within only 2 months.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi all,
My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db
continue growing. It claims mon are using a lot of disk space.
I set "mon compact on start = true" and restart one of the monitors. But
it started and campacting for a long time, seems it has no end.
[image: image.png]
Hi all,
We are planning for a new pool to store our dataset using CephFS. These data are almost read-only (but not guaranteed) and consist of a lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at getting the highest read throughput.
If we just use a replicated pool of size 3 on SSD, we should get the best performance, however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such small object read workload, I think.
Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d expect the same read throughout as the all SSD deployment.
I’ve read the documents and did some tests. Here is the crush rule I’m testing with:
rule mixed_replicated_rule {
id 3
type replicated
min_size 1
max_size 10
step take default class ssd
step chooseleaf firstn 1 type host
step emit
step take default class hdd
step chooseleaf firstn -1 type host
step emit
}
Now I have the following conclusions, but I’m not very sure:
* The first OSD produced by crush will be the primary OSD (at least if I don’t change the “primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And every read request will read from SSD if it is up.
* It is currently not possible to enforce SSD and HDD OSD to be chosen from different hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal value 3, on the pool using the above crush rule.
Am I correct about the above statements? How would this work from your experience? Thanks.
Dear all,
I am experimenting with Ceph as a replacement for the AndrewFileSystem (https://en.wikipedia.org/wiki/Andrew_File_System). In my current setup, I am using AFS as a distributed filesystem for approximately 1000 users to store personal data and let them access their home directories and other shared data from multiple locations across different buildings. The authentication is managed by Kerberos (+ LDAP server). My goal is to replace AFS with CephFS but keep the current Kerberos database.
Right now I've managed to set up a testing Ceph cluster with 6 nodes and 11 osds and I can mount CephFS using the kernel driver + CephX.
However, from the Ceph docs, I can't understand if this might be a correct use-case for Ceph since the default authentication method CephX doesn't have a standard username/password authentication protocol. As far as I understand it requires the creation of a keyring with a random password generated on-the-fly which can then be used to mount the filesystem using the CephFS kernel module (https://docs.ceph.com/en/latest/cephfs/mount-using-kernel-driver/#mounting-…).
As for the Kerberos integration, I found in the docs this page https://docs.ceph.com/en/latest/dev/ceph_krb_auth/ which is still a draft even if the last update was almost 2 years ago. From this page, I don't understand if the current version of Ceph supports full integration with GSSAPI/kerberos/LDAP. Since the docs only refer to keytab files, I was wondering if Kerberos can only be used as an authentication protocol between Ceph monitors/osds/metadata-servers and not for mounting the filesystem.
Therefore I am asking
- if anyone has tried Ceph for a similar use-case
- what is the current status of Kerberos integration
- if there are alternatives to CephX for mounting CephFS using kernel drivers which uses a username/password protocol
Thank you and best regards,
Alessandro Piazza
Hi,
I've got a problem on Octopus (15.2.3, debian packages) install, bucket
S3 index shows a file:
s3cmd ls s3://upvid/255/38355 --recursive
2020-07-27 17:48 50584342
s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4
radosgw-admin bi list also shows it
{
"type": "plain",
"idx":
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4",
"entry": { "name":
"255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4",
"instance": "", "ver": {
"pool": 11,
"epoch": 853842
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 50584342,
"mtime": "2020-07-27T17:48:27.203008Z",
"etag": "2b31cc8ce8b1fb92a5f65034f2d12581-7",
"storage_class": "",
"owner": "filmweb-app",
"owner_display_name": "filmweb app user",
"content_type": "",
"accounted_size": 50584342,
"user_data": "",
"appendable": "false"
},
"tag": "_3ubjaztglHXfZr05wZCFCPzebQf-ZFP",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}
},
but trying to download it via curl (I've set permissions to public0 only gets me
<?xml version="1.0"
encoding="UTF-8"?><Error><Code>NoSuchKey</Code><BucketName>upvid</BucketName><RequestId>tx0000000000000000e716d-005f1f14cb-e478a-pl-war1</RequestId><HostId>e478a-pl-war1-pl</HostId></Error>
(the actually nonexisting files shows access denied in same context)
same with other tools:
$ s3cmd get s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4 /tmp
download: 's3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' -> '/tmp/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' [1 of 1]
ERROR: S3 error: 404 (NoSuchKey)
cluster health is OK
Any ideas what is happening here ?
--
Mariusz Gronczewski, Administrator
Efigence S. A.
ul. Wołoska 9a, 02-583 Warszawa
T: [+48] 22 380 13 13
NOC: [+48] 22 380 10 20
E: admin(a)efigence.com
Hello,
i played around with some log level i can´t remember and my logs are
now getting bigger than my DVD-Movie collection.
E.g.: journalctl -b -u
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df(a)mon.ceph03.service >
out.file is 1,1GB big.
I did already try:
ceph tell mon.ceph03 config set debug_mon 0/10
ceph tell mon.ceph03 config set debug_osd 0/10
ceph tell mon.ceph03 config set debug_mgr 0/10
ceph tell mon.ceph03 config set "mon_health_to_clog" false
ceph tell mon.ceph03 config set "mon_health_log_update_period" 30
ceph tell mon.ceph03 config set "debug_mgr" "0/0"
which made it better, but i really cant remember it all and would like
to have the default values.
Is there a way to reset those Log Values?
Cheers,
Michael