October 2020 - ceph-users

by Andrei Mikhailovsky

Hello, We are planning to perform a small upgrade to our cluster and slowly start adding 12TB SATA HDD drives. We need to accommodate for additional SSD WAL/DB requirements as well. Currently we are considering the following: HDD Drives - Seagate EXOS 12TB SSD Drives for WAL/DB - Intel D3 S4510 960GB or Intel D3 S4610 960GB Our cluster isn't hosting any IO intensive DBs nor IO hungry VMs such as Exchange, MSSQL, etc. From the documentation that I've read the recommended size for DB is between 1% and 4% of the size of the osd. Would 2% figure be sufficient enough (so around 240GB DB size for each 12TB osd?) Also, from your experience, which is a better model for the SSD DB/WAL? Would Intel S4510 be sufficient enough for our purpose or would the S4610 be a much better choice? Are there any other cost effective performance to consider instead of the above models? The same question to the HDD. Any other drives we should consider instead of the Seagate EXOS series? Thanks for you help and suggestions. Andrei

3 years, 4 months

4
4
0 0

Problems with mon

by Mateusz Skała

Hello Community, I have problems with ceph-mons in docker. Docker pods are starting but I got a lot of messages "e6 handle_auth_request failed to assign global_id” in log. 2 mons are up but I can’t send any ceph commands. Regards Mateusz

3 years, 4 months

2
4
0 0

Not all OSDs in rack marked as down when the rack fails

by Wido den Hollander

Hi, I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as down when the network is cut to that rack. Situation: - Nautilus cluster - 3 racks - 120 OSDs, 40 per rack We performed a test where we turned off the network Top-of-Rack for each rack. This worked as expected with two racks, but with the third something weird happened. From the 40 OSDs which were supposed to be marked as down only 36 were marked as down. In the end it took 15 minutes for all 40 OSDs to be marked as down. $ ceph config set mon mon_osd_reporter_subtree_level rack That setting is set to make sure that we only accept reports from other racks. What we saw in the logs for example: 2020-10-29T03:49:44.409-0400 7fbda185e700 10 mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 osd.51 has 54 reporters, 239.856038 grace (20.000000 + 219.856 + 7.43801e-23), max_failed_since 2020-10-29T03:47:22.374857-0400 But osd.51 was still not marked as down after 54 reporters have reported that it is actually down. I checked, no ping or other traffic possible to osd.51. Host is unreachable. Another osd was marked as down, but it took a couple of minutes as well: 2020-10-29T03:50:54.455-0400 7fbda185e700 10 mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 osd.37 has 48 reporters, 221.378970 grace (20.000000 + 201.379 + 6.34437e-23), max_failed_since 2020-10-29T03:47:12.761584-0400 2020-10-29T03:50:54.455-0400 7fbda185e700 1 mon.CEPH2-MON1-206-U39(a)0(leader).osd e107102 we have enough reporters to mark osd.37 down In the end osd.51 was marked as down, but only after the MON decided to do so: 2020-10-29T03:53:44.631-0400 7fbda185e700 0 log_channel(cluster) log [INF] : osd.51 marked down after no beacon for 903.943390 seconds 2020-10-29T03:53:44.631-0400 7fbda185e700 -1 mon.CEPH2-MON1-206-U39(a)0(leader).osd e107104 no beacon from osd.51 since 2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago. marking down I haven't seen this happen before in any cluster. It's also strange that this only happens in this rack, the other two racks work fine. ID CLASS WEIGHT TYPE NAME -1 1545.35999 root default -206 515.12000 rack 206 -7 27.94499 host CEPH2-206-U16 ... -207 515.12000 rack 207 -17 27.94499 host CEPH2-207-U16 ... -208 515.12000 rack 208 -31 27.94499 host CEPH2-208-U16 ... That's how the CRUSHMap looks like. Straight forward and 3x replication over 3 racks. This issue only occurs in rack *207*. Has anybody seen this before or knows where to start? Wido

3 years, 5 months

2
3
0 0

Bucket notification is working strange

by Krasaev

Hi everyone, I asked the same question in stackoverflow, but will repeat here. I configured bucket notification using a bucket owner creds and when the owner does actions I can see new events in a configured endpoint(kafka actually). However, when I try to do actions in the bucket, but with another user creds I do not see events in the configured notification topic. Is it expected behavior and each user has to configure own topic(is it possible if a user is not system at all)? Or I have missed something? Thank you. https://stackoverflow.com/questions/64384060/enable-bucket-notifications-fo…

3 years, 5 months

2
1
0 0

OSD memory leak?

by Frank Schilder

Hi all, on a mimic 13.2.8 cluster I observe a gradual increase of memory usage by OSD daemons, in particular, under heavy load. For our spinners I use osd_memory_target=2G. The daemons overrun the 2G in virt size rather quickly and grow to something like 4G virtual. The real memory consumption stays more or less around the 2G of the target. There are some overshoots, but these go down again during periods with less load. What I observe now is that the actual memory consumption slowly grows and OSDs start using more than 2G virtual memory. I see this as slowly growing swap usage despite having more RAM available (swappiness=10). This indicates allocated but unused memory or memory not accessed for a long time, usually a leak. Here some heap stats: Before restart: osd.101 tcmalloc heap stats:------------------------------------------------ MALLOC: 3438940768 ( 3279.6 MiB) Bytes in use by application MALLOC: + 5611520 ( 5.4 MiB) Bytes in page heap freelist MALLOC: + 257307352 ( 245.4 MiB) Bytes in central cache freelist MALLOC: + 357376 ( 0.3 MiB) Bytes in transfer cache freelist MALLOC: + 6727368 ( 6.4 MiB) Bytes in thread cache freelists MALLOC: + 25559040 ( 24.4 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 3734503424 ( 3561.5 MiB) Actual memory used (physical + swap) MALLOC: + 575946752 ( 549.3 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 4310450176 ( 4110.8 MiB) Virtual address space used MALLOC: MALLOC: 382884 Spans in use MALLOC: 35 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ # ceph daemon osd.101 dump_mempools { "mempool": { "by_pool": { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 4691828, "bytes": 37534624 }, "bluestore_cache_data": { "items": 0, "bytes": 0 }, "bluestore_cache_onode": { "items": 51, "bytes": 28968 }, "bluestore_cache_other": { "items": 5761276, "bytes": 46292425 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 67, "bytes": 46096 }, "bluestore_writing_deferred": { "items": 208, "bytes": 26037057 }, "bluestore_writing": { "items": 52, "bytes": 6789398 }, "bluefs": { "items": 9478, "bytes": 183720 }, "buffer_anon": { "items": 291450, "bytes": 28093473 }, "buffer_meta": { "items": 546, "bytes": 34944 }, "osd": { "items": 98, "bytes": 1139152 }, "osd_mapbl": { "items": 78, "bytes": 8204276 }, "osd_pglog": { "items": 341944, "bytes": 120607952 }, "osdmap": { "items": 10687217, "bytes": 186830528 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 } }, "total": { "items": 21784293, "bytes": 461822613 } } } Right after restart + health_ok: osd.101 tcmalloc heap stats:------------------------------------------------ MALLOC: 1173996280 ( 1119.6 MiB) Bytes in use by application MALLOC: + 3727360 ( 3.6 MiB) Bytes in page heap freelist MALLOC: + 25493688 ( 24.3 MiB) Bytes in central cache freelist MALLOC: + 17101824 ( 16.3 MiB) Bytes in transfer cache freelist MALLOC: + 20301904 ( 19.4 MiB) Bytes in thread cache freelists MALLOC: + 5242880 ( 5.0 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 1245863936 ( 1188.1 MiB) Actual memory used (physical + swap) MALLOC: + 20488192 ( 19.5 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 1266352128 ( 1207.7 MiB) Virtual address space used MALLOC: MALLOC: 54160 Spans in use MALLOC: 33 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Am I looking at a memory leak here or are these heap stats expected? I don't mind the swap usage, it doesn't have impact. I'm just wondering if I need to restart OSDs regularly. The "leakage" above occurred within only 2 months. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 5 months

4
25
0 0

monitor sst files continue growing

by Zhenshi Zhou

Hi all, My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db continue growing. It claims mon are using a lot of disk space. I set "mon compact on start = true" and restart one of the monitors. But it started and campacting for a long time, seems it has no end. [image: image.png]

3 years, 5 months

4
14
0 0

Rados Crashing

by Brent Kennedy

We are performing file maintenance( deletes essentially ) and when the process gets to a certain point, all four rados gateways crash with the following: Log output: -5> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s s3:delete_obj verifying op params -4> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s s3:delete_obj pre-executing -3> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s s3:delete_obj executing -2> 2020-10-20 06:09:53.997 7f161758f700 10 monclient: get_auth_request con 0x55d2c02ff800 auth_method 0 -1> 2020-10-20 06:09:54.009 7f1609d74700 5 process_single_shard(): failed to acquire lock on obj_delete_at_hint.0000000079 0> 2020-10-20 06:09:54.035 7f15f1543700 -1 *** Caught signal (Segmentation fault) ** in thread 7f15f1543700 thread_name:civetweb-worker ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) 1: (()+0xf5d0) [0x7f161d3405d0] 2: (()+0x2bec80) [0x55d2bcd1fc80] 3: (std::string::assign(std::string const&)+0x2e) [0x55d2bcd2870e] 4: (rgw_bucket::operator=(rgw_bucket const&)+0x11) [0x55d2bce3e551] 5: (RGWObjManifest::obj_iterator::update_location()+0x184) [0x55d2bced7114] 6: (RGWObjManifest::obj_iterator::operator++()+0x263) [0x55d2bd092793] 7: (RGWRados::update_gc_chain(rgw_obj&, RGWObjManifest&, cls_rgw_obj_chain*)+0x51a) [0x55d2bd0939ea] 8: (RGWRados::Object::complete_atomic_modification()+0x83) [0x55d2bd093c63] 9: (RGWRados::Object::Delete::delete_obj()+0x74d) [0x55d2bd0a87ad] 10: (RGWDeleteObj::execute()+0x915) [0x55d2bd04b6d5] 11: (rgw_process_authenticated(RGWHandler_REST*, RGWOp*&, RGWRequest*, req_state*, bool)+0x915) [0x55d2bcdfbb35] 12: (process_request(RGWRados*, RGWREST*, RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, RGWRestfulIO*, OpsLogSocket*, optional_yield, rgw::dmclock::Scheduler*, int*)+0x1cd8) [0x55d2bcdfdea8] 13: (RGWCivetWebFrontend::process(mg_connection*)+0x38e) [0x55d2bcd41a1e] 14: (()+0x36bace) [0x55d2bcdccace] 15: (()+0x36d76f) [0x55d2bcdce76f] 16: (()+0x36dc18) [0x55d2bcdcec18] 17: (()+0x7dd5) [0x7f161d338dd5] 18: (clone()+0x6d) [0x7f161c84302d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. My guess is that we need to add more resources to the gateways? They have 2 CPUs and 12GB of memory running as virtual machines on centOS 7.6 . Any thoughts? -Brent

3 years, 5 months

2
2
0 0

The feasibility of mixed SSD and HDD replicated pool

by huww98＠outlook.com

Hi all, We are planning for a new pool to store our dataset using CephFS. These data are almost read-only (but not guaranteed) and consist of a lot of small files. Each node in our cluster has 1 * 1T SSD and 2 * 6T HDD, and we will deploy about 10 such nodes. We aim at getting the highest read throughput. If we just use a replicated pool of size 3 on SSD, we should get the best performance, however, that only leave us 1/3 of usable SSD space. And EC pools are not friendly to such small object read workload, I think. Now I’m evaluating a mixed SSD and HDD replication strategy. Ideally, I want 3 data replications, each on a different host (fail domain). 1 of them on SSD, the other 2 on HDD. And normally every read request is directed to SSD. So, if every SSD OSD is up, I’d expect the same read throughout as the all SSD deployment. I’ve read the documents and did some tests. Here is the crush rule I’m testing with: rule mixed_replicated_rule { id 3 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 1 type host step emit step take default class hdd step chooseleaf firstn -1 type host step emit } Now I have the following conclusions, but I’m not very sure: * The first OSD produced by crush will be the primary OSD (at least if I don’t change the “primary affinity”). So, the above rule is guaranteed to map SSD OSD as primary in pg. And every read request will read from SSD if it is up. * It is currently not possible to enforce SSD and HDD OSD to be chosen from different hosts. So, if I want to ensure data availability even if 2 hosts fail, I need to choose 1 SSD and 3 HDD OSD. That means setting the replication size to 4, instead of the ideal value 3, on the pool using the above crush rule. Am I correct about the above statements? How would this work from your experience? Thanks.

3 years, 5 months

9
18
0 0

Ceph as a distributed filesystem and kerberos integration

by Alessandro Piazza

Dear all, I am experimenting with Ceph as a replacement for the AndrewFileSystem (https://en.wikipedia.org/wiki/Andrew_File_System). In my current setup, I am using AFS as a distributed filesystem for approximately 1000 users to store personal data and let them access their home directories and other shared data from multiple locations across different buildings. The authentication is managed by Kerberos (+ LDAP server). My goal is to replace AFS with CephFS but keep the current Kerberos database. Right now I've managed to set up a testing Ceph cluster with 6 nodes and 11 osds and I can mount CephFS using the kernel driver + CephX. However, from the Ceph docs, I can't understand if this might be a correct use-case for Ceph since the default authentication method CephX doesn't have a standard username/password authentication protocol. As far as I understand it requires the creation of a keyring with a random password generated on-the-fly which can then be used to mount the filesystem using the CephFS kernel module (https://docs.ceph.com/en/latest/cephfs/mount-using-kernel-driver/#mounting-…). As for the Kerberos integration, I found in the docs this page https://docs.ceph.com/en/latest/dev/ceph_krb_auth/ which is still a draft even if the last update was almost 2 years ago. From this page, I don't understand if the current version of Ceph supports full integration with GSSAPI/kerberos/LDAP. Since the docs only refer to keytab files, I was wondering if Kerberos can only be used as an authentication protocol between Ceph monitors/osds/metadata-servers and not for mounting the filesystem. Therefore I am asking - if anyone has tried Ceph for a similar use-case - what is the current status of Kerberos integration - if there are alternatives to CephX for mounting CephFS using kernel drivers which uses a username/password protocol Thank you and best regards, Alessandro Piazza

3 years, 5 months

4
4
0 0

NoSuchKey on key that is visible in s3 list/radosgw bk

by Mariusz Gronczewski

Hi, I've got a problem on Octopus (15.2.3, debian packages) install, bucket S3 index shows a file: s3cmd ls s3://upvid/255/38355 --recursive 2020-07-27 17:48 50584342 s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4 radosgw-admin bi list also shows it { "type": "plain", "idx": "255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4", "entry": { "name": "255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4", "instance": "", "ver": { "pool": 11, "epoch": 853842 }, "locator": "", "exists": "true", "meta": { "category": 1, "size": 50584342, "mtime": "2020-07-27T17:48:27.203008Z", "etag": "2b31cc8ce8b1fb92a5f65034f2d12581-7", "storage_class": "", "owner": "filmweb-app", "owner_display_name": "filmweb app user", "content_type": "", "accounted_size": 50584342, "user_data": "", "appendable": "false" }, "tag": "_3ubjaztglHXfZr05wZCFCPzebQf-ZFP", "flags": 0, "pending_map": [], "versioned_epoch": 0 } }, but trying to download it via curl (I've set permissions to public0 only gets me <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><BucketName>upvid</BucketName><RequestId>tx0000000000000000e716d-005f1f14cb-e478a-pl-war1</RequestId><HostId>e478a-pl-war1-pl</HostId></Error> (the actually nonexisting files shows access denied in same context) same with other tools: $ s3cmd get s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4 /tmp download: 's3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' -> '/tmp/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' [1 of 1] ERROR: S3 error: 404 (NoSuchKey) cluster health is OK Any ideas what is happening here ? -- Mariusz Gronczewski, Administrator Efigence S. A. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 NOC: [+48] 22 380 10 20 E: admin(a)efigence.com

3 years, 5 months

3
3
0 0

2024

2023

2022

2021

2020

2019

ceph-users October 2020