September 2020 - ceph-users

by David Galloway

We're happy to announce the fourth bugfix release in the Octopus series. In addition to a security fix in RGW, this release brings a range of fixes across all components. We recommend that all Octopus users upgrade to this release. For a detailed release notes with links & changelog please refer to the official blog entry at https://ceph.io/releases/v15-2-4-octopus-released Notable Changes --------------- * CVE-2020-10753: rgw: sanitize newlines in s3 CORSConfiguration's ExposeHeader (William Bowling, Adam Mohammed, Casey Bodley) * Cephadm: There were a lot of small usability improvements and bug fixes: * Grafana when deployed by Cephadm now binds to all network interfaces. * `cephadm check-host` now prints all detected problems at once. * Cephadm now calls `ceph dashboard set-grafana-api-ssl-verify false` when generating an SSL certificate for Grafana. * The Alertmanager is now correctly pointed to the Ceph Dashboard * `cephadm adopt` now supports adopting an Alertmanager * `ceph orch ps` now supports filtering by service name * `ceph orch host ls` now marks hosts as offline, if they are not accessible. * Cephadm can now deploy NFS Ganesha services. For example, to deploy NFS with a service id of mynfs, that will use the RADOS pool nfs-ganesha and namespace nfs-ns:: ceph orch apply nfs mynfs nfs-ganesha nfs-ns * Cephadm: `ceph orch ls --export` now returns all service specifications in yaml representation that is consumable by `ceph orch apply`. In addition, the commands `orch ps` and `orch ls` now support `--format yaml` and `--format json-pretty`. * Cephadm: `ceph orch apply osd` supports a `--preview` flag that prints a preview of the OSD specification before deploying OSDs. This makes it possible to verify that the specification is correct, before applying it. * RGW: The `radosgw-admin` sub-commands dealing with orphans -- `radosgw-admin orphans find`, `radosgw-admin orphans finish`, and `radosgw-admin orphans list-jobs` -- have been deprecated. They have not been actively maintained and they store intermediate results on the cluster, which could fill a nearly-full cluster. They have been replaced by a tool, currently considered experimental, `rgw-orphan-list`. * RBD: The name of the rbd pool object that is used to store rbd trash purge schedule is changed from "rbd_trash_trash_purge_schedule" to "rbd_trash_purge_schedule". Users that have already started using `rbd trash purge schedule` functionality and have per pool or namespace schedules configured should copy "rbd_trash_trash_purge_schedule" object to "rbd_trash_purge_schedule" before the upgrade and remove "rbd_trash_purge_schedule" using the following commands in every RBD pool and namespace where a trash purge schedule was previously configured:: rados -p <pool-name> [-N namespace] cp rbd_trash_trash_purge_schedule rbd_trash_purge_schedule rados -p <pool-name> [-N namespace] rm rbd_trash_trash_purge_schedule or use any other convenient way to restore the schedule after the upgrade. Getting Ceph ------------ * Git at git://github.com/ceph/ceph.git * Tarball at http://download.ceph.com/tarballs/ceph-14.2.10.tar.gz * For packages, see http://docs.ceph.com/docs/master/install/get-packages/ * Release git sha1: 7447c15c6ff58d7fce91843b705a268a1917325c -- David Galloway Systems Administrator, RDU Ceph Engineering IRC: dgalloway

3 years, 4 months

4
5
1 0

multiple OSD crash, unfound objects

by Michael Thomas

Over the weekend I had multiple OSD servers in my Octopus cluster (15.2.4) crash and reboot at nearly the same time. The OSDs are part of an erasure coded pool. At the time the cluster had been busy with a long-running (~week) remapping of a large number of PGs after I incrementally added more OSDs to the cluster. After bringing all of the OSDs back up, I have 25 unfound objects and 75 degraded objects. There are other problems reported, but I'm primarily concerned with these unfound/degraded objects. The pool with the missing objects is a cephfs pool. The files stored in the pool are backed up on tape, so I can easily restore individual files as needed (though I would not want to restore the entire filesystem). I tried following the guide at https://docs.ceph.com/docs/octopus/rados/troubleshooting/troubleshooting-pg…. I found a number of OSDs that are still 'not queried'. Restarting a sampling of these OSDs changed the state from 'not queried' to 'already probed', but that did not recover any of the unfound or degraded objects. I have also tried 'ceph pg deep-scrub' on the affected PGs, but never saw them get scrubbed. I also tried doing a 'ceph pg force-recovery' on the affected PGs, but only one seems to have been tagged accordingly (see ceph -s output below). The guide also says "Sometimes it simply takes some time for the cluster to query possible locations." I'm not sure how long "some time" might take, but it hasn't changed after several hours. My questions are: * Is there a way to force the cluster to query the possible locations sooner? * Is it possible to identify the files in cephfs that are affected, so that I could delete only the affected files and restore them from backup tapes? --Mike ceph -s: cluster: id: 066f558c-6789-4a93-aaf1-5af1ba01a3ad health: HEALTH_ERR 1 clients failing to respond to capability release 1 MDSs report slow requests 25/78520351 objects unfound (0.000%) 2 nearfull osd(s) Reduced data availability: 1 pg inactive Possible data damage: 9 pgs recovery_unfound Degraded data redundancy: 75/626645098 objects degraded (0.000%), 9 pgs degraded 1013 pgs not deep-scrubbed in time 1013 pgs not scrubbed in time 2 pool(s) nearfull 1 daemons have recently crashed 4 slow ops, oldest one blocked for 77939 sec, daemons [osd.0,osd.41] have slow ops. services: mon: 4 daemons, quorum ceph1,ceph2,ceph3,ceph4 (age 9d) mgr: ceph3(active, since 11d), standbys: ceph2, ceph4, ceph1 mds: archive:1 {0=ceph4=up:active} 3 up:standby osd: 121 osds: 121 up (since 6m), 121 in (since 101m); 4 remapped pgs task status: scrub status: mds.ceph4: idle data: pools: 9 pools, 2433 pgs objects: 78.52M objects, 298 TiB usage: 412 TiB used, 545 TiB / 956 TiB avail pgs: 0.041% pgs unknown 75/626645098 objects degraded (0.000%) 135224/626645098 objects misplaced (0.022%) 25/78520351 objects unfound (0.000%) 2421 active+clean 5 active+recovery_unfound+degraded 3 active+recovery_unfound+degraded+remapped 2 active+clean+scrubbing+deep 1 unknown 1 active+forced_recovery+recovery_unfound+degraded progress: PG autoscaler decreasing pool 7 PGs from 1024 to 512 (5d) [............................]

3 years, 4 months

4
34
0 0

Provide more documentation for MDS performance tuning on large file systems

by Janek Bevendorff

Hello, Over the last week I have tried optimising the performance of our MDS nodes for the large amount of files and concurrent clients we have. It turns out that despite various stability fixes in recent releases, the default configuration still doesn't appear to be optimal for keeping the cache size under control and avoid intermittent I/O blocks. Unfortunately, it is very hard to tweak the configuration to something that works, because the tuning parameters needed are largely undocumented or only described in very technical terms in the source code making them quite unapproachable for administrators not familiar with all the CephFS internals. I would therefore like to ask if it were possible to document the "advanced" MDS settings more clearly as to what they do and in what direction they have to be tuned for more or less aggressive cap recall, for instance (sometimes it is not clear if a threshold is a min or a max threshold). I am am in the very (un)fortunate situation to have folders with a several 100K direct sub folders or files (and one extreme case with almost 7 million dentries), which is a pretty good benchmark for measuring cap growth while performing operations on them. For the time being, I came up with this configuration, which seems to work for me, but is still far from optimal: mds basic mds_cache_memory_limit 10737418240 mds advanced mds_cache_trim_threshold 131072 mds advanced mds_max_caps_per_client 500000 mds advanced mds_recall_max_caps 17408 mds advanced mds_recall_max_decay_rate 2.000000 The parameters I am least sure about---because I understand the least how they actually work---are mds_cache_trim_threshold and mds_recall_max_decay_rate. Despite reading the description in src/common/options.cc, I understand only half of what they're doing and I am also not quite sure in which direction to tune them for optimal results. Another point where I am struggling is the correct configuration of mds_recall_max_caps. The default of 5K doesn't work too well for me, but values above 20K also don't seem to be a good choice. While high values result in fewer blocked ops and better performance without destabilising the MDS, they also lead to slow but unbounded cache growth, which seems counter-intuitive. 17K was the maximum I could go. Higher values work for most use cases, but when listing very large folders with millions of dentries, the MDS cache size slowly starts to exceed the limit after a few hours, since the MDSs are failing to keep clients below mds_max_caps_per_client despite not showing any "failing to respond to cache pressure" warnings. With the configuration above, I do not have cache size issues any more, but it comes at the cost of performance and slow/blocked ops. A few hints as to how I could optimise my settings for better client performance would be much appreciated and so would be additional documentation for all the "advanced" MDS settings. Thanks a lot Janek

3 years, 4 months

3
13
0 0

block.db/block.wal device performance dropped after upgrade to 14.2.10

by Vladimir Prokofev

Good day, cephers! We've recently upgraded our cluster from 14.2.8 to 14.2.10 release, also performing full system packages upgrade(Ubuntu 18.04 LTS). After that performance significantly dropped, main reason beeing that journal SSDs are now have no merges, huge queues, and increased latency. There's a few screenshots in attachments. This is for an SSD journal that supports block.db/block.wal for 3 spinning OSDs, and it looks like this for all our SSD block.db/wal devices across all nodes. Any ideas what may cause that? Maybe I've missed something important in release notes?

3 years, 4 months

6
20
0 0

atime with cephfs

by Oliver Freyermuth

Dear Cephers, we are currently mounting CephFS with relatime, using the FUSE client (version 13.2.6): ceph-fuse on /cephfs type fuse.ceph-fuse (rw,relatime,user_id=0,group_id=0,allow_other) For the first time, I wanted to use atime to identify old unused data. My expectation with "relatime" was that the access time stamp would be updated less often, for example, only if the last file access was >24 hours ago. However, that does not seem to be the case: ---------------------------------------------- $ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root ... Access: 2019-04-10 15:50:04.975959159 +0200 Modify: 2019-04-10 15:50:05.651613843 +0200 Change: 2019-04-10 15:50:06.141006962 +0200 ... $ cat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root > /dev/null $ sync $ stat /cephfs/grid/atlas/atlaslocalgroupdisk/rucio/group/phys-higgs/ed/cb/group.phys-higgs.17620861._000004.HSM_common.root ... Access: 2019-04-10 15:50:04.975959159 +0200 Modify: 2019-04-10 15:50:05.651613843 +0200 Change: 2019-04-10 15:50:06.141006962 +0200 ... ---------------------------------------------- I also tried this via an nfs-ganesha mount, and via a ceph-fuse mount with admin caps, but atime never changes. Is atime really never updated with CephFS, or is this configurable? Something as coarse as "update at maximum once per day only" would be perfectly fine for the use case. Cheers, Oliver

3 years, 4 months

4
6
0 0

Advice on SSD choices for WAL/DB?

by Andrei Mikhailovsky

Hello, We are planning to perform a small upgrade to our cluster and slowly start adding 12TB SATA HDD drives. We need to accommodate for additional SSD WAL/DB requirements as well. Currently we are considering the following: HDD Drives - Seagate EXOS 12TB SSD Drives for WAL/DB - Intel D3 S4510 960GB or Intel D3 S4610 960GB Our cluster isn't hosting any IO intensive DBs nor IO hungry VMs such as Exchange, MSSQL, etc. From the documentation that I've read the recommended size for DB is between 1% and 4% of the size of the osd. Would 2% figure be sufficient enough (so around 240GB DB size for each 12TB osd?) Also, from your experience, which is a better model for the SSD DB/WAL? Would Intel S4510 be sufficient enough for our purpose or would the S4610 be a much better choice? Are there any other cost effective performance to consider instead of the above models? The same question to the HDD. Any other drives we should consider instead of the Seagate EXOS series? Thanks for you help and suggestions. Andrei

3 years, 5 months

4
4
0 0

OSD memory leak?

by Frank Schilder

Hi all, on a mimic 13.2.8 cluster I observe a gradual increase of memory usage by OSD daemons, in particular, under heavy load. For our spinners I use osd_memory_target=2G. The daemons overrun the 2G in virt size rather quickly and grow to something like 4G virtual. The real memory consumption stays more or less around the 2G of the target. There are some overshoots, but these go down again during periods with less load. What I observe now is that the actual memory consumption slowly grows and OSDs start using more than 2G virtual memory. I see this as slowly growing swap usage despite having more RAM available (swappiness=10). This indicates allocated but unused memory or memory not accessed for a long time, usually a leak. Here some heap stats: Before restart: osd.101 tcmalloc heap stats:------------------------------------------------ MALLOC: 3438940768 ( 3279.6 MiB) Bytes in use by application MALLOC: + 5611520 ( 5.4 MiB) Bytes in page heap freelist MALLOC: + 257307352 ( 245.4 MiB) Bytes in central cache freelist MALLOC: + 357376 ( 0.3 MiB) Bytes in transfer cache freelist MALLOC: + 6727368 ( 6.4 MiB) Bytes in thread cache freelists MALLOC: + 25559040 ( 24.4 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 3734503424 ( 3561.5 MiB) Actual memory used (physical + swap) MALLOC: + 575946752 ( 549.3 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 4310450176 ( 4110.8 MiB) Virtual address space used MALLOC: MALLOC: 382884 Spans in use MALLOC: 35 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ # ceph daemon osd.101 dump_mempools { "mempool": { "by_pool": { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 4691828, "bytes": 37534624 }, "bluestore_cache_data": { "items": 0, "bytes": 0 }, "bluestore_cache_onode": { "items": 51, "bytes": 28968 }, "bluestore_cache_other": { "items": 5761276, "bytes": 46292425 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 67, "bytes": 46096 }, "bluestore_writing_deferred": { "items": 208, "bytes": 26037057 }, "bluestore_writing": { "items": 52, "bytes": 6789398 }, "bluefs": { "items": 9478, "bytes": 183720 }, "buffer_anon": { "items": 291450, "bytes": 28093473 }, "buffer_meta": { "items": 546, "bytes": 34944 }, "osd": { "items": 98, "bytes": 1139152 }, "osd_mapbl": { "items": 78, "bytes": 8204276 }, "osd_pglog": { "items": 341944, "bytes": 120607952 }, "osdmap": { "items": 10687217, "bytes": 186830528 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 } }, "total": { "items": 21784293, "bytes": 461822613 } } } Right after restart + health_ok: osd.101 tcmalloc heap stats:------------------------------------------------ MALLOC: 1173996280 ( 1119.6 MiB) Bytes in use by application MALLOC: + 3727360 ( 3.6 MiB) Bytes in page heap freelist MALLOC: + 25493688 ( 24.3 MiB) Bytes in central cache freelist MALLOC: + 17101824 ( 16.3 MiB) Bytes in transfer cache freelist MALLOC: + 20301904 ( 19.4 MiB) Bytes in thread cache freelists MALLOC: + 5242880 ( 5.0 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 1245863936 ( 1188.1 MiB) Actual memory used (physical + swap) MALLOC: + 20488192 ( 19.5 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 1266352128 ( 1207.7 MiB) Virtual address space used MALLOC: MALLOC: 54160 Spans in use MALLOC: 33 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Am I looking at a memory leak here or are these heap stats expected? I don't mind the swap usage, it doesn't have impact. I'm just wondering if I need to restart OSDs regularly. The "leakage" above occurred within only 2 months. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

3 years, 5 months

4
25
0 0

NoSuchKey on key that is visible in s3 list/radosgw bk

by Mariusz Gronczewski

Hi, I've got a problem on Octopus (15.2.3, debian packages) install, bucket S3 index shows a file: s3cmd ls s3://upvid/255/38355 --recursive 2020-07-27 17:48 50584342 s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4 radosgw-admin bi list also shows it { "type": "plain", "idx": "255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4", "entry": { "name": "255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4", "instance": "", "ver": { "pool": 11, "epoch": 853842 }, "locator": "", "exists": "true", "meta": { "category": 1, "size": 50584342, "mtime": "2020-07-27T17:48:27.203008Z", "etag": "2b31cc8ce8b1fb92a5f65034f2d12581-7", "storage_class": "", "owner": "filmweb-app", "owner_display_name": "filmweb app user", "content_type": "", "accounted_size": 50584342, "user_data": "", "appendable": "false" }, "tag": "_3ubjaztglHXfZr05wZCFCPzebQf-ZFP", "flags": 0, "pending_map": [], "versioned_epoch": 0 } }, but trying to download it via curl (I've set permissions to public0 only gets me <?xml version="1.0" encoding="UTF-8"?><Error><Code>NoSuchKey</Code><BucketName>upvid</BucketName><RequestId>tx0000000000000000e716d-005f1f14cb-e478a-pl-war1</RequestId><HostId>e478a-pl-war1-pl</HostId></Error> (the actually nonexisting files shows access denied in same context) same with other tools: $ s3cmd get s3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4 /tmp download: 's3://upvid/255/38355/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' -> '/tmp/juz_nie_zyjesz_sezon_2___oficjalny_zwiastun___netflix_mp4' [1 of 1] ERROR: S3 error: 404 (NoSuchKey) cluster health is OK Any ideas what is happening here ? -- Mariusz Gronczewski, Administrator Efigence S. A. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 NOC: [+48] 22 380 10 20 E: admin(a)efigence.com

3 years, 5 months

3
3
0 0

Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus

by Jean-Philippe Méthot

Hi, We’re upgrading our cluster OSD node per OSD node to Nautilus from Mimic. From some release notes, it was recommended to run the following command to fix stats after an upgrade : ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0 However, running that command gives us the following error message: > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: In > function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: 53 > : FAILED ceph_assert(r == 0) > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f1a5a823025] > 2: (()+0x25c1ed) [0x7f1a5a8231ed] > 3: (()+0x3c7a4f) [0x55b33537ca4f] > 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 8: (main()+0x10b3) [0x55b335187493] > 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 10: (()+0x1f9b5f) [0x55b3351aeb5f] > 2020-09-10 14:40:25.873 7f1a6467eec0 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: In function 'virtual Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc: 53: FAILED ceph_assert(r == 0) > > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x7f1a5a823025] > 2: (()+0x25c1ed) [0x7f1a5a8231ed] > 3: (()+0x3c7a4f) [0x55b33537ca4f] > 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 6: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 8: (main()+0x10b3) [0x55b335187493] > 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 10: (()+0x1f9b5f) [0x55b3351aeb5f] > *** Caught signal (Aborted) ** > in thread 7f1a6467eec0 thread_name:ceph-bluestore- > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) > 1: (()+0xf630) [0x7f1a58cf0630] > 2: (gsignal()+0x37) [0x7f1a574be387] > 3: (abort()+0x148) [0x7f1a574bfa78] > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x7f1a5a823074] > 5: (()+0x25c1ed) [0x7f1a5a8231ed] > 6: (()+0x3c7a4f) [0x55b33537ca4f] > 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 11: (main()+0x10b3) [0x55b335187493] > 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 13: (()+0x1f9b5f) [0x55b3351aeb5f] > 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal (Aborted) ** > in thread 7f1a6467eec0 thread_name:ceph-bluestore- > > ceph version 14.2.11 (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) > 1: (()+0xf630) [0x7f1a58cf0630] > 2: (gsignal()+0x37) [0x7f1a574be387] > 3: (abort()+0x148) [0x7f1a574bfa78] > 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x7f1a5a823074] > 5: (()+0x25c1ed) [0x7f1a5a8231ed] > 6: (()+0x3c7a4f) [0x55b33537ca4f] > 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] > 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] > 9: (BlueStore::_close_db_and_around(bool)+0x2f8) [0x55b335274528] > 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) [0x55b3352749a1] > 11: (main()+0x10b3) [0x55b335187493] > 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] > 13: (()+0x1f9b5f) [0x55b3351aeb5f] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. What could be the source of this error? I haven’t found much of anything about it online. Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada TEL : +1.514.802.1644 - Poste : 2644 FAX : +1.514.612.0678 CA/US : 1.855.774.4678 FR : 01 76 60 41 43 UK : 0808 189 0423

3 years, 6 months

3
7
0 0

Re: Issues with the ceph-bluestore-tool during cluster upgrade from Mimic to Nautilus

by Igor Fedotov

On 9/25/2020 6:07 PM, Saber(a)PlanetHoster.info wrote: > Hi Igor, > > The only thing abnormal about this osdstore is that it was created by > Mimic 13.2.8 and I can see that the OSDs size of this osdstore are not > the same as the others in the cluster (while they should be exactly > the same size). > > Can it be https://tracker.ceph.com/issues/39151 ? hmm, may be... Did you change H/W at some point for this OSD's node as it happened in the ticket? And it's still unclear to me if the issue is reproducible for you. Could you please also run fsck (at first) and then repair for this OSD and collect log(s). Thanks, Igor > > Thanks! > Saber > CTO @PlanetHoster > >> On Sep 25, 2020, at 5:46 AM, Igor Fedotov <ifedotov(a)suse.de >> <mailto:ifedotov@suse.de>> wrote: >> >> Hi Saber, >> >> I don't think this is related. New assertion happens along the write >> path while the original one occurred on allocator shutdown. >> >> >> Unfortunately there are not much information to troubleshoot this... >> Are you able to reproduce the case? >> >> >> Thanks, >> >> Igor >> >> On 9/25/2020 4:21 AM, Saber(a)PlanetHoster.info wrote: >>> Hi Igor, >>> >>> We had an osd crash a week after running Nautilus. I have attached >>> the logs, is it related to the same bug? >>> >>> >>> >>> >>> Thanks, >>> Saber >>> CTO @PlanetHoster >>> >>>> On Sep 14, 2020, at 10:22 AM, Igor Fedotov <ifedotov(a)suse.de >>>> <mailto:ifedotov@suse.de>> wrote: >>>> >>>> Thanks! >>>> >>>> Now got the root cause. The fix is on its way... >>>> >>>> Meanwhile you might want to try to workaround the issue via setting >>>> "bluestore_hybrid_alloc_mem_cap" to 0 or using different allocator, >>>> e.g. avl for bluestore_allocator (and optionally for >>>> bluefs_allocator too). >>>> >>>> >>>> Hope this helps, >>>> >>>> Igor. >>>> >>>> >>>> >>>> On 9/14/2020 5:02 PM, Jean-Philippe Méthot wrote: >>>>> Alright, here’s the full log file. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Jean-Philippe Méthot >>>>> Senior Openstack system administrator >>>>> Administrateur système Openstack sénior >>>>> PlanetHoster inc. >>>>> 4414-4416 Louis B Mayer >>>>> Laval, QC, H7P 0G1, Canada >>>>> TEL : +1.514.802.1644 - Poste : 2644 >>>>> FAX : +1.514.612.0678 >>>>> CA/US : 1.855.774.4678 >>>>> FR : 01 76 60 41 43 >>>>> UK : 0808 189 0423 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Le 14 sept. 2020 à 06:49, Igor Fedotov <ifedotov(a)suse.de >>>>>> <mailto:ifedotov@suse.de>> a écrit : >>>>>> >>>>>> Well, I can see duplicate admin socket command >>>>>> registration/de-registration (and the second de-registration >>>>>> asserts) but don't understand how this could happen. >>>>>> >>>>>> Would you share the full log, please? >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Igor >>>>>> >>>>>> On 9/11/2020 7:26 PM, Jean-Philippe Méthot wrote: >>>>>>> Here’s the out file, as requested. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Jean-Philippe Méthot >>>>>>> Senior Openstack system administrator >>>>>>> Administrateur système Openstack sénior >>>>>>> PlanetHoster inc. >>>>>>> 4414-4416 Louis B Mayer >>>>>>> Laval, QC, H7P 0G1, Canada >>>>>>> TEL : +1.514.802.1644 - Poste : 2644 >>>>>>> FAX : +1.514.612.0678 >>>>>>> CA/US : 1.855.774.4678 >>>>>>> FR : 01 76 60 41 43 >>>>>>> UK : 0808 189 0423 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Le 11 sept. 2020 à 10:38, Igor Fedotov <ifedotov(a)suse.de >>>>>>>> <mailto:ifedotov@suse.de>> a écrit : >>>>>>>> >>>>>>>> Could you please run: >>>>>>>> >>>>>>>> CEPH_ARGS="--log-file log --debug-asok 5" ceph-bluestore-tool >>>>>>>> repair --path <...> ; cat log | grep asok > out >>>>>>>> >>>>>>>> and share 'out' file. >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Igor >>>>>>>> >>>>>>>> On 9/11/2020 5:15 PM, Jean-Philippe Méthot wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> We’re upgrading our cluster OSD node per OSD node to Nautilus >>>>>>>>> from Mimic. From some release notes, it was recommended to run >>>>>>>>> the following command to fix stats after an upgrade : >>>>>>>>> >>>>>>>>> ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-0 >>>>>>>>> >>>>>>>>> However, running that command gives us the following error >>>>>>>>> message: >>>>>>>>> >>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc >>>>>>>>>> <http://allocator.cc/>: In >>>>>>>>>> function 'virtual Allocator::SocketHook::~SocketHook()' >>>>>>>>>> thread 7f1a6467eec0 time 2020-09-10 14:40:25.872353 >>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc >>>>>>>>>> <http://allocator.cc/>: 53 >>>>>>>>>> : FAILED ceph_assert(r == 0) >>>>>>>>>> ceph version 14.2.11 >>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) >>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>>> char const*)+0x14a) [0x7f1a5a823025] >>>>>>>>>> 2: (()+0x25c1ed) [0x7f1a5a8231ed] >>>>>>>>>> 3: (()+0x3c7a4f) [0x55b33537ca4f] >>>>>>>>>> 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>>>>>>>>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>>>>>>>>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8) >>>>>>>>>> [0x55b335274528] >>>>>>>>>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) >>>>>>>>>> [0x55b3352749a1] >>>>>>>>>> 8: (main()+0x10b3) [0x55b335187493] >>>>>>>>>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>>>>>>>>> 10: (()+0x1f9b5f) [0x55b3351aeb5f] >>>>>>>>>> 2020-09-10 14:40:25.873 7f1a6467eec0 -1 >>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc >>>>>>>>>> <http://allocator.cc/>: In function 'virtual >>>>>>>>>> Allocator::SocketHook::~SocketHook()' thread 7f1a6467eec0 >>>>>>>>>> time 2020-09-10 14:40:25.872353 >>>>>>>>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.11/rpm/el7/BUILD/ceph-14.2.11/src/os/bluestore/Allocator.cc >>>>>>>>>> <http://allocator.cc/>: 53: FAILED ceph_assert(r == 0) >>>>>>>>>> >>>>>>>>>> ceph version 14.2.11 >>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) >>>>>>>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>>> char const*)+0x14a) [0x7f1a5a823025] >>>>>>>>>> 2: (()+0x25c1ed) [0x7f1a5a8231ed] >>>>>>>>>> 3: (()+0x3c7a4f) [0x55b33537ca4f] >>>>>>>>>> 4: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>>>>>>>>> 5: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>>>>>>>>> 6: (BlueStore::_close_db_and_around(bool)+0x2f8) >>>>>>>>>> [0x55b335274528] >>>>>>>>>> 7: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) >>>>>>>>>> [0x55b3352749a1] >>>>>>>>>> 8: (main()+0x10b3) [0x55b335187493] >>>>>>>>>> 9: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>>>>>>>>> 10: (()+0x1f9b5f) [0x55b3351aeb5f] >>>>>>>>>> *** Caught signal (Aborted) ** >>>>>>>>>> in thread 7f1a6467eec0 thread_name:ceph-bluestore- >>>>>>>>>> ceph version 14.2.11 >>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) >>>>>>>>>> 1: (()+0xf630) [0x7f1a58cf0630] >>>>>>>>>> 2: (gsignal()+0x37) [0x7f1a574be387] >>>>>>>>>> 3: (abort()+0x148) [0x7f1a574bfa78] >>>>>>>>>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>>> char const*)+0x199) [0x7f1a5a823074] >>>>>>>>>> 5: (()+0x25c1ed) [0x7f1a5a8231ed] >>>>>>>>>> 6: (()+0x3c7a4f) [0x55b33537ca4f] >>>>>>>>>> 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>>>>>>>>> 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>>>>>>>>> 9: (BlueStore::_close_db_and_around(bool)+0x2f8) >>>>>>>>>> [0x55b335274528] >>>>>>>>>> 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) >>>>>>>>>> [0x55b3352749a1] >>>>>>>>>> 11: (main()+0x10b3) [0x55b335187493] >>>>>>>>>> 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>>>>>>>>> 13: (()+0x1f9b5f) [0x55b3351aeb5f] >>>>>>>>>> 2020-09-10 14:40:25.874 7f1a6467eec0 -1 *** Caught signal >>>>>>>>>> (Aborted) ** >>>>>>>>>> in thread 7f1a6467eec0 thread_name:ceph-bluestore- >>>>>>>>>> >>>>>>>>>> ceph version 14.2.11 >>>>>>>>>> (f7fdb2f52131f54b891a2ec99d8205561242cdaf) nautilus (stable) >>>>>>>>>> 1: (()+0xf630) [0x7f1a58cf0630] >>>>>>>>>> 2: (gsignal()+0x37) [0x7f1a574be387] >>>>>>>>>> 3: (abort()+0x148) [0x7f1a574bfa78] >>>>>>>>>> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, >>>>>>>>>> char const*)+0x199) [0x7f1a5a823074] >>>>>>>>>> 5: (()+0x25c1ed) [0x7f1a5a8231ed] >>>>>>>>>> 6: (()+0x3c7a4f) [0x55b33537ca4f] >>>>>>>>>> 7: (HybridAllocator::~HybridAllocator()+0x17) [0x55b3353ac517] >>>>>>>>>> 8: (BlueStore::_close_alloc()+0x42) [0x55b3351f2082] >>>>>>>>>> 9: (BlueStore::_close_db_and_around(bool)+0x2f8) >>>>>>>>>> [0x55b335274528] >>>>>>>>>> 10: (BlueStore::_fsck(BlueStore::FSCKDepth, bool)+0x2c1) >>>>>>>>>> [0x55b3352749a1] >>>>>>>>>> 11: (main()+0x10b3) [0x55b335187493] >>>>>>>>>> 12: (__libc_start_main()+0xf5) [0x7f1a574aa555] >>>>>>>>>> 13: (()+0x1f9b5f) [0x55b3351aeb5f] >>>>>>>>>> NOTE: a copy of the executable, or `objdump -rdS >>>>>>>>>> <executable>` is needed to interpret this. >>>>>>>>> >>>>>>>>> What could be the source of this error? I haven’t found much >>>>>>>>> of anything about it online. >>>>>>>>> >>>>>>>>> >>>>>>>>> Jean-Philippe Méthot >>>>>>>>> Senior Openstack system administrator >>>>>>>>> Administrateur système Openstack sénior >>>>>>>>> PlanetHoster inc. >>>>>>>>> 4414-4416 Louis B Mayer >>>>>>>>> Laval, QC, H7P 0G1, Canada >>>>>>>>> TEL : +1.514.802.1644 - Poste : 2644 >>>>>>>>> FAX : +1.514.612.0678 >>>>>>>>> CA/US : 1.855.774.4678 >>>>>>>>> FR : 01 76 60 41 43 >>>>>>>>> UK : 0808 189 0423 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io >>>>>>>>> <mailto:ceph-users@ceph.io> >>>>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io >>>>>>>>> <mailto:ceph-users-leave@ceph.io> >>>>>>> >>>>> >>> >

3 years, 6 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users September 2020