Hi everyone,
I'm looking for a proposal for this month's Tech Talk on the 25th at
17:00 UTC. If you have something you want to share with the Ceph
community, consider sending me your proposal:
https://ceph.io/ceph-tech-talks/
--
Mike Perez
He/Him
Ceph Community Manager
Red Hat Los Angeles <https://www.redhat.com>
thingee(a)redhat.com <mailto:thingee@redhat.com>
M: 1-951-572-2633 <tel:1-951-572-2633> IM: IRC Freenode/OFTC: thingee
494C 5D25 2968 D361 65FB 3829 94BC D781 ADA8 8AEA
@Thingee <https://twitter.com/thingee>
<https://www.redhat.com>
<https://redhat.com/summit>
Hello,
What is the current status, of using multiple cephfs?
In octopuss I get lots of warnings, that this feature is still not fully tested, but the latest entry regarding multiple cephfs in the mailinglist is from about 2018.
Is someone using multiple cephfs in production?
Thanks in Advance,
Simon
Hello helpful mailing list folks! After a networking outage, I had a MDS rank failure (originally 3 MDS ranks) that has left my CephFS cluster in a bad shape. I worked through most of the Disaster Recovery guide (https://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disas…) and got my CephFS remounted and (mostly) available. I have additionally completed the lengthy extents and inodes scan. For the most part, I am working fine, but for now have reduced my max MDS down to 1.
However, it looks however that I have a MDS dir_frag issue and damaged metadata on a specific directory when accessed. Here's the appropriate commands/outputs:
# ceph version
ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)
# ceph -s
cluster:
id: <redacted>
health: HEALTH_ERR
1 MDSs report damaged metadata
services:
mon: 3 daemons, quorum katz-c1,katz-c2,katz-c3 (age 6d)
mgr: katz-c2(active, since 10d), standbys: katz-c1, katz-c3
mds: cephfs-katz:1 {0=katz-mds-3=up:active} 5 up:standby
osd: 9 osds: 9 up (since 6d), 9 in (since 6d)
data:
pools: 7 pools, 312 pgs
objects: 19.82M objects, 1.5 TiB
usage: 6.6 TiB used, 11 TiB / 17 TiB avail
pgs: 311 active+clean
1 active+clean+scrubbing+deep
# ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata
MDS_DAMAGE 1 MDSs report damaged metadata
mdskatz-mds-3(mds.0): Metadata damage detected
# ceph tell mds.0 damage ls
[
{
"damage_type": "dir_frag",
"id": 575440387,
"ino": 1099550285476,
"frag": "*",
"path": "/bad/TSAL/conf8N5LVl"
}
]
# ceph tell mds.0 scrub start /bad recursive repair
{
"return_code": 0,
"scrub_tag": "887fa41d-4643-4b2d-bb7d-8f96c02c2b4d",
"mode": "asynchronous"
}
After a few seconds,
# ceph tell mds.0 scrub status
{
"status": "no active scrubs running",
"scrubs": {}
}
The scrub appears to not do anything to fix the issue. I have isolated the directory in my file system (/bad) and do not need the contents of the directory anymore (backups woo!), however a typical "rm -rf" command on the directory fails.
The next steps for recovery is where I am struggling. There have been other emails to this list before about this topic ( https://www.spinics.net/lists/ceph-users/msg53211.html ) but the commands referenced are a bit foreign to me, and I was wondering if you all could provide some additional insight on the exact commands needed. From what I can gathered in the previous thread, I need to:
1. Get the inode for the parent directory ( /bad/TSAL )
# cd /mnt/cephfs/bad
# stat TSAL
File: TSAL
Size: 3 Blocks: 0 IO Block: 65536 directory
Device: 2eh/46d Inode: 1099550201759 Links: 4
Access: (0700/drwx------) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-06-01 15:50:41.000000000 -0400
Modify: 2020-06-11 15:21:42.079970103 -0400
Change: 2020-06-11 15:21:42.079970103 -0400
Birth: -
(So in this case, /bad/TSAL inode: 1099550201759)
2. Check if omap key '1_head' exists in object <inode of directory in hex>.00000000. If it exists, remove it.
This is where I am clueless on how to continue. How do I check if the omap key '1_head' exists, and if so, remove it? What commands am I working with here? (Inode decimal to hex: 1099550201759 -> 100024C979F)
Thank you much!
Chris Wieringa
Hi,
I would like to enable the pg_autoscaler on our nautilus cluster.
Someone told me that I should be really really careful to NOT have
customer impact.
Maybe someone can share some experience on this?
The Cluster got 455 OSDs on 19 hosts with ~17000 PGs and ~1petabyte
raw storage where ~600TB raw is used.
Hi all,
I have a question regarding following rules in Ceph CRUSH map:
enum crush_opcodes {
/*! do nothing
*/
CRUSH_RULE_NOOP = 0,
CRUSH_RULE_TAKE = 1, /* arg1 = value to start with */
CRUSH_RULE_CHOOSE_FIRSTN = 2, /* arg1 = num items to pick */
/* arg2 = type */
CRUSH_RULE_CHOOSE_INDEP = 3, /* same */
CRUSH_RULE_EMIT = 4, /* no args */
CRUSH_RULE_CHOOSELEAF_FIRSTN = 6,
CRUSH_RULE_CHOOSELEAF_INDEP = 7,
CRUSH_RULE_SET_CHOOSE_TRIES = 8, /* override choose_total_tries */
CRUSH_RULE_SET_CHOOSELEAF_TRIES = 9, /* override chooseleaf_descend_once */
CRUSH_RULE_SET_CHOOSE_LOCAL_TRIES = 10,
CRUSH_RULE_SET_CHOOSE_LOCAL_FALLBACK_TRIES = 11,
CRUSH_RULE_SET_CHOOSELEAF_VARY_R = 12,
CRUSH_RULE_SET_CHOOSELEAF_STABLE = 13
};
Can we skip specific rules ? Or lets say, what are the minimum number of
rules required by CRUSH?
Because my understanding is, it all depends on map hierarchy. If we have a
particular/given hierarchy, we can skip certain rules?
Hi Reed,
thanks for the log.
Nothing much of interest there though. Just a regular SST file that
RocksDB instructed to put at "slow" device. Presumably it belongs to a
higher level hence the desire to put it that "far". Or (which is less
likely) RocksDB lacked free space when doing compaction at some point
and spilled some data out. So I was wrong - ceph-kvstore's stats command
output might be helpful...
Thanks,
Igor
On 6/11/2020 5:14 PM, Reed Dier wrote:
> Apologies for the delay Igor,
>
> Hopefully you are still interested in taking a look.
>
> Attached is the bluestore bluefs-log-dump output.
> I gzipped it as the log was very large.
> Let me know if there is anything else I can do to help track this down.
>
> Thanks,
>
> Reed
>
>
>
>> On Jun 8, 2020, at 8:04 AM, Igor Fedotov <ifedotov(a)suse.de
>> <mailto:ifedotov@suse.de>> wrote:
>>
>> Reed,
>>
>> No, "ceph-kvstore-tool stats" isn't be of any interest.
>>
>> For the sake of better issue understanding it might be interesting to
>> have bluefs log dump obtained via ceph-bluestore-tool's
>> bluefs-log-dump command. This will give some insight what RocksDB
>> files are spilled over. It's still not clear what's the root cause
>> for the issue. It's not that frequent and dangerous though so no
>> active investigation on that...
>>
>> Wondering if migration has helped though?
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 6/6/2020 8:00 AM, Reed Dier wrote:
>>> The WAL/DB was part of the OSD deployment.
>>>
>>> OSD is running 14.2.9.
>>>
>>> Would grabbing the ceph-kvstore-tool bluestore-kv <path-to-osd>
>>> stats as in that ticket be of any usefulness to this?
>>>
>>> Thanks,
>>>
>>> Reed
>>>
>>>> On Jun 5, 2020, at 5:27 PM, Igor Fedotov <ifedotov(a)suse.de
>>>> <mailto:ifedotov@suse.de>> wrote:
>>>>
>>>> This might help -see comment #4 at
>>>> https://tracker.ceph.com/issues/44509
>>>>
>>>>
>>>> And just for the sake of information collection - what Ceph version
>>>> is used in this cluster?
>>>>
>>>> Did you setup DB volume along with OSD deployment or they were
>>>> added later as was done in the ticket above?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>> On 6/6/2020 1:07 AM, Reed Dier wrote:
>>>>> I'm going to piggy back on this somewhat.
>>>>>
>>>>> I've battled RocksDB spillovers over the course of the life of the
>>>>> cluster since moving to bluestore, however I have always been able
>>>>> to compact it well enough.
>>>>>
>>>>> But now I am stumped at getting this to compact via $ceph tell
>>>>> osd.$osd compact, which has always worked in the past.
>>>>>
>>>>> No matter how many times I compact it, I always spill over exactly
>>>>> 192KiB.
>>>>>> BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
>>>>>> osd.36 spilled over 192 KiB metadata from 'db' device (26
>>>>>> GiB used of 34 GiB) to slow device
>>>>>> osd.36 spilled over 192 KiB metadata from 'db' device (16
>>>>>> GiB used of 34 GiB) to slow device
>>>>>> osd.36 spilled over 192 KiB metadata from 'db' device (22
>>>>>> GiB used of 34 GiB) to slow device
>>>>>> osd.36 spilled over 192 KiB metadata from 'db' device (13
>>>>>> GiB used of 34 GiB) to slow device
>>>>>
>>>>> The multiple entries are from different time trying to compact it.
>>>>>
>>>>> The OSD is a 1.92TB SATA SSD, the WAL/DB is a 36GB partition on NVMe.
>>>>> I tailed and tee'd the OSD's logs during a manual compaction here:
>>>>> https://pastebin.com/bcpcRGEe
>>>>> This is with the normal logging level.
>>>>> I have no idea how to make heads or tails of that log data, but
>>>>> maybe someone can figure out why this one OSD just refuses to compact?
>>>>>
>>>>> OSD is 14.2.9.
>>>>> OS is U18.04.
>>>>> Kernel is 4.15.0-96.
>>>>>
>>>>> I haven't played with ceph-bluestore-tool or ceph-kvstore-tool but
>>>>> after seeing the above mention in this thread, I do see
>>>>> ceph-kvstore-tool <rocksdb|bluestore-kv?> compact, which sounds
>>>>> like it may be the same thing that ceph tell compact does under
>>>>> the hood?
>>>>>> compact
>>>>>> Subcommand compact is used to compact all data of kvstore. It
>>>>>> will open the database, and trigger a database's compaction.
>>>>>> After compaction, some disk space may be released.
>>>>>
>>>>> Also, not sure if this is helpful:
>>>>>> osd.36 spilled over 192 KiB metadata from 'db' device (13 GiB
>>>>>> used of 34 GiB) to slow device
>>>>>> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
>>>>>> AVAIL %USE VAR PGS STATUS TYPE NAME
>>>>>> 36 ssd 1.77879 1.00000 1.8 TiB 1.2 TiB 1.2 TiB 6.2 GiB
>>>>>> 7.2 GiB 603 GiB 66.88 0.94 85 up osd.36
>>>>> You can see the breakdown between OMAP data and META data.
>>>>>
>>>>> After compacting again:
>>>>>> osd.36 spilled over 192 KiB metadata from 'db' device (26 GiB
>>>>>> used of 34 GiB) to slow device
>>>>>> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
>>>>>> AVAIL %USE VAR PGS STATUS TYPE NAME
>>>>>> 36 ssd 1.77879 1.00000 1.8 TiB 1.2 TiB 1.2 TiB 6.2 GiB 20
>>>>>> GiB 603 GiB 66.88 0.94 85 up osd.36
>>>>>
>>>>> So the OMAP size remained the same, while the metadata ballooned
>>>>> (while still conspicuously spilling over 192KiB exactly)
>>>>> These OSDs have a few RBD images, cephfs metadata, and librados
>>>>> objects (not RGW) stored.
>>>>>
>>>>> The breakdown of OMAP size is pretty widely binned, but the GiB
>>>>> sizes are definitely the minority.
>>>>> Looking at the breakdown with some simple bash-fu
>>>>> KiB = 147
>>>>> MiB = 105
>>>>> GiB = 24
>>>>>
>>>>> To further divide that, all of the GiB sized OMAPs are SSD OSD's:
>>>>>
>>>>>
>>>>> *SSD*
>>>>>
>>>>> *HDD*
>>>>>
>>>>> *TOTAL*
>>>>> *KiB*
>>>>>
>>>>> 0
>>>>>
>>>>> 147
>>>>>
>>>>> 147
>>>>> *MiB*
>>>>>
>>>>> 36
>>>>>
>>>>> 69
>>>>>
>>>>> 105
>>>>> *GiB*
>>>>>
>>>>> 24
>>>>>
>>>>> 0
>>>>>
>>>>> 24
>>>>>
>>>>>
>>>>> I have no idea if any of these data points are pertinent or
>>>>> helpful, but I want to give as clear a picture as possible to
>>>>> prevent chasing the wrong thread.
>>>>> Appreciate any help with this.
>>>>>
>>>>> Thanks,
>>>>> Reed
>>>>>
>>>>>> On May 26, 2020, at 9:48 AM, thoralf schulze
>>>>>> <t.schulze(a)tu-berlin.de <mailto:t.schulze@tu-berlin.de>> wrote:
>>>>>>
>>>>>> hi there,
>>>>>>
>>>>>> trying to get around my head rocksdb spillovers and how to deal with
>>>>>> them … in particular, i have one osds which does not have any pools
>>>>>> associated (as per ceph pg ls-by-osd $osd ), yet it does show up
>>>>>> in ceph
>>>>>> health detail as:
>>>>>>
>>>>>> osd.$osd spilled over 2.9 MiB metadata from 'db' device (49 MiB
>>>>>> used of 37 GiB) to slow device
>>>>>>
>>>>>> compaction doesn't help. i am well aware of
>>>>>> https://tracker.ceph.com/issues/38745 , yet find it really
>>>>>> counter-intuitive that an empty osd with a more-or-less optimal
>>>>>> sized db
>>>>>> volume can't fit its rockdb on the former.
>>>>>>
>>>>>> is there any way to repair this, apart from re-creating the osd?
>>>>>> fwiw,
>>>>>> dumping the database with
>>>>>>
>>>>>> ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-$osd dump >
>>>>>> bluestore_kv.dump
>>>>>>
>>>>>> yields a file of less than 100mb in size.
>>>>>>
>>>>>> and, while we're at it, a few more related questions:
>>>>>>
>>>>>> - am i right to assume that the leveldb and rocksdb arguments to
>>>>>> ceph-kvstore-tool are only relevant for osds with filestore-backend?
>>>>>> - does ceph-kvstore-tool bluestore-kv … also deal with
>>>>>> rocksdb-items for
>>>>>> osds with bluestore-backend?
>>>>>>
>>>>>> thank you very much & with kind regards,
>>>>>> thoralf.
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users(a)ceph.io
>>>>>> <mailto:ceph-users@ceph.io>
>>>>>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
>>>>>> <mailto:ceph-users-leave@ceph.io>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list --ceph-users(a)ceph.io
>>>>> To unsubscribe send an email toceph-users-leave(a)ceph.io
>>>
>
Hi Raymond,
I'm pinging this old thread because we hit the same issue last week.
Is it possible that when you upgraded to nautilus you ran `ceph osd
require-osd-release nautilus` but did not run `ceph mon enable-msgr2`
?
We were in that state (intentionally), and started getting the `unable
to obtain rotating service keys` after around half the osds were
restarted with require_osd_release=nautilus.
Those restarted osds bind on the v2 port, and they seemingly get
confused how to communicate with the mons.
As soon as we did `ceph mon enable-msgr2` to enable v2 on the mons the
osds could boot without issue.
I guess this is a heads up not to skip any step of the nautilus
upgrade, even though the docs make `ceph mon enable-msgr2` look
optional.
Cheers, Dan
-- Dan
On Tue, Jan 28, 2020 at 8:12 PM Raymond Clotfelter <ray(a)ksu.edu> wrote:
>
> I have a server with 12 OSDs on it. Five of them are unable to start, and give the following error message in the their logs:
>
> 2020-01-28 13:00:41.760 7f61fb490c80 0 monclient: wait_auth_rotating timed out after 30
> 2020-01-28 13:00:41.760 7f61fb490c80 -1 osd.178 411005 unable to obtain rotating service keys; retrying
>
> These OSDs were up and running when they initially just died on me. I tried to restart them and they failed to come up. I rebooted the node and they did not recover. All 5 died within a few hours and were all 5 down by time I started poking them. I previously had this happen with 2 other OSDs, one each on 2 servers each with 12 OSDs. I ended up just purging and recreating those OSDs. I would really like to find a solution to fix this problem that does not involve purging the OSDs.
>
> I have tried stopping and starting all monitors and managers, one at a time, and all at the same time. Additionally, all servers in the cluster have been restarted over the past couple of days for various other reasons.
>
> I am on Ceph 14.2.6, Debian buster and am using the Debian packages. All of my servers are kept in the time sync via ntp, and this has been verified multiple times that everything remains in time sync.
>
> I have googled the error message and tried all of the solutions offered from that, but nothing makes any difference.
>
> I would appreciate any constructive advice.
>
> Thanks.
>
> -- ray
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
I am having a problem on my cluster where OSDs on one host are down after reboot. When I run ceph-disk activate-all I get an error message stating "No cluster conf found in /etc/ceph with fsid e1d7b4ae-2dcd-40ee-bea5-d103fe1fa9c9
When I look at the /etc/ceph/ceph.conf file I can see that the fsid does not match the one output in the error above. Is this what is keeping my OSDs from being able to change to an up status?