I was delighted to see the native Debian 12 (bookworm) packages turn up
in Reef 18.2.1.
We currently run a number of ceph clusters on Debian11 (bullseye) /
Quincy 17.2.7. These are not cephadm-managed.
I have attempted to upgrade a test cluster, and it is not going well.
Quincy only supports bullseye, and Reef only supports bookworm, we are
reinstalling from bare metal. However I don't think either of these two
problems are related to that.
Problem 1
--------------
A simple "apt install ceph" goes most of the way, then errors with
Setting up cephadm (18.2.1-1~bpo12+1) ...
usermod: unlocking the user's password would result in a passwordless
account.
You should set a password with usermod -p to unlock this user's password.
mkdir: cannot create directory ‘/home/cephadm/.ssh’: No such file or
directory
dpkg: error processing package cephadm (--configure):
installed cephadm package post-installation script subprocess returned
error exit status 1
dpkg: dependency problems prevent configuration of ceph-mgr-cephadm:
ceph-mgr-cephadm depends on cephadm; however:
Package cephadm is not configured yet.
dpkg: error processing package ceph-mgr-cephadm (--configure):
dependency problems - leaving unconfigured
The two cephadm-related packages are then left in an error state, which
apt tries to continue each time it is run.
The cephadm user has a login directory of /nonexistent, however the
cephadm --configure script is trying to use /home/cephadm (as it was on
Quincy/bullseye).
So, we aren't using cephadm, and decided to keep going as the other
packages were actually installed, and deal with the package state later.
Problem 2
---------------
I upgraded 2/3 monitor nodes without any other problems, and (for the
moment) removed the other Quincy monitor prior to rebuild.
I then shutdown the remaining Quincy manager, and attempted to start the
Reef manager. Although the manager is running, "ceph mgr services" shows
it is only providing the restful and not the dashboard service. The log
file has lots of the following error:
ImportError: PyO3 modules may only be initialized once per interpreter
process
and ceph -s reports "Module 'dashboard' has failed dependency: PyO3
modules may only be initialized once per interpreter process
Questions
---------------
1. Have the Reef/bookworm packages ever been tested in a non-cephadm
environment?
2. I want to revert this cluster back to a fully functional state. I
cannot bring back up the remaining Quincy monitor though ("require
release 18 > 17"). Would I have to go through the procedure of starting
over, and trying to rescue the monmap from the OSDs? (OSDs and an active
MDS are still up and running Quincy). I'm aware that process exists but
have never had to delve into it.
Thanks, Chris
Hi
We have a cluster with which currently looks like so:
services:
mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 13d)
mgr: jolly.tpgixt(active, since 25h), standbys: dopey.lxajvk, lazy.xuhetq
mds: 1/1 daemons up, 2 standby
osd: 449 osds: 425 up (since 15m), 425 in (since 5m); 5104 remapped pgs
data:
volumes: 1/1 healthy
pools: 13 pools, 11153 pgs
objects: 304.11M objects, 988 TiB
usage: 1.6 PiB used, 1.4 PiB / 2.9 PiB avail
pgs: 6/1617270006 objects degraded (0.000%)
366696947/1617270006 objects misplaced (22.674%)
6043 active+clean
5041 active+remapped+backfill_wait
66 active+remapped+backfilling
2 active+recovery_wait+degraded+remapped
1 active+recovering+degraded
It's currently rebalancing after adding a node, but this rebalance has
been rather slow -- right now it's running 66 backfills, but it seems to
stabilize around 8 backfills eventually. We figured that perhaps adding
another node might speed things up.
Immediately upon adding the node, we get slow ops and inactive PG's.
Removing the new node gets us back in working order.
It turns out that even adding 1 OSD breaks the cluster, and immediately
sends it here:
[WRN] PG_DEGRADED: Degraded data redundancy: 6/1617265712 objects degraded (0.000%), 3 pgs degraded
pg 37.c8 is active+recovery_wait+degraded+remapped, acting [410,163,236,209,7,283,155,143,78]
pg 37.1a1 is active+recovering+degraded, acting [234,424,163,74,22,128,177,153,181]
pg 37.1da is active+recovery_wait+degraded+remapped, acting [163,408,230,190,93,284,50,78,44]
[WRN] SLOW_OPS: 22 slow ops, oldest one blocked for 54 sec, daemons [osd.11,osd.110,osd.112,osd.117,osd.120,osd.123,osd.13,osd.136,osd.144,osd.157]... have slow ops.
The OSD added had number 431, so it does not appear to be the immediate
cause of the slow ops, however, removing 431 immediately clears the
problem.
We thought we might be experiencing 'Crush giving up too soon' symptoms
[1], as we have seen similar behaviour on another pool, but it does not
appear to be the case here. We went through the motions described on the
page and everything looked OK.
At least one pool which stops working is a 4+2 EC pool, placed on
spinning rust, some 200-ish disks distributed across 13 nodes. I'm not
sure if other pools break, but that particular 4+2 EC pool is rather
important so I'm a little wary of experimenting blindly.
Any thoughts on where to look next?
Thanks,
Ruben Vestergaard
[1] https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#cru…
Hello,
So we were going to replace a Ceph cluster with some hardware we had laying around using SATA HBAs but I was told that the only right way to build Ceph in 2023 is with direct attach NVMe.
Does anyone have any recommendation for a 1U barebones server (we just drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct attached to the motherboard without a bridge or HBA for Ceph specifically?
Thanks,
-Drew
Hi Jan,
I've just fired an upstream ticket for your case, see
https://tracker.ceph.com/issues/64053 for more details.
You might want to tune (or preferably just remove) your custom
bluestore_cache_.*_ratio settings to fix the issue.
This is reproducible and fixable in my lab this way.
Hope this helps.
Thanks,
Igor
On 15/01/2024 12:54, Jan Marek wrote:
> Hi Igor,
>
> I've tried to start ceph-sod daemon as you advice me and I'm
> sending log osd.1.start.log
>
> About memory: According to 'top' podman ceph daemon don't reach
> 2% of whole server memory (64GB)...
>
> I have switch on autotune of memory...
>
> My ceph config dump - see attached dump.txt
>
> Sincerely
> Jan Marek
>
> Dne Čt, led 11, 2024 at 04:02:02 CET napsal(a) Igor Fedotov:
>> Hi Jan,
>>
>> unfortunately this wasn't very helpful. Moreover the log looks a bit messy -
>> looks like a mixture of outputs from multiple running instances or
>> something. I'm not an expert in using containerized setups though.
>>
>> Could you please simplify things by running ceph-osd process manually like
>> you did for ceph-objectstore-tool. And enforce log output to a file. Command
>> line should look somewhat the following:
>>
>> ceph-osd -i 0 --log-to-file --log-file <some-file> --debug-bluestore 5/20
>> --debug-prioritycache 10
>>
>> Please don't forget to run repair prior to that.
>>
>>
>> Also you haven't answered my questions about custom [memory] settings and
>> RAM usage during OSD startup. It would be nice to hear some feedback.
>>
>>
>> Thanks,
>>
>> Igor
>>
>> On 11/01/2024 16:47, Jan Marek wrote:
>>> Hi Igor,
>>>
>>> I've tried to start osd.1 with debug_prioritycache and
>>> debug_bluestore 5/20, see attached file...
>>>
>>> Sincerely
>>> Jan
>>>
>>> Dne St, led 10, 2024 at 01:03:07 CET napsal(a) Igor Fedotov:
>>>> Hi Jan,
>>>>
>>>> indeed this looks like some memory allocation problem - may be OSD's RAM
>>>> usage threshold reached or something?
>>>>
>>>> Curious if you have any custom OSD settings or may be any memory caps for
>>>> Ceph containers?
>>>>
>>>> Could you please set debug_bluestore to 5/20 and debug_prioritycache to 10
>>>> and try to start OSD once again. Please monitor process RAM usage along the
>>>> process and share the resulting log.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>> On 10/01/2024 11:20, Jan Marek wrote:
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi,
As I’v read and thought a lot about the migration as this is a bigger project, I was wondering if anyone has done that already and might share some notes or playbooks, because in all readings there where some parts missing or miss understandable to me.
I do have some different approaches in mind, so may be you have some suggestions or hints.
a) upgrade nautilus on centos 7 with the few missing features like dashboard and prometheus. After that migrate one node after an other to ubuntu 20.04 with octopus and than upgrade ceph to the recent stable version.
b) migrate one node after an other to ubuntu 18.04 with nautilus and then upgrade to octupus and after that to ubuntu 20.04.
or
c) upgrade one node after an other to ubuntu 20.04 with octopus and join it to the cluster until all nodes are upgraded.
For test I tried c) with a mon node, but adding that to the cluster fails with some failed state, still probing for the other mons. (I dont have the right log at hand right now.)
So my questions are:
a) What would be the best (most stable) migration path and
b) is it in general possible to add a new octopus mon (not upgraded one) to a nautilus cluster, where the other mons are still on nautilus?
I hope my thoughts and questions are understandable :)
Thanks for any hint and suggestion. Best . Götz
Hi,
after osd.15 died in the wrong moment there is:
#ceph health detail
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg stale
pg 10.17 is stuck stale for 3d, current state
stale+active+undersized+degraded, last acting [15]
[WRN] PG_DEGRADED: Degraded data redundancy: 172/57063399 objects
degraded (0.000%), 1 pg degraded, 1 pg undersized
pg 10.17 is stuck undersized for 3d, current state
stale+active+undersized+degraded, last acting [15]
which will never resolv as there is no osd.15 anymore.
So a
ceph pg 10.17 mark_unfound_lost delete
was executed.
ceph seems to be a bit confused about pg 10.17 now:
While this worked before, its not working anymore
# ceph pg 10.17 query
Error ENOENT: i don't have pgid 10.17
And while this was pointing to 15 the map changed now to 5 and 6 ( which
is correct ):
# ceph pg map 10.17
osdmap e14425 pg 10.17 (10.17) -> up [5,6] acting [5,6]
According to ceph health, ceph assumes that osd.15 is still somehow in
charge.
The pg map seems to think that 10.17 is on osd.5 and osd.6
But pg 10.17 seems not to be really existing, as a query will fail.
Any idea whats going wrong and howto fix this?
Thank you!
--
Mit freundlichen Gruessen / Best regards
Oliver Dzombic
Layer7 Networks
mailto:info@layer7.net
Anschrift:
Layer7 Networks GmbH
Zum Sonnenberg 1-3
63571 Gelnhausen
HRB 96293 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic
UST ID: DE259845632
Happy new year everybody.
I just found out that the orchestrator in one of our clusters is not doing
anything.
What I tried until now:
- disabling / enabling cephadm (no impact)
- restarting hosts (no impact)
- starting upgrade to same version (no impact)
- starting downgrade (no impact)
- forcefully removing hosts and adding them again (now I have no daemons
anymore)
- applying new configurations (no impact)
The orchestrator just does nothing.
Cluster itself is fine.
I also checked the SSH connecability from all hosts to all hosts (
https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors)
The logs always show a message like "took the task" but then nothing
happens.
Cheers
Boris
Hi ceph community
I noticed the following problem after upgrading my ceph instance on Debian
12.4 from 17.2.7 to 18.2.1:
I had placed bluestore block.db for hdd osd's on raid1/mirrored logical
volumes on 2 nvme devices, so that if a single block.db nvme device fails,
that not all hdd osds fail.
That worked fine under 17.2.7 and had no problems during host/osd restarts.
During the upgrade to 18.2.1 the osd's wouldn't with the block.db on
mirrored lv wouldn't start anymore because the block.db symlink was updated
to pointing to the wrong device mapper device, and the osd startup failed
with error message that block.db device is busy.
OSD1:
2024-01-05T19:56:43.592+0000 7fdde9f43640 -1
bluestore(/var/lib/ceph/osd/ceph-1) _minimal_open_bluefs add block
device(/var/lib/ceph/osd/ceph-1/block.db) returned: (16) Device or resource
busy
2024-01-05T19:56:43.592+0000 7fdde9f43640 -1
bluestore(/var/lib/ceph/osd/ceph-1) _open_db failed to prepare db
environment:
2024-01-05T19:56:43.592+0000 7fdde9f43640 1 bdev(0x55a2d5014000
/var/lib/ceph/osd/ceph-1/block) close
2024-01-05T19:56:43.892+0000 7fdde9f43640 -1 osd.1 0 OSD:init: unable to
mount object store
the symlink was updated to point to
lrwxrwxrwx 1 ceph ceph 111 Jan 5 20:57 block ->
/dev/mapper/ceph--dec5bd7c--d84f--40d9--ba14--6bd8aadf2957-osd--block--cdd02721--6876--4db8--bdb2--12ac6c70127c
lrwxrwxrwx 1 ceph ceph 48 Jan 5 20:57 block.db ->
/dev/mapper/optane-ceph--db--osd1_rimage_1_iorig
the correct symlink would have been:
lrwxrwxrwx 1 ceph ceph 111 Jan 5 20:57 block ->
/dev/mapper/ceph--dec5bd7c--d84f--40d9--ba14--6bd8aadf2957-osd--block--cdd02721--6876--4db8--bdb2--12ac6c70127c
lrwxrwxrwx 1 ceph ceph 48 Jan 5 20:57 block.db ->
/dev/mapper/optane-ceph--db--osd1
To continue with the upgrade I converted one by one all the block.db lvm
logical volumes back to linear volumes, and fixed the symlinks manually.
converting the lv's back to linear was necessary, because even when I fixed
the symlink manually, after a osd restart the symlink would be created
wrong again if the block.db would point to a raid1 lv.
Here's any example how the symlink looked before an osd was touched by the
18.2.1 upgrade:
OSD2:
lrwxrwxrwx 1 ceph ceph 93 Jan 4 03:38 block ->
/dev/ceph-17a894d6-3a64-4e5e-9fa0-8dd3b5f4bf33/osd-block-3cd7a5af-9002-47a7-b4c2-540381d53be7
lrwxrwxrwx 1 ceph ceph 24 Jan 4 03:38 block.db ->
/dev/optane/ceph-db-osd2
Here's what the output of lvs -a -o +devices looked like for OSD1 block.db
device when it was an raid1 lv:
LV VG
Attr LSize Pool Origin
Data% Meta% Move Log Cpy%Sync Convert Devices
ceph-db-osd1 optane
rwi-a-r--- 44.00g
100.00
ceph-db-osd1_rimage_0(0),ceph-db-osd1_rimage_1(0)
[ceph-db-osd1_rimage_0] optane
gwi-aor--- 44.00g
[ceph-db-osd1_rimage_0_iorig] 100.00
ceph-db-osd1_rimage_0_iorig(0)
[ceph-db-osd1_rimage_0_imeta] optane
ewi-ao---- 428.00m
/dev/sdg(55482)
[ceph-db-osd1_rimage_0_imeta] optane
ewi-ao---- 428.00m
/dev/sdg(84566)
[ceph-db-osd1_rimage_0_iorig] optane
-wi-ao---- 44.00g
/dev/sdg(9216)
[ceph-db-osd1_rimage_0_iorig] optane
-wi-ao---- 44.00g
/dev/sdg(82518)
[ceph-db-osd1_rimage_1] optane
gwi-aor--- 44.00g
[ceph-db-osd1_rimage_1_iorig] 100.00
ceph-db-osd1_rimage_1_iorig(0)
[ceph-db-osd1_rimage_1_imeta] optane
ewi-ao---- 428.00m
/dev/sdj(55392)
[ceph-db-osd1_rimage_1_imeta] optane
ewi-ao---- 428.00m
/dev/sdj(75457)
[ceph-db-osd1_rimage_1_iorig] optane
-wi-ao---- 44.00g
/dev/sdj(9218)
[ceph-db-osd1_rimage_1_iorig] optane
-wi-ao---- 44.00g
/dev/sdj(73409)
[ceph-db-osd1_rmeta_0] optane
ewi-aor--- 4.00m
/dev/sdg(55388)
[ceph-db-osd1_rmeta_1] optane
ewi-aor--- 4.00m
/dev/sdj(9217)
It would be good if the symlinks were recreated pointing to the correct
device even when they point to a raid1 lv.
Not sure if this problem has been reported yet.
Cheers
Reto