Hi,
I am set to resize OSDs in ceph cluster to extend overall cluster capacity, by adding 40GB to each of disk and noticed that after disk resize and OSD restart RAW USE size grows proportionally to new size, ex. by 20GB while DATA remains the same, which makes new space not readily available. Here is the osd output of cluster:
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
1 hdd 0.09769 1.00000 100 GiB 83 GiB 82 GiB 164 MiB 891 MiB 17 GiB 82.79 1.00 77 up
3 hdd 0.09769 1.00000 100 GiB 83 GiB 82 GiB 355 MiB 772 MiB 17 GiB 82.74 1.00 84 up
2 hdd 0.09769 1.00000 100 GiB 84 GiB 82 GiB 337 MiB 1.3 GiB 16 GiB 83.88 1.01 82 up
4 hdd 0.09769 1.00000 140 GiB 125 GiB 84 GiB 148 MiB 919 MiB 15 GiB 89.24 1.07 80 up
6 hdd 0.09769 1.00000 140 GiB 106 GiB 104 GiB 333 MiB 1015 MiB 34 GiB 75.47 0.91 107 up
7 hdd 0.09769 1.00000 140 GiB 118 GiB 97 GiB 351 MiB 1.2 GiB 22 GiB 84.48 1.02 101 up
TOTAL 720 GiB 598 GiB 531 GiB 1.6 GiB 6.1 GiB 122 GiB 83.10
MIN/MAX VAR: 0.91/1.07 STDDEV: 4.06
The OSDs I managed to extend are 7, 6 and 4. Only OSD number 6 detected new size and did not inflate RAW USE, OSD 7 and 4 have RAW USE vs DATA gap.
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 720 GiB 122 GiB 598 GiB 598 GiB 83.10
TOTAL 720 GiB 122 GiB 598 GiB 598 GiB 83.10
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 449 KiB 2 1.3 MiB 0 16 GiB
cephfs-metadata 2 16 832 MiB 245.62k 2.4 GiB 4.80 16 GiB
cephfs-replicated 3 128 176 GiB 545.23k 530 GiB 91.63 16 GiB
replicapool 4 32 19 B 2 12 KiB 0 16 GiB
This reports nearly 600GB used, while it should be more like 530GB as cephfs-replicated pool is reporting its data usage.
Any ideas why is this happening? Should I continue with extension of all OSDs to 140GB to see if that makes a difference?
Br,
merp.
Hello guys,
This situation is driving me crazy, I have tried to deploy a ceph cluster,
in all ways possible, even with ansible and at some point it breaks. I'm
using Ubuntu 22.0.4. This is one of the errors I'm having, some problem
with ceph-exporter. Please could you help me, I have been dealing with
this for like 5 days.
Kind regards
root@node1-ceph:~# cephadm bootstrap --mon-ip 10.0.0.52
Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit systemd-timesyncd.service is enabled and running
Repeating the final host check...
docker (/usr/bin/docker) is present
systemctl is present
lvcreate is present
Unit systemd-timesyncd.service is enabled and running
Host looks OK
Cluster fsid: 4ce3a92a-8ddd-11ee-9b23-6341187f70c1
Verifying IP 10.0.0.52 port 3300 ...
Verifying IP 10.0.0.52 port 6789 ...
Mon IP `10.0.0.52` is in CIDR network `10.0.0.0/24`
Mon IP `10.0.0.52` is in CIDR network `10.0.0.0/24`
Mon IP `10.0.0.52` is in CIDR network `10.0.0.1/32`
Mon IP `10.0.0.52` is in CIDR network `10.0.0.1/32`
Internal network (--cluster-network) has not been provided, OSD replication
will default to the public_network
Pulling container image quay.io/ceph/ceph:v17...
Ceph version: ceph version 17.2.7
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting mon public_network to 10.0.0.1/32,10.0.0.0/24
Wrote config to /etc/ceph/ceph.conf
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Creating mgr...
Verifying port 9283 ...
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/15)...
mgr not available, waiting (2/15)...
mgr not available, waiting (3/15)...
mgr not available, waiting (4/15)...
mgr not available, waiting (5/15)...
mgr is available
Enabling cephadm module...
Waiting for the mgr to restart...
Waiting for mgr epoch 5...
mgr epoch 5 is available
Setting orchestrator backend to cephadm...
Generating ssh key...
Wrote public SSH key to /etc/ceph/ceph.pub
Adding key to root@localhost authorized_keys...
Adding host node1-ceph...
Deploying mon service with default placement...
Deploying mgr service with default placement...
Deploying crash service with default placement...
Deploying ceph-exporter service with default placement...
Non-zero exit code 22 from /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph --init -e
CONTAINER_IMAGE=quay.io/ceph/ceph:v17 -e NODE_NAME=node1-ceph -e
CEPH_USE_RANDOM_NONCE=1 -v
/var/log/ceph/4ce3a92a-8ddd-11ee-9b23-6341187f70c1:/var/log/ceph:z -v
/tmp/ceph-tmp6yz3vt5s:/etc/ceph/ceph.client.admin.keyring:z -v
/tmp/ceph-tmpfhd01qwu:/etc/ceph/ceph.conf:z quay.io/ceph/ceph:v17 orch
apply ceph-exporter
/usr/bin/ceph: stderr Error EINVAL: Usage:
/usr/bin/ceph: stderr ceph orch apply -i <yaml spec> [--dry-run]
/usr/bin/ceph: stderr ceph orch apply <service_type>
[--placement=<placement_string>] [--unmanaged]
/usr/bin/ceph: stderr
Traceback (most recent call last):
File "/usr/sbin/cephadm", line 9653, in <module>
main()
File "/usr/sbin/cephadm", line 9641, in main
r = ctx.func(ctx)
File "/usr/sbin/cephadm", line 2205, in _default_image
return func(ctx)
File "/usr/sbin/cephadm", line 5774, in command_bootstrap
prepare_ssh(ctx, cli, wait_for_mgr_restart)
File "/usr/sbin/cephadm", line 5275, in prepare_ssh
cli(['orch', 'apply', t])
File "/usr/sbin/cephadm", line 5708, in cli
return CephContainer(
File "/usr/sbin/cephadm", line 4144, in run
out, _, _ = call_throws(self.ctx, self.run_cmd(),
File "/usr/sbin/cephadm", line 1853, in call_throws
raise RuntimeError('Failed command: %s' % ' '.join(command))
RuntimeError: Failed command: /usr/bin/docker run --rm --ipc=host
--stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph --init -e
CONTAINER_IMAGE=quay.io/ceph/ceph:v17 -e NODE_NAME=node1-ceph -e
CEPH_USE_RANDOM_NONCE=1 -v
/var/log/ceph/4ce3a92a-8ddd-11ee-9b23-6341187f70c1:/var/log/ceph:z -v
/tmp/ceph-tmp6yz3vt5s:/etc/ceph/ceph.client.admin.keyring:z -v
/tmp/ceph-tmpfhd01qwu:/etc/ceph/ceph.conf:z quay.io/ceph/ceph:v17 orch
apply ceph-exporter
--
*Francisco Arencibia Quesada.*
*DevOps Engineer*
The system has been running for years and it would take me weeks to
reconfigure all of its services properly. I'm proceeding with just the
Debian nodes for now; worst case scenario I'll just have to rent a new
server to act as a third node. I was really hoping to use the CentOS one
since it's already there. Somebody else mentioned DRBD earlier so I've
looked into that too, but in that case I ran into problems with the Debian
systems instead because apparently only DRBD 9 and up supports more than
two nodes, and the only Debian kernel module packages I've found are for
DRDB 8. Ironically, CentOS does have a working kmod-drbd90 package
available.
On Wed, Nov 29, 2023 at 5:53 PM Anthony D'Atri <aad(a)dreamsnake.net> wrote:
> Update it to 8 or Rocky 9?
>
> > On Nov 29, 2023, at 14:05, Leo28C <leo28c(a)gmail.com> wrote:
> >
> >> Why are you talking about version 14 now anyhow?
> >
> > One of my nodes is running CentOS 7 and the latest version I found for it
> > is 14, unless there's a way to get 15 working on it that I don't know
> about?
> >
> > On Wed, Nov 29, 2023 at 1:48 AM Robert Sander <
> r.sander(a)heinlein-support.de>
> > wrote:
> >
> >> On 11/28/23 17:50, Leo28C wrote:
> >>> Problem is I don't have the cephadm command on every node. Do I need it
> >>> on all nodes or just one of them? I tried installing it via curl but my
> >>> ceph version is 14.2.22 which is not on the archive anymore so the curl
> >>> command returns a 404 error html file. How do I get cephadm for 14.2?
> >>
> >> There is no cephadm for Ceph 14 as the orchestrator was first introduced
> >> in version 15.
> >>
> >> Why are you talking about version 14 now anyhow?
> >> As long as your nodes fulfill the requirements for cephadm you can
> >> install the latest version of Ceph.
> >>
> >> PS: Please reply to the list.
> >>
> >> Regards
> >> --
> >> Robert Sander
> >> Heinlein Consulting GmbH
> >> Schwedter Str. 8/9b, 10119 Berlin
> >>
> >> https://www.heinlein-support.de
> >>
> >> Tel: 030 / 405051-43
> >> Fax: 030 / 405051-19
> >>
> >> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> >> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users(a)ceph.io
> >> To unsubscribe send an email to ceph-users-leave(a)ceph.io
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
>
Hi,
I have set following permission to admin user:
radosgw-admin caps add --uid=admin --tenant=admin --caps="users=*;buckets=*"
Now I would like to upload some object with admin user to some other
user/tenant (tester1$tester1) to his bucket test1.
Other user has uid tester1 and tenant tester1 and bucket test1 created.
I've tried with python:
headers = {'x-amz-meta-tenancy': 'tester1'}
client.upload_file(file_path, bucket_name, object_name,
ExtraArgs={'Metadata': headers})
But I get response:
An error occurred (NoSuchBucket) when calling the PutObject operation: None
Any ideas why I get this error even though bucket test1 for tester1$tester1
exists?
Kind regards,
Rok
On 11/28/23 17:50, Leo28C wrote:
> Problem is I don't have the cephadm command on every node. Do I need it
> on all nodes or just one of them? I tried installing it via curl but my
> ceph version is 14.2.22 which is not on the archive anymore so the curl
> command returns a 404 error html file. How do I get cephadm for 14.2?
There is no cephadm for Ceph 14 as the orchestrator was first introduced
in version 15.
Why are you talking about version 14 now anyhow?
As long as your nodes fulfill the requirements for cephadm you can
install the latest version of Ceph.
PS: Please reply to the list.
Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin
https://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
Hi,
I would like to get info if the bucket or object got updated.
I can get this info with a changed etag of an object, but not I cannot get
etag from bucket, so I am looking at
https://docs.ceph.com/en/latest/radosgw/notifications/
How do I create a topic and where do I send request with parameters?
Anyone have an example (cli, python, ...) with topic creation where I can
get notification to a http endpoint when an object got created in a bucket?
Kind regards,
Rok
Hi Eugen,
Please find the details below
FROM HYPERVISOR SYSLOG:-
Nov 29 07:07:46 kernel: [ 1171.392249] IPv6: ADDRCONF(NETDEV_CHANGE):
qvoc178e343-d8: link becomes ready
Nov 29 07:07:46 kernel: [ 1171.392460] IPv6: ADDRCONF(NETDEV_CHANGE):
qvbc178e343-d8: link becomes ready
Nov 29 07:07:46 kernel: [ 1171.397266] qbrc178e343-d8: port
1(qvbc178e343-d8) entered blocking state
Nov 29 07:07:46kernel: [ 1171.397268] qbrc178e343-d8: port
1(qvbc178e343-d8) entered disabled state
Nov 29 07:07:46 kernel: [ 1171.397414] device qvbc178e343-d8 entered
promiscuous mode
FROM DMESG LOG:-
[Wed Nov 29 07:07:45 2023] qbrc178e343-d8: port 1(qvbc178e343-d8) entered
disabled state
[Wed Nov 29 07:07:45 2023] device qvbc178e343-d8 entered promiscuous mode
[Wed Nov 29 07:07:45 2023] qbrc178e343-d8: port 1(qvbc178e343-d8) entered
blocking state
[Wed Nov 29 07:07:45 2023] qbrc178e343-d8: port 1(qvbc178e343-d8) entered
forwarding state
[Wed Nov 29 07:07:45 2023] device qvoc178e343-d8 entered promiscuous mode
[Wed Nov 29 07:07:49 2023] qbrc178e343-d8: port 2(tapc178e343-d8) entered
blocking state
[Wed Nov 29 07:07:49 2023] qbrc178e343-d8: port 2(tapc178e343-d8) entered
disabled state
[Wed Nov 29 07:07:49 2023] device tapc178e343-d8 entered promiscuous mode
[Wed Nov 29 07:07:49 2023] qbrc178e343-d8: port 2(tapc178e343-d8) entered
blocking state
[Wed Nov 29 07:07:49 2023] qbrc178e343-d8: port 2(tapc178e343-d8) entered
forwarding state
NOVA-COMPUTE LOG:-
2023-11-29 00:38:31.027 7 INFO nova.compute.manager [-] [instance:
1dbd1562-44c1-44b7-9d1e-97ac61716db3] VM Stopped (Lifecycle Event)
2023-11-29 00:38:31.115 7 INFO nova.compute.manager
[req-cda5058f-c026-4586-a2b3-50f9727f1220 - - - - -] [instance:
1dbd1562-44c1-44b7-9d1e-97
ac61716db3] During _sync_instance_power_state the DB power_state (1) does
not match the vm_power_state from the hypervisor (4). Updating power
_state in the DB to match the hypervisor.
2023-11-29 00:38:34.045 7 INFO nova.compute.manager [-] [instance:
46b48b4e-3675-453c-8c87-f21f1b7fb86c] VM Stopped (Lifecycle Event)
2023-11-29 00:38:34.080 7 INFO nova.compute.manager [-] [instance:
b3df30a6-de61-448e-8451-938309b20ab5] VM Stopped (Lifecycle Event)
2023-11-29 00:38:34.122 7 INFO nova.compute.manager
[req-a6d0b47f-f50b-4bbc-a0d7-df49511ca4a7 - - - - -] [instance:
46b48b4e-3675-453c-8c87-f2
1f1b7fb86c] During _sync_instance_power_state the DB power_state (1) does
not match the vm_power_state from the hypervisor (4). Updating power
_state in the DB to match the hypervisor.
2023-11-29 00:38:34.164 7 INFO nova.compute.manager
[req-7aab66e0-73f6-4ae1-960d-2f24dfba3131 - - - - -] [instance:
b3df30a6-de61-448e-8451-93
8309b20ab5] During _sync_instance_power_state the DB power_state (1) does
not match the vm_power_state from the hypervisor (4). Updating power
_state in the DB to match the hypervisor.
There are multiple virtual machine went down on different hypervisor
And also os disk resides on ceph storage During this incident if directly
restart the machine then we will get I/O error on console
so first we have to rebuild the os-disk volume from ceph then we have
restart the machine for proper functioning of VM
I checked instance console logs also but nothing found suspicious
Thanks & Regards
Arihant Jain
On Mon, Nov 27, 2023 at 7:49 PM Eugen Block <eblock(a)nde.ag> wrote:
> I don't see how ceph could be the issue here. Do you have libvirt logs
> and dmesg output from the hypervisor and something from the VM like
> the relevant syslog excerpts (before it gets shut down)? Is only one
> VM affected or all from the same hypervisor or several across
> different hypervisors?
> The nova-compute.log doesn't seem to be enough, but you could also
> enable debug logs to see if it reveals more.
>
> Zitat von AJ_ sunny <jains8550(a)gmail.com>:
>
> > Hi team,
> >
> > After doing above changes I am still getting the issue in which machine
> > continuously went shutdown
> >
> > In nova-compute logs I am getting only this footprint
> >
> > Logs:-
> > 2023-10-16 08:48:10.971 7 WARNING nova.compute.manager
> > [req-c7b731db-2b61-400e-917f-8645c9984696 f226d81a45dd46488fb2e19515 848
> > 316d215042914de190f5f9e1c8466bf0 default default] [instance:
> > 4b04d3f1-1fbd-4b63-b693-a0ef316ecff3] Received unexpected - vent
> > network-vif-plugged-f191f6c8-dff5-4c1b-94b3-8d91aa6ff5ac for instance
> with
> > vm_state active and task_state None. 2023-10-21 22:42:44.589 7 INFO
> > nova.compute.manager [-] [instance: 4b04d3f1-1fbd-4b63-b693-a0ef316ecff3]
> > VM Stopped (Lifecyc Event)
> >
> > 2023-10-21 22:42:44.683 7 INFO nova.compute.manager
> > [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d -] [instance: 4b04d3f1-
> > fbd-4b63-b693-a0ef316ecff3] During _sync_instance_power_state the DB
> > power_state (1) does not match the vm_power_state from ti e hypervisor
> (4).
> > Updating power_state in the DB to match the hypervisor.
> >
> > 2023-10-21 22:42:44.811 7 WARNING nova.compute.manager
> > [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d ----] [instance: 4b04d3f
> > 1-1fbd-4b63-b693-a0ef316ecff3] Instance shutdown by itself. Calling the
> > stop API. Current vm_state: active, current task_state : None, original
> DB
> > power_state: 1, current VM power_state: 4 2023-10-21 22:42:44.977 7 INFO
> > nova.compute.manager [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d -]
> > [instance: 4b04d3f1-1
> >
> > fbd-4b63-b693-a0ef316ecff3] Instance is already powered off in the
> > hypervisor when stop is called.
> >
> >
> > And in this architecture we are using ceph is the backend storage for
> > Nova,glance & cinder
> > When machine auto goes down and if i try to start the machine it will go
> in
> > error i.e. in Vm console is show I/O ERROR during boot so first we need
> to
> > rebuild the volume from ceph side then I have to start the machine
> > Rbd object-map rebuild<volume-id>
> > Openstack server start <server-id>
> >
> > So this issue is showing two faces one from ceph side and another from
> > nova-compute log
> > can someone please help me out to fix out this issue asap
> >
> > Thanks & Regards
> > Arihant Jain
> >
> > On Tue, 24 Oct, 2023, 4:56 pm , <smooney(a)redhat.com> wrote:
> >
> >> On Tue, 2023-10-24 at 10:11 +0530, AJ_ sunny wrote:
> >> > Hi team,
> >> >
> >> > Vm is not shutting off by owner from inside its automatically went to
> >> > shutdown i.e. libvirt lifecycle stop event triggering
> >> > In my nova.conf configuration I am using ram_allocation_ratio = 1.5
> >> > And previously I tried to set in nova.conf
> >> > Sync_power_state_interval = -1 but still facing the same problem
> >> > OOM might be causing this issue
> >> > Can you please give me some idea to fix this issue if OOM is the cause
> >> the general answer is swap.
> >>
> >> nova should alwasy be deployed with swap even if you do not have over
> >> commit enabled.
> >> there are a few reason for this the first being python allocates memory
> >> diffently if
> >> any swap is aviable, even 1G is enough to have it not try to commit all
> >> memory. so
> >> when swap is aviable the nova/neutron agents will use much less resident
> >> memeory even with
> >> out usign any of the swap space.
> >>
> >> we have some docs about this downstream
> >>
> >>
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17…
> >>
> >> if you are being ultra conservative we recommend allocating (ram *
> >> allocation ratio) in swap so in your case allcoate
> >> 1.5 times your ram as swap. we woudl expect the actul useage of swap to
> be
> >> a small fraction of that however so we
> >> also provide a formula for
> >>
> >> overcommit_ratio = NovaRAMAllocationRatio - 1
> >> Minimum swap size (MB) = (total_RAM * overcommit_ratio) +
> RHEL_min_swap
> >> Recommended swap size (MB) = total_RAM * (overcommit_ratio +
> >> percentage_of_RAM_to_use_for_swap)
> >>
> >> so say your host had 64G of ram with an allocation ratio of 1.5 and a
> min
> >> swap percentaiong of 25%
> >> the conserviver swap recommentation would be
> >>
> >> (64*(0.5+0.25)) + disto_min_swap
> >> (64*0.75) + 4G = 52G of recommended swap
> >>
> >> if your wondering why we add a min swap precentage and disto min swap
> its
> >> basically to acocund for the
> >> Qemu and host OS memory overhead as well as the memory used by the
> >> nova/neutron agents and libvirt/ovs
> >>
> >>
> >> if your not using memory over commit my general recommdation is if you
> >> have less then 64G of ram allcoate 16G if you
> >> have more then 256G of ram allocate 64G and you should be fine. when you
> >> do use memofy over commit you must
> >> have at least enouch swap to account for the qemu overhead of all
> instance
> >> + the over committed memory.
> >>
> >>
> >> the other common cause of OOM errors is if you are using numa affinity
> and
> >> the guest dont request
> >> hw:mem_page_size=<something> without setting a mem_page_size request we
> >> dont do numa aware memory placement. the kernel
> >> OOM system works
> >> on a per numa node basis, numa affintiy does not supprot memory over
> >> commit either so that is likly not your issue.
> >> i jsut said i woudl mention it to cover all basis.
> >>
> >> regards
> >> sean
> >>
> >>
> >>
> >> >
> >> >
> >> > Thanks & Regards
> >> > Arihant Jain
> >> >
> >> > On Mon, 23 Oct, 2023, 11:29 pm , <smooney(a)redhat.com> wrote:
> >> >
> >> > > On Mon, 2023-10-23 at 13:19 -0400, Jonathan Proulx wrote:
> >> > > >
> >> > > > I've seen similar log traces with overcommitted memory when the
> >> > > > hypervisor runs out of physical memory and OOM killer gets the VM
> >> > > > process.
> >> > > >
> >> > > > This is an unusuall configuration (I think) but if the VM owner
> >> claims
> >> > > > they didn't power down the VM internally you might look at the
> local
> >> > > > hypevisor logs to see if the VM process crashed or was killed for
> >> some
> >> > > > other reason.
> >> > > yep OOM events are one common causes fo this.
> >> > >
> >> > > nova is bacialy just saying "hay you said this vm should be active
> but
> >> its
> >> > > not, im going to update the db to reflect
> >> > > reality." you can turn that off with
> >> > >
> >> > >
> >>
> https://docs.openstack.org/nova/latest/configuration/config.html#workaround…
> >> > > or
> >> > >
> >> > >
> >>
> https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.sy…
> >> > > either disabel the sync via setign the interval to -1
> >> > > or disable haneling the virt lifecycle events.
> >> > >
> >> > > i would recommend the sync_power_state_interval approach but again
> if
> >> vms
> >> > > are stopping
> >> > > and you dont know why you likely should discover why rahter then
> just
> >> > > turning if the update of the nova db to reflect
> >> > > the actual sate.
> >> > >
> >> > > >
> >> > > > -Jon
> >> > > >
> >> > > > On Mon, Oct 23, 2023 at 02:02:26PM +0100, smooney(a)redhat.com
> wrote:
> >> > > > :On Mon, 2023-10-23 at 17:45 +0530, AJ_ sunny wrote:
> >> > > > :> Hi team,
> >> > > > :>
> >> > > > :> I am using openstack kolla ansible on wallaby version and
> >> currently I
> >> > > am
> >> > > > :> facing issue with virtual machine, vm is shutoff by itself and
> and
> >> > > from log
> >> > > > :> it seems libvirt lifecycle stop event triggering again and
> again
> >> > > > :>
> >> > > > :> Logs:-
> >> > > > :> 2023-10-16 08:48:10.971 7 WARNING nova.compute.manager
> >> > > > :> [req-c7b731db-2b61-400e-917f-8645c9984696
> >> f226d81a45dd46488fb2e19515
> >> > > 848
> >> > > > :> 316d215042914de190f5f9e1c8466bf0 default default] [instance:
> >> > > > :> 4b04d3f1-1fbd-4b63-b693-a0ef316ecff3] Received unexpected -
> vent
> >> > > > :> network-vif-plugged-f191f6c8-dff5-4c1b-94b3-8d91aa6ff5ac for
> >> instance
> >> > > with
> >> > > > :> vm_state active and task_state None. 2023-10-21 22:42:44.589 7
> >> INFO
> >> > > > :> nova.compute.manager [-] [instance:
> >> > > 4b04d3f1-1fbd-4b63-b693-a0ef316ecff3]
> >> > > > :> VM Stopped (Lifecyc Event)
> >> > > > :>
> >> > > > :> 2023-10-21 22:42:44.683 7 INFO nova.compute.manager
> >> > > > :> [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d -] [instance:
> 4b04d3f1-
> >> > > > :> fbd-4b63-b693-a0ef316ecff3] During _sync_instance_power_state
> the
> >> DB
> >> > > > :> power_state (1) does not match the vm_power_state from ti e
> >> > > hypervisor (4).
> >> > > > :> Updating power_state in the DB to match the hypervisor.
> >> > > > :>
> >> > > > :> 2023-10-21 22:42:44.811 7 WARNING nova.compute.manager
> >> > > > :> [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d ----] [instance:
> 4b04d3f
> >> > > > :> 1-1fbd-4b63-b693-a0ef316ecff3] Instance shutdown by itself.
> >> Calling
> >> > > the
> >> > > > :> stop API. Current vm_state: active, current task_state : None,
> >> > > original DB
> >> > > > :> power_state: 1, current VM power_state: 4 2023-10-21
> 22:42:44.977
> >> 7
> >> > > INFO
> >> > > > :> nova.compute.manager [req-1d99b87b-7ff7-462d-ab18-fbdec6bda71d
> -]
> >> > > > :> [instance: 4b04d3f1-1
> >> > > > :>
> >> > > > :> fbd-4b63-b693-a0ef316ecff3] Instance is already powered off in
> the
> >> > > > :> hypervisor when stop is called.
> >> > > > :
> >> > > > :that sounds like the guest os shutdown the vm.
> >> > > > :i.e. somethign in the guest ran sudo poweroff
> >> > > > :then nova detected teh vm was stoped by the user and updated its
> db
> >> to
> >> > > match
> >> > > > :that.
> >> > > > :
> >> > > > :that is the expected beahvior wehn you have the power sync
> enabled.
> >> > > > :it is enabled by default.
> >> > > > :>
> >> > > > :>
> >> > > > :> Thanks & Regards
> >> > > > :> Arihant Jain
> >> > > > :> +91 8299719369
> >> > > > :
> >> > > >
> >> > >
> >> > >
> >>
> >>
>
>
>
>
>>
>> 1) They’re client aka desktop SSDs, not “enterprise”
>> 2) They’re a partition of a larger OSD shared with other purposes
>
> Yup. They're a mix of SATA SSDs and NVMes, but everything is
> consumer-grade. They're only 10% full on average and I'm not
> super-concerned with performance. If they did get full I'd allocate
> more space for them. Performance is more than adequate for the very
> light loads they have.
Fair enough. We sometimes see people bringing a toothpick to a gun fight and expecting a different result, so I had to ask. Just keep an eye on their endurance burn.
>
>
> It is interesting because Quincy had no issues with the autoscaler
> with the exact same cluster config. It might be a Rook issue, or it
> might just be because so many PGs are remapped. I'll take another
> look at that once it reaches more of a steady state.
>
> In any case, if the balancer is designed more for equal-sized OSDs I
> can always just play with reweights to balance things.
Look into the JJ balancer, I’ve read good things about it.
>
> --
> Rich
>> Very small and/or non-uniform clusters can be corner cases for many things, especially if they don’t have enough PGs. What is your failure domain — host or OSD?
>
> Failure domain is host,
Your host buckets do vary in weight by roughly a factor of two. They naturally will get PGs more or less relative to their aggregate CRUSH weight, and thus also the OSDs on each host.
> and PG number should be fairly reasonable.
Reason is in the eye of the beholder. I make the PG ratio for the cluster as a whole to be ~~90. I would definitely add more, that should help.
>> Are your OSDs sized uniformly? Please send the output of the following commands:
>
> OSDs are definitely not uniform in size. This might be the issue with
> the automation.
>
> You asked for it, but I do apologize for the wall of text that follows...
You described a small cluster, so this is peanuts.
>> `ceph osd tree`
>
> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
> -1 131.65762 root default
> -25 16.46977 host k8s1
> 14 hdd 5.45799 osd.14 up 0.90002 1.00000
> 19 hdd 10.91409 osd.19 up 1.00000 1.00000
> 22 ssd 0.09769 osd.22 up 1.00000 1.00000
> -13 25.56458 host k8s3
> 2 hdd 10.91409 osd.2 up 0.84998 1.00000
> 3 hdd 1.81940 osd.3 up 0.75002 1.00000
> 20 hdd 12.73340 osd.20 up 1.00000 1.00000
> 10 ssd 0.09769 osd.10 up 1.00000 1.00000
> -14 12.83107 host k8s4
> 0 hdd 10.91399 osd.0 up 1.00000 1.00000
> 5 hdd 1.81940 osd.5 up 1.00000 1.00000
> 11 ssd 0.09769 osd.11 up 1.00000 1.00000
> -2 14.65048 host k8s5
> 1 hdd 1.81940 osd.1 up 0.70001 1.00000
> 17 hdd 12.73340 osd.17 up 1.00000 1.00000
> 12 ssd 0.09769 osd.12 up 1.00000 1.00000
> -6 14.65048 host k8s6
> 4 hdd 1.81940 osd.4 up 0.75000 1.00000
> 16 hdd 12.73340 osd.16 up 0.95001 1.00000
> 13 ssd 0.09769 osd.13 up 1.00000 1.00000
> -3 23.74518 host k8s7
> 6 hdd 12.73340 osd.6 up 1.00000 1.00000
> 15 hdd 10.91409 osd.15 up 0.95001 1.00000
> 8 ssd 0.09769 osd.8 up 1.00000 1.00000
> -9 23.74606 host k8s8
> 7 hdd 14.55269 osd.7 up 1.00000 1.00000
> 18 hdd 9.09569 osd.18 up 1.00000 1.00000
> 9 ssd 0.09769 osd.9 up 1.00000 1.00000
Looks like one 100GB SSD OSD per host? This is AIUI the screaming minimum size for an OSD. With WAL, DB, cluster maps, and other overhead there doesn’t end up being much space left for payload data. On larger OSDs the overhead is much more into the noise floor. Given the side of these SSD OSDs, I suspect at least one of the following is true?
1) They’re client aka desktop SSDs, not “enterprise”
2) They’re a partition of a larger OSD shared with other purposes
I suspect that this alone would be enough to frustrate the balancer, which AFAIK doesn’t take overhead into consideration. You might disable the balancer module, reset the reweights to 1.00, and try the JJ balancer but I dunno that it would be night vs day.
> Note this cluster is in the middle of re-creating all the OSDs to
> modify the OSD allocation size
min_alloc_size? Were they created on an older Ceph release? Current defaults for [non]rotational media are both 4KB; they used to be 64KB but were changed with some churn …. around the Pacific / Octopus era IIRC. If you’re re-creating to minimize space amp, does that mean you’re running RGW with a significant fraction of small objects? With RBD — or CephFS with larger files — that isn’t so much an issue.
> I have scrubbing disabled since I'm
> basically rewriting just about everything in the cluster weekly right
> now but normally that would be on.
>
> cluster:
> id: ba455d73-116e-4f24-8a34-a45e3ba9f44c
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
> 546 pgs not deep-scrubbed in time
> 542 pgs not scrubbed in time
>
> services:
> mon: 3 daemons, quorum e,f,g (age 7d)
> mgr: a(active, since 7d)
> mds: 1/1 daemons up, 1 hot standby
> osd: 22 osds: 22 up (since 5h), 22 in (since 33h); 101 remapped pgs
> flags noscrub,nodeep-scrub
> rgw: 1 daemon active (1 hosts, 1 zones)
>
> data:
> volumes: 1/1 healthy
> pools: 13 pools, 617 pgs
> objects: 9.36M objects, 33 TiB
> usage: 67 TiB used, 65 TiB / 132 TiB avail
> pgs: 1778936/21708668 objects misplaced (8.195%)
> 516 active+clean
> 100 active+remapped+backfill_wait
> 1 active+remapped+backfilling
>
> io:
> client: 371 KiB/s rd, 2.8 MiB/s wr, 2 op/s rd, 7 op/s wr
> recovery: 25 MiB/s, 6 objects/s
>
> progress:
> Global Recovery Event (7d)
> [=======================.....] (remaining: 36h)
>
>> `ceph osd df`
>
> Note that these are not in a steady state right now. OSD 6 in
> particular was just re-created and is repopulating. A few of the
> reweights were set to deal with some gross issues in balance - when it
> all settles down I plan to optimize them.
>
> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
> 14 hdd 5.45799 0.90002 5.5 TiB 3.0 TiB 3.0 TiB 2.0 MiB 11 GiB 2.4 TiB 55.51 1.09 72 up
> 19 hdd 10.91409 1.00000 11 TiB 6.2 TiB 6.2 TiB 3.1 MiB 16 GiB 4.7 TiB 57.12 1.12 144 up
Unless you were to carefully segregate larger and smaller HDDs into separate pools, right-sizing the PG could is tricky. 144 is okay, 72 is a bit low, upstream guidance notwithstanding. I would still bump some of your pg_nums a bit.
> 22 ssd 0.09769 1.00000 100 GiB 2.4 GiB 1.8 GiB 167 MiB 504 MiB 98 GiB 2.43 0.05 32 up
> 2 hdd 10.91409 0.84998 11 TiB 4.5 TiB 4.5 TiB 5.0 MiB 9.7 GiB 6.4 TiB 41.11 0.81 99 up
> 3 hdd 1.81940 0.75002 1.8 TiB 1.0 TiB 1.0 TiB 2.3 MiB 3.8 GiB 818 GiB 56.11 1.10 21 up
> 20 hdd 12.73340 1.00000 13 TiB 7.1 TiB 7.1 TiB 3.7 MiB 16 GiB 5.6 TiB 56.01 1.10 165 up
> 10 ssd 0.09769 1.00000 100 GiB 1.3 GiB 299 MiB 185 MiB 835 MiB 99 GiB 1.29 0.03 38 up
> 0 hdd 10.91399 1.00000 11 TiB 6.5 TiB 6.5 TiB 3.7 MiB 15 GiB 4.4 TiB 59.41 1.17 144 up
> 5 hdd 1.81940 1.00000 1.8 TiB 845 GiB 842 GiB 1.7 MiB 3.3 GiB 1018 GiB 45.36 0.89 23 up
> 11 ssd 0.09769 1.00000 100 GiB 3.1 GiB 1.3 GiB 157 MiB 1.6.GiB 97 GiB 3.09 0.06 33 up
> 1 hdd 1.81940 0.70001 1.8 TiB 983 GiB 979 GiB 1.3 MiB 3.4. GiB 880 GiB 52.76 1.04 26 up
> 17 hdd 12.73340 1.00000 13 TiB 7.3 TiB 7.2 TiB 3.6 MiB 15 GiB 5.5 TiB 56.95 1.12 159 up
> 12 ssd 0.09769 1.00000 100 GiB 1.5 GiB 120 MiB 55 MiB 1.3 GiB 99 GiB 1.49 0.03 21 up
> 4 hdd 1.81940 0.75000 1.8 TiB 1.0 TiB 1.0 TiB 2.5 MiB 3.0 GiB 820 GiB 55.98 1.10 24 up
> 16 hdd 12.73340 0.95001 13 TiB 7.6 TiB 7.5 TiB 7.9 MiB 16 GiB 5.2 TiB 59.32 1.17 171 up
> 13 ssd 0.09769 1.00000 100 GiB 2.4 GiB 528 MiB 196 MiB 1.7 GiB 98 GiB 2.38 0.05 33 up
> 6 hdd 12.73340 1.00000 13 TiB 1.7 TiB 1.7 TiB 1.3 MiB 4.5 GiB 11 TiB 13.66 0.27 48 up
> 15 hdd 10.91409 0.95001 11 TiB 6.5 TiB 6.5 TiB 5.2 MiB 13 GiB 4.4 TiB 59.42 1.17 155 up
> 8 ssd 0.09769 1.00000 100 GiB 1.9 GiB 1.1 GiB 116 MiB 788 MiB 98 GiB 1.95 0.04 26 up
> 7 hdd 14.55269 1.00000 15 TiB 7.8 TiB 7.7 TiB 3.9 MiB 16 GiB 6.8 TiB 53.32 1.05 172 up
> 18 hdd 9.09569 1.00000 9.1 TiB 4.9 TiB 4.9 TiB 3.9 MiB 11 GiB 4.2 TiB 53.96 1.06 109 up
> 9 ssd 0.09769 1.00000 100 GiB 2.2 GiB 391 MiB 264 MiB 1.6 GiB 98 GiB 2.25 0.04 40 up
> TOTAL 132 TiB 67 TiB 67 TiB 1.2 GiB 164 GiB 65 TiB 50.82
> MIN/MAX VAR: 0.03/1.17 STDDEV: 29.78
>
>
>> `ceph osd dump | grep pool`
>
> pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on pg_num_max 32 pg_num_min 1 application mgr
Check the CRUSH rule for this pool. On my clusters Rook creates it without specifying a device-class, but the other pools get rules that do specify a device class. By way of the shadow CRUSH topology, this sort of looks like multiple CRUSH roots to the pg_autoscaler, which is why you have no output from the status below. I added a bit to the docs earlier this year to call this out. Perhaps the Rook folks on the list might have thoughts about preventing this situation, I don’t recall if I created a github issue for it.
That said, I’m personally not a fan of the pg autoscaler and tend to disable it. ymmv. Unless you enable the “bulk” option, it may well be that you have too few PGs for effective bin packing. Think about filling a 55 gal drum with beach balls vs with golf balls.
So many pools for such a small cluster …. are you actively using CephFS, RBD, *and* RGW? If not, I’d suggest removing whatever you aren’t using and adjusting pg_num for the pools you are using.
> pool 2 'myfs-metadata' replicated size 3 min_size 2 crush_rule 25 object_hash rjenkins pg_num 16 pgp_num 16
> pool 3 'myfs-replicated' replicated size 2 min_size 1 crush_rule 26 object_hash rjenkins pg_num 256 pgp_num 256
> pool 4 'pvc-generic-pool' replicated size 3 min_size 2 crush_rule 17 object_hash rjenkins pg_num 128 pgp_num 128
> pool 13 'myfs-eck2m2' erasure profile myfs-eck2m2_ecprofile size 4 min_size 3 crush_rule 8 pg_num 128 pgp_num 128
> pool 22 'my-store.rgw.otp' replicated size 3 min_size 2 crush_rule 24 pg_num 8 pgp_num 8
> pool 23 'my-store.rgw.buckets.index' replicated size 3 min_size 2 pg_num 8 pgp_num 8
> pool 24 'my-store.rgw.log' replicated size 3 min_size 2 crush_rule 23 pg_num 8 pgp_num 8
> pool 25 'my-store.rgw.control' replicated size 3 min_size 2 crush_rule 19 object_hash rjenkins pg_num 8 pgp_num 8
> pool 26 '.rgw.root' replicated size 3 min_size 2 crush_rule 18 pg_num 8 pgp_num 8
> pool 27 'my-store.rgw.buckets.non-ec' replicated size 3 min_size 2 pg_num 8 pgp_num 8
> pool 28 'my-store.rgw.meta' replicated size 3 min_size 2 pg_num 8 pgp_num 8
> pool 29 'my-store.rgw.buckets.data' erasure profile my-store.rgw.buckets.data_ecprofile size 4 min_size 3 pg_num 32 pgp_num 32 autoscale_mode on
Is that a 2,2 or 3,1 profile?
>
>> `ceph balancer status`
>
> This does have normal output when the cluster isn't in the middle of recovery.
>
> {
> "active": true,
> "last_optimize_duration": "0:00:00.000107",
> "last_optimize_started": "Tue Nov 28 22:11:56 2023",
> "mode": "upmap",
> "no_optimization_needed": true,
> "optimize_result": "Too many objects (0.081907 > 0.050000) are
> misplaced; try again later",
> "plans": []
> }
>
>> `ceph osd pool autoscale-status`
>
> No output for this. I'm not sure why
See above, I suspected this.
> - this has given output in the
> past. Might be due to being in the middle of recovery, or it might be
> a Reef issue (I don't think I've looked at this since upgrading). In
> any case, PG counts are in the osd dump, and I have the hdd storage
> classes set to warn I think.
>
>> The balancer module can be confounded by certain complex topologies like multiple device classes and/or CRUSH roots.
>>
>> Since you’re using Rook, I wonder if you might be hitting something that I’ve seen myself; the above commands will tell the tale.
>
> Yeah, if it is designed for equally-sized OSDs then it isn't going to
> work quite right for me. I do try to keep hosts reasonably balanced,
> but not individual OSDs.
Ceph is fantastic for flexibility, but it’s not above giving us enough rope to hang ourselves with.
>
> --
> Rich
I'm fairly new to Ceph and running Rook on a fairly small cluster
(half a dozen nodes, about 15 OSDs). I notice that OSD space use can
vary quite a bit - upwards of 10-20%.
In the documentation I see multiple ways of managing this, but no
guidance on what the "correct" or best way to go about this is. As
far as I can tell there is the balancer, manual manipulation of upmaps
via the command line tools, and OSD reweight. The last two can be
optimized with tools to calculate appropriate corrections. There is
also the new read/active upmap (at least for non-EC pools), which is
manually triggered.
The balancer alone is leaving fairly wide deviations in space use, and
at times during recovery this can become more significant. I've seen
OSDs hit the 80% threshold and start impacting IO when the entire
cluster is only 50-60% full during recovery.
I've started using ceph osd reweight-by-utilization and that seems
much more effective at balancing things, but this seems redundant with
the balancer which I have turned on.
What is generally considered the best practice for OSD balancing?
--
Rich