Hi, I have a Nautilus cluster version 14.2.6 , and I have noted that
when some OSD go down the cluster doesn't start recover. I have checked
that the option noout is unset.
What could be the reason for this behavior?
--
*******************************************************
Andrés Rojas Guerrero
Unidad Sistemas Linux
Area Arquitectura Tecnológica
Secretaría General Adjunta de Informática
Consejo Superior de Investigaciones Científicas (CSIC)
Pinar 19
28006 - Madrid
Tel: +34 915680059 -- Ext. 990059
email: a.rojas(a)csic.es
ID comunicate.csic.es: @50852720l:matrix.csic.es
*******************************************************
Hello, I have 6 hosts with 12 SSD disks on each host for a total of 72 OSD,
I am using CEPH Octopos in its latest version, the deployment was done
using ceph admin and containers according to the dosing, we are having some
problems with performance of the cluster, I mount it on a proxmox cluster
and on windows VMs I have the problem of the disks being 100% occupied with
a simple browser opening, when I switch to another NFS storage for example
everything goes back to normal, I have the CEPH cluster now mounted and
with only 1 VM inside it, and we have the problem of slowness and slow ops,
the network speed between the hosts in the cluster is 25Gb tested with
iperf, between ceph and proxmox is 25Gb per host, someone already passed
that?
Many Tks
Hello,
I have a small 6 nodes Octopus 15.2.11 cluster installed on bare metal with cephadm and I added a second OSD to one of my 3 OSD nodes. I started then copying data to my ceph fs mounted with kernel mount but then both OSDs on that specific nodes crashed.
To this topic I have the following questions:
1) How can I find out why the two OSD crashed? because everything is in podman containers I don't know where are the logs to find out the reason why this happened. From the OS itself everything looks ok, there was no out of memory error.
2) I would assume the two OSD container would restart on their own but this is not the case it looks like. How can I restart manually these 2 OSD containers on that node? I believe this should be a "cephadm orch" command?
The health of the cluster right now is:
CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
PG_DEGRADED: Degraded data redundancy: 132518/397554 objects degraded (33.333%), 65 pgs degraded, 65 pgs undersized
Thank your for your hints.
Best regards,
Mabi
Hello ceph community,
I'm trying to upgrade a pacific (v16.2.0) cluster to the last version,
but the upgrading process seems to be stuck. The mgr log (debug level)
does not show any significant message regarding the upgrade, other than
when it is started/paused/resumed/stopped.
2021-05-06T14:29:59.294725+0000 mgr.hostc.riclju (mgr.3935983) 35645 :
cephadm [INF] Upgrade: Started with target docker.io/ceph/ceph:v16.2.2
2021-05-06T14:49:55.710023+0000 mgr.hostc.riclju (mgr.3935983) 36285 :
cephadm [INF] Paused
2021-05-06T14:50:24.444742+0000 mgr.hostc.riclju (mgr.3935983) 36302 :
cephadm [INF] Resumed
2021-05-06T14:51:36.888269+0000 mgr.hostc.riclju (mgr.3935983) 36349 :
cephadm [INF] Upgrade: Paused upgrade to docker.io/ceph/ceph:v16.2.2
2021-05-06T14:51:50.411779+0000 mgr.hostc.riclju (mgr.3935983) 36357 :
cephadm [INF] Upgrade: Resumed upgrade to docker.io/ceph/ceph:v16.2.2
2021-05-06T14:52:01.660682+0000 mgr.hostc.riclju (mgr.3935983) 36365 :
cephadm [INF] Upgrade: Stopped
It may be worth mentioning that last week I had trouble trying to deploy
RGWs. It was not possible to deploy de RGWs using this command
ceph orch apply rgw orbyta --realm=realma --zone=zonea --placement="2"
So the following were used
ceph orch daemon add rgw zonea --placement hostb
ceph orch daemon add rgw zonea --placement hosta
After those commands were issued the orchestrator would still not deploy
the RGWs, unless the current MGR failed over to another standby MGR.
After that, the RGWs where depoyed.
Another problem I have is the refresh parameter of the orchestrator. The
last time the daemons listed in ceph orch ps where refreshed is the last
time a MGR was set to failed, and issuing ceph orch ps --refresh does
not seem to update
It looks like all those symptoms are related somehow, but I don't know
how to dig further into the internals of the orchestrator to get more
information.
I greatly appreciate if you can point me in the right direction.
Thank you, kind regards.
--
AltaVoz <https://www.altavoz.net/>
Fernando Cid
Ingeniero de Operaciones
www.altavoz.net <https://www.altavoz.net/>
Ubicación AltaVoz
Viña del Mar: 2 Poniente 355 of 53
<https://www.altavoz.net/altavoz/contacto> | +56 32 276 8060
<tel:+56322768060>
Santiago: Antonio Bellet 292 of 701
<https://www.altavoz.net/altavoz/contacto> | +56 2 2585 4264
<tel:+562225854264>
I manage a historical cluster of severak ceph nodes with each 128 GB Ram and 36 OSD each 8 TB size.
The cluster ist just for archive purpose and performance is not so important.
The cluster was running fine for long time using ceph luminous.
Last week I updated it to Debian 10 and Ceph Nautilus.
Now I can see that the memory usage of each osd grows slowly to 4 GB each and once the system has
no memory left it will oom-kill processes
I have already configured osd_memory_target = 1073741824 .
This helps for some hours but then memory usage will grow from 1 GB to 4 GB per OSD.
Any ideas what I can do to further limit osd memory usage ?
It would be good to keep the hardware running some more time without upgrading RAM on all
OSD machines.
Any Ideas ?
Thanks
Christoph
Hello Ceph,
Can you set the SSL min version? Such as TLS1.2?
Glen
This e-mail is intended solely for the benefit of the addressee(s) and any other named recipient. It is confidential and may contain legally privileged or confidential information. If you are not the recipient, any use, distribution, disclosure or copying of this e-mail is prohibited. The confidentiality and legal privilege attached to this communication is not waived or lost by reason of the mistaken transmission or delivery to you. If you have received this e-mail in error, please notify us immediately.
Hello Anthony,
it was introduced in octopus 15.2.10
See: https://docs.ceph.com/en/latest/releases/octopus/
Do you know how you would set it in pacific? :)
Guess, there shouldnt be much difference...
Thank you
Mehmet
Am 28. April 2021 19:21:19 MESZ schrieb Anthony D'Atri <anthony.datri(a)gmail.com>:
>I think that’s new with Pacific.
>
>> On Apr 28, 2021, at 1:26 AM, ceph(a)elchaka.de wrote:
>>
>>
>>
>> Hello,
>>
>> I have an octopus cluster and want to change some values - but i
>cannot find any documentation on how to set values(multiple) with
>>
>> bluestore_rocksdb_options_annex
>>
>> Could someone give me some examples.
>> I would like to do this like ceph config set ...
>>
>> Thanks in advice
>> Mehmet
>> _______________________________________________
>> ceph-users mailing list -- ceph-users(a)ceph.io
>> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hello all,
I just wanted to let you know that DigitalOcean has open-sourced a
tool we've developed called pgremapper.
Originally inspired by CERN's upmap exception table manipulation
scripts, pgremapper is a CLI written in Go which exposes a number of
upmap-based algorithms for backfill-related usecases: Canceling
backfill (like CERN's upmap-remapped.py, but with some extra tricks up
its sleeve), draining PGs off of an OSD, undoing upmaps in a
controlled and concurrent manor, and more.
If you're interested, please read the details in the repo's README:
https://github.com/digitalocean/pgremapper
Josh
https://io500.org/cfs
Stabilization Period: 05 - 14 May 2021 AoE
Submission Deadline: 11 June 2021 AoE
The IO500 is now accepting and encouraging submissions for the upcoming
8th IO500 list. Once again, we are also accepting submissions to the 10
Node Challenge to encourage the submission of small scale results. The
new ranked lists will be announced via live-stream at a virtual session.
We hope to see many new results.
What's New
Starting with ISC'21, the IO500 now follows a two-staged approach.
First, there will be a two-week stabilization period during which we
encourage the community to verify that the benchmark runs properly.
During this period the benchmark will be updated based upon feedback
from the community. The final benchmark will then be released on Monday,
May 1st. We expect that runs compliant with the rules made during the
stabilization period are valid as the final submission unless a
significant defect is found.
We are now creating a more detailed schema to describe the hardware and
software of the system under test and provide the first set of tools to
ease capturing of this information for inclusion with the submission.
Further details will be released on the submission page.
Background
The benchmark suite is designed to be easy to run and the community has
multiple active support channels to help with any questions. Please note
that submissions of all sizes are welcome; the site has customizable
sorting, so it is possible to submit on a small system and still get a
very good per-client score, for example. Additionally, the list is about
much more than just the raw rank; all submissions help the community by
collecting and publishing a wider corpus of data. More details below.
Following the success of the Top500 in collecting and analyzing
historical trends in supercomputer technology and evolution, the IO500
was created in 2017, published its first list at SC17, and has grown
exponentially since then. The need for such an initiative has long been
known within High-Performance Computing; however, defining appropriate
benchmarks had long been challenging. Despite this challenge, the
community, after long and spirited discussion, finally reached consensus
on a suite of benchmarks and a metric for resolving the scores into a
single ranking.
The multi-fold goals of the benchmark suite are as follows:
Maximizing simplicity in running the benchmark suite
Encouraging optimization and documentation of tuning parameters for
performance
Allowing submitters to highlight their "hero run" performance numbers
Forcing submitters to simultaneously report performance for challenging
IO patterns.
Specifically, the benchmark suite includes a hero-run of both IOR and
mdtest configured however possible to maximize performance and establish
an upper-bound for performance. It also includes an IOR and mdtest run
with highly prescribed parameters in an attempt to determine a
lower-bound. Finally, it includes a namespace search as this has been
determined to be a highly sought-after feature in HPC storage systems
that has historically not been well-measured. Submitters are encouraged
to share their tuning insights for publication.
The goals of the community are also multi-fold:
Gather historical data for the sake of analysis and to aid predictions
of storage futures
Collect tuning information to share valuable performance optimizations
across the community
Encourage vendors and designers to optimize for workloads beyond "hero
runs"
Establish bounded expectations for users, procurers, and administrators
10 Node I/O Challenge
The 10 Node Challenge is conducted using the regular IO500 benchmark,
however, with the rule that exactly 10 client nodes must be used to run
the benchmark. You may use any shared storage with, e.g., any number of
servers. When submitting for the IO500 list, you can opt-in for
"Participate in the 10 compute node challenge only", then we will not
include the results into the ranked list. Other 10-node node submissions
will be included in the full list and in the ranked list. We will
announce the result in a separate derived list and in the full list but
not on the ranked IO500 list at io500.org.
Birds-of-a-Feather
Once again, we encourage you to submit to join our community, and to
attend our virtual BoF "The IO500 and the Virtual Institute of I/O" at
ISC 2021, (time to be announced), where we will announce the new IO500
and 10 node challenge lists. The current list includes results from
BeeGFS, CephFS, DAOS, DataWarp, GekkoFS, GFarm, IME, Lustre, MadFS,
Qumulo, Spectrum Scale, Vast, WekaIO, and YRCloudFile. We hope that the
upcoming list grows even more.
--
The IO500 Committee