Hi everyone,
I want to invite you to apply to an internship program called Outreachy!
Outreachy provides three-month internships to work in Free and Open
Source Software (FOSS). Outreachy internship projects may include
programming, user experience, documentation, illustration, graphical
design, or data science. Interns often find employment after their
internship with Outreachy sponsors or jobs that use the skills they
learned during their internship.
Ceph has had ten projects submitted in Outreachy since 2018. Now we can
submit more projects for our May-August 2021 round!
The project can be coordinated on the etherpad:
https://pad.ceph.com/p/project-ideas
Projects need to be submitted by the mentor here for approval:
https://www.outreachy.org/communities/cfp/ceph/
Outreachy internships run twice a year. The internships run from May to
August and December to March. Interns are paid a stipend of $6,000 USD
for the three months of work.
Outreachy internships are entirely remote and are open to applicants
around the world. Interns work remotely with experienced mentors. We
expressly invite women (both cis and trans), trans men, and genderqueer
people to apply. We also expressly invite applications from residents
and nationals of the United States of any gender. They are Black/African
American, Hispanic/Latin@, Native American/American Indian, Alaska
Native, Native Hawaiian, or Pacific Islander. Anyone who faces
under-representation, systematic bias, or discrimination in their
country's technology industry is invited to apply. More details and
eligibility criteria can be found here:
https://www.outreachy.org/apply/eligibility/
The next Outreachy internship round is from May 24, 2021, until Aug. 24,
2021.
Initial applications are currently open. Initial applications are due on
Feb. 22, 2021, at 4 pm UTC. Apply today:
https://www.outreachy.org/apply/
Applying to Outreachy is a little different than other internship
programs. You'll fill out an initial application. If your initial
application is approved, you'll move onto the five-week contribution
phase. During the contribution phase, you'll make contact with project
mentors and contribute to the project. Outreachy organizers have found
that the most vital applicants contact mentors early, ask many
questions, and continually submit contributions throughout the
contribution phase.
Please let Ali or I know if you have any questions about the program.
The Outreachy organizers (Karen Sandler, Sage Sharp, Marina
Zhurakhinskaya, Cindy Pallares, and Tony Sebro) can all be reached
through our contact form:
https://www.outreachy.org/contact/contact-us/.
We hope you'll help us spread the word about Outreachy internships!
--
Mike Perez
Hi all,
We've recently run into an issue where our single ceph rbd pool is throwing errors for nearfull osds. The OSDs themselves vary in PGs/%full with a low of 64/78% and a high of 73/86%. Is there any suggestions on how to get this to balance a little more cleanly? Currently we have 360 drives in a single pool with 8192 PGs. I think we may be able to double the PG num and that will balance things a bit cleaner but I just wanted to see if there may be anything the community suggests other than that. Let me know if there's any further info I forgot to provide if that'll help sort this out.
Thanks,
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 741 TiB 135 TiB 606 TiB 607 TiB 81.85
TOTAL 741 TiB 135 TiB 606 TiB 607 TiB 81.85
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
pool 1 162 TiB 46.81M 494 TiB 89.02 20 TiB
cluster:
health: HEALTH_WARN
85 nearfull osd(s)
1 pool(s) nearfull
services:
osd: 360 osds: 360 up (since 7d), 360 in (since 7d)
data:
pools: 1 pools, 8192 pgs
objects: 46.81M objects, 169 TiB
usage: 607 TiB used, 135 TiB / 741 TiB avail
pgs: 8192 active+clean
Hi,
Probably a basic/stupid question but I'm asking anyway. Through lack of knowledge and experience at the time, when we set up our pools, our pool that holds the majority of our data was created with a PG/PGP num of 64. As the amount of data has grown, this has started causing issues with balance of data across OSDs. I want to increase the PG count to at least 512, or maybe 1024 - obviously, I want to do this incrementally. However, rather than going from 64 to 128, then 256 etc, I'm considering doing this in much smaller increments over a longer period of time so that it will hopefully be doing the majority of moving around of data during the quieter time of day. So, may start by going in increments of 4 until I get up to 128 and then go in jumps of 8 and so on.
My question is, will I still end up with the same net result going in increments of 4 until I hit 128 as I would if I were to go straight to 128 in one hit. What I mean by that is that once I reach 128, would I have the exact same level of data balance across PGs as I would if I went straight to 128? Are there any drawbacks in going up in small increments over a long period of time? I know that I'll have uneven PG sizes until I get to that exponent of 2 but that should be OK as long as the end result is the desired result. I suspect I may have a greater amount of data moving around overall doing it this way but given my goal is to reduce the amount of intensive data moves during higher traffic times, that's not a huge concern in the grand scheme of things.
Thanks in advance,
Mark
Hi everyone!
I'm facing a weird issue with one of my CEPH clusters:
OS: CentOS - 8.2.2004 (Core)
CEPH: Nautilus 14.2.11 - stable
RBD using erasure code profile (K=3; m=2)
When I want to format one of my RBD image (client side) I've got the
following kernel messages multiple time with different sector IDs:
*[2417011.790154] blk_update_request: I/O error, dev rbd23, sector
164743869184 op 0x3:(DISCARD) flags 0x4000 phys_seg 1 prio class
0[2417011.791404] rbd: rbd23: discard at objno 20110336 2490368~1703936
result -1 *
At first I thought about a faulty disk BUT the monitoring system is not
showing anything faulty so I decided to run manual tests on all my OSDs to
look at disk health using smartctl etc.
None of them is marked as not healthy and actually they don't get any
counter with faulty sectors/read or writes and the Wear Level is 99%
So, the only particularity of this image is it is a 80Tb image, but it
shouldn't be an issue as we already have that kind of image size used on
another pool.
If anyone have a clue at how I could sort this out, I'll be more than happy
^^
Kind regards!
I'm trying to use ceph-volume to do various things.
It works fine locally, for things like
ceph-volume lvm zap
But when I want it to do OSD level things, it is unhappy.
To use a trivial example, it wants to do things like
/usr/bin/ceph --cluster ceph --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
but then dies saying,
[errno 13] RADOS permission denied (error connecting to the cluster)
and if I directly run that long command myself, it indeed dies.
(which is not too surprising, since /var/lib/ceph/bootstrap-osd/ceph.keyring does not exist)
However, if I just run from the same command prompt,
/usr/bin/ceph osd tree -f json
it works fine.
How can i get ceph-volume to just use the creds that are already working somewhere?
--
Philip Brown| Sr. Linux System Administrator | Medata, Inc.
5 Peters Canyon Rd Suite 250
Irvine CA 92606
Office 714.918.1310| Fax 714.918.1325
pbrown(a)medata.com| www.medata.com
Friends,
Any help or suggestion here for missing data?
Thanks,
-Vikas
From: Vikas Rana <vrana(a)vtiersys.com>
Sent: Tuesday, February 16, 2021 12:20 PM
To: 'ceph-users(a)ceph.io' <ceph-users(a)ceph.io>
Subject: Data Missing with RBD-Mirror
Hi Friends,
We have a very weird issue with rbd-mirror replication. As per the command
output, we are in sync but the OSD usage on DR side doesn't match the Prod
Side.
On Prod, we are using close to 52TB but on DR side we are only 22TB.
We took a snap on Prod and mounted the snap on DR side and compared the data
and we found lot of missing data. Please see the output below.
Please help us resolve this issue or point us in right direction.
Thanks,
-Vikas
DR# rbd --cluster cephdr mirror pool status cifs --verbose
health: OK
images: 1 total
1 replaying
research_data:
global_id: 69656449-61b8-446e-8b1e-6cf9bd57d94a
state: up+replaying
description: replaying, master_position=[object_number=390133, tag_tid=4,
entry_tid=447832541], mirror_position=[object_number=390133, tag_tid=4,
entry_tid=447832541], entries_behind_master=0
last_update: 2021-01-29 15:10:13
DR# ceph osd pool ls detail
pool 5 'cifs' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins
pg_num 128 pgp_num 128 last_change 1294 flags hashpspool stripe_width 0
application rbd
removed_snaps [1~5]
PROD# ceph df detail
POOLS:
NAME ID QUOTA OBJECTS QUOTA BYTES USED %USED
MAX AVAIL OBJECTS DIRTY READ WRITE RAW USED
cifs 17 N/A N/A 26.0TiB 30.10
60.4TiB 6860550 6.86M 873MiB 509MiB 52.1TiB
DR# ceph df detail
POOLS:
NAME ID QUOTA OBJECTS QUOTA BYTES USED %USED
MAX AVAIL OBJECTS DIRTY READ WRITE RAW USED
cifs 5 N/A N/A 11.4TiB 15.78
60.9TiB 3043260 3.04M 2.65MiB 431MiB 22.8TiB
PROD#:/vol/research_data# du -sh *
11T Flab1
346G KLab
1.5T More
4.4T ReLabs
4.0T WLab
DR#:/vol/research_data# du -sh *
2.6T Flab1
14G KLab
52K More
8.0K RLabs
202M WLab
I'm coming back to trying mixed SSD+spinning disks after maybe a year.
It was my vague recollection, that if you told ceph "go auto configure all the disks", it would actually automatically carve up the SSDs into the appropriate number of LVM segments, and use them as WAL devices for each hdd based OSD on the system.
Was I wrong?
Because when I tried to bring up a brand new cluster (Octopus, cephadm bootstrapped), with multiple nodes and multiple disks per node...
it seemed to bring up the SSDS as just another set of OSDs.
it clearly recognized them as ssd. The output of "ceph orch device ls" showed them as ssd vs hdd for the others.
It just...didnt use them as I expected.
?
Maybe I was thinking of ceph ansible.
Is there not a nice way to do this with the new cephadm based "ceph orch"?
I would rather not have to go write json files or whatever by hand, when a computer should be perfectly capable of auto generating this stuff itself
--
Philip Brown| Sr. Linux System Administrator | Medata, Inc.
5 Peters Canyon Rd Suite 250
Irvine CA 92606
Office 714.918.1310| Fax 714.918.1325
pbrown(a)medata.com| www.medata.com
At one point in the life cycle of my test ceph cluster, I used the
--all-available-devices
flag of ceph orch.
Which will always attempt to bring up any new autodetected disks.
I now see in the docs,
"If you want to avoid this behavior (disable automatic creation of OSD on available devices), use the unmanaged parameter:"
But... I believe I did run it with the --unmanaged flag afterwards.
Unfortunately, the original still seems to persist, and it keeps auto creating.
How can I get it to stop?
I also see mention that,
"When the parameter all-available-devices or a DriveGroup specification is used, a cephadm service is created"
However, using "ceph orch ps", I dont see any relevantly named service.
Where else should I be looking?
--
Philip Brown| Sr. Linux System Administrator | Medata, Inc.
5 Peters Canyon Rd Suite 250
Irvine CA 92606
Office 714.918.1310| Fax 714.918.1325
pbrown(a)medata.com| www.medata.com
What is the best way to move an rbd image to a different pool. I want to move some 'old' images (some have snapshots) to backup pool. For some there is also a difference in device class.