Dear All,
Due to a mistake in my "rolling restart" script, one of our ceph
clusters now has a number of unfound objects:
There is an 8+2 erasure encoded data pool, 3x replicated metadata pool,
all data is stored as cephfs.
root@ceph7 ceph-archive]# ceph health
HEALTH_ERR 24/420880027 objects unfound (0.000%); Possible data damage:
14 pgs recovery_unfound; Degraded data redundancy: 64/4204261148 objects
degraded (0.000%), 14 pgs degraded
"ceph health detail" gives me a handle on which pgs are affected.
e.g:
pg 5.f2f has 2 unfound objects
pg 5.5c9 has 2 unfound objects
pg 5.4c1 has 1 unfound objects
and so on...
plus more entries of this type:
pg 5.6d is active+recovery_unfound+degraded, acting
[295,104,57,442,240,338,219,33,150,382], 1 unfound
pg 5.3fa is active+recovery_unfound+degraded, acting
[343,147,21,131,315,63,214,365,264,437], 2 unfound
pg 5.41d is active+recovery_unfound+degraded, acting
[20,104,190,377,52,141,418,358,240,289], 1 unfound
Digging deeper into one of the bad pg, we see the oid for the two
unfound objects:
root@ceph7 ceph-archive]# ceph pg 5.f2f list_unfound
{
"num_missing": 4,
"num_unfound": 2,
"objects": [
{
"oid": {
"oid": "1000ba25e49.00000207",
"key": "",
"snapid": -2,
"hash": 854007599,
"max": 0,
"pool": 5,
"namespace": ""
},
"need": "22541'3088478",
"have": "0'0",
"flags": "none",
"locations": [
"189(8)",
"263(9)"
]
},
{
"oid": {
"oid": "1000bb25a5b.00000091",
"key": "",
"snapid": -2,
"hash": 3637976879,
"max": 0,
"pool": 5,
"namespace": ""
},
"need": "22541'3088476",
"have": "0'0",
"flags": "none",
"locations": [
"189(8)",
"263(9)"
]
}
],
"more": false
}
While it would be nice to recover the data, this cluster is only used
for storing backups.
As all OSD are up and running, presumably the data blocks are
permanently lost?
If it's hard / impossible to recover the data, presumably we should now
consider using "ceph pg 5.f2f mark_unfound_lost delete" on each
affected pg?
Finally, can we use the oid to identify the affected files?
best regards,
Jake
--
Jake Grimmett
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Hi all,
I really hope this isn't seen as spam. I am looking to find a position
where I can focus on Linux storage/Ceph. If anyone is currently
looking please let me know. Linkedin profile frankritchie.
Thanks,
Frank
This is the seventh update to the Ceph Nautilus release series. This is
a hotfix release primarily fixing a couple of security issues. We
recommend that all users upgrade to this release.
Notable Changes
---------------
* CVE-2020-1699: Fixed a path traversal flaw in Ceph dashboard that
could allow for potential information disclosure (Ernesto Puerta)
* CVE-2020-1700: Fixed a flaw in RGW beast frontend that could lead to
denial of service from an unauthenticated client (Or Friedmann)
Blog Link: https://ceph.io/releases/v14-2-7-nautilus-released/
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.7.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 3d58626ebeec02d8385a4cefb92c6cbc3a45bfe8
--
Abhishek Lekshmanan
SUSE Software Solutions Germany GmbH
We're happy to announce 13th bug fix release of the Luminous v12.2.x
long term stable release series. We recommend that all users upgrade to
this release. Many thanks to all the contributors, in particular Yuri &
Nathan, in getting this release out of the door. This shall be the last
release of the Luminous series.
For a detailed release notes, please check out the official blog entry
at https://ceph.io/releases/v12-2-13-luminous-released/
Notable Changes
---------------
* Ceph now packages python bindings for python3.6 instead of python3.4,
because EPEL7 recently switched from python3.4 to python3.6 as the
native python3. see the announcement[1] for more details on the
background of this change.
* We now have telemetry support via a ceph-mgr module. The telemetry module is
absolutely on an opt-in basis, and is meant to collect generic cluster
information and push it to a central endpoint. By default, we're pushing it
to a project endpoint at https://telemetry.ceph.com/report, but this is
customizable using by setting the 'url' config option with::
ceph telemetry config-set url '<your url>'
You will have to opt-in on sharing your information with::
ceph telemetry on
You can view exactly what information will be reported first with::
ceph telemetry show
Should you opt-in, your information will be licensed under the
Community Data License Agreement - Sharing - Version 1.0, which you can
read at https://cdla.io/sharing-1-0/
The telemetry module reports information about CephFS file systems,
including:
- how many MDS daemons (in total and per file system)
- which features are (or have been) enabled
- how many data pools
- approximate file system age (year + month of creation)
- how much metadata is being cached per file system
As well as:
- whether IPv4 or IPv6 addresses are used for the monitors
- whether RADOS cache tiering is enabled (and which mode)
- whether pools are replicated or erasure coded, and
which erasure code profile plugin and parameters are in use
- how many RGW daemons, zones, and zonegroups are present; which RGW frontends are in use
- aggregate stats about the CRUSH map, like which algorithms are used, how
big buckets are, how many rules are defined, and what tunables are in use
* A health warning is now generated if the average osd heartbeat ping
time exceeds a configurable threshold for any of the intervals
computed. The OSD computes 1 minute, 5 minute and 15 minute intervals
with average, minimum and maximum values. New configuration option
`mon_warn_on_slow_ping_ratio` specifies a percentage of
`osd_heartbeat_grace` to determine the threshold. A value of zero
disables the warning. New configuration option
`mon_warn_on_slow_ping_time` specified in milliseconds over-rides the
computed value, causes a warning when OSD heartbeat pings take longer
than the specified amount. New admin command `ceph daemon mgr.#
dump_osd_network [threshold]` command will list all connections with a
ping time longer than the specified threshold or value determined by
the config options, for the average for any of the 3 intervals. New
admin command `ceph daemon osd.# dump_osd_network [threshold]` will do
the same but only including heartbeats initiated by the specified OSD.
* The configuration value `osd_calc_pg_upmaps_max_stddev` used for upmap
balancing has been removed. Instead use the mgr balancer config
`upmap_max_deviation` which now is an integer number of PGs of
deviation from the target PGs per OSD. This can be set with a command
like `ceph config set mgr mgr/balancer/upmap_max_deviation 2`. The
default `upmap_max_deviation` is 1. There are situations where crush
rules would not allow a pool to ever have completely balanced PGs. For
example, if crush requires 1 replica on each of 3 racks, but there are
fewer OSDs in 1 of the racks. In those cases, the configuration value
can be increased.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-12.2.13.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 584a20eb0237c657dc0567da126be145106aa47e
[1]: https://lists.fedoraproject.org/archives/list/epel-announce@lists.fedorapro…
--
Abhishek Lekshmanan
SUSE Software Solutions Germany GmbH
GF: Felix Imendörffer HRB 21284 (AG Nürnberg)
We have 18 Sata disks (each 2TB) on a physical server, each disk with an
OSD deployed.
I am not sure how much CPU and memory resources should be prepared for this
server.
Does each OSD require a physical CPU? and how to calculate memory usage?
Thanks.
I would like to (in this order)
- set the data pool for the root "/" of a ceph-fs to a custom value, say "P" (not the initial data pool used in fs new)
- create a sub-directory of "/", for example "/a"
- mount the sub-directory "/a" with a client key with access restricted to "/a"
The client will not be able to see the dir layout attribute set at "/", its not mounted.
Will the data of this client still go to the pool "P", that is, does "/a" inherit the dir layout transparently to the client when following the steps above?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Turns out it is probably orphans.
We are running ceph luminous : 12.2.12
And the orphans find is stuck in the stage : "iterate_bucket_index" on shard "0" for 2 days now.
Anyone is facing this issue ?
Regards,
De : ceph-users <ceph-users-bounces(a)lists.ceph.com<mailto:ceph-users-bounces@lists.ceph.com>>
Envoyé : 21 January 2020 10:10
À : ceph-users(a)lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Objet : [ceph-users] Understand ceph df details
Hi everyone,
I'm trying to understand where is the difference between the command :
ceph df details
And the result I'm getting when I run this script :
total_bytes=0
while read user; do
echo $user
bytes=$(radosgw-admin user stats --uid=${user} | grep total_bytes_rounded | tr -dc "0-9")
if [ ! -z ${bytes} ]; then
total_bytes=$((total_bytes + bytes))
pretty_bytes=$(echo "scale=2; $bytes / 1000^4" | bc)
echo " ($bytes B) $pretty_bytes TiB"
fi
pretty_total_bytes=$(echo "scale=2; $total_bytes / 1000^4" | bc)
done <<< "$(radosgw-admin user list | jq -r .[])"
echo ""
echo "Total : ($total_bytes B) $pretty_total_bytes TiB"
When I run df I get this :
default.rgw.buckets.data 70 N/A N/A 226TiB 89.23 27.2TiB 61676992 61.68M 2.05GiB 726MiB 677TiB
And when I use my script I don't have the same result :
Total : (207579728699392 B) 207.57 TiB
It means that I have 20 TiB somewhere but I can't find and must of all understand where this 20 TiB.
Does anyone have an explanation ?
Fi :
[root@ceph_monitor01 ~]# radosgw-admin gc list -include-all | grep oid | wc -l
23
Hello.
Thanks to advice from bauen1 I now have OSDs on Debian/Nautilus and have
been able to move on to MDS and CephFS. Also, looking around in the
Dashboard I noticed the options for Crush Failure Domain and further
that it's possible to select 'OSD'.
As I mentioned earlier our cluster is fairly small at this point (3
hosts, 24 OSDs) , but we want to get as much usable storage as possible
until we can get more nodes. SInce the nodes are brand new we are
probably more concerned about disk failures than about node failures for
the next few months.
If I interpret Crush Failure Domain = OSD, this means it's possible to
create pools that behave somewhat similar to RAID 6 - something like 8 +
2 except dispersed across multiple nodes. With the pool spread around
like this loosing any one disk shouldn't put the cluster into read-only
mode - if a disk did fail, would the cluster re-balance and reconstruct
the lost data until the failed OSD was replaced.
Does this make sense? Or is it just wishful thinking.
Thanks.
-Dave
--
Dave Hall
Binghamton University