Hi,
for test purposes, I have set up two 100 GB OSDs, one
taking a data pool and the other metadata pool for cephfs.
Am running 14.2.6-1-gffd69200ad-1 with packages from
https://mirror.croit.io/debian-nautilus
Am then running a program that creates a lot of 1 MiB files by calling
fopen()
fwrite()
fclose()
for each of them. Error codes are checked.
This works successfully for ~100 GB of data, and then strangely also succeeds
for many more 100 GB of data... ??
All written files have size 1 MiB with 'ls', and thus should contain the data
written. However, on inspection, the files written after the first ~100 GiB,
are full of just 0s. (hexdump -C)
To further test this, I use the standard tool 'cp' to copy a few random-content
files into the full cephfs filessystem. cp reports no complaints, and after
the copy operations, content is seen with hexdump -C. However, after forcing
the data out of cache on the client by reading other earlier created files,
hexdump -C show all-0 content for the files copied with 'cp'. Data that was
there is suddenly gone...?
I am new to ceph. Is there an option I have missed to avoid this behaviour?
(I could not find one in
https://docs.ceph.com/docs/master/man/8/mount.ceph/ )
Is this behaviour related to
https://docs.ceph.com/docs/mimic/cephfs/full/
?
(That page states 'sometime after a write call has already returned 0'. But if
write returns 0, then no data has been written, so the user program would not
assume any kind of success.)
Best regards,
Håkan
I would like to (in this order)
- set the data pool for the root "/" of a ceph-fs to a custom value, say "P" (not the initial data pool used in fs new)
- create a sub-directory of "/", for example "/a"
- mount the sub-directory "/a" with a client key with access restricted to "/a"
The client will not be able to see the dir layout attribute set at "/", its not mounted.
Will the data of this client still go to the pool "P", that is, does "/a" inherit the dir layout transparently to the client when following the steps above?
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Turns out it is probably orphans.
We are running ceph luminous : 12.2.12
And the orphans find is stuck in the stage : "iterate_bucket_index" on shard "0" for 2 days now.
Anyone is facing this issue ?
Regards,
De : ceph-users <ceph-users-bounces(a)lists.ceph.com<mailto:ceph-users-bounces@lists.ceph.com>>
Envoyé : 21 January 2020 10:10
À : ceph-users(a)lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Objet : [ceph-users] Understand ceph df details
Hi everyone,
I'm trying to understand where is the difference between the command :
ceph df details
And the result I'm getting when I run this script :
total_bytes=0
while read user; do
echo $user
bytes=$(radosgw-admin user stats --uid=${user} | grep total_bytes_rounded | tr -dc "0-9")
if [ ! -z ${bytes} ]; then
total_bytes=$((total_bytes + bytes))
pretty_bytes=$(echo "scale=2; $bytes / 1000^4" | bc)
echo " ($bytes B) $pretty_bytes TiB"
fi
pretty_total_bytes=$(echo "scale=2; $total_bytes / 1000^4" | bc)
done <<< "$(radosgw-admin user list | jq -r .[])"
echo ""
echo "Total : ($total_bytes B) $pretty_total_bytes TiB"
When I run df I get this :
default.rgw.buckets.data 70 N/A N/A 226TiB 89.23 27.2TiB 61676992 61.68M 2.05GiB 726MiB 677TiB
And when I use my script I don't have the same result :
Total : (207579728699392 B) 207.57 TiB
It means that I have 20 TiB somewhere but I can't find and must of all understand where this 20 TiB.
Does anyone have an explanation ?
Fi :
[root@ceph_monitor01 ~]# radosgw-admin gc list -include-all | grep oid | wc -l
23
We have a small ceph cluster running built from components that were
phased out from compute applications. the current cluster consists of
i7-860s. 6 disks (5TB, 7200RPM) per node and 8 nodes totaling 48 OSDs.
A compute cluster will be discontinued, which will make Ryzen 5-1600
hardware available (8 nodes with 16GB RAM each) with which to replace
the CPUs of the current setup.
How could we best distribute the OSDs (keeping existing disks for
storage) to the Ryzen systems to get a good performance improvement?
Unfortunately the interconnect is still only 1Gb/s so expected to be a
limiting factor. Would it make sense to create fewer bigger nodes, e.g.,
to use 6 nodes with 8 disks each or even more condensed?
We would like to move the Luminous cluster to Nautilus/Bluestore and can
get SSDs for each of the nodes, as it appears to be essential to get
performance. Can actually benefit from improvements in the OSDs if the
network is so limited? Would bonding of network interfaces be a
workaround until we can get a network update or are we overestimating
the power of the upgraded OSD nodes?
What strategy would you suggest with these resources?
Any comments and suggestions would be highly welcome :)
Thanks in advance
Philipp
I have a query about https://docs.ceph.com/docs/master/cephfs/createfs/:
"The data pool used to create the file system is the "default" data pool and the location for storing all inode backtrace information, used for hard link management and disaster recovery. For this reason, all inodes created in CephFS have at least one object in the default data pool."
This does not match my experience (nautilus servers, nautlius FUSE client or Centos 7 kernel client). I have a cephfs with a replicated top-level pool and a directory set to use erasure coding with setfattr, though I also did the same test using the subvolume commands with the same result. "Ceph df detail" shows no objects used in the top level pool, as shown in https://gist.github.com/pcass-epcc/af24081cf014a66809e801f33bcb535b (also displayed in-line below)
It would be useful if indeed clients didn't have to write to the top-level pool, since that would mean we could give different clients permission only to pool-associated subdirectories without giving everyone write access to a pool with data structures shared between all users of the filesystem.
[root@hdr-admon01 ec]# ceph df detail; ceph fs ls; ceph fs status
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 3.3 PiB 3.3 PiB 32 TiB 32 TiB 0.95
nvme 2.9 TiB 2.9 TiB 504 MiB 2.5 GiB 0.08
TOTAL 3.3 PiB 3.3 PiB 32 TiB 32 TiB 0.95
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR
cephfs.fs1.metadata 5 162 MiB 63 324 MiB 0.01 1.4 TiB N/A N/A 63 0 B 0 B
cephfs.fs1-replicated.data 6 0 B 0 0 B 0 1.0 PiB N/A N/A 0 0 B 0 B
cephfs.fs1-ec.data 7 8.0 GiB 2.05k 11 GiB 0 2.4 PiB N/A N/A 2.05k 0 B 0 B
name: fs1, metadata pool: cephfs.fs1.metadata, data pools: [cephfs.fs1-replicated.data cephfs.fs1-ec.data ]
fs1 - 4 clients
===
+------+--------+------------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+------------+---------------+-------+-------+
| 0 | active | hdr-meta02 | Reqs: 0 /s | 29 | 16 |
+------+--------+------------+---------------+-------+-------+
+----------------------------+----------+-------+-------+
| Pool | type | used | avail |
+----------------------------+----------+-------+-------+
| cephfs.fs1.metadata | metadata | 324M | 1414G |
| cephfs.fs1-replicated.data | data | 0 | 1063T |
| cephfs.fs1-ec.data | data | 11.4G | 2505T |
+----------------------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| hdr-meta01 |
+-------------+
MDS version: ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable)
[root@hdr-admon01 ec]# ll /test-fs/ec/
total 12582912
-rw-r--r--. 1 root root 4294967296 Jan 27 22:26 new-file
-rw-r--r--. 2 root root 4294967296 Jan 28 14:06 new-file2
-rw-r--r--. 2 root root 4294967296 Jan 28 14:06 new-file-same-inode-as-newfile2
Regards,
Phil
_________________________________________
Philip Cass
HPC Systems Specialist - Senior Systems Administrator
EPCC
[cid:image002.png@01D5D5EF.2E463230]
Advanced Computing Facility
Bush Estate
Penicuik
Tel: +44 (0)131 4457815
Email: p.cass(a)epcc.ed.ac.uk<mailto:p.cass@epcc.ed.ac.uk>
_________________________________________
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
The information contained in this e-mail (including any attachments) is confidential and is intended for the use of the addressee only. If you have received this message in error, please delete it and notify the originator immediately.
Please consider the environment before printing this email.
Hi,
I have a cephfs in production based on 2 pools (data+metadata).
Data is in erasure coding with the profile :
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=2
plugin=jerasure
technique=reed_sol_van
w=8
Metadata is in replicated mode with k=3
The crush rules are as follow :
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "ec_data",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 5,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
When we installed it, everything was in the same room, but know we
splitted our cluster (6 servers but soon 8) in 2 rooms. Thus we updated
the crushmap by adding a room layer (with ceph osd crush add-bucket
room1 room etc) and move all our servers in the tree to the correct
place (ceph osd crush move server1 room=room1 etc...).
Now, we would like to change the rules to set a failure domain to room
instead of host (to be sure that in case of disaster in one of the rooms
we will still have a copy in the other).
What is the best strategy to do this ?
F.
Hello All,
I have a HW RAID based 240 TB data pool with about 200 million files for
users in a scientific institution. Data sizes range from tiny parameter
files for scientific calculations and experiments to huge images of
brain scans. There are group directories, home directories, Windows
roaming profile directories organized in ZFS pools on Solaris operating
systems, exported via NFS and Samba to Linux, macOS, and Windows clients.
I would like to switch to CephFS because of the flexibility and
expandability but I cannot find any recommendations for which storage
backend would be suitable for all the functionality we have.
Since I like the features of ZFS like immediate snapshots of very large
data pools, quotas for each file system within hierarchical data trees
and dynamic expandability by simply adding new disks or disk images
without manual resizing would it be a good idea to create RBD images,
map them onto the file servers and create zpools on the mapped images? I
know that ZFS best works with raw disks but maybe a RBD image is close
enough to a raw disk?
Or would CephFS be the way to go? Can there be multiple CephFS pools for
the group data folders and for the user's home directory folders for
example or do I have to have everything in one single file space?
Maybe someone can share his or her field experience?
Thank you very much.
Best regards
Willi
Hello,
in my cluster one after the other OSD dies until I recognized that it
was simply an "abort" in the daemon caused probably by
2020-01-31 15:54:42.535930 7faf8f716700 -1 log_channel(cluster) log
[ERR] : trim_object Snap 29c44 not in clones
Close to this msg I get a stracktrace:
ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
1: /usr/bin/ceph-osd() [0xb35f7d]
2: (()+0x11390) [0x7f0fec74b390]
3: (gsignal()+0x38) [0x7f0feab43428]
4: (abort()+0x16a) [0x7f0feab4502a]
5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f0feb48684d]
6: (()+0x8d6b6) [0x7f0feb4846b6]
7: (()+0x8d701) [0x7f0feb484701]
8: (()+0x8d919) [0x7f0feb484919]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27e) [0xc3776e]
10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x10dd) [0x868cfd]
11: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0x80)
[0x8690e0]
12: (Context::complete(int)+0x9) [0x6c8799]
13: (void ReplicatedBackend::sub_op_modify_reply<MOSDRepOpReply,
113>(std::tr1::shared_ptr<OpRequest>)+0x21b) [0xa5ae0b]
14:
(ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x15b)
[0xa53edb]
15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x1cb) [0x84c78b]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ef) [0x6966ff]
17: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x4e4) [0x696e14]
18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x71e)
[0xc264fe]
19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc29950]
20: (()+0x76ba) [0x7f0fec7416ba]
21: (clone()+0x6d) [0x7f0feac1541d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Yes, I know it's still hammer, I want to upgrade soon, but I want to
resolve that issue first. If I lose that PG, I don't worry.
So: What it the best approach? Can I use something like
ceph-objectstore-tool ... <object> remove-clone-metadata <cloneid> ? I
assume 29c44 is my Object, but what's the clone od?
Best regards,
derjohn
This is the seventh update to the Ceph Nautilus release series. This is
a hotfix release primarily fixing a couple of security issues. We
recommend that all users upgrade to this release.
Notable Changes
---------------
* CVE-2020-1699: Fixed a path traversal flaw in Ceph dashboard that
could allow
for potential information disclosure (Ernesto Puerta)
* CVE-2020-1700: Fixed a flaw in RGW beast frontend that could lead to
denial of
service from an unauthenticated client (Or Friedmann)
--
David Galloway
Systems Administrator, RDU
Ceph Engineering
IRC: dgalloway
Hi All,
Long story-short, we’re doing disaster recovery on a cephfs cluster, and are at a point where we have 8 pgs stuck incomplete. Just before the disaster, I increased the pg_count on two of the pools, and they had not completed increasing the pgp_num yet. I’ve since forced pgp_num to the current values.
So far, I’ve tried mark_unfound_lost but they don’t report any unfound objects, and I’ve tried force-create-pg but that has no effect, except on one of the pgs, which went to creating+incomplete. During the disaster recovery, I had to re-create several OSDs (due to unreadable superblocks,) and now one of the new osds, as well as one of the existing osds won’t start. The log from the startup of osd.29 is here: https://pastebin.com/PX9AAj8m, which seems to indicate that it won’t start because it’s supposed to have copies of the incomplete placement groups.
ceph pg 5.38 query (one of the incomplete) gives: https://pastebin.com/Jf4GnZTc
I have hunted around in the osds listed for all the placement groups for any sign of a pg that I could mark as complete with ceph-objectstore-tool, but can’t find any. I don’t care about the data in the pgs, but I can’t abandon the filesystem.
Any help would be greatly appreciated.
-TJ Ragan