Hi Sebastian,
Thanks a lot for your reply. It was really helpful and it is clear that
'make check' don't start a ceph cluster. After your email I have figured it
out. This brings me to another question :-)
In my earlier email I should have defined what exactly I mean by 'workload'
in my case. Given my current task/scenario, the definition of 'workload'
is only the workload of client machine. Meaning, if there is a Ceph
cluster, I am only concerned with the workload of a single ceph
client node. And not the workload of other nodes that include OSDs, MONs,
MDS etc. The question arises what exactly on ceph client? On client side, I
would like to profile the workload of CRUSH. Because I am quite sure there
are many computations in CRUSH that are compute intensive for CPU and can
be offloaded. May be these compute intensive computations can be more
parallelized. This is why I was profiling the binaries of unit tests (in
particular CRUSH unit tests) on profiling tool Valgrind --tool=callgrind to
see the function calls. May be this is not the right way? Please do comment
on it :-).
Considering my task, would you still recommend me to use Teuthology tests
at this point? Please do comment on this also :-). Because integration
tests (Teuthology framework) require multi-machine clusters to run. And
according to my understanding that would be too complex for a single client
workload or lets say if I am only interested in CRUSH workload.
Thanks in advance :-)
On Thu, May 7, 2020 at 12:20 AM Bobby <italienisch1987(a)gmail.com> wrote:
>
>
> Hi Sebastian,
>
> Thanks a lot for your reply. It was really helpful and it is clear that
> 'make check' don't start a ceph cluster. After your email I have figured it
> out. This brings me to another question :-)
>
> In my earlier email I should have defined what exactly I mean by
> 'workload' in my case. Given my current task/scenario, the definition of
> 'workload' is only the workload of client machine. Meaning, if there is a
> Ceph cluster, I am only concerned with the workload of a single ceph
> client node. And not the workload of other nodes that include OSDs, MONs,
> MDS etc. The question arises what exactly on ceph client? On client side, I
> would like to profile the workload of CRUSH. Because I am quite sure there
> are many computations in CRUSH that are compute intensive for CPU and can
> be offloaded. May be these compute intensive computations can be more
> parallelized. This is why I was profiling the binaries of unit tests (in
> particular CRUSH unit tests) on profiling tool Valgrind --tool=callgrind to
> see the function calls. May be this is not the right way? Please do comment
> on it :-).
>
> Considering my task, would you still recommend me to use Teuthology tests
> at this point? Please do comment on this also :-). Because integration
> tests (Teuthology framework) require multi-machine clusters to run. And
> according to my understanding that would be too complex for a single client
> workload or lets say if I am only interested in CRUSH workload.
>
> Thanks in advance :-´)
>
> Bobby
>
> On Wed, May 6, 2020 at 5:37 PM Sebastian Wagner <sebastian.wagner(a)suse.com>
> wrote:
>
>> Hi Bobby,
>>
>> `make check` aka unit tests don't start a ceph cluster. Instead they test
>> individual functions. There is nothing similar to a "workload" involved
>> here.
>>
>> Maybe, you're interested in the vstart_runner, which makes it possible to
>> run Teuthology tests in a vstart cluster.
>>
>> Best,
>>
>> Sebastian
>> _______________________________________________
>> Dev mailing list -- dev(a)ceph.io
>> To unsubscribe send an email to dev-leave(a)ceph.io
>>
>
Hi Frank,
Reviving this old thread as to whether the performance on these raw NL-SAS
drives is adequate? I was wondering if this is a deep archive with almost
no retrieval, or how many drives are used? In my experience with large
parallel writes, WAL/DB with bluestore, or journal drives on SSD with
filestore have always been needed to sustain a reasonably consistent
transfer rate.
Very much appreciate any reference info as to your design.
Best regards,
Alex
On Mon, Jul 8, 2019 at 4:30 AM Frank Schilder <frans(a)dtu.dk> wrote:
> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all
>> journals collocated on the same disk with the data. Disks are spinning
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench)
>> depending on EC profile, object size, write size, etc. Results were varying
>> a lot. My advice would be to run benchmarks with your hardware. If there
>> was a single perfect choice, there wouldn't be so many options. For
>> example, my tests will not be valid when using separate fast disks for WAL
>> and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated
>> pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second,
>> which is probably the network limit and not the disk limit. IOP/s get
>> better with more disks, but are way lower than what replicated pools can
>> provide. On a cephfs with EC data pool, small-file IO will be comparably
>> slow and eat a lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes,
>> which is due to the way EC overwrites are handled. This is one bottleneck
>> for IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All
>> other choices were poor. The value of m seems not relevant for performance.
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or
>> 8MB, with IOP/s getting somewhat better with slower object sizes but
>> throughput dropping fast. I use the default of 4MB in production. Works
>> well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than
>> other plugins, which is preferrable for IOP/s. However, CPU usage can
>> become a problem and a plugin optimized for specific values of k and m
>> might help here. Under usual circumstances I see very low load on all OSD
>> hosts, even under rebalancing. However, I remember that once I needed to
>> rebuild something on all OSDs (I don't remember what it was, sorry). In
>> this situation, CPU load went up to 30-50% (meaning up to half the cores
>> were at 100%), which is really high considering that each server has only
>> 16 disks at the moment and is sized to handle up to 100. CPU power could
>> become a bottle for us neck in the future.
>>
>> These are some general observations and do not replace benchmarks for
>> specific use cases. I was hunting for a specific performance pattern, which
>> might not be what you want to optimize for. I would recommend to run
>> extensive benchmarks if you have to live with a configuration for a long
>> time - EC profiles cannot be changed.
>>
>> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also
>> use bluestore compression. All meta data pools are on SSD, only very little
>> SSD space is required. This choice works well for the majority of our use
>> cases. We can still build small expensive pools to accommodate special
>> performance requests.
>>
>> Best regards,
>>
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of David <
>> xiaomajia.st(a)gmail.com>
>> Sent: 07 July 2019 20:01:18
>> To: ceph-users(a)lists.ceph.com
>> Subject: [ceph-users] What's the best practice for Erasure Coding
>>
>> Hi Ceph-Users,
>>
>> I'm working with a Ceph cluster (about 50TB, 28 OSDs, all Bluestore on
>> lvm).
>> Recently, I'm trying to use the Erasure Code pool.
>> My question is "what's the best practice for using EC pools ?".
>> More specifically, which plugin (jerasure, isa, lrc, shec or clay)
>> should I adopt, and how to choose the combinations of (k,m) (e.g.
>> (k=3,m=2), (k=6,m=3) ).
>>
>> Does anyone share some experience?
>>
>> Thanks for any help.
>>
>> Regards,
>> David
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users(a)lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
Hi,
We’ve recently installed a new Ceph cluster running Octopus 15.2.1, and we’re using RGW with an erasure coded backed pool.
I started to get a suspicion that deleted objects were not getting cleaned up properly, and I wanted to verify this by checking the garbage collector.
That’s when I discovered that when I run “radosgw-admin gc list”, I get the following error:
"ERROR: failed to list objs: (22) Invalid argument”
When running the command with the debug-rgw=20 flag, I see a bit more information:
2020-05-05T18:39:19.455+0000 7f3312d82080 20 add_watcher() i=0
2020-05-05T18:39:19.455+0000 7f3312d82080 20 add_watcher() i=1
2020-05-05T18:39:19.455+0000 7f3312d82080 20 add_watcher() i=2
2020-05-05T18:39:19.455+0000 7f3312d82080 20 add_watcher() i=3
2020-05-05T18:39:19.455+0000 7f3312d82080 20 add_watcher() i=4
2020-05-05T18:39:19.455+0000 7f3312d82080 20 add_watcher() i=5
2020-05-05T18:39:19.455+0000 7f3312d82080 20 add_watcher() i=6
2020-05-05T18:39:19.455+0000 7f3312d82080 20 add_watcher() i=7
2020-05-05T18:39:19.455+0000 7f3312d82080 2 all 8 watchers are set, enabling cache
2020-05-05T18:39:19.455+0000 7f3312d82080 20 check_secure_mon_conn(): auth registy supported: methods=[2,1] modes=[2,1]
2020-05-05T18:39:19.455+0000 7f3312d82080 20 check_secure_mon_conn(): method 1 is insecure
2020-05-05T18:39:19.455+0000 7f32d4fd9700 2 RGWDataChangesLog::ChangesRenewThread: start
2020-05-05T18:39:19.519+0000 7f3246ffd700 20 reqs_thread_entry: start
2020-05-05T18:39:19.519+0000 7f3312d82080 20 init_complete bucket index max shards: 11
2020-05-05T18:39:19.519+0000 7f3244ff9700 20 reqs_thread_entry: start
2020-05-05T18:39:19.519+0000 7f323affd700 20 reqs_thread_entry: start
ERROR: failed to list objs: (22) Invalid argument
2020-05-05T18:39:19.523+0000 7f32d4fd9700 2 RGWDataChangesLog::ChangesRenewThread: start
2020-05-05T18:39:19.523+0000 7f3312d82080 20 remove_watcher() i=0
2020-05-05T18:39:19.523+0000 7f3312d82080 2 removed watcher, disabling cache
2020-05-05T18:39:19.523+0000 7f3312d82080 20 remove_watcher() i=1
2020-05-05T18:39:19.523+0000 7f3312d82080 20 remove_watcher() i=2
2020-05-05T18:39:19.527+0000 7f3312d82080 20 remove_watcher() i=3
2020-05-05T18:39:19.527+0000 7f3312d82080 20 remove_watcher() i=4
2020-05-05T18:39:19.527+0000 7f3312d82080 20 remove_watcher() i=5
2020-05-05T18:39:19.527+0000 7f3312d82080 20 remove_watcher() i=6
2020-05-05T18:39:19.527+0000 7f3312d82080 20 remove_watcher() I=7
I find very little information regarding this error, so I wondered if someone here could help me troubleshoot the issue?
Thanks,
James.
Hi,
I am trying to setup the Zabbix reporting module, but it is giving an
error which looks like a Python error:
ceph zabbix config-show
Error EINVAL: TypeError: __init__() got an unexpected keyword argument 'index'
I have configured the zabbix_host and identifier already at this point.
The command: 'ceph zabbix send' also fails to run with 'Failed to send
data to Zabbix'
I am running:
- CentOS 8.1
- ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee)
octopus (stable)
- Zabbix 4.4-1.el8 release
(https://repo.zabbix.com/zabbix/4.4/rhel/8/x86_64/zabbix-release-4.4-1.el8.n…)
- Python version 3.6.8
Any suggestions? I am wondering if this could be requiring Python 2.7 to run?
--
Matt Larson, PhD
Madison, WI 53705 U.S.A.
On Mon, Mar 9, 2020 at 3:19 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
>
> For testing purposes I changed the kernel 3.10 for a 5.5, now I am
> getting these messages. I assume the 3.10 was just never displaying
> these. Could this be a problem with my caps of the fs id user?
>
> [Mon Mar 9 23:10:52 2020] ceph: Can't lookup inode 1 (err: -13)
> [Mon Mar 9 23:12:03 2020] ceph: Can't lookup inode 1 (err: -13)
> [Mon Mar 9 23:13:12 2020] ceph: Can't lookup inode 1 (err: -13)
> [Mon Mar 9 23:14:19 2020] ceph: Can't lookup inode 1 (err: -13)
For posterity, a tracker was opened for this bug:
https://tracker.ceph.com/issues/44546
--
Patrick Donnelly, Ph.D.
He / Him / His
Senior Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
Dear all,
We currently run a small Ceph cluster on 2 machines and we wonder what
are the theoretical max BW/IOPS we can achieve through RBD with our setup.
Here are the environment details:
- The Ceph release is an octopus 15.2.1 running on Centos 8, both
machines have 180GB RAM, 72 cores, and 40 * 1.8TB SSD disks each
- Regarding network we deployed two isolated 100Gb/s networks for front
and back connectivity
- Since all disks have the same performance, we created 1 OSD per SSD
using bluestore (default setup with LVM) to reach a total of 80 OSDs (40
OSD per machine)
- On top of that we have a single 2x replicated RBD pool with 2048 PGs
in order to reach a global average of 50 PGs per OSD (our experiments
with 100 PGs/OSD didn't provided perfomance improvement, only extra CPU
consumption)
- We kept default settings for all RBD images we created for benchmarks
(4MB obj size, 4MB stripe width, 1 stripe)
- The crush map and replication rules used are very simple (2 hosts, 40
OSDs per host with same device class and weight)
- All tuning settings (caches sizing, op threads, bluestore, rocksdb
options, etc.) are the default options provided with the Octopus release.
Here are the best values observed so far using both rados bench and fio
with many different setup (varying amount of clients, threads, RBD
images, bloc sizes from 4k to 4m, random/sequential, iodepth, etc.):
- Read BW: 24GB/s (looks like we reached the maximum network capacity of
both machines here)
- Read IOPS: 600k
- Write BW: 7 GB/s
- Write IOPS: 100k
Those are simply the maximum numbers obtained regardless latency as we
first want to stress the infrastructure to see what are the maximum
thoughput & IOPS we can achieve. Latency care/measurements will come after.
We also have the feelings that the 2x replication of the RBD pool is a
big deal with only 2 nodes in the cluster, dividing maximum speeds by
more than 2. This will probably have much less impact when scaling up
the cluster with new nodes.
We also noticed that at some point during recovery operations (eg.
rebalancing PGs after new OSD was added into the pool) the total
read/write throughput and IOPS are climbing to several GB/s and millions
IOPS, so we wonder if we can achieve any better with legitimate RBD
clients load.
Do you guys would like to share numbers of your setup or have any hints
for potential improvements?
Thanks.
Regards,
--
Vincent Kherbache
R&D Director
Titan Datacenter
Dear Cephalopodians,
seeing the recent moves of major HDD vendors to sell SMR disks targeted for use in consumer NAS devices (including RAID systems),
I got curious and wonder what the current status of SMR support in Bluestore is.
Of course, I'd expect disk vendors to give us host-managed SMR disks for data center use cases (and to tell us when actually they do so...),
but in that case, Bluestore surely needs some new intelligence for best performance in the shingled ages.
I had a quick look at the repository and could only make out that libzbc has been added some years ago,
but no activity after this (also no tickets in the issue tracker). Is this still something on the roadmap?
It would be wonderful for backup / archiving / mostly data ingest / cold storage clusters to be able to use cheaper and larger disks once they become available :-).
I'm still positively optimistic for such use cases even though my personal experience with SMR has not been so well up to now:
I bought a (not cleanly labelled...) DM-SMR, used it for BTRFS archiving (btrbk, i.e. btrfs-send and -receive),
and after it was filled once, it got excruciatinly slow (less than a few kiB/s even when changing only a few 100 MB after prolonged idle).
But then, deleting btrfs snapshots of a desktop OS is pure random read/write access, and there are no optimizations for that use case in btrfs (and the drive did not even support TRIM/DISCARD),
and I read on the btrfs list now that it can work well even in arrays mostly used for cold storage if the balancing is throttled ;-).
Cheers,
Oliver
I have been using snapshots on cephfs since luminous, 1xfs and
1xactivemds and used an rsync on it for backup.
Under luminious I did not encounter any problems with this setup. I
think I was even snapshotting user dirs every 7 days having thousands of
snapshots (which I later heard, is not recommend and one should stick
below 400 or so?)
When upgrading to nautilus, this snapshot feature was disabled (that is
default in the upgrade). Did not notice nor expected this. When I
enabled again snapshotting. I had problems with the rsync backup. So I
reverted back to the slower ceph-fuse mount. I also brought down the
snapshots to 36, but I am still stuck with "clients failing to respond
to capability release", "clients failing to respond to cache pressure"
and "MDSs report slow requests"
Which is odd, since my use did not change since luminous.
All in all is fine, but what I do not like is, that such a thing can
happen between upgrades.
-----Original Message-----
From: Stolte, Felix [mailto:f.stolte@fz-juelich.de]
Sent: 06 May 2020 09:09
To: ceph-users(a)ceph.io
Subject: [ceph-users] Cephfs snapshots in Nautilus
Hi Folks,
I really like to use snapshots on cephfs, but even on octopus release
snapshots are still marked as an experimental feature. Is anyone using
snapshots in production environments? Which issues did you encounter? Do
I risk a corrupted filesystem or just non-working snapshots?
We run a single fs with one active mds.
Best regards
Felix
------------------------------------------------------------------------
-------------
------------------------------------------------------------------------
-------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt
------------------------------------------------------------------------
-------------
------------------------------------------------------------------------
-------------
Hi Cephers,
I am trying to install Ceph Octopus using ceph-deploy on CentOS 7,
as installing ceph-mgr-dashboard, it required these packages:
- python3-cherrypy
- python3-jwt
- python3-routes
https://pastebin.com/dSQPgGJD
But, when I installed these packages, they were not available.
https://pastebin.com/Hguf8kJe
Is there anyone installed Octopus successfully on CentOS 7? Can you help me?
Thanks in advance.
On Wed, May 6, 2020 at 3:53 PM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
>
> I have been using snapshots on cephfs since luminous, 1xfs and
> 1xactivemds and used an rsync on it for backup.
> Under luminious I did not encounter any problems with this setup. I
> think I was even snapshotting user dirs every 7 days having thousands of
> snapshots (which I later heard, is not recommend and one should stick
> below 400 or so?)
>
> When upgrading to nautilus, this snapshot feature was disabled (that is
> default in the upgrade). Did not notice nor expected this. When I
> enabled again snapshotting. I had problems with the rsync backup. So I
> reverted back to the slower ceph-fuse mount. I also brought down the
> snapshots to 36, but I am still stuck with "clients failing to respond
> to capability release", "clients failing to respond to cache pressure"
> and "MDSs report slow requests"
> Which is odd, since my use did not change since luminous.
>
please open tracker tickets for these
> All in all is fine, but what I do not like is, that such a thing can
> happen between upgrades.
>
>
>
>
> -----Original Message-----
> From: Stolte, Felix [mailto:f.stolte@fz-juelich.de]
> Sent: 06 May 2020 09:09
> To: ceph-users(a)ceph.io
> Subject: [ceph-users] Cephfs snapshots in Nautilus
>
> Hi Folks,
>
>
>
> I really like to use snapshots on cephfs, but even on octopus release
> snapshots are still marked as an experimental feature. Is anyone using
> snapshots in production environments? Which issues did you encounter? Do
> I risk a corrupted filesystem or just non-working snapshots?
>
>
>
> We run a single fs with one active mds.
>
>
>
> Best regards
>
> Felix
>
> ------------------------------------------------------------------------
> -------------
>
> ------------------------------------------------------------------------
> -------------
>
> Forschungszentrum Juelich GmbH
>
> 52425 Juelich
>
> Sitz der Gesellschaft: Juelich
>
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>
> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
>
> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
>
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt
>
> ------------------------------------------------------------------------
> -------------
>
> ------------------------------------------------------------------------
> -------------
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io