Hi everyone!
I have been working for the past week or so trying to get ceph-iscsi to work - Octopus release. Even just getting a single node working would be a major victory in this battle but so far, victory has proven elusive.
My setup: a pair of Dell Optiplex 7010 desktops, each with 16 gig of memory and 1 boot drive (USB 3) and 3 SATA drives (500 Gb SSHD drives). No RAID controllers anywhere. Yes, I know that 3 nodes is the recommended minimum number for a production system - this isn't production (this is just seeing if the darned thing will even work).
I am using Centos 8.1.1911 for the OS (4.18.0 kernel) with a basic or minimal installation (no X-Window). Single Gigabit ethernet per node. I have 2 MON and 2 Mgr installed and working, and I have a total of 6 OSDs working. I created the RBD pool (named "rbd" per the published instructions), creating it initially with 256 PGs (autoscale decided that 32 was a better choice - whatever). The cluster is green and all 6 OSDs are green (up and in). All deployment is via cephadm and all containers are running via podman.
Here is where things start to fall apart.
I was able to find RPM packages for targetcli and python-rtslib (called python3-rtslib) but was not able to find tcmu-runner nor ceph-iscsi packages. OK, no big deal. Time to head over to the manual install guide.
I was able to build tcmu-runner, install it and apparently it is running (systemctl says it is active) so that appears to be OK.
The problem is getting rbd-target-gw and rbd-target-api to work. They appear to build OK and of course, I am able to get them registered with systemd. They universally fail when trying to run them (systemctl start rbd-target-gw or systemctl start rbd-target-api). Both report failure. Looking in journalctl -xe shows no hints at all regarding why they failed (only that they did). Looking in /var/log/rbd-target-api/ show nothing at all (no files). Likewise in /var/log/rbd-target-gw/ (no files).
HELP!!
Now, some possibly germane questions:
1) are any other Ceph services required for ceph-iscsi to work like RADOSgw?
2) since there are no apparent packages available for ceph-iscsi, can anything be inferred to the production-readiness of the subsystem?
3) are there any known errata or missing steps in the instructions for getting ceph-iscsi to work?
Thanks!
Ron Gage
Hello List,
i did:
root@ceph01:~# ceph cephadm set-ssh-config -i /tmp/ssh_conf
root@ceph01:~# cat /tmp/ssh_conf
Host *
User root
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
root@ceph01:~# ceph config-key set mgr/cephadm/ssh_identity_key -i
/root/.ssh/id_rsa
set mgr/cephadm/ssh_identity_key
root@ceph01:~# ceph config-key set mgr/cephadm/ssh_identity_pub -i
/root/.ssh/id_rsa.pub
set mgr/cephadm/ssh_identity_pub
But i get:
root@ceph01:~# ceph orch host add ceph01 10.10.1.1
Error ENOENT: Failed to connect to ceph01 (10.10.1.1). Check that the
host is reachable and accepts connections using the cephadm SSH key
root@ceph01:~# ceph config-key get mgr/cephadm/ssh_identity_key =>
this shows my private key
How can i debug this?
root@ceph01:~# ssh 10.10.1.1
or
root@ceph01:~# ssh ceph01
work without a prompt or key error.
I am using 15.2.0.
Thanks,
Michael
Hi all,
a few weeks ago, a number of virtual Ceph Developer Summit meetings took
place as a replacement for the in-person summit that was planned as part
of Cephalocon in Seoul: https://pad.ceph.com/p/cds-pacific
The Ceph Dashboard team also participated in these and held three video
conference meetings to lay out our plans for the Pacific release.
For details, please take a look at our notes at this Etherpad:
https://pad.ceph.com/p/ceph-dashboard-pacific-priorities
We tried to identify a few "themes", outlining individual tasks which we
keep track of in the tracker.ceph.com bug tracker. The tracker issues
should be used for discussing and defining the tasks at hand.
A key theme for the upcoming Ceph Pacific release is the intention to
further deepen and enhance the integration and support with cephadm and
the orchestrator.
For Ceph octopus, we tried focusing on the most common day-2 operation
which is OSD management, but going forward we would like to also support
the deployment and management of all other Ceph-related services that
can be rolled out via cephadm and the orchestrator.
In a hopefully not so distant future, we would like to be able to use
the dashboard as a kind of "graphical installer", that guides the user
through the entire installation deployment process of a Ceph cluster
from scratch (well, almost: starting from an initial Mon+Mgr deployment).
Another key theme is closing feature gaps: the various services of a
Ceph cluster like RBD or RGW are constantly evolving and getting new
features, so we always are trying to catch up with the latest
developments there.
We're also looking into enhancing our monitoring/alerting support and
integration with Grafana and Prometheus.
Last but not least, we always try to enhance and improve existing
functionality and work on better usability and user experience. This
also includes bigger refactoring work or updating key components that
the dashboard depends on.
As always, we would like the dashboard to be an application that Ceph
administrators like and actually *want* to use to perform their jobs, so
we are very keen on getting your feedback here!
If there is anything you are missing or if you find any part of the
dashboard to be confusing or not helpful, we'd like to know about it!
Please get in touch with us to share your impressions and ideas. The
best way to do this is to join the #ceph-dashboard IRC channel on OFTC
or by filing a bug report via the tracker:
https://tracker.ceph.com/projects/mgr/issues/new
Thank you,
Lenz
--
SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
GF: Felix Imendörffer, HRB 36809 (AG Nürnberg)
Hello everybody
In octopus there are some interesting looking features, so I tried to upgrading my Centos 7 test nodes, according to:
https://docs.ceph.com/docs/master/releases/octopus/
Everything went fine and the cluster is healthy.
To test out the new dashboard functions, I tried to install it, but there are missing dependencies:
yum install ceph-mgr-dashboard.noarch
.....
--> Finished Dependency Resolution
Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch)
Requires: python3-routes
Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch)
Requires: python3-jwt
Error: Package: 2:ceph-mgr-dashboard-15.2.1-0.el7.noarch (Ceph-noarch)
Requires: python3-cherrypy
Installing them with pip3 does of course make no difference, because those are yum dependencies.
Does anyone know a workaround?
Do I have to upgrade to Centos 8 for this to work?
Thanks in advance,
Simon
Hi
@Eric Ivancich my cluster has some history and trash gathered over the
years. Most (terabytes) is from https://tracker.ceph.com/issues/43756.
I was able to reproduce the problem on my LAB and it is for sure connected
with https://tracker.ceph.com/issues/43756. When you are on a version older
than 14.2.8 you would need to apply lifecycle policy which tries to abort
interrupted multiparts older than x days. And when the bucket index is
sharded then the broken/un-cancellable MPs are born.
To test it I can use s3cmd. My LAB cluster was upgraded to 14.2.8 to make
sure the new version does not do cleanup automagically.
Here is my procedure (I truncated my personal data):
*s3cmd --access_key= --secret_key= --host= --host-bucket= multipart
s3://kate-mp-issue*
*s3://kate-mp-issue/Initiated Path Id2020-04-06T07:48:55.323Z
s3://kate-mp-issue/bottest_20200406T074855.img
2~-9SKkHzGKXYX_zNdHNs_S8RY9hWjISS*
*s3cmd --access_key= --secret_key= --host= --host-bucket= abortmp
s3://kate-mp-issue/bottest_20200406T074855.img
2~-9SKkHzGKXYX_zNdHNs_S8RY9hWjISS*
*ERROR: S3 error: 404 (NoSuchUpload)*
*RGW logs:*
2020-04-29 07:24:23.126 7fb21b819700 1 ====== starting new request
req=0x7fb21b8128d0 =====
2020-04-29 07:24:23.126 7fb21b819700 1 ====== req done req=0x7fb21b8128d0
op status=0 http_status=200 latency=0s ======
2020-04-29 07:24:23.126 7fb21b819700 1 civetweb: 0x381a000: IP - -
[29/Apr/2020:07:24:22 +0000] "GET /kate-mp-issue/?location HTTP/1.1" 200
275 - -
2020-04-29 07:24:23.202 7fb21b819700 1 ====== starting new request
req=0x7fb21b8128d0 =====
2020-04-29 07:24:23.202 7fb21b819700 1 ====== req done req=0x7fb21b8128d0
op status=-2009 *http_status=404* latency=0s ======
2020-04-29 07:24:23.202 7fb21b819700 1 civetweb: 0x381a000: Ip - -
[29/Apr/2020:07:24:22 +0000] "DELETE
/kate-mp-issue/bottest_20200406T074855.img?uploadId=2~-9SKkHzGKXYX_zNdHNs_S8RY9hWjISS
HTTP/1.1" *404* 439 - -
So basically I need to remove those ghost entries from the list of
interrupted multiparts and clean up objects which are left overs.
As far as I understand I would need to go over every object in the pool
with `rados ls`, then compare the output with `radosgw-admin bi list (done
for every bucket)` and with new command `radosgw-admin radoslist` and
remove objects which are on 1 but not on 2 and 3. Plus clean up the
interrupted multipart list. Is that correct?
@EDH - Manuel Rios <mriosfer(a)easydatahost.com> Is that your method also?
I really need to clean up the terbaytes of the leftovers, because my prod
cluster is getting full. And now buying anything is not an option (harsh
times due to pandemy).
Kind regards / Pozdrawiam,
Katarzyna Myrek
Kind regards / Pozdrawiam,
Katarzyna Myrek
wt., 28 kwi 2020 o 19:45 EDH - Manuel Rios <mriosfer(a)easydatahost.com>
napisał(a):
> Im prettty sure that you got the same issue than we already reported :
>
> https://tracker.ceph.com/issues/43756
>
> Garbage and garbage stored into our OSD without be able to cleanup wasting
> a lot of space.
>
> As you can see its solved in the new versions but... the last versión
> didnt have any "scrub" or similar system to fix the garbage generated in
> the past versions.
>
> As result , even big companies got their RGW plattform with tons of TB
> wasted.
>
> Eric, Is there a way to ask you to develop(RGW Team) a system to clean our
> rgw clusters like rgw bucket scrub?
>
> I talked with Cbodley and he explained how to it manually but the process
> is extremely complex.
>
> We already calculated that at least a 25% of our rgw cluster is garbage
> (100TB), and our options right now:
>
> - Deploy a new cluster a move rgw Users one by one with their buckets with
> an external copy, hopping in the last nautilus version this not happen
> again (Not usefull option and not transparent)
> - Buy disk and disk waiting for a solution as External Tool (No sure to
> continue this way)
> - Hire external developers with knowleage of ceph and create a private
> tool for that. (Developers with Ceph Core/Rgw knowleage will be no easy to
> find)
>
> Here: ceph version 14.2.8
>
>
>
> -----Mensaje original-----
> De: Eric Ivancich <ivancich(a)redhat.com>
> Enviado el: martes, 28 de abril de 2020 18:39
> Para: Katarzyna Myrek <katarzyna(a)myrek.pl>
> CC: ceph-users(a)ceph.io
> Asunto: [ceph-users] Re: RGW and the orphans
>
> Hi Katarzyna,
>
> Incomplete multipart uploads are not considered orphans.
>
> With respect to the 404s…. Which version of ceph are you running? What
> tooling are you using to list and cancel? Can you provide a console
> transcript of the listing and cancelling?
>
> Thanks,
>
> Eric
>
> --
> J. Eric Ivancich
> he / him / his
> Red Hat Storage
> Ann Arbor, Michigan, USA
>
> > On Apr 28, 2020, at 2:57 AM, Katarzyna Myrek <katarzyna(a)myrek.pl> wrote:
> >
> > Hi all
> >
> > I am afraid that there is even more thrash available - running
> > rgw-orphan-list does not find everything. Like I still have broken
> > multiparts -> when I do s3cmd multipart I get a list of
> > "pending/interrupted multiparts". When I try to cancel such multipart
> > I get 404.
> >
> > Does anyone have a method for cleanup of such things? Or even a list
> > of tasks which should be run regularly on clusters with rgw ?
> >
> >
> > Kind regards / Pozdrawiam,
> > Katarzyna Myrek
> >
> >
> > wt., 21 kwi 2020 o 09:57 Janne Johansson <icepic.dz(a)gmail.com>
> napisał(a):
> >>
> >> Den tis 21 apr. 2020 kl 07:29 skrev Eric Ivancich <ivancich(a)redhat.com
> >:
> >>>
> >>> Please be certain to read the associated docs in both:
> >>>
> >>> doc/radosgw/orphans.rst
> >>> doc/man/8/rgw-orphan-list.rst
> >>>
> >>> so you understand the limitations and potential pitfalls. Generally
> this tool will be a precursor to a large delete job, so understanding
> what’s going on is important.
> >>> I look forward to your report! And please feel free to post additional
> questions in this forum.
> >>>
> >>
> >> Where are those?
> >> https://github.com/ceph/ceph/tree/master/doc/man/8
> >> https://github.com/ceph/ceph/tree/master/doc/radosgw
> >> don't seem to contain them in master. Nor in nautilus branch or octopus.
> >>
> >> This whole issue feels weird, rgw (or its users) produces dead
> >> fragments of mulitparts, orphans and whatnot that needs cleaning up
> sooner or later and the info we get is that the old cleaner isn't meant to
> be used, it hasn't worked for a long while, there is no fixed version,
> perhaps there is a script somewhere with caveats. This (slightly
> frustrated) issue is of course on top of "bi trim"
> >> "bilog trim"
> >> "mdlog trim"
> >> "usage trim"
> >>
> >> "datalog trim"
> >>
> >> "sync error trim"
> >>
> >> "gc process"
> >>
> >> "reshard stale-instances rm"
> >>
> >>
> >>
> >> that we rgw admins are supposed to know when to run, how often, what
> their quirks are and so on.
> >>
> >>
> >> 'Docs' for rgw means "datalog trim" --help says "trims the datalog",
> and the long version on the web would be "this operation trims the datalog"
> or something that doesn't add anything more.
> >>
> >>
> >>
> >>
> >> --
> >>
> >> "Grumpy cat was an optimist"
> >>
> >
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an
> email to ceph-users-leave(a)ceph.io
>
Hello,
I have a problem with radosgw service where the actual disk usage (ceph df shows 28TB usage) is way more than reported by the radosgw-admin bucket stats (9TB usage). I have tried to get to the end of the problem, but no one seems to be able to help. As a last resort I will attempt to copy the buckets, rename them and remove the old buckets.
What is the best way of doing this (probably on a high level) so that the copy process doesn't carry on the wasted space to the new buckets?
Cheers
Andrei
Hi,
I upgraded from 13.2.5 to 14.2.6 last week and am now seeing
significantly higher latency on various MDS operations. For example,
the 2min rate of ceph_mds_server_req_create_latency_sum /
ceph_mds_server_req_create_latency_count for an 8hr window last Monday
prior to the upgrade was an average of 2ms. Today, however the same
stat shows 869ms. Other operations including open, readdir, rmdir,
etc. are also taking significantly longer.
Here's a partial example of an op from dump_ops_in_flight:
{
"description": "client_request(client.342513090:334359409
create #...)",
"initiated_at": "2020-04-13 15:30:15.707637",
"age": 0.19583208099999999,
"duration": 0.19767626299999999,
"type_data": {
"flag_point": "submit entry: journal_and_reply",
"reqid": "client.342513090:334359409",
"op_type": "client_request",
"client_info": {
"client": "client.342513090",
"tid": 334359409
},
"events": [
{
"time": "2020-04-13 15:30:15.707637",
"event": "initiated"
},
{
"time": "2020-04-13 15:30:15.707637",
"event": "header_read"
},
{
"time": "2020-04-13 15:30:15.707638",
"event": "throttled"
},
{
"time": "2020-04-13 15:30:15.707640",
"event": "all_read"
},
{
"time": "2020-04-13 15:30:15.781935",
"event": "dispatched"
},
{
"time": "2020-04-13 15:30:15.785086",
"event": "acquired locks"
},
{
"time": "2020-04-13 15:30:15.785507",
"event": "early_replied"
},
{
"time": "2020-04-13 15:30:15.785508",
"event": "submit entry: journal_and_reply"
}
]
}
}
This along with every other 'create' op I've seen has a 50ms+ delay
between all_read and dispatched events - what is happening during this
time? I'm not sure what I'm looking for the in the MDS debug logs.
We have a mix of clients from 12.2.x through 14.2.8; my plan was to
upgrade those pre-Nautilus clients this week. There is only a single
MDS rank with 1 backup. Other functions of this cluster - RBDs and RGW
- do not appear impacted so this looks limited to the MDS. I did not
observe this behavior after upgrading a dev cluster last month.
Has anyone seen anything similar? Thanks for any assistance!
Josh
Hi Ceph folks,
I am relatively new to Ceph cluster and I hope I can quickly receive some help here.
I would like to recover files from cephfs data pool. Someone wrote that inode linkage and file names are stored in omap data of objects in metadata pool.
I cant find any information about the structure of omap data of the objects in metadata pool to help me write for example a script to retrieve filenames and the related objects
So I can use for example “rados get” to retrieve those files.
Are there any working script that would traverse through all the metadata pool to find out file names corresponded to objects in data pool?
/Ed
Hi,
is there a way to synchronize a specific bucket by Ceph across the available datacenters?
I've just found multi site setup but that one sync the complete cluster, which is equal to failover solution.
For me just 1 bucket.
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hello,
running Ceph Nautilus 14.2.4, we encountered this documented dynamic resharding issue:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-November/037531.ht…
We disabled dynamic resharding in the configuration, and attempted to reshard to 1 shard:
# radosgw-admin reshard add --bucket files --num-shards 1 --yes-i-really-mean-it
However, it achieved nothing, and the bucket is now stuck in resharding status. It is impossible to clear the resharding flag (I have tried the bucket check --fix operation with no avail)
# radosgw-admin reshard cancel --bucket=files
2020-04-28 11:47:18.721 7fd213b969c0 -1 ERROR: failed to remove entry from reshard log, oid=reshard.0000000000 tenant= bucket=files
# radosgw-admin bucket reshard --bucket files --num-shards 1
ERROR: the bucket is currently undergoing resharding and cannot be added to the reshard list at this time