Hey all, we were doing some testing of ceph against our product and we
found some behavior we want to run by you.
We are using the S3 ceph interface.
Attached is a python file using boto3 which, when run against two
different deployments of ceph (octopus ceph nano and our production
nautilus 14.2.11 deployment), appears to repro a strange issue.
After running for a while, a recently uploaded file forever disappears from
list_objects requests. This file still appears to be visible to get_object
if you know the specific name, but does not show up in list_objects.
There are more details about the experiment in the attached python file.
We produced a run of this experiment with debug logging, in which we see a
trace message
RGWRados::cls_bucket_list_ordered: skipping <filename>
In the same millisecond that the file was PUT.
Reading the code, this comes from when a call to check_disk_state returns
ENOENT, where we see
if (!list_state.is_delete_marker() && !astate->exists) {
/* object doesn't exist right now -- hopefully because it's
* marked as !exists and got deleted */
if (list_state.exists) {
/* FIXME: what should happen now? Work out if there are any
* non-bad ways this could happen (there probably are, but annoying
* to handle!) */
}
// encode a suggested removal of that key
list_state.ver.epoch = io_ctx.get_last_version();
list_state.ver.pool = io_ctx.get_id();
cls_rgw_encode_suggestion(CEPH_RGW_REMOVE, list_state,
suggested_updates);
return -ENOENT;
}
It seems like this might be some kind of race between PUT and list_object
in which some kind of object metadata is apparently deleted... the FIXME is
at least a little suspicious :).
I would love to know what's going on here, and if there is a fix or
workaround we can do to prevent this behavior. Let me know if there is any
other information we can provide.
Thank you so much!
Best,
-Joseph Victor