Hello,

we've been trying to get Ceph to work on IBM Z, a big-endian system,
and have been running into various serious issues relating to endian
conversion code.

The main issue we've been seeing is that while the old-style
decode/encode machinery in include/encoding.h automatically
byte-swaps all integers on big-endian systems (to ensure the
serialized format is always little endian), the *new-style*
machinery in include/denc.h does not.

This seemed confusing at first since there is quite a bit of code
there that appears intended to perform exactly that function, e.g.

template<typename T>
struct ExtType<T, std::enable_if_t<std::is_same_v<T, int32_t> ||
std::is_same_v<T, uint32_t>>> {
using type = __le32;
};

However, it turns out that at this point __le32 is actually
just an alias for __u32, so this whole machinery doesn't
really do anything at all.

Looking at the old code in encoding.h, I notice that it works
similarly, but uses ceph_le32 instead of __le32. The former
is a C++ class that actually does perform byte-swap on access.

Even more confusing, there is this code in include/types.h:

// temporarily remap __le* to ceph_le* for benefit of shared kernel/userland headers
#define __le16 ceph_le16
#define __le32 ceph_le32
#define __le64 ceph_le64
#include "ceph_fs.h"
#include "ceph_frag.h"
#include "rbd_types.h"
#undef __le16
#undef __le32
#undef __le64

which --sometimes-- redefines __le32 as ceph_le32, but those
redefines are not active at the point denc.h is included.

So it would appear that the usage of __le32 in denc.h is incorrect,
and this code should be using ceph_le32 instead. Is this right?

But even so, grepping for __le32 throughout the code base shows
quite a bit of additional places where it is used, most of which
also appear to make the assumption that byte-swaps automatically
happen. In addition, there appear to be some places where e.g.
ceph_fs.h is included directly, without going via types.h -- and
in those places, we suddenly no longer get the byte-swaps ...


Now I was wondering whether the best way forward might be to just
have __le32 always be defined as ceph_le32 when compiling user
space code. But then I noticed that it used to be that way,
but that was deliberated changed by this commit back in 2010:

commit 737b5043576153817a6b4195b292672585df10d3
Author: Sage Weil <sage@newdream.net>
Date: Fri May 7 13:45:00 2010 -0700

endian: simplify __le* type hackery

Instead of preventing linux/types.h from being included, instead name
our types ceph_le*, and remap using #define _only_ when including the
shared kernel/userspace headers.


So I'm a bit at a loss to understand how all this is supposed to
be working. Any suggestions would be welcome -- we'd be willing
to implement whatever's needed, but would like some guidance as
to how the solution should look like ...

Bye,
Ulrich