On NUMA implementations of Cavium ThunderX, node1 memory addresses start with
bit 40 set to 1, and therefore requires >= 41 bits of VA. A side effect of
this is an increase from 3 to 4 page table levels.
Note: The alternative way to increase VA bits w/o moving to 4 level page
tables would be to adopt a larger page size. That would have a far more
significant impact to memory usage, and is known to have issue with current
Ubuntu userspace:
UBUNTU: SAUCE: (namespace) ext4: Add module parameter to enable user namespace mounts
This is still an experimental feature, so disable it by default
and allow it only when the system administrator supplies the
userns_mounts=true module parameter.
UBUNTU: SAUCE: (namespace) ext4: Add support for unprivileged mounts from user namespaces
Support unprivileged mounting of ext4 volumes from user
namespaces. This requires the following changes:
- Perform all uid, gid, and projid conversions to/from disk
relative to s_user_ns. In many cases this will already be
handled by the vfs helper functions. This also requires
updates to handle cases where ids may not map into s_user_ns.
A new helper, projid_valid_eq(), is added to help with this.
- Update most capability checks to check for capabilities in
s_user_ns rather than init_user_ns. These mostly reflect
changes to the filesystem that a user in s_user_ns could
already make externally by virtue of having write access to
the backing device.
- Restrict unsafe options in either the mount options or the
ext4 superblock. Currently the only concerning option is
errors=panic, and this is made to require CAP_SYS_ADMIN in
init_user_ns.
- Verify that unprivileged users have the required access to the
journal device at the path passed via the journal_path mount
option.
Note that for the journal_path and the journal_dev mount
options, and for external journal devices specified in the
ext4 superblock, devcgroup restrictions will be enforced by
__blkdev_get(), (via blkdev_get_by_dev()), ensuring that the
user has been granted appropriate access to the block device.
- Set the FS_USERNS_MOUNT flag on the filesystem types supported
by ext4.
sysfs attributes for ext4 mounts remain writable only by real
root.
UBUNTU: SAUCE: (namespace) fuse: Restrict allow_other to the superblock's namespace or a descendant
Unprivileged users are normally restricted from mounting with the
allow_other option by system policy, but this could be bypassed
for a mount done with user namespace root permissions. In such
cases allow_other should not allow users outside the userns
to access the mount as doing so would give the unprivileged user
the ability to manipulate processes it would otherwise be unable
to manipulate. Restrict allow_other to apply to users in the same
userns used at mount or a descendant of that namespace. Also
export current_in_userns() for use by fuse when built as a
module.
UBUNTU: SAUCE: (namespace) fuse: Support fuse filesystems outside of init_user_ns
In order to support mounts from namespaces other than
init_user_ns, fuse must translate uids and gids to/from the
userns of the process servicing requests on /dev/fuse. This
patch does that, with a couple of restrictions on the namespace:
- The userns for the fuse connection is fixed to the namespace
from which /dev/fuse is opened.
- The namespace must be the same as s_user_ns.
These restrictions simplify the implementation by avoiding the
need to pass around userns references and by allowing fuse to
rely on the checks in inode_change_ok for ownership changes.
Either restriction could be relaxed in the future if needed.
For cuse the namespace used for the connection is also simply
current_user_ns() at the time /dev/cuse is opened.
UBUNTU: SAUCE: (namespace) fuse: Add support for pid namespaces
When the userspace process servicing fuse requests is running in
a pid namespace then pids passed via the fuse fd are not being
translated into that process' namespace. Translation is necessary
for the pid to be useful to that process.
Since no use case currently exists for changing namespaces all
translations can be done relative to the pid namespace in use
when fuse_conn_init() is called. For fuse this translates to
mount time, and for cuse this is when /dev/cuse is opened. IO for
this connection from another namespace will return errors.
Requests from processes whose pid cannot be translated into the
target namespace will have a value of 0 for in.h.pid.
File locking changes based on previous work done by Eric
Biederman.
UBUNTU: SAUCE: (namespace) fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
The user in control of a super block should be allowed to freeze
and thaw it. Relax the restrictions on the FIFREEZE and FITHAW
ioctls to require CAP_SYS_ADMIN in s_user_ns.
Signed-off-by: Seth Forshee <email address hidden>
Signed-off-by: Eric W. Biederman <email address hidden>
Signed-off-by: Tim Gardner <email address hidden>
UBUNTU: SAUCE: (namespace) capabilities: Allow privileged user in s_user_ns to set security.* xattrs
A privileged user in s_user_ns will generally have the ability to
manipulate the backing store and insert security.* xattrs into
the filesystem directly. Therefore the kernel must be prepared to
handle these xattrs from unprivileged mounts, and it makes little
sense for commoncap to prevent writing these xattrs to the
filesystem. The capability and LSM code have already been updated
to appropriately handle xattrs from unprivileged mounts, so it
is safe to loosen this restriction on setting xattrs.
The exception to this logic is that writing xattrs to a mounted
filesystem may also cause the LSM inode_post_setxattr or
inode_setsecurity callbacks to be invoked. SELinux will deny the
xattr update by virtue of applying mountpoint labeling to
unprivileged userns mounts, and Smack will deny the writes for
any user without global CAP_MAC_ADMIN, so loosening the
capability check in commoncap is safe in this respect as well.
Signed-off-by: Seth Forshee <email address hidden>
Acked-by: Serge Hallyn <email address hidden>
Signed-off-by: Eric W. Biederman <email address hidden>
Signed-off-by: Tim Gardner <email address hidden>