参考资料:ContainerInterface
创建带有systemd的镜像
FROM debian:11
ARG DEBIAN_MIRROR=mirrors.tuna.tsinghua.edu.cn
# for systemd
VOLUME ["/sys/fs/cgroup","/run","/run/lock","/tmp"]
ENV container=docker
# for package installation
ENV DEBIAN_FRONTEND=noninteractive TZ=Asia/Shanghai
# install systemd and dbus
RUN set -xe; \
sed -i -e "s/deb\.debian\.org/${DEBIAN_MIRROR}/" /etc/apt/sources.list; \
apt-get update; \
apt-get install -y --no-install-recommends libsystemd0 systemd dbus; \
apt-get clean; \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*; \
rm -f /var/run/nologin
# remove auto-started services and targets (e.g. getty, systemd-logind)
#RUN rm -vf /lib/systemd/system/multi-user.target.wants/* \
# /etc/systemd/system/*.wants/* \
# /lib/systemd/system/local-fs.target.wants/* \
# /lib/systemd/system/sockets.target.wants/*udev* \
# /lib/systemd/system/sockets.target.wants/*initctl* \
# /lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
# /lib/systemd/system/systemd-update-utmp*
RUN set -xe; \
ln -sf /bin/systemctl /sbin/halt; \
ln -sf /bin/systemctl /sbin/poweroff; \
ln -sf /bin/systemctl /sbin/reboot; \
ln -sf /bin/systemctl /sbin/runlevel; \
ln -sf /bin/systemctl /sbin/shutdown; \
ln -sf /bin/systemctl /sbin/telinit; \
ln -sf /lib/systemd/systemd /sbin/init; \
true
# stop signal for systemd
STOPSIGNAL SIGRTMIN+3
WORKDIR /root
USER root
# install user utilities
RUN set -xe; \
apt-get update; \
apt-get install -y iproute2 procps file vim htop tmux iputils-ping dnsutils curl wget; \
true
# install sshd
#RUN set -xe; \
# apt-get install -y --no-install-recommends openssh-server
#EXPOSE 22
CMD ["/sbin/init"]
运行带有systemd的镜像
一般来说,systemd需要CAP_SYS_ADMIN才能完整运行。否则会有部分功能无法使用。(见开头ContainerInterface页面)
People have been asking to be able to run systemd without CAP_SYS_ADMIN and CAP_SYS_MKNOD in the container. This is now supported to some level in systemd, but we recommend against it (see above). If CAP_SYS_ADMIN and CAP_SYS_MKNOD are missing from the containers systemd will now gracefully turn off PrivateTmp=, PrivateNetwork=, ProtectHome=, ProtectSystem= and others, because those caps are requires to implement these options. The services using these settings (which include many of systemd’s own) will hence run in a different, less secure environment when the caps are missing than with them around.
但是,systemd在没有CAP_SYS_ADMIN的情况下也能正常启动。下面的例子中将不赋予该privilege。
systemd需要提供cgroup和tmpfs。旧版本的systemd可能无法使用read only的cgroup目录。
Docker
docker run \
-it -d --rm \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
--tmpfs /run \
--tmpfs /run/lock \
--tmpfs /tmp \
debsysd:11
Kubernetes
这里为了方便直接创建了Pod。正常应该写成Deployment
apiVersion: v1
kind: Pod
metadata:
name: debsys-test
labels:
kubectl.kubernetes.io/default-container: main
spec:
# 不挂载/run/secrets/kubernetes.io/serviceaccount
automountServiceAccountToken: false
# 容器退出后即不重启,观察测试systemd退出状态
restartPolicy: Never
containers:
- name: main
image: debsysd:11
# 测试环境里的image是提前用ctr i import加载的,不能去pull
imagePullPolicy: Never
# 虽然这里开着,但attach进pod的getty的尝试我还未成功过
stdin: true
tty: true
# ports:
# - containerPort: 22
# hostIP:
# hostPort: 9999
# protocol: TCP
# name: ssh
volumeMounts:
- mountPath: /sys/fs/cgroup
name: cgroup
readOnly: true
- mountPath: /tmp
name: tmpfs
subPath: tmp
- mountPath: /run
name: tmpfs
subPath: run
- mountPath: /run/lock
name: tmpfs
subPath: run-lock
volumes:
- name: tmpfs
emptyDir:
medium: Memory
sizeLimit: 64Mi
- name: cgroup
hostPath:
path: /sys/fs/cgroup
type: Directory
如何捕获systemd的reboot/poweroff信号
man 2 reboot

可以启动一个独立的init进程,并将systemd运行在嵌套的pid namespace内。当systemd发起reboot或poweroff时,systemd的父进程可以通过wait(2)捕获到信号,从而执行处理指令。(注意,调用reboot syscall需要CAP_SYS_BOOT)
在容器中使用volume作为rootfs
不得不说,这确实是一个非常诡异且扭曲的需求。
由于容器运行时需要往容器的/proc、/etc等位置写入内容,因此不能将volume简单挂载至rootfs,也不能使用switch_root、pivot_root等工具切换rootfs。
实现想法根据是否使用mount namespaces大概可以分为两种:第一种是在容器内利用unshare、mount等工具建立一个嵌套的容器,并chroot进volume以作为rootfs,随后启动真正的init;第二种则是使用mount -R复制挂载点,并使用chroot进入init。后文将分别展示两种方案。
需要注意的是,使用mount需要CAP_SYS_ADMIN。如果想要让systemd没有此特权,需要用capsh等方式在启动systemd时丢弃该特权。
创建空镜像
不使用namespaces的实现
这个方案比较简单直白,所以先说。
该“空”镜像基于alpine以使用其提供的busybox基本环境,/init将在启动后创建嵌套的ns以在volume上启动volume的init命令
FROM alpine:3.16.2
VOLUME ["/sys/fs/cgroup","/run","/run/lock","/tmp","/data00"]
ENV container=docker
COPY init /
STOPSIGNAL SIGRTMIN+3
ENTRYPOINT ["/init"]
CMD ["/lib/systemd/systemd"]
/init
该脚本假定了rootfs的volume被挂载在/data00。
如之前所说,由于容器运行时在exec等操作时,有使用/proc的需求。在将挂载转移到/data00内时,没有使用-M移动挂载点,而是使用了-R以复制挂载点。
该脚本的参数将作为init命令,以PID=1启动。除此之外,在使用特殊参数时,该脚本还可以getshell。如/init exec /bin/bash以在嵌套容器内启动bash
init (0755)
#!/bin/bash
MODE="$1"
case "$MODE" in
getshell|exec)
MODE=exec
shift 1
;;
*)
MODE=init
;;
esac
ROOT=/data00
INIT="$1"
[[ -z "$INIT" ]] && INIT=/bin/bash
shift 1
DEBUG= #echorun
if [[ ! $$ = 1 ]]; then
DEBUG=echo
fi
echorun() {
echo "+ $@"
"$@"
}
declare -a initcmd=()
if [[ ! -z "$INIT_DROP_CAP" ]]; then
initcmd=(capsh --drop="$INIT_DROP_CAP" --chroot="$ROOT" --shell="$INIT" -- "$@" )
else
initcmd=(chroot "$ROOT" "$INIT" "$@")
fi
export -n INIT_DROP_CAP
# exec
if [[ "$MODE" = "exec" ]]; then
exec "${initcmd[@]}"
fi
declare -A mounts
while read -a aa; do
[[ "${aa[1]}" = "/" ]] && continue
[[ "${aa[1]}" = "$ROOT"* ]] && continue
ok=0
for m in "${!mounts[@]}"; do
[[ "${aa[1]}" == "$m"* ]] && ok=1 && break
[[ "$m" == "${aa[1]}"* ]] && unset mounts["$m"]
done
[[ $ok = 0 ]] && mounts["${aa[1]}"]=1
done < /proc/mounts
unset mounts[/proc]
for m in "${!mounts[@]}"; do
[[ -f "$m" ]] && [[ ! -f "$ROOT$m" ]] && $DEBUG touch "$ROOT$m"
[[ -d "$m" ]] && [[ ! -d "$ROOT$m" ]] && $DEBUG mkdir -p "$ROOT$m"
$DEBUG mount -n -R "$m" "$ROOT$m"
done
$DEBUG mount -n -R /proc "$ROOT/proc"
if [[ ! $$ = 1 ]]; then
$DEBUG "${initcmd[@]}"
else
# set -xe
exec "${initcmd[@]}"
fi
基于namespaces的实现
需要启动一个单独的daemon作为容器的1号进程以保留容器自己的mount ns(否则就和switch_root一样了),再次启动init进程时还需要加载pid namespace。该实现用golang(会创建协程)和bash(处理信号困难)都不合适,还是c/c++比较合适。
由于编译时开启了-static,镜像可以直接使用FROM scratch
FROM scratch
VOLUME ["/sys/fs/cgroup","/run","/run/lock","/tmp","/data00"]
ENV container=docker
COPY init /
STOPSIGNAL SIGINT
ENTRYPOINT ["/init"]
CMD ["/lib/systemd/systemd"]
/init
makefile
init: init.cpp
g++ -O2 --std=c++20 -Wall -Wextra -Werror -static -fPIE -pie -o $@ $<
strip -s $@
init.cpp
#include <cstdio>
#include <cstdlib>
#include <cstdint>
#include <cstring>
#include <vector>
#include <string>
#include <unistd.h>
#include <errno.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <sys/mount.h>
#include <sys/reboot.h>
#include <linux/reboot.h>
#include <fcntl.h>
#include <mntent.h>
struct cloned_func {
size_t stack_size = 4096;
void* new_stack() const {
char* stack = (char*)::mmap(nullptr, stack_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
if (stack == MAP_FAILED) {
return nullptr;
}
char* stackTop = stack + stack_size;
return (void*)stackTop;
}
int clone(void* stack = nullptr) {
if (!stack) {
stack = new_stack();
}
int flags = get_clone_flags();
return ::clone(&cloned_entrypoint, stack, flags, this);
}
virtual int routine() const = 0;
// no copy
cloned_func(const cloned_func&) = delete;
cloned_func& operator=(const cloned_func&) = delete;
protected:
// abstract class
cloned_func() {}
virtual int get_clone_flags() const = 0;
private:
static int cloned_entrypoint(void* p) {
cloned_func* that = (cloned_func*)p;
return that->routine();
}
};
struct preinit_proc : cloned_func {
char *chroot_path = nullptr;
char *argv0 = nullptr;
char **argv = nullptr;
char **env = nullptr;
preinit_proc(char** argv, char** env) : argv(argv), env(env) {}
virtual int get_clone_flags() const {
return SIGCHLD | CLONE_NEWNS | CLONE_NEWPID;
}
virtual int routine() const {
if (chroot_path) {
int stat = do_chroot();
if (stat != EXIT_SUCCESS) {
return stat;
}
}
if (mount("proc", "/proc", "proc", 0, NULL) < 0) {
perror("mount(proc)");
return EXIT_FAILURE;
}
if (argv) {
if (argv0) {
execve(argv0, argv, env);
} else {
execvpe(argv[0], argv, env);
}
perror("execve");
return EXIT_FAILURE;
}
for (;;) {
pause();
}
}
private:
int do_chroot() const {
std::vector<std::string> mountpoints = {};
struct mntent *ent;
FILE* aFile;
aFile = setmntent("/proc/mounts", "r");
if (!aFile) {
perror("setmntent");
return EXIT_FAILURE;
}
while ((ent = getmntent(aFile))) {
auto dir = std::string(ent->mnt_dir);
if (dir != "/" && !dir.starts_with(chroot_path) && !dir.starts_with("/proc")) {
bool unknown_path = true;
for(auto& e : mountpoints) {
if (e.starts_with(dir)) {
e = dir;
unknown_path = false;
break;
} else if (dir.starts_with(e)) {
unknown_path = false;
break;
}
}
if (unknown_path) {
mountpoints.push_back(dir);
}
}
}
endmntent(aFile);
for(auto& e : mountpoints) {
if (mount(e.c_str(), (chroot_path + e).c_str(), NULL, MS_MOVE, NULL) < 0) {
perror("mount");
return EXIT_FAILURE;
}
}
if (chroot(chroot_path) < 0) {
perror("chroot");
return EXIT_FAILURE;
}
if (chdir("/") < 0) {
perror("chdir");
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
};
#define PRINTERR(fmt, ...) fprintf(stderr, "########## " fmt " ##########\n", ## __VA_ARGS__)
static pid_t init_pid = 0;
static bool is_killing = false;
void handle_unexpected_status(int exit_code, int signal = 0, bool coredump = false) {
if (signal == 0) {
if (exit_code == 0) {
printf("init terminated normally.\n");
} else {
printf("init terminated with exit code %d.\n", exit_code);
}
// exit(exit_code);
} else {
if (is_killing) {
printf("init stopped on plan.\n");
}
printf("init killed by signal %d.\n", signal);
if (coredump) {
printf("Note: core dump is generated.\n");
}
// exit(128 + signal); // consistent with caontainer exit code
}
exit(EXIT_SUCCESS);
}
void handle_namespace_reboot(bool is_poweroff) {
if (is_poweroff) {
printf("Request daemon to poweroff!\n");
// exit(128 + SIGINT);
} else {
printf("Request deamon to reboot!\n");
// exit(128 + SIGHUP);
}
exit(EXIT_SUCCESS);
}
void handle_pause_continue(bool is_resuming, int stop_signal = 0) {
if (is_resuming) {
PRINTERR("INFO: init process resumed");
} else {
PRINTERR("INFO: init process paused by signal %d", stop_signal);
}
}
static void sigdown_handler(int signum) {
if (init_pid > 0) {
if (signum == SIGTERM) {
PRINTERR("INFO got signal %d, kill pid %d with SIGKILL", signum, init_pid);
kill(init_pid, SIGKILL);
} else {
PRINTERR("INFO got signal %d, kill pid %d with SIGRTMIN+3", signum, init_pid);
kill(init_pid, SIGRTMIN+3);
}
} else {
PRINTERR("INFO: got signal %d, exiting.", signum);
exit(EXIT_SUCCESS);
}
}
static void sigchld_handler(int signum) {
(void)signum;
for (;;) {
int wstatus = 0;
int pid = waitpid(-1, &wstatus, WUNTRACED | WCONTINUED | WNOHANG);
if (pid < 0) {
PRINTERR("waitpid error: %s", strerror(errno));
break;
}
if (pid == 0) {
break;
}
// PRINTERR("INFO: SIGCHLD from pid %d", pid);
if (WIFEXITED(wstatus)) {
handle_unexpected_status(WEXITSTATUS(wstatus));
} else if (WIFSIGNALED(wstatus)) {
int sig = WTERMSIG(wstatus);
if (sig == SIGHUP) {
handle_namespace_reboot(false);
} else if (sig == SIGINT) {
handle_namespace_reboot(true);
} else {
#ifdef WCOREDUMP
handle_unexpected_status(0, sig, WCOREDUMP(wstatus));
#else
handle_unexpected_status(0, sig, false);
#endif
}
} else if (WIFSTOPPED(wstatus)) {
handle_pause_continue(false, WSTOPSIG(wstatus));
} else if (WIFCONTINUED(wstatus)) {
handle_pause_continue(true, WSTOPSIG(wstatus));
} else {
PRINTERR("INFO: Uninterpretable wstatus %d", wstatus);
}
}
}
static void cleanup_initpid() {
unlink("/init.status");
}
int main(int argc, char** argv, char** env) {
if (getpid() != 1 && argv[1] != NULL && argv[1][0] != '/') {
FILE* f = fopen("/init.status", "r");
if(!f) {
perror("open(/init.status)");
return EXIT_FAILURE;
}
char chroot_path[4096];
pid_t pid = 0;
if (fscanf(f, "%d\n%s\n", &pid, chroot_path) != 2) {
fprintf(stderr, "Failed to read PID from pid file.");
return EXIT_FAILURE;
}
fclose(f);
printf("setns to pid %d\n", pid);
char buf[32];
snprintf(buf, 32, "/proc/%d/ns/pid", pid);
int pidfd = open(buf, O_RDONLY);
snprintf(buf, 32, "/proc/%d/ns/mnt", pid);
int mntfd = open(buf, O_RDONLY);
if (pidfd < 0) {
perror("open(pid)");
return EXIT_FAILURE;
}
if (mntfd < 0) {
perror("open(mnt)");
return EXIT_FAILURE;
}
if (setns(pidfd, CLONE_NEWPID) < 0) {
perror("setns(pid)");
return EXIT_FAILURE;
}
close(pidfd);
if (setns(mntfd, CLONE_NEWNS) < 0) {
perror("setns(mnt)");
return EXIT_FAILURE;
}
close(mntfd);
if (std::string("/") != chroot_path) {
if (chroot(chroot_path) < 0) {
perror("chroot");
return EXIT_FAILURE;
}
if (chdir("/") < 0) {
perror("chdir");
return EXIT_FAILURE;
}
}
pid_t cloned_pid = vfork();
if (cloned_pid > 0) {
wait(NULL);
} else if (cloned_pid == 0) {
execvpe(argv[1], &argv[1], env);
perror("execve");
} else {
perror("vfork");
}
return EXIT_SUCCESS;
}
struct sigaction sigdown_action = {};
sigdown_action.sa_handler = sigdown_handler;
for (int& signum : std::vector({ SIGINT, SIGTERM })) {
if (sigaction(signum, &sigdown_action, NULL) < 0) {
char buf[20];
snprintf(buf, 20, "sigaction(%d)", signum);
perror(buf);
return EXIT_FAILURE;
}
}
struct sigaction sigchld_action = {};
sigchld_action.sa_handler = sigchld_handler;
// sigchld_action.sa_flags = SA_NOCLDSTOP;
if (sigaction(SIGCHLD, &sigchld_action, NULL) < 0) {
perror("sigaction(SIGCHLD)");
return EXIT_FAILURE;
}
preinit_proc initproc = { &argv[1], env };
if (argc == 1) {
initproc.argv = nullptr;
}
initproc.chroot_path = "/data00";
pid_t pid = initproc.clone();
if (pid < 0) {
perror("clone");
return EXIT_FAILURE;
}
init_pid = pid;
atexit(cleanup_initpid);
FILE* f = fopen("/init.status", "w");
const char* chroot_path = initproc.chroot_path;
if (!chroot_path) {
chroot_path = "/";
}
fprintf(f, "%d\n%s\n", pid, chroot_path);
fclose(f);
for (;;) {
pause();
}
}
Last modified on 2022-10-03