参考资料：ContainerInterface

创建带有systemd的镜像

FROM debian:11

ARG DEBIAN_MIRROR=mirrors.tuna.tsinghua.edu.cn

# for systemd
VOLUME ["/sys/fs/cgroup","/run","/run/lock","/tmp"]
ENV container=docker

# for package installation
ENV DEBIAN_FRONTEND=noninteractive TZ=Asia/Shanghai

# install systemd and dbus
RUN set -xe; \
        sed -i -e "s/deb\.debian\.org/${DEBIAN_MIRROR}/" /etc/apt/sources.list; \
        apt-get update; \
        apt-get install -y --no-install-recommends libsystemd0 systemd dbus; \
        apt-get clean; \
        rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*; \
        rm -f /var/run/nologin

# remove auto-started services and targets (e.g. getty, systemd-logind)
#RUN rm -vf /lib/systemd/system/multi-user.target.wants/* \
#        /etc/systemd/system/*.wants/* \
#        /lib/systemd/system/local-fs.target.wants/* \
#        /lib/systemd/system/sockets.target.wants/*udev* \
#        /lib/systemd/system/sockets.target.wants/*initctl* \
#        /lib/systemd/system/sysinit.target.wants/systemd-tmpfiles-setup* \
#        /lib/systemd/system/systemd-update-utmp*

RUN set -xe; \
        ln -sf /bin/systemctl /sbin/halt; \
        ln -sf /bin/systemctl /sbin/poweroff; \
        ln -sf /bin/systemctl /sbin/reboot; \
        ln -sf /bin/systemctl /sbin/runlevel; \
        ln -sf /bin/systemctl /sbin/shutdown; \
        ln -sf /bin/systemctl /sbin/telinit; \
        ln -sf /lib/systemd/systemd /sbin/init; \
        true

# stop signal for systemd
STOPSIGNAL SIGRTMIN+3

WORKDIR /root
USER root

# install user utilities
RUN set -xe; \
        apt-get update; \
        apt-get install -y iproute2 procps file vim htop tmux iputils-ping dnsutils curl wget; \
        true

# install sshd
#RUN set -xe; \
#        apt-get install -y --no-install-recommends openssh-server
#EXPOSE 22

CMD ["/sbin/init"]

运行带有systemd的镜像

一般来说，systemd需要CAP_SYS_ADMIN才能完整运行。否则会有部分功能无法使用。（见开头ContainerInterface页面）

People have been asking to be able to run systemd without CAP_SYS_ADMIN and CAP_SYS_MKNOD in the container. This is now supported to some level in systemd, but we recommend against it (see above). If CAP_SYS_ADMIN and CAP_SYS_MKNOD are missing from the containers systemd will now gracefully turn off PrivateTmp=, PrivateNetwork=, ProtectHome=, ProtectSystem= and others, because those caps are requires to implement these options. The services using these settings (which include many of systemd’s own) will hence run in a different, less secure environment when the caps are missing than with them around.

但是，systemd在没有CAP_SYS_ADMIN的情况下也能正常启动。下面的例子中将不赋予该privilege。

systemd需要提供cgroup和tmpfs。旧版本的systemd可能无法使用read only的cgroup目录。

Docker

docker run \
  -it -d --rm \
  -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
  --tmpfs /run \
  --tmpfs /run/lock \
  --tmpfs /tmp \
  debsysd:11

Kubernetes

这里为了方便直接创建了Pod。正常应该写成Deployment

apiVersion: v1
kind: Pod
metadata:
  name: debsys-test
  labels:
    kubectl.kubernetes.io/default-container: main
spec:
  # 不挂载/run/secrets/kubernetes.io/serviceaccount
  automountServiceAccountToken: false
  # 容器退出后即不重启，观察测试systemd退出状态
  restartPolicy: Never
  containers:
  - name: main
    image: debsysd:11
    # 测试环境里的image是提前用ctr i import加载的，不能去pull
    imagePullPolicy: Never
    # 虽然这里开着，但attach进pod的getty的尝试我还未成功过
    stdin: true
    tty: true
#    ports:
#    - containerPort: 22
#      hostIP:
#      hostPort: 9999
#      protocol: TCP
#      name: ssh
    volumeMounts:
    - mountPath: /sys/fs/cgroup
      name: cgroup
      readOnly: true
    - mountPath: /tmp
      name: tmpfs
      subPath: tmp
    - mountPath: /run
      name: tmpfs
      subPath: run
    - mountPath: /run/lock
      name: tmpfs
      subPath: run-lock
  volumes:
  - name: tmpfs
    emptyDir:
      medium: Memory
      sizeLimit: 64Mi
  - name: cgroup
    hostPath:
      path: /sys/fs/cgroup
      type: Directory

如何捕获systemd的reboot/poweroff信号

man 2 reboot

Behavior inside PID namespace

可以启动一个独立的init进程，并将systemd运行在嵌套的pid namespace内。当systemd发起reboot或poweroff时，systemd的父进程可以通过wait(2)捕获到信号，从而执行处理指令。（注意，调用reboot syscall需要CAP_SYS_BOOT）

在容器中使用volume作为rootfs

不得不说，这确实是一个非常诡异且扭曲的需求。

由于容器运行时需要往容器的/proc、/etc等位置写入内容，因此不能将volume简单挂载至rootfs，也不能使用switch_root、pivot_root等工具切换rootfs。

实现想法根据是否使用mount namespaces大概可以分为两种：第一种是在容器内利用unshare、mount等工具建立一个嵌套的容器，并chroot进volume以作为rootfs，随后启动真正的init；第二种则是使用mount -R复制挂载点，并使用chroot进入init。后文将分别展示两种方案。

需要注意的是，使用mount需要CAP_SYS_ADMIN。如果想要让systemd没有此特权，需要用capsh等方式在启动systemd时丢弃该特权。

创建空镜像

不使用namespaces的实现

这个方案比较简单直白，所以先说。

该“空”镜像基于alpine以使用其提供的busybox基本环境，/init将在启动后创建嵌套的ns以在volume上启动volume的init命令

FROM alpine:3.16.2
VOLUME ["/sys/fs/cgroup","/run","/run/lock","/tmp","/data00"]
ENV container=docker
 
COPY init /
 
STOPSIGNAL SIGRTMIN+3
 
ENTRYPOINT ["/init"]
CMD ["/lib/systemd/systemd"]
/init

该脚本假定了rootfs的volume被挂载在/data00。

如之前所说，由于容器运行时在exec等操作时，有使用/proc的需求。在将挂载转移到/data00内时，没有使用-M移动挂载点，而是使用了-R以复制挂载点。

该脚本的参数将作为init命令，以PID=1启动。除此之外，在使用特殊参数时，该脚本还可以getshell。如/init exec /bin/bash以在嵌套容器内启动bash

init (0755)

#!/bin/bash
 
MODE="$1"
 
case "$MODE" in
getshell|exec)
        MODE=exec
        shift 1
        ;;
*)
        MODE=init
        ;;
esac
 
ROOT=/data00
INIT="$1"
[[ -z "$INIT" ]] && INIT=/bin/bash
shift 1
 
DEBUG= #echorun
if [[ ! $$ = 1 ]]; then
        DEBUG=echo
fi
 
echorun() {
        echo "+ $@"
        "$@"
}
 
declare -a initcmd=()
if [[ ! -z "$INIT_DROP_CAP" ]]; then
        initcmd=(capsh --drop="$INIT_DROP_CAP" --chroot="$ROOT" --shell="$INIT" -- "$@" )
else
        initcmd=(chroot "$ROOT" "$INIT" "$@")
fi
export -n INIT_DROP_CAP
 
 
# exec
if [[ "$MODE" = "exec" ]]; then
        exec "${initcmd[@]}"
fi
 
 
declare -A mounts
while read -a aa; do
        [[ "${aa[1]}" = "/" ]] && continue
        [[ "${aa[1]}" = "$ROOT"* ]] && continue
        ok=0
        for m in "${!mounts[@]}"; do
                [[ "${aa[1]}" == "$m"* ]] && ok=1 && break
                [[ "$m" == "${aa[1]}"* ]] && unset mounts["$m"]
        done
        [[ $ok = 0 ]] && mounts["${aa[1]}"]=1
done < /proc/mounts
 
unset mounts[/proc]
 
 
for m in "${!mounts[@]}"; do
        [[ -f "$m" ]] && [[ ! -f "$ROOT$m" ]] && $DEBUG touch "$ROOT$m"
        [[ -d "$m" ]] && [[ ! -d "$ROOT$m" ]] && $DEBUG mkdir -p "$ROOT$m"
        $DEBUG mount -n -R "$m" "$ROOT$m"
done
$DEBUG mount -n -R /proc "$ROOT/proc"
 
if [[ ! $$ = 1 ]]; then
        $DEBUG "${initcmd[@]}"
else
#        set -xe
        exec "${initcmd[@]}"
fi

基于namespaces的实现

需要启动一个单独的daemon作为容器的1号进程以保留容器自己的mount ns（否则就和switch_root一样了），再次启动init进程时还需要加载pid namespace。该实现用golang（会创建协程）和bash（处理信号困难）都不合适，还是c/c++比较合适。

由于编译时开启了-static，镜像可以直接使用FROM scratch

FROM scratch
VOLUME ["/sys/fs/cgroup","/run","/run/lock","/tmp","/data00"]
ENV container=docker

COPY init /

STOPSIGNAL SIGINT

ENTRYPOINT ["/init"]
CMD ["/lib/systemd/systemd"]
/init

makefile

init: init.cpp
    g++ -O2 --std=c++20 -Wall -Wextra -Werror -static -fPIE -pie -o $@ $<
    strip -s $@

init.cpp

#include <cstdio>
#include <cstdlib>
#include <cstdint>
#include <cstring>

#include <vector>
#include <string>

#include <unistd.h>
#include <errno.h>
#include <sched.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <sys/mount.h>
#include <sys/reboot.h>
#include <linux/reboot.h>
#include <fcntl.h>
#include <mntent.h>

struct cloned_func {
	size_t stack_size = 4096;
	void* new_stack() const {
		char* stack = (char*)::mmap(nullptr, stack_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
		if (stack == MAP_FAILED) {
			return nullptr;
		}
		char* stackTop = stack + stack_size;
		return (void*)stackTop;
	}
	int clone(void* stack = nullptr) {
		if (!stack) {
			stack = new_stack();
		}
		int flags = get_clone_flags();
		return ::clone(&cloned_entrypoint, stack, flags, this);
	}
	virtual int routine() const = 0;

	// no copy
	cloned_func(const cloned_func&) = delete;
	cloned_func& operator=(const cloned_func&) = delete;
protected:
	// abstract class
	cloned_func() {}
	virtual int get_clone_flags() const = 0;

private:
	static int cloned_entrypoint(void* p) {
		cloned_func* that = (cloned_func*)p;
		return that->routine();
	}
};


struct preinit_proc : cloned_func {
	char *chroot_path = nullptr;
	char *argv0 = nullptr;
	char **argv = nullptr;
	char **env = nullptr;
	preinit_proc(char** argv, char** env) : argv(argv), env(env) {}
	virtual int get_clone_flags() const {
		return SIGCHLD | CLONE_NEWNS | CLONE_NEWPID;
	}
	virtual int routine() const {
		if (chroot_path) {
			int stat = do_chroot();
			if (stat != EXIT_SUCCESS) {
				return stat;
			}
		}

		if (mount("proc", "/proc", "proc", 0, NULL) < 0) {
			perror("mount(proc)");
			return EXIT_FAILURE;
		}

		if (argv) {
			if (argv0) {
				execve(argv0, argv, env);
			} else {
				execvpe(argv[0], argv, env);
			}
			perror("execve");
			return EXIT_FAILURE;
		}
		for (;;) {
			pause();
		}

	}
private:
	int do_chroot() const {
		std::vector<std::string> mountpoints = {};
		struct mntent *ent;
		FILE* aFile;
		aFile = setmntent("/proc/mounts", "r");
		if (!aFile) {
			perror("setmntent");
			return EXIT_FAILURE;
		}
		while ((ent = getmntent(aFile))) {
			auto dir = std::string(ent->mnt_dir);
			if (dir != "/" && !dir.starts_with(chroot_path) && !dir.starts_with("/proc")) {
				bool unknown_path = true;
				for(auto& e : mountpoints) {
					if (e.starts_with(dir)) {
						e = dir;
						unknown_path = false;
						break;
					} else if (dir.starts_with(e)) {
						unknown_path = false;
						break;
					}
				}
				if (unknown_path) {
					mountpoints.push_back(dir);
				}
			}
		}
		endmntent(aFile);

		for(auto& e : mountpoints) {
			if (mount(e.c_str(), (chroot_path + e).c_str(), NULL, MS_MOVE, NULL) < 0) {
				perror("mount");
				return EXIT_FAILURE;
			}
		}
		if (chroot(chroot_path) < 0) {
			perror("chroot");
			return EXIT_FAILURE;
		}
		if (chdir("/") < 0) {
			perror("chdir");
			return EXIT_FAILURE;
		}
		return EXIT_SUCCESS;
	}
};

#define PRINTERR(fmt, ...) fprintf(stderr, "########## " fmt " ##########\n", ## __VA_ARGS__)

static pid_t init_pid = 0;
static bool is_killing = false;

void handle_unexpected_status(int exit_code, int signal = 0, bool coredump = false) {
	if (signal == 0) {
		if (exit_code == 0) {
			printf("init terminated normally.\n");
		} else {
			printf("init terminated with exit code %d.\n", exit_code);
		}
//		exit(exit_code);
	} else {
		if (is_killing) {
			printf("init stopped on plan.\n");
		}
		printf("init killed by signal %d.\n", signal);
		if (coredump) {
			printf("Note: core dump is generated.\n");
		}
//		exit(128 + signal);	// consistent with caontainer exit code
	}
	exit(EXIT_SUCCESS);
}

void handle_namespace_reboot(bool is_poweroff) {
	if (is_poweroff) {
		printf("Request daemon to poweroff!\n");
//		exit(128 + SIGINT);
	} else {
		printf("Request deamon to reboot!\n");
//		exit(128 + SIGHUP);
	}
	exit(EXIT_SUCCESS);
}

void handle_pause_continue(bool is_resuming, int stop_signal = 0) {
	if (is_resuming) {
		PRINTERR("INFO: init process resumed");
	} else {
		PRINTERR("INFO: init process paused by signal %d", stop_signal);
	}
}


static void sigdown_handler(int signum) {
	if (init_pid > 0) {
		if (signum == SIGTERM) {
			PRINTERR("INFO got signal %d, kill pid %d with SIGKILL", signum, init_pid);
			kill(init_pid, SIGKILL);
		} else {
			PRINTERR("INFO got signal %d, kill pid %d with SIGRTMIN+3", signum, init_pid);
			kill(init_pid, SIGRTMIN+3);
		}
	} else {
		PRINTERR("INFO: got signal %d, exiting.", signum);
		exit(EXIT_SUCCESS);
	}
}

static void sigchld_handler(int signum) {
	(void)signum;

	for (;;) {
		int wstatus = 0;
		int pid = waitpid(-1, &wstatus, WUNTRACED | WCONTINUED | WNOHANG);
		if (pid < 0) {
			PRINTERR("waitpid error: %s", strerror(errno));
			break;
		}
		if (pid == 0) {
			break;
		}
//		PRINTERR("INFO: SIGCHLD from pid %d", pid);
		if (WIFEXITED(wstatus)) {
			handle_unexpected_status(WEXITSTATUS(wstatus));
		} else if (WIFSIGNALED(wstatus)) {
			int sig = WTERMSIG(wstatus);
			if (sig == SIGHUP) {
				handle_namespace_reboot(false);
			} else if (sig == SIGINT) {
				handle_namespace_reboot(true);
			} else {
#ifdef WCOREDUMP
				handle_unexpected_status(0, sig, WCOREDUMP(wstatus));
#else
				handle_unexpected_status(0, sig, false);
#endif
			}
		} else if (WIFSTOPPED(wstatus)) {
			handle_pause_continue(false, WSTOPSIG(wstatus));
		} else if (WIFCONTINUED(wstatus)) {
			handle_pause_continue(true, WSTOPSIG(wstatus));
		} else {
			PRINTERR("INFO: Uninterpretable wstatus %d", wstatus);
		}
	}
}

static void cleanup_initpid() {
	unlink("/init.status");
}

int main(int argc, char** argv, char** env) {
	if (getpid() != 1 && argv[1] != NULL && argv[1][0] != '/') {
		FILE* f = fopen("/init.status", "r");
		if(!f) {
			perror("open(/init.status)");
			return EXIT_FAILURE;
		}
		char chroot_path[4096];
		pid_t pid = 0;
		if (fscanf(f, "%d\n%s\n", &pid, chroot_path) != 2) {
			fprintf(stderr, "Failed to read PID from pid file.");
			return EXIT_FAILURE;
		}
		fclose(f);

		printf("setns to pid %d\n", pid);

		char buf[32];
		snprintf(buf, 32, "/proc/%d/ns/pid", pid);
		int pidfd = open(buf, O_RDONLY);
		snprintf(buf, 32, "/proc/%d/ns/mnt", pid);
		int mntfd = open(buf, O_RDONLY);
		if (pidfd < 0) {
			perror("open(pid)");
			return EXIT_FAILURE;
		}
		if (mntfd < 0) {
			perror("open(mnt)");
			return EXIT_FAILURE;
		}
		if (setns(pidfd, CLONE_NEWPID) < 0) {
			perror("setns(pid)");
			return EXIT_FAILURE;
		}
		close(pidfd);
		if (setns(mntfd, CLONE_NEWNS) < 0) {
			perror("setns(mnt)");
			return EXIT_FAILURE;
		}
		close(mntfd);

		if (std::string("/") != chroot_path) {
			if (chroot(chroot_path) < 0) {
				perror("chroot");
				return EXIT_FAILURE;
			}
			if (chdir("/") < 0) {
				perror("chdir");
				return EXIT_FAILURE;
			}
		}

		pid_t cloned_pid = vfork();
		if (cloned_pid > 0) {
			wait(NULL);
		} else if (cloned_pid == 0) {
			execvpe(argv[1], &argv[1], env);
			perror("execve");
		} else {
			perror("vfork");
		}

		return EXIT_SUCCESS;
	}

	struct sigaction sigdown_action = {};
	sigdown_action.sa_handler = sigdown_handler;

	for (int& signum : std::vector({ SIGINT, SIGTERM })) {
		if (sigaction(signum, &sigdown_action, NULL) < 0) {
			char buf[20];
			snprintf(buf, 20, "sigaction(%d)", signum);
			perror(buf);
			return EXIT_FAILURE;
		}
	}

	struct sigaction sigchld_action = {};
	sigchld_action.sa_handler = sigchld_handler;
//	sigchld_action.sa_flags = SA_NOCLDSTOP;
	if (sigaction(SIGCHLD, &sigchld_action, NULL) < 0) {
		perror("sigaction(SIGCHLD)");
		return EXIT_FAILURE;
	}

	preinit_proc initproc = { &argv[1], env };
	if (argc == 1) {
		initproc.argv = nullptr;
	}
	initproc.chroot_path = "/data00";

	pid_t pid = initproc.clone();
	if (pid < 0) {
		perror("clone");
		return EXIT_FAILURE;
	}
	init_pid = pid;
	atexit(cleanup_initpid);

	FILE* f = fopen("/init.status", "w");
	const char* chroot_path = initproc.chroot_path;
	if (!chroot_path) {
		chroot_path = "/";
	}
	fprintf(f, "%d\n%s\n", pid, chroot_path);
	fclose(f);

	for (;;) {
		pause();
	}
}

Last modified on 2022-10-03