Building a Container from Scratch Using Shell Commands
The “Container” is a colloquial term for a lightweight, portable, and self-sufficient unit that can run software in a consistent environment. It uses Linux kernel features like namespaces and cgroups to isolate processes and manage resources but does so in a way that is declarative and reproducible way to package applications and their dependencies together and make them portable across different environments.
Key Technologies Used in Containers
The container runtime is responsible for managing the lifecycle of containers, including starting, stopping, and monitoring them. And behind the scenes, it uses the following Linux kernel features to create and execute containers.
-
Cgroups: Control groups limit and prioritize resource usage (CPU, memory, disk I/O) for processes, ensuring that containers do not consume more than their allocated share. We are going to use
cgroupfsto manage cgroups. Cgroups allow you to group processes and apply resource limits to them, ensuring that no single process can monopolize system resources. -
Namespaces: These provide isolation for processes, ensuring that they cannot see or interact with processes in other containers or the host system. Namespaces create separate views of system resources, such as process IDs, network interfaces, and file systems, for each container. We are going to use
unsharecommand to create a new namespace for the container. -
Networking: Containers can have their own network stack, allowing them to communicate with each other and the outside world without interfering with the host system’s network configuration. We are going to use
veth(virtual Ethernet) pairs to create isolated network interfaces for each container. -
OverlayFS: Then to keep the container lightweight, it uses a layered filesystem, which allows multiple containers to share the same base image while maintaining their own unique changes. This is typically implemented using technologies like OverlayFS or AUFS.
Cgroups
Control groups (cgroups) are a Linux kernel feature that allows you to allocate and limit resources (CPU, memory, disk I/O, etc.) for a group of processes. Cgroups provide a way to manage and monitor resource usage, ensuring that no single process or container can consume all available resources on the system.
For simplicity, we are going to use control group filesystem (cgroupfs) to manage cgroups. Cgroupfs is a virtual filesystem that provides a way to interact with cgroups through the filesystem interface.
Following is.a simple script to create a cgroup and limit the CPU usage of a process:
#!/usr/bin/env bash
set -e
CGROUP_NAME="my_cgroup"
CPU_SHARE="256" # 256 out of 1024 shares (25% CPU)
MEM_LIMIT="100M" # 100 MB memory limit
setup_cgroup() {
echo "[+] Setting up cgroup..."
for ctrl in cpu memory; do
mkdir -p /sys/fs/cgroup/$ctrl/$CGROUP_NAME
echo $$ > /sys/fs/cgroup/$ctrl/$CGROUP_NAME/tasks
done
echo "$CPU_SHARES" > /sys/fs/cgroup/cpu/$CGROUP_NAME/cpu.shares
echo "$MEM_LIMIT" > /sys/fs/cgroup/memory/$CGROUP_NAME/memory.limit_in_bytes
}
cleanup_cgroup() {
echo "[+] Cleaning up cgroup..."
for ctrl in cpu memory; do
rmdir /sys/fs/cgroup/$ctrl/$CGROUP_NAME || true
done
}
run_in_cgroup(cmd =("$@")) {
echo "[+] Running command in cgroup..."
echo $$ > /sys/fs/cgroup/cpu/$CGROUP_NAME/tasks
echo $$ > /sys/fs/cgroup/memory/$CGROUP_NAME/tasks
"${cmd[@]}" # Run the command in the cgroup
}
trap cleanup_cgroup EXIT # Ensure cleanup on exit
# == main ===
setup_cgroup
run_in_cgroup "$@"
Network isolation
Network isolation is a crucial aspect of containerization, allowing each container to have its own network stack, including IP addresses, routing tables, and network interfaces. This ensures that containers can communicate with each other and the outside world without interfering with the host system’s network configuration. In this example, we will use veth - Virtual Ethernet pairs to create isolated network interfaces for each container.
We also need to set up a network bridge to connect the container’s network interface to the host system’s network. A bridge acts as a virtual switch, allowing containers to communicate with each other and the host system.
#!/usr/bin/env bash
set -e
BRIDGE="br0"
VETH_HOST="veth-host0"
VETH_CONT="veth-cont0"
IP_HOST="192.168.50.1"
IP_CONT="192.168.50.2"
SUBNET="192.168.50.0/24"
setup_bridge() {
echo "[+] Setting up network bridge..."
ip link add "$BRIDGE" type bridge || true
ip addr add "$IP_HOST/24" dev "$BRIDGE" || true
ip link set "$BRIDGE" up
}
setup_veth() {
echo "[+] Creating veth pair..."
ip link add "$VETH_HOST" type veth peer name "$VETH_CONT"
ip link set "$VETH_HOST" up
ip link set "$VETH_HOST" master "$BRIDGE"
}
cleanup_network() {
echo "[+] Cleaning up network..."
ip link del "$VETH_HOST" || true
ip link del "$BRIDGE" || true
}
For complex access control, routing, and firewall rules, you can use tools like iptables or nftables to manage network traffic between containers and the host system. Below is an example of how to set up a basic firewall rule and NAT (Network Address Translation) to allow containers to access the internet while keeping them isolated from the host system using nftables.
#!/usr/bin/env bash
set -e
setup_nat() {
# Enable IP forwarding
echo "[+] Enabling IP forwarding..."
sysctl -w net.ipv4.ip_forward=1
echo "[+] Setting up NAT..."
nft add table ip nat || true
nft add chain ip nat POSTROUTING { type nat hook postrouting priority 100; }
nft add rule ip nat POSTROUTING oifname "$BRIDGE" masquerade
}
setup_firewall() {
echo "[+] Setting up firewall rules..."
nft add table ip filter || true
nft add chain ip filter input { type filter hook input priority 0; }
nft add chain ip filter forward { type filter hook forward priority 0; }
nft add chain ip filter output { type filter hook output priority 0; }
# Allow established and related connections
nft add rule ip filter input ct state established,related accept
nft add rule ip filter forward ct state established,related accept
# Allow traffic from the bridge interface
nft add rule ip filter input iifname "$BRIDGE" accept
nft add rule ip filter forward iifname "$BRIDGE" accept
# Drop all other incoming connections
nft add rule ip filter input drop
}
cleanup() {
echo "[+] Cleaning up firewall rules..."
nft flush table ip nat || true
nft flush table ip filter || true
sysctl -w net.ipv4.ip_forward=0
}
Namespace
Namespace in Linux kernel is a feature to create partitions within the operating system, allowing processes to have their own isolated view of system resources. This is similar to a jail environment, where each containerized application runs in its own confined space, unaware of other containers or the host system.
There are different kind of namespaces, each providing isolation for different system resources:
- PID Namespace: Isolates process IDs, allowing containers to have their own process trees.
- Network Namespace: Provides isolated network interfaces, IP addresses, and routing tables for each container.
- Mount Namespace: Allows containers to have their own file system views, enabling them to have different root directories and mount points.
- UTS Namespace: Isolates hostname and domain name, allowing containers to have their own unique identifiers.
- IPC Namespace: Isolates inter-process communication resources, such as message queues and semaphores, for each container.
- User Namespace: Provides isolation for user and group IDs, allowing containers to run with different privileges than the host system.
- Cgroup Namespace: Isolates control groups, allowing containers to have their own resource limits and priorities.
We are going to use unshare command to create a new namespace. The unshare command allows you to run a program with some namespaces unshared from the parent process. This means that the new process will have its own isolated view of the system resources defined by the namespaces.
#!/usr/bin/env bash
set -euo pipefail
CMD=${1:-"/bin/sh"}
CONTAINER_UID=1000 # User ID inside the container
CONTAINER_GID=1000 # Group ID inside the container
HOSTNAME="my-container"
ROOTFS="/tmp/container-rootfs" # Must be a valid minimal filesystem with /bin/sh
launch_container() {
echo "[+] Launching container..."
unshare \
--user --map-root-user \
--mount --uts --ipc --pid --net --fork \
--mount-proc \
bash -c "
# Setup veth inside container namespace
ip link set lo up
ip link set '$VETH_CONT' up
ip addr add '$IP_CONT/24' dev '$VETH_CONT'
ip link set '$VETH_CONT' name eth0
ip route add default via '$IP_HOST'
hostname '$HOSTNAME'
# Mount proc in chroot
mount -t proc proc '$ROOTFS/proc'
# Drop to user in chroot
echo '[*] Inside container as UID $(id -u)...'
exec chroot --userspec=$CONTAINER_UID:$CONTAINER_GID '$ROOTFS' $CMD
" &
CONTAINER_PID=$!
echo "[+] Container PID: $CONTAINER_PID"
echo "[+] Attaching container to cgroup..."
echo $CONTAINER_PID > /sys/fs/cgroup/cpu/$CGROUP_NAME/tasks
echo $CONTAINER_PID > /sys/fs/cgroup/memory/$CGROUP_NAME/tasks
echo "[+] Container launched successfully."
wait "$CONTAINER_PID"
echo "[+] Container exited."
}
Setting up the rootfs using busybox
To create a minimal filesystem for the container, we can use busybox, which provides a lightweight set of Unix utilities. This will allow us to have a basic environment with essential commands like sh, ls, cp, etc.
#!/usr/bin/env bash
set -euo pipefail
ROOTFS="/tmp/container-rootfs"
setup_rootfs() {
echo "[+] Setting up root filesystem..."
mkdir -p "${ROOTFS}/proc/{bin,proc}"
cd "$ROOTFS"
# Download busybox
wget -q https://busybox.net/downloads/binaries/busybox-x86_64 -O busybox
chmod +x busybox
# Create symlinks for common commands
for cmd in sh ls cp mv rm mkdir; do
ln -s busybox "$cmd"
done
# Create necessary directories
mkdir -p proc sys dev etc tmp var run
echo "[+] Root filesystem setup complete."
}
Running the Container
To run the container, we will use the mini-container.sh script we created earlier. This script will set up the cgroup, network, and namespace, and then launch a shell inside the container.
./mini-container.sh /bin/sh
You will get:
- A new shell prompt inside the container.
- The container will have its own isolated network interface, process tree, and filesystem.
- You can run commands like
ls,cp, etc., inside the container, and they will operate within the isolated environment. - You can exit the container by typing
exit, which will terminate the container process and clean up the resources.
Recap
In this article, we have built a simple container from scratch using basic shell commands and Linux kernel features. We have covered the following key concepts:
unshare: to create namespacescgroups: to limit CPU and memoryveth + bridge: to allow network with NATchroot: to isolate filesystemchroot --userspec: to simulate unprivileged container
For more complex and advanced containerization, may include feature like image layering, storage drivers, and seccomp or AppArmor profiles for security, and orchestration tools like Kubernetes for managing containerized applications at scale.