Learn how to build a container from scratch, understanding the core concepts and components involved in containerization.

Building a Container from Scratch Using Shell Commands


The “Container” is a colloquial term for a lightweight, portable, and self-sufficient unit that can run software in a consistent environment. It uses Linux kernel features like namespaces and cgroups to isolate processes and manage resources but does so in a way that is declarative and reproducible way to package applications and their dependencies together and make them portable across different environments.

Key Technologies Used in Containers

The container runtime is responsible for managing the lifecycle of containers, including starting, stopping, and monitoring them. And behind the scenes, it uses the following Linux kernel features to create and execute containers.

  • Cgroups: Control groups limit and prioritize resource usage (CPU, memory, disk I/O) for processes, ensuring that containers do not consume more than their allocated share. We are going to use cgroupfs to manage cgroups. Cgroups allow you to group processes and apply resource limits to them, ensuring that no single process can monopolize system resources.

  • Namespaces: These provide isolation for processes, ensuring that they cannot see or interact with processes in other containers or the host system. Namespaces create separate views of system resources, such as process IDs, network interfaces, and file systems, for each container. We are going to use unshare command to create a new namespace for the container.

  • Networking: Containers can have their own network stack, allowing them to communicate with each other and the outside world without interfering with the host system’s network configuration. We are going to use veth (virtual Ethernet) pairs to create isolated network interfaces for each container.

  • OverlayFS: Then to keep the container lightweight, it uses a layered filesystem, which allows multiple containers to share the same base image while maintaining their own unique changes. This is typically implemented using technologies like OverlayFS or AUFS.

Cgroups

Control groups (cgroups) are a Linux kernel feature that allows you to allocate and limit resources (CPU, memory, disk I/O, etc.) for a group of processes. Cgroups provide a way to manage and monitor resource usage, ensuring that no single process or container can consume all available resources on the system.

For simplicity, we are going to use control group filesystem (cgroupfs) to manage cgroups. Cgroupfs is a virtual filesystem that provides a way to interact with cgroups through the filesystem interface.

Following is.a simple script to create a cgroup and limit the CPU usage of a process:

#!/usr/bin/env bash
set -e

CGROUP_NAME="my_cgroup"
CPU_SHARE="256" # 256 out of 1024 shares (25% CPU)
MEM_LIMIT="100M" # 100 MB memory limit

setup_cgroup() {
  echo "[+] Setting up cgroup..."
  for ctrl in cpu memory; do
    mkdir -p /sys/fs/cgroup/$ctrl/$CGROUP_NAME
    echo $$ > /sys/fs/cgroup/$ctrl/$CGROUP_NAME/tasks
  done
  echo "$CPU_SHARES" > /sys/fs/cgroup/cpu/$CGROUP_NAME/cpu.shares
  echo "$MEM_LIMIT" > /sys/fs/cgroup/memory/$CGROUP_NAME/memory.limit_in_bytes
}

cleanup_cgroup() {
  echo "[+] Cleaning up cgroup..."
  for ctrl in cpu memory; do
    rmdir /sys/fs/cgroup/$ctrl/$CGROUP_NAME || true
  done
}

run_in_cgroup(cmd =("$@")) {
  echo "[+] Running command in cgroup..."
  echo $$ > /sys/fs/cgroup/cpu/$CGROUP_NAME/tasks
  echo $$ > /sys/fs/cgroup/memory/$CGROUP_NAME/tasks
  "${cmd[@]}" # Run the command in the cgroup
}

trap cleanup_cgroup EXIT # Ensure cleanup on exit

# == main ===
setup_cgroup
run_in_cgroup "$@"

Network isolation

Network isolation is a crucial aspect of containerization, allowing each container to have its own network stack, including IP addresses, routing tables, and network interfaces. This ensures that containers can communicate with each other and the outside world without interfering with the host system’s network configuration. In this example, we will use veth - Virtual Ethernet pairs to create isolated network interfaces for each container.

We also need to set up a network bridge to connect the container’s network interface to the host system’s network. A bridge acts as a virtual switch, allowing containers to communicate with each other and the host system.

#!/usr/bin/env bash
set -e

BRIDGE="br0"
VETH_HOST="veth-host0"
VETH_CONT="veth-cont0"
IP_HOST="192.168.50.1"
IP_CONT="192.168.50.2"
SUBNET="192.168.50.0/24"

setup_bridge() {
  echo "[+] Setting up network bridge..."
  ip link add "$BRIDGE" type bridge || true
  ip addr add "$IP_HOST/24" dev "$BRIDGE" || true
  ip link set "$BRIDGE" up
}

setup_veth() {
  echo "[+] Creating veth pair..."
  ip link add "$VETH_HOST" type veth peer name "$VETH_CONT"
  ip link set "$VETH_HOST" up
  ip link set "$VETH_HOST" master "$BRIDGE"
}

cleanup_network() {
  echo "[+] Cleaning up network..."
  ip link del "$VETH_HOST" || true
  ip link del "$BRIDGE" || true
}

For complex access control, routing, and firewall rules, you can use tools like iptables or nftables to manage network traffic between containers and the host system. Below is an example of how to set up a basic firewall rule and NAT (Network Address Translation) to allow containers to access the internet while keeping them isolated from the host system using nftables.

#!/usr/bin/env bash
set -e

setup_nat() {
  # Enable IP forwarding
  echo "[+] Enabling IP forwarding..."
  sysctl -w net.ipv4.ip_forward=1
  echo "[+] Setting up NAT..."
  nft add table ip nat || true
  nft add chain ip nat POSTROUTING { type nat hook postrouting priority 100; }
  nft add rule ip nat POSTROUTING oifname "$BRIDGE" masquerade
}

setup_firewall() {
  echo "[+] Setting up firewall rules..."
  nft add table ip filter || true
  nft add chain ip filter input { type filter hook input priority 0; }
  nft add chain ip filter forward { type filter hook forward priority 0; }
  nft add chain ip filter output { type filter hook output priority 0; }

  # Allow established and related connections
  nft add rule ip filter input ct state established,related accept
  nft add rule ip filter forward ct state established,related accept

  # Allow traffic from the bridge interface
  nft add rule ip filter input iifname "$BRIDGE" accept
  nft add rule ip filter forward iifname "$BRIDGE" accept

  # Drop all other incoming connections
  nft add rule ip filter input drop
}

cleanup() {
  echo "[+] Cleaning up firewall rules..."
  nft flush table ip nat || true
  nft flush table ip filter || true
  sysctl -w net.ipv4.ip_forward=0
}

Namespace

Namespace in Linux kernel is a feature to create partitions within the operating system, allowing processes to have their own isolated view of system resources. This is similar to a jail environment, where each containerized application runs in its own confined space, unaware of other containers or the host system.

There are different kind of namespaces, each providing isolation for different system resources:

  • PID Namespace: Isolates process IDs, allowing containers to have their own process trees.
  • Network Namespace: Provides isolated network interfaces, IP addresses, and routing tables for each container.
  • Mount Namespace: Allows containers to have their own file system views, enabling them to have different root directories and mount points.
  • UTS Namespace: Isolates hostname and domain name, allowing containers to have their own unique identifiers.
  • IPC Namespace: Isolates inter-process communication resources, such as message queues and semaphores, for each container.
  • User Namespace: Provides isolation for user and group IDs, allowing containers to run with different privileges than the host system.
  • Cgroup Namespace: Isolates control groups, allowing containers to have their own resource limits and priorities.

We are going to use unshare command to create a new namespace. The unshare command allows you to run a program with some namespaces unshared from the parent process. This means that the new process will have its own isolated view of the system resources defined by the namespaces.

#!/usr/bin/env bash
set -euo pipefail

CMD=${1:-"/bin/sh"}
CONTAINER_UID=1000 # User ID inside the container
CONTAINER_GID=1000 # Group ID inside the container
HOSTNAME="my-container"
ROOTFS="/tmp/container-rootfs" # Must be a valid minimal filesystem with /bin/sh

launch_container() {
  echo "[+] Launching container..."
  unshare \
    --user --map-root-user \
    --mount --uts --ipc --pid --net --fork \
    --mount-proc \
    bash -c "
      # Setup veth inside container namespace
      ip link set lo up
      ip link set '$VETH_CONT' up
      ip addr add '$IP_CONT/24' dev '$VETH_CONT'
      ip link set '$VETH_CONT' name eth0
      ip route add default via '$IP_HOST'
      hostname '$HOSTNAME'

      # Mount proc in chroot
      mount -t proc proc '$ROOTFS/proc'

      # Drop to user in chroot
      echo '[*] Inside container as UID $(id -u)...'
      exec chroot --userspec=$CONTAINER_UID:$CONTAINER_GID '$ROOTFS' $CMD
    " &
  CONTAINER_PID=$!
  echo "[+] Container PID: $CONTAINER_PID"

  echo "[+] Attaching container to cgroup..."
  echo $CONTAINER_PID > /sys/fs/cgroup/cpu/$CGROUP_NAME/tasks
  echo $CONTAINER_PID > /sys/fs/cgroup/memory/$CGROUP_NAME/tasks

  echo "[+] Container launched successfully."
  wait "$CONTAINER_PID"
  echo "[+] Container exited."
}

Setting up the rootfs using busybox

To create a minimal filesystem for the container, we can use busybox, which provides a lightweight set of Unix utilities. This will allow us to have a basic environment with essential commands like sh, ls, cp, etc.

#!/usr/bin/env bash
set -euo pipefail

ROOTFS="/tmp/container-rootfs"
setup_rootfs() {
  echo "[+] Setting up root filesystem..."
  mkdir -p "${ROOTFS}/proc/{bin,proc}"
  cd "$ROOTFS"

  # Download busybox
  wget -q https://busybox.net/downloads/binaries/busybox-x86_64 -O busybox
  chmod +x busybox

  # Create symlinks for common commands
  for cmd in sh ls cp mv rm mkdir; do
    ln -s busybox "$cmd"
  done

  # Create necessary directories
  mkdir -p proc sys dev etc tmp var run

  echo "[+] Root filesystem setup complete."
}

Running the Container

To run the container, we will use the mini-container.sh script we created earlier. This script will set up the cgroup, network, and namespace, and then launch a shell inside the container.

./mini-container.sh /bin/sh

You will get:

  • A new shell prompt inside the container.
  • The container will have its own isolated network interface, process tree, and filesystem.
  • You can run commands like ls, cp, etc., inside the container, and they will operate within the isolated environment.
  • You can exit the container by typing exit, which will terminate the container process and clean up the resources.

Recap

In this article, we have built a simple container from scratch using basic shell commands and Linux kernel features. We have covered the following key concepts:

  • unshare: to create namespaces
  • cgroups: to limit CPU and memory
  • veth + bridge: to allow network with NAT
  • chroot: to isolate filesystem
  • chroot --userspec: to simulate unprivileged container

For more complex and advanced containerization, may include feature like image layering, storage drivers, and seccomp or AppArmor profiles for security, and orchestration tools like Kubernetes for managing containerized applications at scale.