Docker Tutorial Part 3: Dockerfile

Han included in Docker DevOps Programming

2025-09-13 2294 words 11 minutes views

Contents

This is part of my Docker Basics series — introductory guides to help you get started with Docker, learn key concepts, and build your skills step by step.

Part 1: Understanding Container
Part 2: Basic Commands
Part 3: Dockerfile
Part 4: Networks

Basic Commands

A Dockerfile is essentially a text file with a predetermined structure that contains a set of instructions for building a Docker image. The instructions in the Dockerfile specify what base image to start with (for example, Ubuntu 20.04), what software to install, and how to configure the image. The purpose of a Dockerfile is to automate the process of building a Docker image so that the image can be easily reproduced and distributed.

The structure of a Dockerfile is a list of commands (one per line) that Docker (containerd to be exact) uses to build an image. Each command creates a new layer in the image in UnionFS, and the resulting image is the union of all the layers. The fewer layers we manage to create, the smaller the resulting image.

The most frequently used commands in a Dockerfile are the following:

FROM
COPY
ADD
EXPOSE
CMD
ENTRYPOINT
RUN
LABEL
ENV
ARG
VOLUME
USER
WORKDIR

FROM

A Dockerfile starts with a FROM command, which specifies the base image to start with:

FROM ubuntu:20.04

You can also name this build using as keyword followed by a custom name:

FROM ubuntu:20.04 as builder1

docker build will try to download Docker images from the public Docker Hub registry, but it’s also possible to use other registries out there, or a private one.

COPY and ADD

The COPY command is used to copy files or directories from the host machine to the container file system. Take the following example:

COPY . /var/www/html

You can also use the ADD command to add files or directories to your Docker image. ADD has additional functionality beyond COPY. It can extract a TAR archive file automatically and check for the presence of a URL in the source field, and if it finds one, it will download the file from the URL. Finally, the ADD command has a --chown option to set the ownership of the files in the destination.

In general, it is recommended to use COPY in most cases, and only use ADD when the additional functionality it provides is needed.

EXPOSE

The EXPOSE command in a Dockerfile informs Docker that the container listens on the specified network ports at runtime. It does not actually publish the ports. It is used to provide information to the user about which ports are intended to be published by the container.

For example, if a container runs a web server on port 80, you would include the following line in your Dockerfile:

EXPOSE 80

You can specify whether the port listens on TCP or UDP – after specifying the port number, add a slash and a TCP or UDP keyword (for example, EXPOSE 80/udp). The default is TCP if you specify only a port number.

The EXPOSE command does not publish the ports. To make ports available, you will need to publish them with the use of the -p or --publish option when running the docker run command:

docker run -p 8080:80 thedockerimagename:tag

This will map port 8080 on the host machine to port 80 in the container so that any incoming traffic on port 8080 will be forwarded to the web server running in the container on port 80.

Regardless of the EXPOSE command, you can publish different ports when running a container. EXPOSE is used to inform the user about which ports are intended to be published by the container.

ENTRYPOINT and CMD

The ENTRYPOINT instruction defines the command that will always be executed when the container starts. It essentially turns the container into an executable that behaves like a binary or script. Unlike CMD, which can be fully replaced by arguments at runtime, ENTRYPOINT is fixed, and arguments you pass with docker run are simply appended to it.

You usually use ENTRYPOINT when you want your container to act like a single-purpose tool.

ENTRYPOINT ["curl"]

Running docker run mycurl https://example.com executescurl https://example.com
- docker run: start a new container.
- mycurl: the image name.
- https://example.com: arguments passed to the container.

Docker does the following internally:

Looks at the image mycurl.
Sees that the ENTRYPOINT is set to ["curl"].
Appends your command-line arguments (https://example.com) to the ENTRYPOINT.
It become curl https://example.com

If you want to provide default arguments that can be overridden, you combine it with CMD.

ENTRYPOINT ["python"]
CMD ["app.py", "--debug"]

Running docker run myapp:python app.py --debug
Running docker run myapp server.py: python server.py

Meanwhile, if you only use CMD, it’s fully replaceable at runtime.

CMD ["nginx", "-g", "daemon off;"]

Running docker run webserver: runs nginx
Running docker run webserver bash: runs bash instead (overrides CMD)

Rule of thumb:

Use ENTRYPOINT if the container should always execute a specific binary.
Use CMD for providing defaults that the user might want to override.
Combine both if you want an executable with flexible arguments.

RUN

The RUN instruction is executed at build time and creates a new layer in the image. It’s used to install software, configure the environment, or set up files. Once executed, its results are part of the final image.

For example, you can use the RUN command to install system dependencies and clean up:

RUN apt-get update && \
    apt-get install -y git curl && \ # install required tools
    rm -rf /var/lib/apt/lists/* # Cleanup reduces image size.

You can use the RUN command to create a directory:

RUN mkdir -p /data/logs

or prepare a non-root user:

RUN useradd -ms /bin/bash appuser

It’s worth noting that the order of the RUN commands in the Dockerfile is important, as each command creates a new layer in the image, and the resulting image is the union of all the layers. So, if you’re expecting some packages to be installed later in the process, you need to do it before using them.

LABEL

The LABEL instruction attaches metadata to an image in the form of key-value pairs. This can include maintainer info, version, licensing, or anything meaningful to your workflow.

LABEL maintainer="Alice Lee <alice@example.com>" \
      version="1.4" \
      description="Lightweight API server image"

This metadata can later be queried:

docker inspect myimage | grep version

Labeling images makes them easier to manage, track, and audit in CI/CD pipelines.

ENV and ARG

Both define variables, but their lifetimes differ:

ARG: available only at build time. Used to pass values when running docker build.
ENV: It creates an environment variable that is accessible to all processes running inside the container.

ARG APP_VERSION=latest
RUN echo "Building version $APP_VERSION"

You can override it at build:

docker build --build-arg APP_VERSION=2.0 .

The ENV command is used to set environment variables:

ENV PORT=8080
EXPOSE $PORT

Now any process inside the container can access $PORT.

Rule of thumb:

Use ARG for build-time options (e.g., base image tag, dependency version).
Use ENV for runtime configuration (e.g., service port, API key).

VOLUME

The VOLUME instruction in a Dockerfile defines a mount point where data can be stored outside the container’s writable layer.

This means that the data will persist even if the container is removed or rebuilt.
Volumes are managed by Docker, or you can bind them to host directories.
They’re especially important for databases, logs, or user-generated content.

When you run a container without volumes, everything is stored in its writable layer.

That writable layer is destroyed when you remove the container (docker rm).
That’s why data disappears.

When you declare a VOLUME in the Dockerfile (or with -v at docker run), Docker does something special:

It says: “Don’t keep this folder (/var/lib/mysql) inside the container’s temporary writable layer.”
Instead, it mounts an external storage location (on the host) to that folder.
The container sees /var/lib/mysql as normal, but under the hood, all files actually live in:

/var/lib/docker/volumes/<volume_id>/_data   (if Docker-managed volume)

/my/local/db   (if you used -v /my/local/db:/var/lib/mysql)

Imagine you’re building a Docker image for MySQL. You don’t want your database data to disappear every time the container restarts.

FROM mysql:8.0

## Define where MySQL stores its data
VOLUME ["/var/lib/mysql"]

EXPOSE 3306

VOLUME ["/var/lib/mysql"] tells Docker that /var/lib/mysql should be stored outside the container layer system. You can see the volume with

docker volume ls
docker volume inspect <volume_name>

USER

By default, containers run as root, which is risky. The USER instruction sets a non-root user for better security.

If there’s a Docker or kernel vulnerability, a process running as root inside the container could potentially escape and gain root access on the host.
This is the main worry in multi-tenant environments (like shared servers or Kubernetes clusters).

RUN useradd -ms /bin/bash appuser
USER appuser

Now every process inside the container runs as appuser.
-m: create the user’s home directory (e.g., /home/appuser).
-s /bin/bash: set the user’s login shell to /bin/bash.

You can still override:

docker run --user root myimage

WORKDIR

The WORKDIR instruction sets the current working directory inside the container. Every subsequent RUN, CMD, ENTRYPOINT, COPY, or ADD will be executed relative to this directory.

You can use the WORKDIR command to set the working directory to /usr/local/app:

WORKDIR /usr/local/app

WORKDIR /usr/src/app
COPY . .
RUN pip install -r requirements.txt

COPY . .: copies files into/usr/src/app
RUN pip install -r requirements.txt: runs inside /usr/src/app

The WORKDIR can be changed multiple times during an image build.

FROM ubuntu:22.04

# First working directory
WORKDIR /app
RUN echo "Hello" > hello.txt

# Change to another directory
WORKDIR /app/subdir
RUN echo "World" > world.txt

# Copy from build context relative to new WORKDIR
COPY main.py .

Writing Efficient Dockerfiles

Small base images

The full python image ships lots of build tools you usually don’t need in production. Using a smaller base cuts image size.

Recommended for most apps (good balance of size & compatibility):
```
FROM python:3.11-slim    # Debian (Bookworm) base
```
Smallest footprint, but riskier (musl libc; manylinux wheels may not work and native builds can be painful):
```
FROM python:3.11-alpine
```

Tip: Alpine is great for pure-Python deps. If you need C extensions (NumPy, psycopg2, etc.), stick with a Debian-based slim image for painless wheel installs.

Run as a non-root user (security best practice)

Don’t run your app as root. Running your application as a non-root user reduces the potential impact if the container is ever compromised. Create a dedicated user and run the app under it.

Debian/Ubuntu (slim) variant:

FROM python:3.11-slim

# Create non-root user
RUN useradd -m appuser

WORKDIR /app

# 1) Copy only dependency file first (cache-friendly)
COPY requirements.txt .

# 2) Install deps (small image; faster rebuilds)
RUN pip install --no-cache-dir -r requirements.txt

# 3) Copy the rest of the app with correct ownership
COPY --chown=appuser:appuser . .

# Run as non-root
USER appuser

CMD ["python3", "app.py"]

useradd creates a system user with no login shell or home directory.
COPY --chown=... sets ownership as files are copied, avoiding an extra chown layer.
USER appuser ensures your process runs without root privileges.

Reuses the cached layer

Docker builds image layers from each instruction in your Dockerfile. If the input to a layer hasn’t changed, Docker reuses the cached layer instead of re-running it. Your application code changes often, but your dependencies (requirements) change rarely. So if you install deps in an earlier layer and copy your code later, most builds can reuse the heavy “install deps” layer and only re-run the quick “copy code” step.

Bad Example:

FROM python:3.11-slim
WORKDIR /app

# ❌ Copies everything (your changing code!) first
COPY . .

# ❌ Now this runs every time your code changed above
RUN pip install --no-cache-dir -r requirements.txt

CMD ["python3", "app.py"]

Any code change triggers the cache before pip install, so dependencies reinstall every time.

Good Example:

FROM python:3.11-slim
WORKDIR /app

# ✅ Copy only the dependency file first (rarely changes)
COPY requirements.txt .

# ✅ Install deps now; this layer is cached until requirements.txt changes
RUN pip install --no-cache-dir -r requirements.txt

# ✅ Copy your frequently changing app code last
COPY . .

CMD ["python3", "app.py"]

Dependency install is isolated in its own layer and only re-runs when requirements change.
--no-cache-dir is a pip option that tells pip not to write a local download/cache when installing packages. When you run pip install -r requirements.txt, pip:
- downloads wheels/source archives to a local cache (usually ~/.cache/pip),
- installs the packages from that cache.
- That cache can speed up the next install because files are already downloaded.

Multi-Stage Builds

Some Python packages need extra tools to compile, but your app won’t use those tools after it’s built. With multi-stage builds, you compile everything in a builder image, then move just the finished packages into a light “runtime” image. That keeps the final image small, quick to start, and easier to secure.

# Build stage
FROM python:3.11 AS builder

WORKDIR /build
COPY requirements.txt .

# Install build dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends gcc libpq-dev && \
    pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt

# Final stage
FROM python:3.11-slim

WORKDIR /app
# Copy only wheels from builder
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir --no-index --find-links=/wheels /wheels/*

COPY . .

CMD ["python3", "app.py"]

Use `.dockerignore` file

Before building, Docker sends your project folder to the engine (i.e., build context, the set of files/folders Docker sends to the Docker engine at the start of docker build.). Use .dockerignore to exclude junk (git files, venvs, caches) so builds are faster and caching works.

# Version control
.git/
.gitignore

# Python artifacts
__pycache__/
*.py[cod]
*$py.class
.pytest_cache/
.coverage

# Environments & secrets
.env
.venv

# Build outputs
build/
dist/
*.egg-info/

# Optional: tests and local tooling not needed in the image
tests/
.idea/
.vscode/

References

The Linux DevOps Handbook, Damian Wojsław and Grzegorz Adamowicz
How to Write Efficient Dockerfiles for Your Python Applications