Select the right version of Ubuntu is important. Softwares for deep learning like CUDA and NCCL provide specific compiles for different Ubuntu versions; not all Ubuntu versions are available. In 2025, Ubuntu 22.04 and Ubuntu 20.04 are the safe choices.
After install the OS, full-upgrade of kernel/packages must be done. Otherwise, newly added packages could clash.
sudo apt update
sudo apt full-upgrade --yes
sudo apt autoremove --yes
sudo apt autoclean --yes
reboot
How to Install and Set Up the Fish Shell | by Saad Jamil | Medium
sudo apt-add-repository ppa:fish-shell/release-3
sudo apt-get update
sudo apt-get install fish
fish automatically in Bash shellAdd this line to the end of ~/.bashrc:
# auto launch fish shell
fish
P.S. ~/.bashrc will be executed whenever a new bash shell is launched. Please note that all commands after this line will only be executed after you exit the fish shell. Therefore make sure that this line is at the end of ~/.bashrc.
I really love the Pastel Powerline Preset | Starship preset of Starship:
curl -sS https://starship.rs/install.sh | sh
Add the following to the end of ~/.config/fish/config.fish:
# init starship
starship init fish | source
P.S. If you want to add conda & python information to the command prompt, download my starship.toml and use it to replace ~/.config/starship.toml:
Then open a new terminal:
starship preset pastel-powerline -o ~/.config/starship.toml
Noted that we need to use the Nerd Fonts. I personally prefer CaskaydiaCove Nerd Font (download. Set it as the font used in terminal after installing it:
CaskaydiaCove Nerd Font.Download VS Code's Linux version and then:
sudo apt install <file>.deb
sudo apt-get install git
Let's configure its default user name and user email. Noted that when you push commit to GitHub, the email will be used to identify your GitHub account:
git config --global user.name <name>
git config --global user.email <email>
git config --global init.defaultBranch main
To authorize your operation on GitHub, you will also need to generate a ssh key:
ssh-keygen -t rsa -C "<email>"
And then you need to add it to your account: Settings > SSH and GPG keys > Add SSH Key. Fill the title as you like and paste the key with the content of the generated id_rsa.pub (NOT id_rsa!!). The content of id_rsa.pub can be easily accessed from command line:
cat ~/.ssh/id_rsa.pub
curl -sL "https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh" > "Miniconda3.sh"
bash Miniconda3.sh
P.S. If you enter fish shell from bash shell, the previous commands only automatically initialize conda for bash shell. To initialize conda for fish shell as well, run the following command in bash shell:
conda init
Install the newest Nvidia driver compatible with your GPU. You don't need to worry about its compatibility with CUDA, since the driver is designed to be backward-compatible:
Download The Official NVIDIA Drivers | NVIDIA
NVIDIA drivers installation - Ubuntu Server documentation
sudo apt install nvidia-driver-550
reboot
Verify:
lsmod | grep nvidia
nvidia-smi
P.S. If you have multiple GPUs installed, you can test their connection via:
nvidia-smi topo -m
conda create -n pytorch python=3.10
conda activate pytorch
pip3 install torch torchvision torchaudio
It will also install the bundled CUDA for you, thus you don't have to install CUDA yourself. However, commands like nvcc would not be available. To verify installation:
python
>>> import torch
>>> device = 'cuda' if torch.cuda.is_available() else 'cpu'
>>> torch.rand(5, 3).to(device)
P.S. If you just want to use PyTorch with CUDA, as specified before, you don't need to install CUDA yourself. However, if you want to compile PyTorch yourself or write customized CUDA codeto boost performance, you will need to install the CUDA Toolkit yourself.
P.S. Run nvidia-smi to see the highest CUDA version your current driver supports.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
Launch a new terminal and verify:
nvcc --version
NCCL is for multi-nodes/GPUs operation. Select the appropriate version according to your CUDA version.
NVIDIA Collective Communications Library (NCCL) | NVIDIA Developer
Installation Guide | NVIDIA Deep Learning NCCL Documentation
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt install libnccl2=2.26.5-1+cuda12.4 libnccl-dev=2.26.5-1+cuda12.4
Tests:
Uninstall all potentially conflicting packages:
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
P.S. Run this in bash shell instead of fish shell.
Set up Docker's apt repository:
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
# P.S. Bash shell needed.
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
Install the Docker packages:
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world
Install the prerequisites:
sudo apt-get update && sudo apt-get install -y --no-install-recommends \
curl \
gnupg2
Configure the production repository:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
Run with bash to install Nvidia container toolkit:
sudo apt-get update
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.0-1
sudo apt-get install -y \
nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}
Configure docker and restart its daemon:
sudo nvidia-ctk runtime configure --runtime=docker
sudo nvidia-ctk runtime configure --runtime=docker
Verification:
sudo docker run --rm -it --gpus all pytorch/pytorch:2.0.0-cuda11.7-cudnn8-devel bash
P.S. Nvidia provides a detailed tutorial on using Docker with CUDA: Containers For Deep Learning Frameworks User Guide | Nvidia
For running multi-container apps.
sudo apt-get update
sudo apt-get install docker-compose-plugin
docker compose version
sudo apt-get install openssh-server
code /etc/ssh/sshd_config
You could change the port number, say 2222:
Port 2222
You may also want to enable the password login:
PasswordAuthentication yes
After altering the configuration, restart ssh server:
sudo systemctl restart ssh
sudo apt install net-tools
ifconfig
Find the IP address. Then on other device connected to the same network, you can SSH into the Ubuntu machine:
ssh <user>@<address>
P.S. If you've changed the port number:
ssh <user>@<address> -p <port>
How To Configure SSH Key-Based Authentication on a Linux Server | DigitalOcean
Having to enter password for each ssh login is not very convenient. We can make our life a little bit easier by setting up key authentication.
On the client machine
First generate a key:
ssh-keygen
It will prompt you to input the <key path> and the pass phrase. Then you need to add your key to your local ssh client:
ssh-add <key path>
Noted that you will need to add it again when you reboot your local machine. You may add this command to ~/.bashrc (or ~/.zshrc if you use macOS) for convenience.
Then send the key to the remote server:
ssh-copy-id -i <key path> -p <port> <address>
Noted the the -p argument is the port number of the remote server you set before. Moreover, the address is the IP address of the remote server which can be check by ifconfig.
On the host machine
Before you can log in to the remote server without entering the password, you will need to enable key authentication first:
code /etc/ssh/sshd_config
And then change uncomment the row of PubKeyAuthentication yes. For security considerations, it's preferable to disable password authentication if you have already setup key authentication. Change the row of PasswordAuthentication yes to PasswordAuthentication no.
And then:
sudo systemctl restart sshd
If everything is setup well, you will no longer need to enter the password the next time you ssh into the remote machine.
To access the ssh host from the internet, we need to expose it to the internet, i.e. NAT traversal. You can use Cloudflare to setup the tunnel, given that you own a domain. Even if you don't have a domain, buying one from Cloudflare is still cheaper than some NAT tools' paid subscriptions.
Follow this guide to set it up:
Connect to SSH with client-side cloudflared (legacy) · Cloudflare Zero Trust docs
Key steps:
<subdomain>.<domain> the public domain to access the service, e.g. ssh.example.com<type>://<url> the url to the access the service on the remote machine, e.g. ssh://localhost:22cloudflare on you client machine.Add this to ~/.ssh/config:
Host ssh.example.com
ProxyCommand /usr/local/bin/cloudflared access ssh --hostname %h
Then you can:
ssh user@ssh.example.com
As you can see, ProxyCommand defines what happens when you launch ssh to a host. In this case, it runs cloudflared access ssh --hostname %h to setup the tunnel, i.e. one cloudflare process per connection. However, if your network is not stable (due to whatever reason, e.g. GFW), this could make your connection vulnerable. A better solution is:
cloudflared access ssh --hostname ssh.example.com --url localhost:<local-port>
Then you can:
ssh user@localhost -p <local-port>
Access of some online resources are restricted outside the campus. For example, the eStudent website is painstakingly slow--almost unusable. With SSH, you can easily turn your remote machine into a SOCKS5 proxy server; whatever accessible to the remote machine will then be accessible to you:
ssh user@<address> -p <port> -D <another-port>
Then setup SOCKS5 proxy in your OS/browser's proxy setting.
We might launch some web services on the remote machine, such as TensorBoard for monitoring the training progress:
tensorboard --logdir experiment/
This will launch a web page on http://localhost:6006 by default. Using tmux (in the next section) or nohup, you can make it run in background like an persistent web service. If you want to access it from your local machine, one approach is with SSH's port forwarding:
ssh user@<address> -p <port> -L 6006:localhost:6006
However, this is a little bit inconvenient. Instead, you can use Cloudflare to expose it to a domain name:
<subdomain>.<domain> the public domain to access the service, e.g. tb.example.com<type>://<url> the url to the access the service on the remote machine, e.g. http://localhost:6006Then the webpage will be accessible via Internet. However, this could raise privacy or security concerns. You are strongly recommended to setup restrictions on who can access it. For example, set it as only you can access:
Emails value your-github-email, i.e. only you can access this serviceThen when you open http://<subdomain>.<domain>, you will need to authenticate via GitHub. It will check whether your GitHub account's email matches the one you fill in the policy. If so, you will be directed to the webpage; otherwise, your access will be denied.
P.S. The default authentication method is one-time-password via email. However, due to unknown reason, I couldn't receive any authentication email. You may explore other authentication methods as well.
tmuxWhen using ssh for remote development, all running process will be terminated once you disconnect from the host. This is frustrating when you have something that could take hours or days to complete (e.g. training neural network) or your network is not stable. In that case, you need tmux.
sudo apt update sudo apt install tmux
Enable mouse in tmux:
touch ~/.tmux.conf
echo "set -g mouse on" >> ~/.tmux.conf
First start a tmux session:
tmux
Or start with a custom name:
tmux new -s mysession
P.S. It can be renamed outside the session:
tmux rename-session -t <old_name> <new_name>
It will launch a terminal like the ordinary one. When you disconnect to the host, the process running in that terminal will continue to run. To view the status of all running sessions:
tmux ls
When you reconnect to the host, you can enter that tmux session by:
tmux attach
tmux attach -t <name>
Kill a specific session:
tmux kill-session -t <name>
Kill all sessions (except the session you are in):
tmux kill-session -a
I use PyVista package quite a lot for 3D mesh rendering. When it's on the remote ssh host, things could get a little bit difficult. Here is a worked recipe:
Installation — PyVista 0.45.2 documentation
python - PyVista plotting issue in Visual Studio Code using WSL 2 with Ubuntu 22.04.4 - Stack Overflow
conda create --n vtk python=3.10
conda activate vtk
pip install pyvista[jupyter] ipykernel
Then in the .ipynb notebook on the remote machine:
import pyvista as pv
pv.set_jupyter_backend('html')
pl = pv.Plotter()
pl.add_mesh(
mesh=pv.read('output/x.obj'),
texture=pv.read_texture('output/texture.png'),
)
pl.show()
import pyvista as pv
pl = pv.Plotter(off_screen=True)
pl.add_mesh(
mesh=pv.read('output/x.obj'),
texture=pv.read_texture('output/texture.png'),
)
pl.screenshot('output.png', window_size=[1024, 1024], return_img=False)
pl.close()
wget https://glados.one/tools/clash-verge_1.3.8_amd64.deb
sudo apt install ./clash-verge_1.3.8_amd64.deb
System Proxy and Auto LaunchAdd these line to ~/.bashrc:
# proxy setting
export https_proxy=http://127.0.0.1:7890 http_proxy=http://127.0.0.1:7890 all_proxy=socks5://127.0.0.1:7890
apt with proxyAdd this line to /etc/apt/apt.conf:
Acquire::http::Proxy "http://<address>:<port>";
git with proxygit config --global http.proxy http://<address>:<port>
Download and run:
./cpuburn
P.S. Monitor GPU frequency:
watch "cat /proc/cpuinfo | grep 'MHz'"
For a more visual monitoring of the system's resources including CPU, RAM, disk, & network usage:
sudo apt install btop
git clone git@github.com:wilicc/gpu-burn.git
cd gpu-burn
make
General test:
./gpu_burn 3600
Tensor core test:
./gpu_burn -tc 3600
P.S. Monitor GPU status:
watch -n 1 nvidia-smi
Or for a more visual monitoring:
sudo add-apt-repository ppa:flexiondotorg/nvtop
sudo apt install nvtop
GitHub - Syllo/nvtop: GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
安静、高性价比双卡装机【100亿模型计划】
P.S. CUDA memory test:
GitHub - ComputationalRadiationPhysics/cuda_memtest: Fork of CUDA GPU memtest :eyeglasses:
git clone git@github.com:ComputationalRadiationPhysics/cuda_memtest.git
# build
mkdir build
cd build
# RTX 3090 is capability 8.5 -> 85
# check here for other models: https://developer.nvidia.com/cuda-gpus
cmake -DCMAKE_CUDA_ARCHITECTURES=85 ..
make
cd ..
mv build/cuda_memtest .
# testing
./sanity_check.sh
./cuda_memtest