6 Version Control System (VCS)

How do you keep track of changes in your code? How do you collaborate with others on a project? How do you manage different versions of your code? A version control system (VCS) is a tool that helps you answer these questions. It allows you to track changes in your code, collaborate with others, and manage different versions of your code.

6.1 Why use a VCS?

Here are some scenarios where a VCS can be useful:

6.1.1 Tracking Changes

Imagine you are working on a project with your group members. You want to add some new features to the code, but you are not sure if the changes will work. Then, you create a copy of the code and make the changes (e.g., code_v2.py, code_final.py, code_final_final.py, etc.). Your group members also make changes to the code, and now you have multiple copies of the code with different changes. How do you keep track of all these changes?

With a VCS, you can track changes in your code, see who made the changes, when the changes were made, and why the changes were made. If you don’t use a VCS, you may end up with multiple copies of the code with different changes, which can quickly become messy and error-prone.

6.1.2 Collaborating with Others

Imagine you are working on a project with your group members. You want to share your progress with your group members and get feedback on your code. Then you:

Send an email with the code attached, and your group members make changes to the code and send it back to you. Now, you have multiple copies of the code with different changes. How do you merge these changes?
Use a cloud storage service like Google Drive or Dropbox to share the code with your group members. You all work on the same copy of the code, and suddenly, the file gets corrupted. How do you recover the previous version of the code?
Send the code via a messaging app like WhatsApp or Slack. Your colleagues make changes to the code, and now you have multiple copies of the code with different changes. How do you keep track of all these changes?

With a VCS, you can collaborate with others more effectively. You can work on the same copy of the code, see who made the changes, when the changes were made, and why the changes were made. Not only that, but you can also revert to previous versions of the code, resolve conflicts between different versions of the code, and merge changes made by different group members.

6.2 How Does a VCS Work?

A VCS works by keeping track of changes in your code over time. It stores different versions of your code in a repository, which acts as a database for your code. When you make changes to your code, the VCS records these changes and stores them in the repository. You can view the history of changes, revert to previous versions of the code, and merge changes made by different group members.

You can think of a VCS as a time machine for your code. It allows you to travel back in time and see how your code looked at different points in time. You can also create branches in the repository, which are like parallel universes where you can work on different versions of the code without affecting the main version. Or, as the memory card in a video game, where you can save your progress at different points in the game and reload it later if you make a mistake.

6.3 Getting Started with Git and GitHub

There are many VCSs available, such as Git, Mercurial, and Subversion. In this course, we will use Git, which is one of the most popular VCSs. Git is widely used in the software development industry and is supported by platforms like GitHub, GitLab, and Bitbucket.

To get started with Git, you need to install Git on your computer and create a GitHub account. GitHub is a platform for hosting and sharing code, and it allows you to collaborate with others on projects. You can create repositories on GitHub to store your code, track changes in your code, and collaborate with others.

Here are some resources to help you get started with Git and GitHub:

Create a GitHub account. As a student, you can get a free GitHub Pro account with additional features. Join GitHub Education.
Install GitHub Desktop. GitHub Desktop is a graphical user interface for Git that makes it easier to work with Git repositories.

6.4 The Basics of Git

Git is a command-line tool that you can use to track changes in your code, collaborate with others, and manage different versions of your code. Here are some basic Git commands:

git init: Initialize a new Git repository in the current directory.
git clone <repository-url>: Clone a remote Git repository to your local machine.
git add <file>: Add a file to the staging area.
git commit -m "Commit message": Commit changes in the staging area to the repository.
git push: Push changes from your local repository to a remote repository.
git pull: Pull changes from a remote repository to your local repository.
git status: Show the status of the working directory and the staging area.
git log: Show the commit history of the repository.
git branch: List, create, or delete branches in the repository. For example, git branch <branch-name> creates a new branch, and git branch -d <branch-name> deletes a branch.
git checkout <branch-name>: Switch to a different branch in the repository. For example, git checkout main switches to the main branch.
git merge <branch-name>: Merge changes from a different branch into the current branch.
git stash: Stash changes in the working directory and the staging area.

Using GitHub Desktop, you can perform many of these Git commands using a graphical user interface, which makes it easier to work with Git repositories.

As your project grows, you may need to use more advanced Git commands and techniques to manage your code effectively. However, these basic Git commands are a good starting point for working with Git repositories.

6.5 The .gitignore File

When working with Git, you may want to exclude certain files and directories from being tracked by Git. For example, you may want to exclude log files, temporary files, huge data files, output files, and other files that are not essential to the project. You can do this by creating a .gitignore file in the root directory of the Git repository.

The .gitignore file contains a list of files and directories that Git should ignore. You can use wildcards and patterns to specify which files and directories to ignore. For example, you can use *.log to ignore all log files, *.csv to ignore all CSV files, and data/ to ignore the data directory.

In Figure 6.1, we provide an example of a .gitignore file that excludes log files, temporary files, output files, and other files from being tracked by Git. Sometimes, you may want to whitelist certain files or directories that are excluded by the .gitignore file. You can do this by using the ! character before the file or directory name.

# Ignore log files
*.log

# Ignore temporary files
*.tmp

# Ignore output files
*.out

# Ignore data files in the data directory
data/

# Ignore the .vscode directory
.vscode/

# Ignore the __pycache__ directory
__pycache__/

# Ignore the .pytest_cache directory
.pytest_cache/

# Ignore all folders with the label _test/
*_test/

# Whitelist the data/processed directory
!data/processed/

# Ignore all CSV files in any folder2 subdirectory within any folder1 directory
# the /*/ pattern matches any subdirectory
folder1/*/folder2/*.csv

# Ignore files inside the virtual environment directory
venv/

Figure 6.1: Example of a .gitignore file that excludes log files, temporary files, output files, and other files from being tracked by Git. The file uses wildcards¹ and patterns to specify which files and directories to ignore.

6.5.1 Refreshing the .gitignore File

Sometimes, you may have forgotten to add certain files or directories to the .gitignore file. These files or directories have already been tracked by Git. To stop tracking these files or directories:

Remove the file from the staging area and the repository, but keep it in the working directory. You can do this by using the git rm --cached <file> command:
- git rm --cached <file>: Remove a file from the staging area and the repository, but keep it in the working directory.
- git rm -r --cached <directory>: Remove a directory and its contents from the staging area and the repository, but keep it in the working directory.
  - rm stands for remove.
  - -r stands for recursive, which means to remove the directory and its contents.
  - --cached means to remove the file or directory from the staging area and the repository, but keep it in the working directory.
  - <directory> is the name of the directory you want to remove.
Use git rm -r --cached . to remove all files in the current directory from the staging area and the repository, but keep them in the working directory. The . represents the current directory.
After removing the file or directory from the staging area and the repository, you can add the file or directory to the .gitignore file:
- git add <file>: Add a specific file to the staging area.
- git add .: Add all files in the current directory to the staging area.
Finally, commit the changes to the repository using the git commit -m "Commit message" command.

A wildcard is a character that can be used to represent one or more characters in a file name or directory name. The most common wildcards are * and ?. The * wildcard represents zero or more characters, while the ? wildcard represents a single character.↩︎