Lectures and Class Material

Link to the GitHub Repository containing the lecture materials

1. Reproducibility in Life Sciences

Instructor: JB Poline

Outline

With this lecture, you will get a general introduction to reproducible - or irreproducible - life sciences. Specifically, you will

learn what is meant by reproducibility of research results in the life sciences
understand the main causes for irreproducible results
learn the possible collective and individual actions for curbing irreproducibility

Material: GitHub Link

Pre-recorded lecture video: YouTube Link

Slides: Slides

Lecture Resources

Canonical paper: Ten common statistical mistakes to watch out for when writing or reviewing a manuscript

Questions you will be able to answer after taking this module:

Is the term “replicability” generally applied to obtaining the same results with another (new) dataset ?
Is the root cause of irreproducibility the publication incentive ?
What is a similar result with the same methodology or pipeline but different data ?

2. Introduction to the Terminal and Bash

Instructors: Brent McPherson, Alyssa Dai

Outline

To follow most of the other modules, you will have to have some basic understanding of the command line. In this module we’ll take a look at the the BourneAgainSHell (BASH), the default command line in most Linux systems. You will learn how to:

move around on your computer with the command line, create and open directories and files
find things with the command line (files and programs, PATH variables)
run useful command line programs and find help (find, grep, ls, and man / documentation)

Materials:

Github Link

Pre-recorded lecture video (by Sebastian Urchs): YouTube Link

Slides: Slides

Questions you will be able to answer after taking this module:

What is a command line shell?
How would you copy thousands of files with file names starting with "my_good_file..." to a different directory on your computer?
Among thousands of files and directories you know there is one where you wrote down "location of my thesis backup". How do you find this file?
What is an environment variable, and how can you change it?

3. Introduction to Python

Instructor: Michelle Wang, Jacob Sanz-Robinson

Outline

This lecture is designed to get students up and running with Python. It is expected that Python 3 (preferably 3.7 or later) is installed, and that the students have some basic previous experience in a scripting language.
It will guide students through the fundamental syntax, concepts, and data structures required to code in Python 3.
Topics include: Running your code, commenting, variables, arithmetic, logic, strings, lists, tuples, dictionaries, functions, libraries, if statements, loops, exceptions, and classes.

Material: GitHub Link

Pre-recorded lecture video (by Jacob Sanz-Robinson): YouTube Link

Lecture Resources

Think Python 2e textbook

Questions you will be able to answer after taking this module:

(1) How does the use of a break statement alter the flow of a loop in Python?

(2) What happens if you attempt to append new elements to a Tuple?

(3) Without running the code on your machine, what is the printed output when the following code is run?

my_dictionary = {"a" : 1, "b" : {"c" : {"d" : [4,5,6,4]}}, "c" : [1,2,3]}
x = my_dictionary["b"]["c"]["d"].append(my_dictionary["c"][-3])
print(my_dictionary.values())

a) [1, {‘c’: {‘d’: [4, 5, 6, 4}}, [1, 2, 3]]
b) [1, {‘c’: {‘d’: [4, 5, 6, 4, 1]}}, [1, 2, 3]]
c) [1, [4,5,6,4,1], [1,2,3]]
d) [1, [4,5,6,4], [1,2,3]]

(4) Without running the code on your machine, which string is returned by my_function when called with the specified parameters?

def my_function(x, y, z):
    result = ""
    if len(z) <= 6 and len(z) > 2:
        result = z[-2] + y
    else:
        result = x + y
    return x + x + result

my_function("111", "abc", "0100")

a) ‘1111110abc’
b) ‘0abc111111’
c) ‘111111bca0’
d) ‘1111111110’

4. Scientific Python: NumPy and Scipy

Instructor: Jérôme Dockès

Outline

This lecture will introduce NumPy and its ndarray data structure, which are at the core of most scientific Python packages. At the end of the lecture, participants will be able to:

Understand why NumPy enables efficient computation and what are NumPy arrays.
Manipulate arrays of numbers with NumPy

Materials: GitHub Link

Lecture Resources

NumPy User Guide

5. Introduction to Git and GitHub

Instructor: Kendra Oudyk

Outline

Git and GitHub are key tools for doing version control in both academia and industry. These tools can help students do more efficient, open, and reproducible research. Further, knowing these tools can help prepare students for careers in academia and industry. In this lecture, students will learn

What is version control and why has it become so important in science and industry;
How to track and share their own work using Git and GitHub; and
How to collaborate and contribute to open projects using Git and GitHub.

Materials: GitHub Link

Pre-recorded lecture video: YouTube Link

Slides: Slides

Lecture Resources

Questions you will be able to answer after taking this module:

In a ____ version control system, individuals have the entire repository and its history in their local repository.

a) Centralized
b) Distributed

What is the basic workflow for tracking a change and sharing it on github?

a) git commit, git add, git push
b) git pull, git add, git push
c) git add, git commit, git push

How do you start a parallel line of development, in order to do nonlinear version control?

a) make a tag
b) start a new branch
c) create a remote repository

How do you make a copy of another GitHub repo on your GitHub account?

a) git clone <repo address>
b) go to the repo’s GitHub page and click “fork”
c) go to the repo’s GitHub page and open an issue to ask for a copy
d) go to the repo’s GitHub page and do a pull request

6. Data Wrangling with Pandas

Instructor: Jacob Sanz-Robinson

This module is designed to introduce students to the Pandas Python library for manipulating data in tables and time series (not to be confused with the bear of the same name). It aims to build a basic understanding of what happens underneath the hood in Pandas, and arm you with the essential practical knowledge to fearlessly tackle the next CSV file you encounter in the wild.

Outline

Introduction

a) What is Pandas?
b) (Very) Brief History
c) Why should I care about Pandas?
d) Features & Docs

Pandas Objects

a) The Series Object
b) The DataFrame Object
c) The Index Object

Pandas Wrangler Essentials

a) Data I/O
b) Selection and Indexer
c) Filtering
d) Combining DataFrames
e) Inbuilt Aggregations

Materials: GitHub Link

Pre-recorded lecture video: YouTube link

Lecture resources

Questions you will be able to answer after taking this module:

Which of the following is an immutable Pandas Object?

a) Index
b) DataFrame
c) Series
d) Array

What function would you use to combine two Pandas DataFrames if you wanted to align their rows based on common column values?

a) Append
b) Concat
c) Merge
d) Map

7. Classical statistics pitfalls and remedies

Instructor: JB Poline

Outline

Most of published results still rely on some statistical inference. With this lecture, you will

get a reminder of the classical statistical framework and learn about the issues brought by the use of statistical inference
learn (or be reminded of) the notion of effect size, power, positive predictive values and the consequences of low powered studies
understand the file drawer effect, p-hacking, and know about some solutions.

Materials: GitHub Link

Pre-recorded lecture video: YouTube Link

Slides: Slides

Lecture Resources

Think Stats

8. Introduction to Machine Learning part 1: supervised learning

Instructor: Nikhil Bhagwat

Outline

Define machine-learning nomenclature
Describe basics of the “learning” process
Explain model design choices and performance trade-offs
Introduce model selection and validation frameworks
Explain model performance metrics

Materials: GitHub Link

Pre-recorded lecture video: YouTube Link

Slides: Slides

Lecture Resources

IMPORTANT! To fully understand the material taught in this module, you should make sure that you are already familiar with the following concepts (please take some time to review them if needed):

Basics of linear algebra (check out these videos if you need a refresher)
- Do you know how to use vectors?
- Do you know how to multiply two matrices?
Basics of linear regression
- Do you know what a mean-square error is?
- How to fit linear regression or GLMs?

Questions you will be able to answer after taking this module:

When is ML a useful approach?
Supervised learning
- Model training - what is under/over-fitting?
- Model selection - what is (nested) cross-validation?
- Model evaluation - what are type-1 and type-2 errors?
What NOT to do when using ML models in your research

Things you will NOT learn in this module (if you are an advanced ML student)

In-depth review of unsupervised learning approaches (e.g. clustering)
How train deep-learning models
How to use and/or defeat chatGPT

9. Introduction to Machine Learning part 2: Model selection & validation; dimensionality reduction

Instructor: Jérôme Dockès

Outline

In this module, you will:

Learn how to properly select a machine-learning model, set hyperparameters, and evaluate prediction performance.
Understand the challenges of learning from high-dimensional data and learn about tools to mitigate the issue.

Materials: GitHub Link

Pre-recorded lecture video: YouTube Link

Slides: Link

Lecture Resources

Questions you will be able to answer after taking this module:

I am predicting continuous cognitive scores of 1,000 participants using 20,000 brain imaging features. I use least-squares regression. What is regularization and why do I need it?
I decide to use ridge regression (l2 regularization). How can I set the regularization hyperparameter?
I also add a dimensionality reduction step to my model: PCA. I do 5-fold cross-validation, and I perform a full grid-search, using 3 folds for the inner validation loop. I use a grid of 3 options for the number of PCA components and 6 options for the ridge hyperparameter. How many times (at least) will I need to fit a PCA?

10. Introduction to Data Visualization in Python

Instructor: Kendra Oudyk

Outline

Data visualization is an essential skill for scientists.
At the grad student level, you’re probably already familiar with basic plots (e.g., bar plot vs pie chart), as well as types of data (e.g, ordered vs categorical).
With that in mind, I hope to take you a bit deeper into the technicalities of planning and executing and effictive Figure.

Materials: GitHub Link

Pre-recorded lecture video, Part 1 Decoding: YouTube link
Pre-recorded lecture video, Part 2 Encoding: YouTube link

Slides for Part 1 Decoding: Slides
Slides for Part 2 Encoding: Slides

11. Virtualization of computing environments

Instructor: Sebastian Urchs

Outline

Learn why containerization and virtualization are important for research projects.
Have an overview of different solutions to create isolated environments.
Get some basic hands on experience with and Docker.

Materials: GitHub Link

Pre-recorded lecture video: YouTube Link

Slides: Slides

Questions you will be able to answer after taking this module:

When working with the file system inside a Docker container, which statements are true?
- I cannot see files on the host system from inside the container
- files written into the container file system are lost with the container
- I can mount paths on the host system into the container to expose their contents to it
What is an advantage of Docker over a Virtual Machine?
- a Docker container can run any operating system, independently of the host operating system
- Docker is a good choice for shared systems because of its high level of security
- Docker containers are easier to specify, build, and manage and have better sharing infrastructure
What is the difference between a Docker container and a Docker image?
- A Docker container is a registry service to store and share Docker images
- A Docker image is a read-only snapshot and a Docker container is a running instance of it
- A Docker container is a read-only snapshot that can be easily shared (e.g. on Dockerhub) and from it, many live Docker images can be spawned
What is an advantage conda has over pip for Python environments?
- conda is usually prepackaged with Python, so you don’t have to install anything
- conda has more Python packages than pip because of the Anaconda distribution
- conda can resolve non-Python dependencies and can also create virtual environments

12. High Performance Computing (HPC)

Instructor: Brent McPherson

Outline

Learn the key facts about High Performance Computing (HPC) and Cloud computing
Understand the advantages and the constraints of HPC
Learn the key concepts and practical bash commands to get started on the Compute Canada HPC

Materials: GitHub Link

Pre-recorded lecture video (by Darcy Quesnel): YouTube Link

Slides: Slides

Lecture Resources

Compute Canada wiki

Questions you will be able to answer after taking this module:

Choose the area that Advanced Research Computing traditionally does not include

a) HPC/Clusters
b) Research Data Management
c) Cloud Computing
d) Video Games

Choose all components that are part of an HPC Compute Node

a) Processor/Core
b) Display/Monitor
c) Memory
d) Mouse
e) Local Disk

Choose all ways to access an HPC Cluster

a) Secure shell to a Login Node
b) Secure shell to a Compute Node
c) Secure transfer to a Data Transfer Node

2023 Lectures and Class Material

Links to the pre-recorded lectures and material (2023)

Lectures and Class Material

1. Reproducibility in Life Sciences

2. Introduction to the Terminal and Bash

3. Introduction to Python

4. Scientific Python: NumPy and Scipy

5. Introduction to Git and GitHub

6. Data Wrangling with Pandas

7. Classical statistics pitfalls and remedies

8. Introduction to Machine Learning part 1: supervised learning

9. Introduction to Machine Learning part 2: Model selection & validation; dimensionality reduction

10. Introduction to Data Visualization in Python

11. Virtualization of computing environments

12. High Performance Computing (HPC)