Getting started with Python + Jupyter Notebook

Author

Simon Schulte, Johan Velez

Published

June 13, 2023

1 Installing Python Packages from your Jupyter Notebook

1.1 Using Conda

If you have installed Python via the package manager Anaconda, to install a Python package you can use the following command in a Jupyter notebook:

conda install <package-name>

For example, to install the numpy package, you would run the following command from a Jupyter Notebook code cell:

conda install numpy

For some older versions of Jupyter Notebook running this command causes the following error: SyntaxError: invalid syntax.

Then you can try adding a %-sign before conda:

%conda install <package-name>

1.2 Using Pip

If you have just installed Python (without Anaconda), you can use the following command in a Jupyter notebook to install a Python package:

pip install <package-name>

For example, to install the matplotlib package, you would run:

pip install matplotlib

For some older versions of Jupyter Notebook running this command causes the following error: SyntaxError: invalid syntax.

Then you can try adding a %-sign before pip:

%pip install <package-name>

2 Jupyter Notebook Keyboard Shortcuts

Here are some of the most important keyboard shortcuts for working with Jupyter notebooks:

  • Enter: Enter edit mode
  • Shift + Enter: Run cell and move to the next one
  • Ctrl + Enter: Run cell and stay on the current one
  • Esc: Exit edit mode
  • A: Insert cell above
  • B: Insert cell below
  • D, D: Delete selected cell
  • Z: Undo last cell deletion
  • M: Change cell to markdown
  • Y: Change cell to code
  • Ctrl + S: Save notebook
  • Ctrl + Shift + P: Show command palette
  • Shift + Tab: Show docstring (in edit or command mode)

For a full list of keyboard shortcuts, you can click on the Keyboard Shortcuts option in the Help menu.

3 Python packages we use a lot

There are two Python packages that we use a lot in our course: Pandas and NumPy. While some of their features overlap (e.g. matrix multiplication), others are unique to one of the packages (e.g. matrix inversion).

We recommend to use Pandas whenever possible, because you keep the column-names and indices (=row-names) of your matrix/vector when you manipulate matrices in Pandas DataFrame format instead of NumPy’s arrays or matrix format.

3.1 Pandas

Pandas is a powerful library for data analysis in Python. It provides data structures for efficiently storing and manipulating large, heterogeneous datasets, as well as tools for data cleaning, merging, reshaping, and analysis.

Here you find a cheatsheet explaining the main functionalities of Pandas: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

To use Pandas in a Jupyter notebook, you can install it using either Conda or Pip (see the “Installing Python Packages” section above), and then import it using the following command:

import pandas as pd

3.2 NumPy

NumPy is a fundamental package for scientific computing in Python. It provides powerful tools for working with arrays and matrices, as well as a large library of mathematical functions for linear algebra, Fourier analysis, and more.

Here you find a cheatsheet explaining the main functionalities of Pandas: https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

To use NumPy in a Jupyter notebook, you can install it using either Conda or Pip (see the “Installing Python Packages” section above), and then import it using the following command:

import numpy as np

4 Converting between NumPy Arrays/Matrices and Pandas DataFrames

4.1 From NumPy to Pandas

import numpy as np

# First we create a NumPy array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Note, a numpy array has not column- or row-names (=index)!

Converting to Pandas DataFrame:

import pandas as pd

# create a Pandas DataFrame from a NumPy array
df = pd.DataFrame(arr, index=['row1', 'row2', 'row3'], columns=['col1', 'col2', 'col3'])
df
col1 col2 col3
row1 1 2 3
row2 4 5 6
row3 7 8 9

… but: Pandas DataFrame has column- and row-names!

4.2 From Pandas to NumPy

import pandas as pd

# create a Pandas DataFrame
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': [7, 8, 9]}, index=['row1', 'row2', 'row3'])

Convert to NumPy Array:

# convert a Pandas DataFrame to a NumPy array
arr = df.to_numpy()
arr
array([[1, 4, 7],
       [2, 5, 8],
       [3, 6, 9]])

5 Indexing Data in Python

5.1 Pandas DataFrame

5.1.1 Using .iloc and .loc

  • .iloc is used for index-based selection of rows and columns
  • .loc is used for label-based selection of rows and columns
import pandas as pd

# create a sample dataframe
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8]}, index=['row1', 'row2', 'row3', 'row4'])
print(df)
      col1  col2
row1     1     5
row2     2     6
row3     3     7
row4     4     8
# select a single row using .iloc
row = df.iloc[0]
print(row)
col1    1
col2    5
Name: row1, dtype: int64
# select a single column using .loc
col = df.loc[:, 'col1']
print(col)
row1    1
row2    2
row3    3
row4    4
Name: col1, dtype: int64
# select a subset of rows and columns using .iloc
subset = df.iloc[0:2, 0:2]
print(subset)
      col1  col2
row1     1     5
row2     2     6
# select a subset of rows and columns using .loc
subset = df.loc[['row1', 'row2'], ['col1', 'col2']]
print(subset)
      col1  col2
row1     1     5
row2     2     6

5.2 NumPy Array

  • Indexing in NumPy arrays can be done using integer arrays, boolean arrays, or slices
  • Slicing is used to select a range of elements in the array
import numpy as np

# create a sample numpy array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr)
[[1 2 3]
 [4 5 6]
 [7 8 9]]
# select the first row using integer indexing
row = arr[0]
print(row)
[1 2 3]
# select the first column using integer indexing
col = arr[:, 0]
print(col)
[1 4 7]
# select a subset of the array using slices
subset = arr[0:2, 0:2]
print(subset)
[[1 2]
 [4 5]]
# select elements based on a boolean condition
bool_arr = arr > 5
subset = arr[bool_arr]
print(subset)
[6 7 8 9]

6 MultiIndex in Pandas

In pandas, a MultiIndex allows you to have multiple levels of indexing in your DataFrame. This is useful when you want to work with higher-dimensional data or when you need to group and analyze data across multiple dimensions. Especially when working with Multi-Regional Input-Output (MRIO) Tables which have two dimensions (country and industry) this is feature is super helpful.

Here’s an example of creating and working with a DataFrame that has a MultiIndex representing a simple MRIO with two countries and two regions.

6.1 Example: Creating a DataFrame with MultiIndex

Let’s create a 4x4 DataFrame with random integers, where both the columns and rows have a 2-dimensional index representing country and industry.

import pandas as pd
import numpy as np

# Define the index levels
countries = ['country1', 'country2']
industries = ['industry1', 'industry2']

# Create the MultiIndex
multi_index = pd.MultiIndex.from_product([countries, industries], names=['Country', 'Industry'])

# Create a DataFrame with random integers
data = np.random.randint(0, 10, size=(4, 4))
df = pd.DataFrame(data, index=multi_index, columns=multi_index)

# Print the DataFrame
print(df)
Country             country1            country2          
Industry           industry1 industry2 industry1 industry2
Country  Industry                                         
country1 industry1         7         6         1         3
         industry2         0         3         1         1
country2 industry1         8         8         6         0
         industry2         6         8         3         7

In this example, we first define the index levels, countries and industries. Then, we create the multi_index by using the pd.MultiIndex.from_product() function and passing the index level arrays along with the names parameter to assign names to the levels.

Next, we create a random integer array using np.random.randint(), and finally, we create the DataFrame df with the specified index and columns.

6.2 Working with MultiIndex

Once you have a DataFrame with a MultiIndex, you can access and manipulate the data using various indexing and slicing techniques. Here are a few examples:

# Accessing specific rows using the MultiIndex
print(df.loc[('country1', 'industry1'),:])
Country   Industry 
country1  industry1    7
          industry2    6
country2  industry1    1
          industry2    3
Name: (country1, industry1), dtype: int64
# Accessing specific columns using the MultiIndex
print(df.loc[:, ('country1', 'industry1')])
Country   Industry 
country1  industry1    7
          industry2    0
country2  industry1    8
          industry2    6
Name: (country1, industry1), dtype: int64

You can also aggregate data using the MultiIndex. Therefore, you first have to group the data using groupby(), and then append the mathematical operation, e.g. .sum(). The attribute axis determines whether the DataFrame is grouped along rows (0) or columns (1). The attribute level determines by which Index the data is grouped.

 # axis=0: aggregate rows
print(df.groupby(axis=0,level=('Country')).sum())
Country   country1            country2          
Industry industry1 industry2 industry1 industry2
Country                                         
country1         7         9         2         4
country2        14        16         9         7
# axis=1: aggregate columns
print(df.groupby(axis=1,level=('Country')).sum()) 
Country             country1  country2
Country  Industry                     
country1 industry1        13         4
         industry2         3         2
country2 industry1        16         6
         industry2        14        10

These are just a few examples of working with MultiIndex in pandas. You can explore more advanced techniques and functions available in the pandas documentation.

7 Specifying Paths in Python with the os Library

In Python, the os module provides functions for working with operating system functionalities, including file and directory operations. One common use case is working with file paths.

7.1 Joining Path Components

To join multiple path components into a single path, you can use the os.path.join() function. This is useful when you need to construct a path dynamically, especially when dealing with different operating systems:

import os

# Joining path components
path = os.path.join('folder', 'subfolder', 'file.txt')
print(path)
folder/subfolder/file.txt

7.2 Getting the Current Working Directory

To get the current working directory, you can use the os.getcwd() function:

import os

# Getting the current working directory
current_dir = os.getcwd()
print(current_dir)
/home/simon/Documents/code/R/python_basics

7.3 Checking if a Path Exists

To check if a path exists, you can use the os.path.exists() function:

import os

# Checking if a path exists
path = '/path/to/file.txt'
exists = os.path.exists(path)
print(exists)
False

7.4 Specifying an Absolute Path in Windows with the os Library

In Windows, absolute paths are specified differently compared to other operating systems due to the use of drive letters (e.g., C:). The os module in Python provides functions to work with these paths.

7.4.1 Absolute Path in Windows

To specify an absolute path in Windows, you need to include the drive letter followed by a colon (:), and use double backslashes (\\) as the path separator. Here’s an example:

import os

# Specify an absolute path in Windows
abs_path = "C:\\path\\to\\file.txt"
print(abs_path)
C:\path\to\file.txt

Make sure to escape the backslashes by using a double backslash (\\).

7.4.2 Using Raw String

Alternatively, you can use a raw string by adding an r prefix before the path string. This allows you to specify the path using single backslashes (\) as the path separator. Here’s an example:

import os

# Specify an absolute path in Windows using a raw string
abs_path = r"C:\path\to\file.txt"
print(abs_path)
C:\path\to\file.txt

Using a raw string is convenient as you don’t need to escape the backslashes.

7.4.3 Converting to Windows Path

If you’re working with a path string that follows the forward slash (/) path separator convention used in other operating systems (e.g., Linux, macOS), you can convert it to a Windows path using the os.path.normpath() function:

import os

# Convert a path to Windows format
path = "/path/to/file.txt"
windows_path = os.path.normpath(path)
print(windows_path)
/path/to/file.txt

The os.path.normpath() function automatically converts the path separator to double backslashes (\\).

Keep in mind that when working with paths in Python, it’s recommended to use the os.path functions to ensure cross-platform compatibility.

8 Plotting in Python

Python provides several popular libraries for data visualization and plotting. Let’s explore some of these libraries and create a bar plot using the example data.

8.1 Example Data

Let’s consider an example DataFrame with two columns: Country and Emissions. This data represents the emissions of each country.

import pandas as pd

data = {
    'Country': ['USA', 'China', 'Russia', 'Germany', 'India'],
    'Emissions': [5500, 8500, 4300, 3700, 3000]
}

df = pd.DataFrame(data)

8.2 Matplotlib

Matplotlib is a widely used plotting library in Python. It provides a flexible and comprehensive set of functions for creating various types of plots.

import matplotlib.pyplot as plt

plt.bar(df['Country'], df['Emissions'])
plt.xlabel('Country')
plt.ylabel('Emissions')
plt.title('Emissions by Country (Matplotlib)')
plt.show()

8.3 Plotly

Plotly is an interactive plotting library that allows you to create interactive, web-based visualizations. It provides a range of graph types and customization options.

import plotly.express as px

fig = px.bar(df, x='Country', y='Emissions', title='Emissions by Country (Plotly)')
fig.show()

8.4 Plotnine (ggplot)

Plotnine is a Python implementation of the popular R library ggplot2. It follows the grammar of graphics approach and provides a powerful and flexible system for creating visually appealing plots.

from plotnine import ggplot, aes, geom_bar, labs

(
  ggplot(df, aes(x='Country', y='Emissions')) + 
      geom_bar(stat='identity') + 
      labs(title='Emissions by Country (Plotnine)')
)

<Figure Size: (640 x 480)>

In the example above, we use the example data to create a bar plot for emissions using each library. Matplotlib, Seaborn, Plotly, and Plotnine (ggplot) all generate the same bar plot, but they offer different levels of customization and interactivity.

These are just a few examples of the many plotting libraries available in Python. Depending on your specific needs and preferences, you can explore other libraries such as Bokeh, Altair, or Plotly Express to create various types of visualizations.

Remember to install the necessary libraries using pip or conda before using them in your code.

9 For-Loop in Python

In Python, a for-loop is used to iterate over a sequence of elements or to perform a set of instructions repeatedly for a specific number of times. The basic syntax of a for-loop in Python is as follows:

for variable in sequence:
    # Code to be executed for each element

Let’s break down the components of a for-loop:

  • variable: This is a variable that takes on the value of each element in the sequence, one at a time, during each iteration of the loop.

  • sequence: This can be any iterable object such as a list, tuple, string, or range. The loop iterates over each element in the sequence.

  • Code to be executed for each element: This is the block of code that is executed for each element in the sequence. It can include one or more statements.

Here’s an example that demonstrates the usage of a for-loop:

fruits = ['apple', 'banana', 'cherry']

for fruit in fruits:
    print(fruit)
apple
banana
cherry

In this example, we have a list of fruits. The for-loop iterates over each fruit in the list and prints it.

You can also use the range() function to generate a sequence of numbers and iterate over them using a for-loop. Here’s an example:

for num in range(1, 6):
    print(num)
1
2
3
4
5

In this case, the for-loop iterates over the numbers from 1 to 5 (inclusive) and prints each number.

For-loops are powerful constructs in Python that allow you to perform repetitive tasks efficiently. They are commonly used when you have a known number of iterations or when you need to iterate over a sequence of elements.

Remember to properly indent the code block within the for-loop to indicate the scope of the loop. The indentation is typically four spaces or a tab.

You can also use the break and continue statements within a for-loop to control the flow of execution.

Experiment with for-loops to iterate over different sequences and perform various operations within the loop. It’s a fundamental construct in Python programming that you’ll frequently encounter and utilize.

10 Functions in Python

Functions are reusable blocks of code that perform a specific task. They help organize code, promote code reuse, and make it easier to understand and maintain. In Python, you can define and use functions. Let’s explore how to work with functions in Python.

10.1 Defining Functions

To define a function in Python, you use the def keyword followed by the function name and parentheses. You can also specify parameters inside the parentheses if the function requires input values. The function body is indented below the function definition line.

def greet():
    print("Hello, world!")

def add_numbers(a, b):
    return a + b

In this example, we define two functions. The greet() function doesn’t take any parameters and simply prints a greeting. The add_numbers() function takes two parameters a and b and returns their sum.

10.2 Using Functions

Once a function is defined, you can use it by calling its name followed by parentheses. If the function expects parameters, you can pass the necessary values inside the parentheses.

greet()  # Output: Hello, world!

result = add_numbers(5, 3)
print(result)  # Output: 8
Hello, world!
8

In this example, we call the greet() function, which prints the greeting message. Then, we call the add_numbers() function with arguments 5 and 3, and the returned result is stored in the result variable. Finally, we print the value of result, which is 8.

10.3 Lambda Functions

In Python, a lambda function is a small anonymous function that doesn’t have a name. It can take any number of arguments but can only have one expression. Lambda functions are typically used when you need a simple, one-line function.

add = lambda x, y: x + y
result = add(2, 3)
print(result)  # Output: 5
5

In this example, we define a lambda function add that takes two arguments x and y and returns their sum. We then call the lambda function and store the result in the result variable. Finally, we print the value of result, which is 5.

Lambda functions are useful when you need a function for a short, simple operation and don’t want to define a separate named function.