Autocnet uses numpy, pandas, and matplotlib to manage data and show it in a userfriendly way. Numpy is powerful processing software, great for large datasets and performing complex calculations. Pandas organizes the data into a dataframe so it can be viewed and manipulated as needed. Matplotlib shows the data using various plots. This tutorial will go over basic operations in jupyter notebook using numpy, pandas, and matplotlib. You will use these software packages extensively working with Autocnet.
%% Cell type:markdown id: tags:
### Cells
%% Cell type:markdown id: tags:
Cells are isolated blocks of code that can be run individually. Although the code is sequester within a cell, the variables created in a cell can be accessed else were. Manipulating cells in jupyter notebook is easy. A cell is what this is,it is where you will be writing your code. There are two types of cells in this document, markdown cells and code cells. The type of cell can be changed at the top of the notebook in the dropdown menu that shows either Markdown or code.
Cells are isolated blocks of code that can be run individually. Although the code is sequesterd within a cell, the variables created in a cell can be accessed else were. Manipulating cells in jupyter notebook is easy. A cell is what this is,it is where you will be writing your code. There are two types of cells in this document, markdown cells and code cells. The type of cell can be changed at the top of the notebook in the dropdown menu that shows either Markdown or code.
#### Exercise: Change the type of this cell from markdown to code.
Code cells will be used for all excercises in this tutorial. All cell manipulation tools will be found at the top of the notebook. To add a cell, push the plus symbol. To move a cell, use the arrow buttons. To delete a cell, click to the left of the cell and enter dd. To a run a cells contents, click shift and enter.
#### Exercise: Move this cell up one and then down one. Add an extra cell and then delete it.
%% Cell type:markdown id: tags:
### Killing Notebooks
%% Cell type:markdown id: tags:
Notebooks are hosted on the nebula cluster, unless specifically shut down, the job running the jupyter notebook will continue running even if jupyter hub fails. It is important to cancel your jupyter jobs after you are finished or you will get made fun of. Select the box next to your notebook in the main jupyter hub tab and click the yellow shutdown button at the top of the page.
Notebooks are hosted on the nebula cluster, unless specifically shut down, the job running the jupyter notebook will continue running even if jupyter hub fails. It is important to cancel your jupyter jobs after you are finished. Select the box next to your notebook in the main jupyter hub tab and click the yellow shutdown button at the top of the page.
%% Cell type:markdown id: tags:
### Kernels
%% Cell type:markdown id: tags:
Jupyter kernels are conda environments that can be accessed within a jupyter notebook
- Click on 'Kernel' tab in top menu
- Go to change 'Change Kernel' and look through the options
%% Cell type:markdown id: tags:
### Anaconda Environments
%% Cell type:markdown id: tags:
Anaconda environments are a collection of python packages that are installed into an isolated environment and can be selectively accessed through activation of that environment. They are particularly helpful because various versions of a program or various combinations of programs in isolated environments do not effect those in another environment.
For example, ASC internally creates anaconda environments for each new release, and release candidate of the ISIS software. If a user would like to access any particular version of ISIS, they would type `conda activate isisx.y.z`, if later they wanted to access a different version they could `conda deactivate & conda activate isisu.v.w` without worrying about cross containination of environment variables.
%% Cell type:markdown id: tags:
## Numpy
%% Cell type:markdown id: tags:
What is Numpy? Numpy is the core library for computing in Python. It provides a multidimensional array object, and tools to work with the array. A numpy array is a grid of values, all of the same type, and indexed by a list of nonnegative integers. The number of dimensions is the rank of the array, the shape of an array is a list of intergers giving the size of the array along each dimension. The real power of numpy is massive vectorization at a processor level. The difference between python and numpy processing is huge
What is Numpy? Numpy is the core library for computing in Python. It provides a multidimensional array object, and tools to work with the array. A numpy array is a grid of values, all of the same type, and indexed by a list of nonnegative integers. The number of dimensions is the rank of the array, the shape of an array is a list of intergers giving the size of the array along each dimension. The real power of numpy is massive vectorization at a processor level. The difference between python and numpy processing is huge.
%% Cell type:code id: tags:
``` python
importnumpyasnp# First thing to do is to add numpy to the notebook you are using. In order to access python modules (or functions from a python module) they must be explicitly loaded into the notebook.
a=np.array([1,2,3])# Create a rank 1 array, vector array
print(type(a))# print() will print a value to a stream. This prints the type of array.
print(a.shape)# prints the shape of the array
print(a[0],a[1],a[2])# prints the values of the array
print(a.dtype)# prints the data type (this is a 64-bit intereger)
```
%% Cell type:markdown id: tags:
Use the ? symbol to query python about an object. Below is an example for ?np.array. A window should pop up that talks about what this object does.
%% Cell type:code id: tags:
``` python
#### Exercise: Use the ? symbol to find more information about the following objects: print, type, np.dot, np.dtype).
```
%% Cell type:code id: tags:
``` python
```
%%Celltype:markdownid:tags:
Numpyarray's can be manipulated in many ways. The following sections shows a few ways to perfom different math functions on aspects of a numpy array.
%% Cell type:code id: tags:
``` python
a[0] = 5 # Change an element of the array. Python has a zero based index, indexing starts at zero instead of one, so the first element is 0.
print(a)
```
%% Cell type:markdown id: tags:
#### Exercise: In the next cell, change the second element in the "a" array to 10. Print your results to check.
%% Cell type:code id: tags:
``` python
a[1] = 10
print(a)
```
%% Cell type:code id: tags:
``` python
a[1]+15 # Adding to one element in the array.
```
%% Cell type:markdown id: tags:
#### Exercise: Add 5 to the third element in the "a" array.
%% Cell type:code id: tags:
``` python
a[2]+5
```
%% Cell type:code id: tags:
``` python
a+1 # Add one to the entire array,this is called broadcasting, it broadcasts to each element of the array.
```
%% Cell type:code id: tags:
``` python
a*2 # Times the array by 2
```
%% Cell type:code id: tags:
``` python
np.sin(a) # Sin of the array. This is a vectorized function, this applies the function to each element of the array.
```
%% Cell type:code id: tags:
``` python
np.std(a) #Standard deviation of the array.
```
%% Cell type:markdown id: tags:
#### Exercise: Take the standard deviation of the cosine of the "a" array.
%% Cell type:code id: tags:
``` python
np.std(np.cos(a))
```
%% Cell type:code id: tags:
``` python
print(np.append([a], [15, 1, 6])) #Appends elements to an array. Different axis of the array can be specified.
print(np.append([a], [[15, 1, 6]], axis=0))
```
%% Cell type:code id: tags:
``` python
b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array, matrix array
print(b.shape) # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0]) # Prints "1 2 4"
```
%% Cell type:markdown id: tags:
#### Exercise: Print the second element of the first rank of the "b" array.
%% Cell type:code id: tags:
``` python
print(b[0,1])
```
%% Cell type:markdown id: tags:
Numpy arrays can be combined with other arrays. This section goes over working with multiple arrays.
%% Cell type:code id: tags:
``` python
np.add(a,b) # Adds two arrays together
```
%% Cell type:code id: tags:
``` python
a + b # Print a and b before printing the sum
print(a)
print(b)
a + b
```
%% Cell type:code id: tags:
``` python
c = np.zeros((2,2)) # Create an array of all zeros
print(c)
```
%% Cell type:code id: tags:
``` python
d = np.ones(3) # Create an array of all ones
print(d)
```
%% Cell type:code id: tags:
``` python
e = np.full((2,2), 7) # Create a constant array
print(e)
```
%% Cell type:code id: tags:
``` python
f = np.eye(2) # Create a 2x2 identity matrix
print(f)
```
%% Cell type:code id: tags:
``` python
e = np.random.random((2,2)) # Create an array filled with random values, uses: adding random error to things, pick random elements, pull a random a sample from data
e = np.random.random((2,2)) # Create an array filled with random values, uses: adding random error to things, pick random elements, pull a random sample from data
print(e)
```
%% Cell type:code id: tags:
``` python
s=a[0] # Contents of an array can be accessed and modified by slicing. Slicing takes an elements from on index and moves them to another.
s=a[0] # Contents of an array can be accessed and modified by slicing. Slicing takes an elements from one index and moves them to another.
print(s) # This should take the first element of the 'a' array and move it to the s array.
print(a)
```
%% Cell type:code id: tags:
``` python
print(b[0:1]) # You can also slice items between indexes
```
%% Cell type:markdown id: tags:
#### Exercise: multiple the "a" and "d" arrays together. Hint: ?np.dot.
%% Cell type:code id: tags:
``` python
np.dot(a,d)
```
%% Cell type:code id: tags:
``` python
a * d
```
%% Cell type:markdown id: tags:
#### What is the difference between np.dot and a * d?
%% Cell type:markdown id: tags:
## Pandas
%% Cell type:markdown id: tags:
What is Pandas? Pandas is a open source data analysis and manipulation tool, built on top of the Python programming language. It is a software library for Python and is used to perform data analysis in Python. Under the hood, pandas uses numpy to perform most of its functions. The most common type of pandas dataframe is a 2-dimensional table, with rows and columns. This is very similar to an excel spreadsheet or SQL table.
%% Cell type:code id: tags:
``` python
# More explanation? Go through it a bit slower so people can have some time to look at the data and get familiar with the dataframe.
```
%% Cell type:code id: tags:
``` python
import pandas as pd # Pandas package import
import requests # Http library for Python. Communicates between browser and web server that is storing data.
```
%% Cell type:code id: tags:
``` python
url = 'https://api.covid19api.com/summary' # Practice data location.
```
%% Cell type:code id: tags:
``` python
r = requests.get(url, verify = False) #A warning will pop up because of the DOI firewall. This will ingest the data to look at.
```
%% Cell type:markdown id: tags:
We use json to extract and structure the data initially but this is just to get it ready for pandas.
%% Cell type:code id: tags:
``` python
json = r.json() # Extracts json structured data from the request.
```
%% Cell type:code id: tags:
``` python
json # Variable that will show the data. It is similar to pvl.
```
%% Cell type:code id: tags:
``` python
json.keys() # Contains the keys of the dictionary as a list. We can use these keys to explore the json similar to selecting a column.
```
%% Cell type:code id: tags:
``` python
json['Countries'] # Sorts the data by country.
```
%% Cell type:code id: tags:
``` python
# Looking at the type of keys can tell us what has interesting data or not.
type(json['Global'])
```
%% Cell type:markdown id: tags:
#### Exercise: Look at what type of keys the Countries, Message, Date, and ID columns are.
%% Cell type:code id: tags:
``` python
type(json['Countries'])
```
%% Cell type:code id: tags:
``` python
# JSON to a Dataframe. # Remove the slug column by specifying columns.
df = pd.DataFrame(json['Countries'])
df.set_index('Country', inplace=True)
df
```
%% Cell type:code id: tags:
``` python
df.shape # Returns the shape of the dataframe, 188 rows and 11 columns.
df.shape # Returns the shape of the dataframe, 190 rows and 11 columns.
```
%% Cell type:code id: tags:
``` python
df.columns # Returns the names of the columns in the dataframe
```
%% Cell type:code id: tags:
``` python
# Preview of the Data
df.head()
```
%% Cell type:markdown id: tags:
#### Exercise: View the tail of the dataframe.
%% Cell type:code id: tags:
``` python
df.tail()
```
%% Cell type:code id: tags:
``` python
pd.set_option('display.max_rows',None) # By default, pandas, does not show the entire dataframe. Change these two options to change that.
pd.set_option('display.max_columns',None)
df
```
%% Cell type:code id: tags:
``` python
df.loc['Colombia'] #Allows you to look up columns and rows based on values.
```
%% Cell type:markdown id: tags:
#### Exercise: Look up the United States, China, Brazil, and France.
%% Cell type:code id: tags:
``` python
df.loc['China'] #Looks things up by row.
```
%% Cell type:code id: tags:
``` python
df.loc[df['TotalDeaths'] > 100000] # You can combine .loc with other parameters to sort the dataframe.
```
%% Cell type:code id: tags:
``` python
#combine this with quantile. can do equal values, and inequalities. Pull a row and Total deaths
```
%% Cell type:markdown id: tags:
#### Excercise: Look up the TotalConfirmed > 10000000
%% Cell type:code id: tags:
``` python
df.loc[df['TotalConfirmed'] > 10000000]
```
%% Cell type:code id: tags:
``` python
df['NewRecovered'] # This will show all the data for the NewRecovered column.
```
%% Cell type:code id: tags:
``` python
df[["TotalConfirmed","TotalDeaths"]].describe() # How to compute descriptive statistics .describe()
```
%% Cell type:markdown id: tags:
#### Exercise: Look up descriptive statistics for TotalConfirmed and NewConfirmed.
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
df['TotalConfirmed'].quantile(.95) # percentile function. Allows you to compute percentiles on a series.
```
%% Cell type:markdown id: tags:
#### Exercise: Look up the 25th, 50th, and 75th quantile for TotalDeaths using one function.
%% Cell type:code id: tags:
``` python
df['TotalDeaths'].quantile([.25, .50, .75])
```
%% Cell type:code id: tags:
``` python
df['NewDeaths']+df['TotalDeaths'] # You can perform simple math functions on the different columns.
```
%% Cell type:markdown id: tags:
#### Exercise: Calculate the death rate by country.
%% Cell type:code id: tags:
``` python
df['TotalDeaths']/df['TotalConfirmed']
```
%% Cell type:code id: tags:
``` python
df.sort_values(by='TotalDeaths', ascending=False) # The .sort_values function allows you to sort data by either the x or y axis.
```
%% Cell type:markdown id: tags:
#### Exercise: Sort the dataframe with fewest TotalConfirmed on top.
What is Matplotlib? Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. In this section, we are going to take the dataframe and plot the data using boxplots, histograms, scatterplots, and lineplots.
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
```
%% Cell type:markdown id: tags:
### Titles and axes properties

%% Cell type:code id: tags:
``` python
# Box plots
# Creating Data
np.random.seed(10)
data = np.random.normal(100, 20, 200)
fig = plt.figure(figsize =(15, 9))
# Plot
plt.boxplot(data)
plt.show()
```
%% Cell type:code id: tags:
``` python
df.boxplot(column=['TotalRecovered','TotalDeaths']) # You can plot multiple columns on one plot.
plt.yscale("log") # You can change the x and y axis to log scale.
```
%% Cell type:markdown id: tags:
#### Exercise: Make a boxplot of TotalConfirmed and TotalDeaths cases. Plot using a log scale on the y axis.
plt.scatter(x, y, s=area, facecolor='blue', alpha=0.5)
plt.title('Scatter plot')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
```
%% Cell type:code id: tags:
``` python
# Choosing what color to use is important when making a plot. https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html has a wide array of colormaps to choose from.
#### Exercise: Create a scatterplot of TotalDeaths vs. TotalConfirmed with a log scale on the TotalConfirmed axis, a title, and a coloarscale. Is there trend to the data?
%% Cell type:code id: tags:
``` python
df.plot.scatter(x='TotalConfirmed', y='TotalDeaths', c='NewDeaths', colormap='winter', title='TotalDeaths vs. TotalConfirmed')
plt.xscale("log")
```
%% Cell type:code id: tags:
``` python
# line plots
# Data
t = np.arange(0.0, 4.0, 0.001)
s = 1 + np.sin(2 * np.pi * t)
# Plot
fig, ax = plt.subplots()
ax.plot(t, s)
ax.set(xlabel='Wavelength', ylabel='Waveheight',
title='Line Plot')
ax.grid()
fig.savefig("test.png")
plt.show()
```
%% Cell type:markdown id: tags:
#### Exercise: Create a line plot of Total Confirmed vs. Total Recovered. Use what you have learned to improve the plot. Ex. Titles, log, color, etc.
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Addtional resources
%% Cell type:code id: tags:
``` python
# A few other things you can modify in Matplotplib
x = np.linspace(0, 2, 100)
fig, ax = plt.subplots() # Create a figure and an axes.
ax.plot(x, x, label='linear') # Plot some data on the axes.
ax.plot(x, x**2, label='quadratic') # Plot more data on the axes.
ax.plot(x, x**3, label='cubic')
ax.set_xlabel('x label') # Add an x-label to the axes.
ax.set_ylabel('y label') # Add a y-label to the axes.
ax.set_title("Simple Plot") # Add a title to the axes.