# Alex McFarlane

## Contents

This is a beginners introduction. Each topic, subtopic and even sentence most likely has alternatives and improvements in modern literature. If a topic interests you hit you favourite search engine and you will find a lot more information.

# Expected background

• Programming The level of programming expected is beginner. You should be comfortable with lists, data structures, iteration and be familiar with numpy.
• Maths The level of maths is GCSE/AS level (upper-high school). You should be able to taken derivatives of exponential functions and be familiar with the chain-rule.

# Theory

## The XOR function

The XOR function is the simplest (afaik) non-linear function. Is is impossible to separate True results from the False results using a linear function.

def xor(x1, x2):
"""returns XOR"""
return bool(x1) != bool(x2)

x = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([xor(*x) for x in inputs])


This is clear on a plot

import pandas as pd
import seaborn as sns
sns.set()

data = pd.DataFrame(x, columns=['x1', 'x2'])
data['xor'] = y
sns.scatterplot(data=data, x='x1', y='x2', style='xor', hue='xor', s=100)


A neural network is essentially a series of hyperplanes (a plane in N dimensions) that group / separate regions in the target hyperplane.

Two lines is all it would take to separate the True values from the False values in the XOR gate.

# The Perceptron

There’s lots of good articles about perceptrons. To quickly summarise, a perceptron is essentially a method of separating a manifold with a hyperplane. This is just drawing a straight line to separate an n-dimensional space into two regions: True or False. I will interchangeably refer to these as neurons or perceptrons. Arguablly different, they are basically the same thing.

### Activation functions

The way our brains work is like a sort of step function. Neurons fires a 1 if there is enough build up of voltage else it doesn’t fire (i.e a zero). We aim, via the percepton, to recreate this behaviour.

The problem with a step function is that they are discontinuous. This creates problems with the practicality of the mathematics (talk to any derivatives trader about the problems in hedging barrier options at the money). Thus we tend to use a smooth functions, the sigmoid, which is infinitely differentiable, allowing us to easily do calculus with our model.

def sigmoid(x):
return 1/(1 + np.exp(-x))


### Hyperplanes

A single perceptron, therefore, cannot separate our XOR gate because it can only draw one straight line.

The trick is to realise that we can just logically stack two perceptrons. Two perceptrons that will draw straight lines, and another perceptron that serves to combine these two separate signals into a single signal that just has to differntiate between a single True / False boundary.

## Learning parameters

The “knowledge” of a neural network is all contained in the learned parameters which are the weights and bias. The weights are multiplied to each signal sent by their respective perceptrons and the bias are added as $y(x) = wx + b$ where $w$ is the weight and $b$ is the bias.

The backpropagation algorithm (backprop.) is the key method by which we seqeuntially adjust the weights by backpropagating the errors from the final output neuron.

We define define the error as anything that will decrease as we approach the target distribution. Let $E$ be the error function given by

where $y_o$ is the result of the output layer (the prediction) and $y$ is the true value given in the training data.

Later we will require the derivative of this function so we can add in a factor of 0.5 which simplifies the derivative.

def error(target, prediction):
return .5 * (target - prediction)**2

 *Note: Explicitly we should define as the norm like, $E = \frac{1}{2} y - y_{o} ^2$ since $y$ and $y_{o}$ are vectors but practically it makes no difference and so I prefer to keep it simple for this tutorial.*

### Algorithm

The learning algorithm consists of the following steps:

1. Randomly initialise bias and weights
2. Iterate the training data
• Forward propagate: Calculate the neural net the output
• Compute a “loss function”
• Backwards propagate: Calculate the gradients with respect to the weights and bias
3. Exit when error is minimised to some criteria

Note that here we are trying to replicate the exact functional form of the input data. This is not probabilistic data so we do not need a train / validation / test split as overtraining here is actually the aim.

## Back propagation

We want to find the minimum loss given a set of parameters (the weights and biases). Recalling some AS level maths, we can find the minima of a function by minimising the gradient (each minima has zero gradient). Also luckily for us, this problem has no local minima so we don’t need to do any funny business to guarantee convergence.

Real world problems require stochastic gradient descents which “jump about” as they descend giving them the ability to find the global minima given a long enough time.

We therefore have several quantitites that require calculation

$\frac{\partial E}{\partial w_{o}}$,

$\frac{\partial E}{\partial w_{h}}$,

$\frac{\partial E}{\partial b_{o}}$

and $\frac{\partial E}{\partial w_{h}}$ where $h$ and $o$ denote hidden and output layers and $E$ is the total error given by

Starting at the output layer, and by chain rule,

This is true because the output layer $y_o = \sigma(a_o)$ varies with respect to the activation $a_o = w_o \cdot y_h + b_o$ in the output layer. The activation in the output layer varies with respect to the weights of the output layer. Recall that this is a partial derivative so we hold the bias of the output layer and the hidden layer output as constants. We can calculate all these terms.

The derivative of the Error with respect to the output layer is just

due to the sneaky extra half

def error_derivative(target, prediction):
return - target + prediction


The derivative of the output layer with respect to the sigmoid is

def sigmoid_derivative(sigmoid_result):
return sigmoid_result * (1 - sigmoid_result)


The derivative of the activation function with respect to the weights is

which is just the result from the hidden layer.

The same is true for the bias (changing $w_o$ for $b_o$) except

For the hidden layer we just keep expanding. Instead of taking $y_h$ as a variable we now treat it as a function that varies with the output of the hidden layer. Thus instead of stopping at $\frac{\partial a_o}{\partial w_o}$ we must continue since $a_o = w_o \cdot y_h + b_o$ is a fuction of $y_h$ which is no longer a constant.

Recall, if we vary $w_h$, $y_h = \sigma(a_h)$ and further, $a_h = w_h \cdot x + b$ where $x$ is the input training data.

Starting at the output layer, and by chain rule,

the three extra derivatives we need to calculate are

then recall the derivative of the sigmoid from before

and finally

and again the bias are the same except $\frac{\partial a_h}{\partial b_h} = 1$

Let all the parameters be defined in the vector $\theta = (w_o, w_h, b_o, b_h)$. Then they are all updated like

where $\alpha$ is the learning rate that is fixed at some constant (mysteriously, usually about 0.02) and this defines how fast we descend the gradient. Too fast and we overshoot and too small and it takes too long.

Explicitly for $w_o, b_o$ this would be,

where the $y_h^T$ denotes the transpose of the vector $y_h$. I was a bit underhand in the maths to simplify, but as we will see there is some care required with transposing of dot products when multiplying the weights and layers.

# Implementation

## Initialisation

Initialise the weights and bias like

import numpy as np

def xor(x1, x2):
return bool(x1) != bool(x2)

x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([xor(*i) for i in x])

# I explicitly name these for clarity
n_neurons_input, n_neurons_hidden, n_neurons_output = 2, 2, 1

# if this confuses you draw out all the possible wasy to connect the
# first two neurons remembering that we have only connections between layers
# and not between neurons on the same layer
w_hidden = np.random.random(size=(n_neurons_input, n_neurons_hidden))
b_hidden = np.random.random(size=(1, n_neurons_hidden))

w_output = np.random.random(size=(n_neurons_hidden, n_neurons_output))
b_output = np.random.random(size=(1, n_neurons_output))


## Forward Propagation

We now have a neural network (albeit a lousey one!) that can be used to make a prediction. To make a prediction we must cross multiply all the weights with the inputs of each respective layer, summing the result and adding bias to the sum.

The action of cross multiplying all combinations of two vectors and summing the result is just the dot product, so that a single forward propagation is given as

def sigmoid(x):
return 1/(1 + np.exp(-x))

y_hidden = sigmoid(np.dot(x, w_hidden) + b_hidden)
y_output = sigmoid(np.dot(y_hidden, w_output) + b_output)


where y_output is now our estimation of the function from the neural network.

## Back propagation

The back propagation is then done

def sigmoid_derivative(sigmoid_result):
return sigmoid_result * (1 - sigmoid_result)

grad_output = (- y + y_output) * sigmoid_derivative(y_output)


The updates require a little explanation of the cheats I did before in the theory section. The function, $a_o = w_o \cdot y_h + b_o$ is actually a vector equation so explicitly, $a_o = y_h^T w_o + b_o$ and similarly, $a_h = x^T w_h + b_h$ which gives some intuition why we need to do the transposes below

w_output -= alpha * y_hidden.T.dot(grad_output)


The biases we recall had a derivative of 1. In reality this is the identity, or a vector of 1s so it reduces to a simple sum of all elements in the vector

b_output -= alpha * np.sum(grad_output, axis=0, keepdims=True)
b_hidden -= alpha * np.sum(grad_hidden, axis=0, keepdims=True)


## Iteration / Early Stopping

We want to stop after some number of iterations. Therefore in our loop we can do something like below to exit early. We should also store some of the variables and gradients to ensure convergence.

import itertools

errors = []
params = []
while True:
# forward prop

# calculate mean error of all the errors for this epoch
e = error(y, y_output).mean()
if e < 1e-4:
break

# back prop

# update parameters

# record intermediate results the concatenate just flattens all the
# matrixes and arrays out into a single "row"
errors.append(e)
params.append(np.concatenate((w_output.ravel(), b_output.ravel(),
w_hidden.ravel(), b_hidden.ravel())))


# Experiment

## Full script

Putting it all together we have

import itertools
import numpy as np

np.random.seed(42) # this makes sure you get the same results as me

def xor(x1, x2):
return bool(x1) != bool(x2)

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def sigmoid_derivative(sigmoid_result):
return sigmoid_result * (1 - sigmoid_result)

def error(target, prediction):
return .5 * (target - prediction)**2

def error_derivative(target, prediction):
return - target + prediction

x = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[xor(*i)] for i in x], dtype=int)

alpha = 0.02
n_neurons_input, n_neurons_hidden, n_neurons_output = 2, 2, 1

w_hidden = np.random.random(size=(n_neurons_input, n_neurons_hidden))
b_hidden = np.random.random(size=(1, n_neurons_hidden))

w_output = np.random.random(size=(n_neurons_hidden, n_neurons_output))
b_output = np.random.random(size=(1, n_neurons_output))

errors = []
params = []
while True:
# forward prop
y_hidden = sigmoid(np.dot(x, w_hidden) + b_hidden)
y_output = sigmoid(np.dot(y_hidden, w_output) + b_output)

# calculate mean error of all the errors for this epoch
e = error(y, y_output).mean()
if e < 1e-4:
break

# back prop
grad_output = error_derivative(y, y_output) * sigmoid_derivative(y_output)

# update parameters

# record intermediate results
errors.append(e)
params.append(np.concatenate((w_output.ravel(), b_output.ravel(),
w_hidden.ravel(), b_hidden.ravel())))


This takes about a minute to run on my macbook. We can now use the model to approximate XOR

>>> def predict(x):
>>>     y_hidden = sigmoid(np.dot(x, w_hidden) + b_hidden)
>>>     return sigmoid(np.dot(y_hidden, w_output) + b_output)
>>> predict(x)
array([[0.01505756],
[0.98652214],
[0.98652109],
[0.01448912]])


As we can see we’re in the order of $10^{-2}$ which makes sense as we minimise the square error to the order of $10^{-4}$.

If this was a real problem, we would save the weights and bias as these define the model.

## Convergence checks

We should check the convergence for any neural network across the paramters.

The simplest check is the error convergence. We should make sure that the error is decreasing. in larger networks the error can jump around quite erractically so often smoothing (e.g. EWMA) is used to see the decline.

It is also sensible to make sure that the parameters and gradients are cnoverging to sensible values. Furthermore, we would expect the gradients to all approach zero.

It is very important in large networks to address exploding parameters as they are a sign of a bug and can easily be missed to give spurious results.

I use Pandas combined with matplotlib to plot more easily

import matplotlib.pyplot as plt
import pandas as pd

fig, axes = plt.subplots(3, 1, figsize=(6, 8), sharex=True)

pd.DataFrame(errors, columns=['error']).plot(ax=axes[0], logy=True)
df_params = pd.DataFrame(params)

for i in range(4):
for i in range(4, 12):

for i in range(1, 2):
axes[2].plot(df_params.iloc[:, i].abs(), c='r', label='abs(output weights)' if i==1 else '__nolabel', lw=1)

axes[2].plot(df_params.iloc[:, 2].abs(), c='b', label='abs(output bias)', lw=1)

for i in range(3, 7):
axes[2].plot(df_params.iloc[:, i].abs(), c='g', label='abs(hidden weights)' if i==3 else '__nolabel', lw=1)

for i in range(7, 9):
axes[2].plot(df_params.iloc[:, i].abs(), c='y', label='abs(hidden bias)' if i==7 else '__nolabel', lw=1)

axes[1].legend()
axes[1].set_yscale('log')

axes[2].legend()
axes[2].set_yscale('log')

fig.tight_layout()


It is noticable that the gradient of the hidden layer is considerably smaller than the output layer. This is normal.

In fact it presents a problem. One of the main problems historically with neural networks were that the gradients became too small too quickly as the network grew. In fact so small so quickly that the change in a deep parameter value causes such a small change in the output that it either gets lost in machine noise. Or, in the case of probabilistic models, lost in dataset noise.

This meant that neural networks couldn’t be used for a lot of the problems that required complex network architecture.

There are several workarounds for this problem which largely fall into architecture (e.g. ReLu) or algorithmic adjustments (e.g. greedy layer training).

The sigmoid, is a key offender in the mix. It maps input numbers onto a “small” range of [0, 1]. There are large regions of the input space which are mapped to an extremely small range. In these regions of the input space, even a large change will produce a small change in the output.

This notebook is an excellent example of choosing Relu instead of sigmoid to avoid the vanishing gradient problem.

## Decision regions

We can plot the hyperplane separation of the decision boundaries. The sigmoid is a smooth function so there is no discontinuous boundary, rather we plot the transition from True into False.

This plot code is a bit more complex than the previous code samples but gives an extremely helpful insight into the workings of the neural network decision process for XOR.

import numpy as np
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

def predict(x):
y_hidden = sigmoid(np.dot(x, w_hidden) + b_hidden)
return sigmoid(np.dot(y_hidden, w_output) + b_output)

markers = ('s', 'X')
colors = ('red', 'blue')

x1_min, x1_max = x[:, 0].min() - 1, x[:, 0].max() + 1
x2_min, x2_max = x[:, 1].min() - 1, x[:, 1].max() + 1
xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution),
np.arange(x2_min, x2_max, resolution))

z = predict(np.array([xx1.ravel(), xx2.ravel()]).T)
z = z.reshape(xx1.shape)

fig, ax = plt.subplots()
ax.contourf(xx1, xx2, z, alpha=0.4, cmap=ListedColormap(colors))
ax.set_xlim(xx1.min(), xx1.max())
ax.set_ylim(xx2.min(), xx2.max())

for idx, cl in enumerate(np.unique(y)):
ax.scatter(x=x[(y == cl).ravel(), 0],
y=x[(y == cl).ravel(), 1],
alpha=0.8, c=colors[idx],
marker=markers[idx], label=cl)
ax.legend()
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
fig.tight_layout()


A converged result should have hyperplanes that separate the True and False values.

# References

## Contents

Lets assume you happen to stumble across the NHS data for Covid19 found at https://www.england.nhs.uk/statistics/statistical-work-areas/covid-19-daily-deaths/

# Parse an Excel with Pandas

We can extract th dat for a funky plot like

import datetime
import pandas as pd

dt = datetime.date.today()
url_ts = url_root + f'COVID-19-total-announced-deaths-{dt.strftime("%-d-%B-%Y")}.xlsx'

_parser = lambda x: x in ['NHS England Region', 'Name'] or isinstance(x, datetime.datetime)
raw = pd.read_excel(url_ts, sheet_name=0, skiprows=15, usecols=_parser)
raw = raw.iloc[2:]
raw.index = pd.to_datetime(raw.index)


I opened the Excel file initially in Excel. There were 15 useless rows at the top. There was also an annoying column format where I only wanted dates, Name of hospital and NHS England Region. This is achieved by passing a function that will return True for any of these fields. There are also two junk rows athe the start of the dataframe and always convert date fields into datetime objects.

# Cleaning Data

If you read the note in the excel file we see that numbers are rastically changed as and when the data comes in. This means that the last 5 days are highly variable. We want to separate this out and make it clear on any plot. Also the data is noisey. We can smooth with EWMA. I thought EWMA is a bit better than moving average because you get a continuous smoothing rather than a plot that jumps around as data moves in and out of the moving average window.

span = 10
unstable_region = 5
raw_diffs = raw.sum(axis=1).diff() # useful later
smoothed = raw_diffs.diff().ewm(span=span).mean()

stable = smoothed.iloc[:-unstable_region+1]
unstable = smoothed.iloc[-unstable_region:]


# Plotting

This is pretty easy

import matplotlib.pyplot as plt
plt.ion()

fig, ax = plt.subplots()
stable.plot(title='Acceleration of Cumulative UK Deaths in NHS Hospitals',
label='stable data',
ax=ax)
unstable.plot(ax=ax, color='red', ls='--', label='unstable data')


We can now plot the unstable dataset like

fig, ax = plt.subplots()
stable.plot(title='Acceleration of Cumulative UK Deaths in NHS Hospitals',
label='stable data',
ax=ax)
unstable.plot(ax=ax, color='red', ls='--', label='unstable data')

ax.set_ylabel('change in daily fatalities')


Now wehave the problem where the last 5 days of data are potentially garbage. WEe need to estimate somehow what the update might look like. That last few days of updates are available on the same landing page.

import requests
import bs4

# adjust the unstable data to take into account updated data
url_page = r'https://www.england.nhs.uk/statistics/statistical-work-areas/covid-19-daily-deaths/'
html = requests.get(url_page).content
soup = bs4.BeautifulSoup(
html,
"html.parser",
parse_only=bs4.SoupStrainer('a')
)
l['href']
for l in soup if 'xls' in l.get('href', '')
and 'daily announced' in l.text.lower()
]


IN the above I parse the page for all a tags which are hyperlinks in HTML. Then I look for all hyperlinks that end in ".xls" and that also have the text "daily announced" in the link text.

Now I want to parse the date the update was given by. THis is a bit tricky because it’s clearly a human process as the files have slightly different names. We are at risk of the process breaking if someone changes the naming convention. I take the assumption "deaths" will always be the end of the file text name and be next to the date.

def get_offset_from_url(u):
d = datetime.datetime.strptime(u[u.index('deaths-') + 7: u.index('-2020') + 5], '%d-%B-%Y')
return (dt - d.date()).days

kws = dict(sheet_name='COVID19 daily deaths by age',
skiprows=14, usecols=_parser, nrows=1)
{
}
)


I pass the data into a dataframe and read from the sheet name skipping useless rows and using the same column parser as before because we only want the dates and it doesn’t need to be recreated.

I’ll let you figure this next bit out, basically I’m recreating the data as it previously was.

updates_raw.columns.name = 'Adjustments n days ago'

# this is because of some errors in the data
raw_as_of_date.iloc[-i-1:] = 0
all_cols.remove(i)

name='The date that was updated (n days ago)')
name='The date that was updated (n days ago)')


There is a mistake in the dataset as well which I remove by raw_as_of_date.iloc[-i-1:] = 0. I add in the name parameters because I’m a bit OCD.

I then plot the data to make sure there isn’t any drastic change in the updates each day. What is possible is that in a stressed system there might be an increasing backlog.

fig2, (ax2, ax2b) = plt.subplots(2, sharex=True)
l = 0
colors = itertools.cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])
c = next(colors)
ax=ax2, lw=1, ls='--', color=c,
label=f'update for previous days made {col} days ago'
)

c = next(colors)

fig2.suptitle('Percentage change in the daily figures')
ax2.set_title("percentage daily adjustments aren't drastically increasing day by day",
fontsize=8)
ax2.set_ylabel('Percentage change')
ax2.legend(fontsize=8)
ax2b.set_ylabel('Absolute change')


AS it turns out this isn’t the case and the updates are roughly the same each day and there is only about a 10% increase after the 5th day so the NHS note is correct in saying the last 5 days are the unstable region.

Estimating the expected updates is just done by integrating under the curve to get the expected updates as a percentage.

raw_diffs_adj = raw_diffs.copy()
for i in range(unstable_region):


# The updated plot

We use the axis handle from the plot above and update for the new estimates of what we expect the data to show.

smoothed_adj = raw_diffs_adj.diff().ewm(span=span).mean()

fig3, ax3 = plt.subplots()
raw_diffs.plot(title='Daily fatalities in NHS Hospitals',
label='original data', ax=ax3)
ax3.set_ylabel('Fatalities attributed to each date')
ax3.legend(loc='upper left')



# Estimating Impact of Government Measures

We now want to annotate where the UK government introduced quarantine phases. I found the parameters for the offset of 13 days byreading a load of medical papers which showed it took 13 days from infection to fatality as the low end of the interquartile range. I assume this would be where we start to see effects really feeding back into the data.

We can add labels for where we expect the data to be affected by isolation like

d0 = datetime.datetime(2020, 3, 16)
d1 = datetime.datetime(2020, 3, 23)
d2 = d0 + datetime.timedelta(days=13)
d3 = d1 + datetime.timedelta(days=13)

label='national isolation + 13 days')
ax.axvspan(d2, d3, color='darkorange', alpha=0.3, lw=0,
label='blanket symptomatic isolation + 13 days')
ax.legend(loc='lower left', fontsize=8)


Layout is better with

fig.tight_layout()


# Full Example Script

import datetime
import pandas as pd
import itertools
import bs4
import requests
from urllib.request import HTTPError

dt = datetime.date.today()
url_ts = url_root + f'COVID-19-total-announced-deaths-{dt.strftime("%-d-%B-%Y")}.xlsx'

_parser = lambda x: x in ['NHS England Region', 'Name'] or isinstance(x, datetime.datetime)
raw = pd.read_excel(url_ts, sheet_name=0, skiprows=15, usecols=_parser)
raw = raw.iloc[2:]

raw = raw.set_index(['NHS England Region', 'Name']).T.cumsum()
raw.index = pd.to_datetime(raw.index)

span = 10
unstable_region = 5
raw_diffs = raw.sum(axis=1).diff() # useful later
smoothed = raw_diffs.diff().ewm(span=span).mean()

stable = smoothed.iloc[:-unstable_region+1]
unstable = smoothed.iloc[-unstable_region:]

fig, ax = plt.subplots()
stable.plot(title='Acceleration of Cumulative UK Deaths in NHS Hospitals',
label='stable data',
ax=ax)
unstable.plot(ax=ax, color='red', ls='--', label='unstable data')

ax.set_ylabel('change in daily fatalities')

# adjust the unstable data to take into account updated data
url_page = r'https://www.england.nhs.uk/statistics/statistical-work-areas/covid-19-daily-deaths/'
html = requests.get(url_page).content
soup = bs4.BeautifulSoup(
html,
"html.parser",
parse_only=bs4.SoupStrainer('a')
)
l['href']
for l in soup if 'xls' in l.get('href', '')
and 'daily announced' in l.text.lower()
]

def get_offset_from_url(u):
d = datetime.datetime.strptime(u[u.index('deaths-') + 7: u.index('-2020') + 5], '%d-%B-%Y')
return (dt - d.date()).days

kws = dict(sheet_name='COVID19 daily deaths by age',
skiprows=14, usecols=_parser, nrows=1)
{
}
)

# this is because of some errors in the data
raw_as_of_date.iloc[-i-1:] = 0
all_cols.remove(i)

name='The date that was updated (n days ago)')
name='The date that was updated (n days ago)')

# check that daily adjustment aren't drastically increasing
# ie. that we can use an average of the last few days without
# implicitly concealing some exponential growth

fig2, (ax2, ax2b) = plt.subplots(2, sharex=True)
l = 0
colors = itertools.cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])
c = next(colors)
ax=ax2, lw=1, ls='--', color=c,
label=f'update for previous days made {col} days ago'
)

c = next(colors)

fig2.suptitle('Percentage change in the daily figures')
ax2.set_title("percentage daily adjustments aren't drastically increasing day by day",
fontsize=8)
ax2.set_ylabel('Percentage change')
ax2.legend(fontsize=8)
ax2b.set_ylabel('Absolute change')

# we only see about 10% increase to data that was more than 5 days ago
# therefore we can adjust accordingly
for i in range(unstable_region):

fig3, ax3 = plt.subplots()
raw_diffs.plot(title='Daily fatalities in NHS Hospitals',
label='original data', ax=ax3)
label='taking into account expected future adjustments')
ax3.set_ylabel('Fatalities attributed to each date')
ax3.legend(loc='upper left')

d0 = datetime.datetime(2020, 3, 16)
d1 = datetime.datetime(2020, 3, 23)
d2 = d0 + datetime.timedelta(days=13)
d3 = d1 + datetime.timedelta(days=13)

label='national isolation + 13 days')
ax.axvspan(d2, d3, color='darkorange', alpha=0.3, lw=0,
label='blanket symptomatic isolation + 13 days')
ax.legend(loc='lower left', fontsize=8)
fig.tight_layout()


# Learning Outcomes

• Be able to create a basic class
• Interact with a class object
• Understand the utility of classes
• An idea of good and bad class usage
• Be able to create modular code
• Import from modular code

# Classes

Without necessarily knowing it you have been itnertacting with classes during this entire tutorial. That be said what is a class?


class Dog:
"""
Creates a Dog. Note that you will normally not document __init__ but instead
put then documentation under the class definition

Args:
name: The dogs name
color: give a color
speed: Give the top speed in mph. Must be >= 0
weight: Give the weight in kg. Must be >= 0

Example:
>>> Dog('rover', 'brown', 10, 50)
"""
def __init__(self, name, color, max_speed, weight):
self.name = name
self.color = color
self.max_speed = max_speed
self.weight = weight

if max_speed <= 0:
raise ValueError('Cannot have max_speed <= 0')
if weight <= 0:
raise ValueError('Cannot have weight <= 0')

def noise(self):
"""
Makes the noise of a dog

Example:
>>> Dog('ronald', 'green', 10, 50).noise()
'woof'
"""
return 'woof'


A class is an abstraction of the logic that defines something. In this instance ‘something’ is a Dog - or rather a pretty basic dog. Currently all our dog can do is Woof!

The __init__ method is used to define what arguments a class can take at it’s creation. We can similarly define a class like

class DogBase:

def noise(self):
return 'woof'


Style Tip Always define classes in ThisFormatOfText

but the we cannot specify any defining characteristics about DogBase so it kind of loses it’s individual charm - there are cases where DogBase would be increadibly useful but we won’t touch on those at this level.

## What is self in a class

Functions within classes are known as class methods. This new namenclature kind of gives us a hint at a few things…

I can define a function outside of the class which doesn’t overwrite the function inside the class

>>> def noise()
...     return 'roar'
>>> Dog('ronald', 'purple', 10, 50).noise()
'woof'


### Why is it called self?

Secondly, notice that all my class methods contain the variable name self as the first argument whereas my function did not have this requirement.

self is actually not required to be called self. In fact I could call it any name. It’s the same as using m for mass in Newton’s law F = m * a. I could easily replace Newton’s law with g = p * j and state that g is force, p is mass and j is acceleration.

The reason we use self is because it’s refers to the class object itself

### What does self do?

Lets create an instance of Dog

>>> rover = Dog('rover', 'brown', 15, 50)


When we call

>>> rover.noise()


what happens internally is actually

>>> Dog.noise(rover)


thus self becomes rover!

## Interacting with class attributes

Anything that is assigned to self becomes an attribute that we can interact with. This is another very useful property of a class: For storing intermediate results.

class Demo:
"""
A demo taking in 3 integers, x, y, z

Example:
>>> d = Demo(1, 2, 3)
>>> (d.x == 1) & (d.y == 2) & (d.z == 3)
True
"""
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z

def method(self, a, b):
"""
Calculates (x^y + z) * a * b

Example:
>>> x, y, z, a, b = 1, 5, 76, 89, 2
>>> d = Demo(x, y, z)
>>> d.method(a, b) - (x**y + z) * a * b  < 1e-6
True

We can also extract the temporary reuslt afterwards
>>> d.tmp - (x**y + z) < 1e-6
True

and we also stored the result
>>> d.res - (x**y + z) * a * b  < 1e-6
True
"""
self.tmp = self.x**self.y + self.z
self.res = self.tmp * a * b
return self.res


Testing Tip As a side note I actually got the variable names on this really simple function wrong several times simply because I decided to rename them all! THe doctests saved me actually putting up incorrect code.

## When to use Classes

This is all great but you don’t work in a Game Studio. This is geared towards banking so why do we care about Dogs?! We don’t, so lets start looking at trading strategies.

Just as a heasd up: This has no relation to how Nomura QIS calculate momentum. This is just a simple demonstration to how we would want to utilise classes for signals

The best way to start is normally with functions or single lines of code. As you become comfortable with components move then into functions.

Consider a simple momentum trading strategy: Simply define the momentum signal as the average of the sign of the return the n previous days

import numpy as np

def momentum_signal(s):
"""
A simple momentum signal

Args:
s: A numpy array of some underlying instrument price

Example:
We expect the returns to be calculated like

>>> import numpy as np
>>> s = np.array([1. , 1.5, 1. , 1.5])
>>> rets_expected = np.array([1.5/1-1, 1/1.5-1, 1.5/1-1])

Then we take the sign of these returns which gives

>>> np.sign(rets_expected)
array([ 1., -1.,  1.])

Finally we average the sign like
>>> expected = np.sign(rets_expected).sum() / 3.

and we can compare against our function using np.allclose
because sometimes the values may vary ever so slightly due to machine
precision

>>> result = momentum_signal(s)
>>> np.allclose(result, expected)
True
"""
if not isinstance(s, np.ndarray):
s = np.array(s)         # we need it to be a numpy array for efficiency
rets = s[1:] / s[:-1] - 1   # See exercise 11.2 to understand this
return np.sign(rets).mean()


This function is pretty generic bcause it will just give the signal on whatever prices you send into it. However, you need logic to take the last n prices from a larger timeseries and may want to do this in a rolling process so as to create a backtest.

These two conditions generate the need for a class

### When to avoid classes

As a beginner classes will almost always make debugging more complicated if you dive into them head first.

Remember that abstraction is a tool: You should only use tools if they make your work more productive.

An analogy is laying a drainage ditch in your back garden. An automated drainage ditcher is clearly far more adept at this task. By the time you source one and finally figure out how it works you could have already been at the pub: Garden swap free. Also in using this new tool, it’s likely that you may misconfigure it and have a lot of issues as a result! Pick the right tool for the right job!

I regularly see code from juniors that will define a single class in a module that will never change in its parameterisation and will only ever be called once with functions that do indepedent tasks.

However, just because they functions were all for the specific task they may have wrote a class like

from yourfirms_quantlib import core

"""
The report I was asked to automate yesterday for that client

Args:

Example:
>>> report = TradeThingyForReportX([432346, 3546457, 345324])
[432346, 3546457, 345324]
"""

Example:
>>> report = TradeThingyForReportX([432346, 3546457, 345324])
2389456.34
"""

def rebuild_forward_curves_with_new_data(self, new_data):

Args:
new_data: A list of some kind of new params for the trades

Example:
>>> report = TradeThingyForReportX([432346, 3546457, 345324])
>>> report.rebuild_forward_curves_with_new_data([{'start_date': 1}])

Lets assume we can check in the core quantlib if they are mutated
[1, 1, 1]
"""
# ...
# some manipulation to reformat new_data
# ...

# assume this changes the IDs in memory during this python session
# althought this would probabaly be a terrible idea in reality!

def get_present_value_with_new_data(new_data):
"""
Gets the PV for a report X with some scenario modifications

Args:
new_data: A list of some kind of new params for the trades

Example:
Lets assume our new_data allows us to bump parameters then

>>> report = TradeThingyForReportX([432346, 3546457, 345324])
>>> report.get_present_value_with_new_data([{'start_date': 1}])
92309.6
"""
self.rebuild_forward_curves_with_new_data(new_data)


Given that this report will never be used outside of this junior’s function: I would say that this is a bad example of writing a class.

Instead I would write the above example as below which results in far less complexity and less code

from yourfirms_quantlib import core

Args:
new_data: A list of some kind of new params for the trades

Example:
>>> rebuild_forward_curves_with_new_data([7878676], [{'end_date': -1}])

Lets assume we can check in the core quantlib if they are mutated
[-1]
"""
# ...
# some manipulation to reformat new_data
# ...

# assume this changes the IDs in memory during this python session
# althought this would probabaly be a terrible idea in reality!

"""
Gets the PV for a report X with some scenario modifications

Args:
new_data: A list of some kind of new params for the trades

Example:
Lets assume our new_data allows us to bump parameters then

>>> pv_for_report_x([432346, 3546457, 345324], [{'start_date': 1}])
92309.6
"""


### When to move into a class

In my opinion there are two good reasonas to use classes in python

1. You want to create many parameterisations of the same underlying logic (e.g. The Dog example)
2. You have many functions that are constantly passing around the same objects

### A class for our backtesting

Lets create an engine that can give us the signal at any given historic timepoint for our momentum strategy

#### Use pandas for timeseries analysis!

Doing timeseries analysis without using pandas is asking for trouble unless you’re an advanced programmer.

This is because pandas contains an index which travels with the arrays that prevents you tying youeslf in knots. A single bad index can and will totally breakdown any systemmatic signal.

Therefore let us start the creation of our strategy using a pandas Series (a pd.Series is just a single column of a pd.DataFrame) check this by creating df = pd.DataFrame({'a': [5, 6, 10]}) and selecting the column 'a' like df['a']

#### Writing the base logic

The following incorporates the basic framework in which we want to operate

class SimpleMom:
"""
A simple momentum strategy

Args:
ts: The prices as a pandas DataFrame
lookback: How many previous days we wish to create a signal from
"""
def __init__(self, prices, lookback):
if not isinstance(prices, pd.Series):
raise ValueError('prices must be a pandas Series because we only want one column!')

self.prices = pd.DataFrame(prices)
self.lookback = lookback


Now lets add the signals in! This time we want to actually get the signal for a specific historical time point so we need some kind of reference to the time in the dataframe index

class SimpleMom:
"""
A simple momentum strategy

Args:
ts: The prices as a pandas DataFrame
lookback: How many previous days we wish to create a signal from

Example:
>>> import pandas as pd
>>> prices = pd.Series([1. , 1.5, 1. , 1.5, 3.0])
>>> sm = SimpleMom(prices, 3)
"""
def __init__(self, prices, lookback):
if not isinstance(prices, pd.Series):
raise ValueError('prices must be a pandas Series because we only want one column!')

self.prices = prices
self.lookback = lookback

def signal(self, t):
"""
Gives the signal that we can observe at time t

Example:
>>> sig = SimpleMom(pd.Series([1. , 1.5, 1. , 1.5, 4.0]), 3).signal(3)
>>> np.allclose(1/3., sig)
True
"""
# this gives us all the prices up to but not including t
# which is what we want because we can't observe the price at t
# Think about EOD prices as an example... We can't see EOD for t
# when are in day t
prices_lookback = self.prices.loc[t - self.lookback: t]
return momentum_signal(prices_lookback)


# Modular Python

At this stage I would strongly recommend turning on autoreload in ipython which can be done like

In [0]: %load_ext autoreload



This will allow your code to autoreload the changes as you edit them!

## Example: Create a basic module

Create python files with the terminal in new directory called nnet the same directory like

./nnet/
__init__.py
core.py


This module will contain the generic mathematical building blocks required for a neural network. Our main module model.py will bring all of these components together to make a model.

Inside core.py put the following contents.

import numpy as np

def sigmoid(x):
"""
The sigmoid function

Example:

This example shows how you can skip doctests
on certain examples e.g. plots that are purely for
demonstration and won't result in testable results

>>> from nnet import core
>>> import matplotlib.pyplot as plt  # doctest: skip
>>> x = np.linspace(-10, 10, 100)    # doctest: skip
>>> plt.plot(x, core.sigmoid(x))     # doctest: skip
>>> plt.show()                       # doctest: skip
"""
return 1 / (1 + np.exp(-x))

def forward_prop(x, w, b):
"""
Forwards propagation

Example:
>>> import numpy as np
>>> from nnet import core
>>> x = np.array([[0],   # demonstrate that x is a column array
...               [1]])  # so has shape of (2, 1) - (rows, columns)
>>> w = np.array([[.5, .5],
...               [.5, .5]])
>>> b = np.array([[1], [1]])  # b is also a column array but on one line
>>> core.forward_prop(x, w, b)
array([[1.5],
[1.5]])
"""
# note that you can chain logic like below in python!
if not (x.shape[0] == b.shape[0] == w.shape[-1]):
raise ValueError('You have w, x, b wrong way round! '
f'Shapes: x: {x.shape}, w: {x.shape}, b: {b.shape}')
return sigmoid(np.dot(w, x) + b)


__init__.py should be a blank file with no contents.

## Example: Importing a Module

Now cd in ipython or the terminal so that nnet is a subdirectory. You will now be able to import the nnet module and use core.sigmoid as shown in the documentation example!

If you want to be able to import like

>>> import nnet
>>> nnet.core.sigmoid(1)
0.5


you will also need to alter the __init__.py to the following

from . import core


## Example: Referencing other local modules

Create a new file model.py so that your directory looks like

./nnet/
__init__.py
core.py
model.py


Inside model.py put the following

import numpy as np

from . import core

def predict(x, w0, w1, b0, b1):
"""
Returns the prediction of a neural network
using specified weights and bias

Example:
>>> import numpy as np
>>> from nnet import model
>>> x0, x1 = 1, 0
>>> x = np.array([[x0], [x1]])
>>> w0 = np.array([[.5, .5], [.5, .5]])
>>> w1 = np.array([[.5, .5], [.5, .5]])
>>> b0 = np.array([[.25], [.25]])
>>> b1 = np.array([[1], [1]])
>>> model.predict(x, w0, w1, b0, b1).round(3)  # rounding avoids floating point issues
array([[0.843],
[0.843]])
"""
assert all((isinstance(o, np.ndarray) for o in (x, w0, w1, b0, b1)))
layer1 = core.forward_prop(x, w0, b0)
return core.forward_prop(layer1, w1, b1)


Style Tip Leave to blank lines after imports. Leave one blank line between libraries installed by default (import numpy as np at the top) and your own created modules (from . import core below)

You will now similarly be able to import and call predict as in the example if your present working directory has nnet as a subdirectory.

You can check your present working directory by typing pwd into ipython and check to see if nnet is a subdirectory by doing ls nnet either in ipython or the terminal

As an aside it should be slightly clearly why my bias for ipython is so strong

## Aside: How Python installs libraries

You have effectively created a mini-library here on-the-fly. This is infact not too far off how libs are installed when you install a library in the terminal / powershell prompt like

python -m pip install SomeLibrary  What actually happens under-the-hood is that a system environment variable namd PYTHONPATH (similar to PATH) contains a list of all directories that python expects a python module (like you have created) to sit within as a subdirectory. These directories then behave as if your ipython terminal had them as the present working directory all at the same time. As a hack you can see this working by doing >>> import sys >>> sys.path.append('/directory/that/containsnnet')  Then even if you are not in the correct directory, you will still be able to import nnet. This can be quite useful during development. Just as a heads up - never put this in finished python modules as it is awefully hacky and a terrible idea in general! # Exercises ## Exercise 11.1: Understand your self Modify the noise so that the dog now speaks english. You should include these two doctests in the Examples and make sure you can pass them I have started this off for you as below. Note that you will need to rewrite the class method in Dog with the below function so it will need to be indented and underneath the definition of Dog def noise(): """ Returns the noise of a fairly smart dog Example: >>> Dog('Rover', 'brown', 15, 50).noise() 'I am Rover the Dog. I am brown. Woof!' >>> Dog('Sally', 'green', 15, 50).noise() 'I am Sally the Dog. I am green. Woof!' """ return # Solve Me!  ## Exercises 11.2: Fixing indexing errors In the example where we create SimpleMom there is actually a major bug. See what happens if you make the t smaller than the lookback: In otherwords see what happens if you try and get a signal where there aren’t enough historic prices available for the required lookback To further elaborate: If you have 4 prices in total as your history, you want to calculate the signal at day 2 but with a lookback window of 4 days… what happens!? Propose a fix to this! # Solve Me!  # Next Topic Coming soon! # Learning Outcomes • How to use dictionary / set objects and differences to list / tuple • How to access dictionary objects in iteration • Utility of dictionaries as hashmaps • Utility of sets in dimensionality reduction ## Contents # Sets Sets only contain unique items >>> trade_ids = [12342, 324562, 12342, 36452, 54767] >>> set(trade_ids) {12342, 36452, 54767, 324562}  We can also iterate a set like: >>> for i in trade_ids: ... print(i, end=',') 12342,324562,12342,36452,54767,  ## Set differences They are also denoted by braces {}. Sets are a mathematical construct and python also supports some set logic such as set differences >>> trade_ids_expected = {12342, 36452, 54767, 324569} # shorter way of defining sets >>> unexpected_trade_ids = set(trade_ids) - trade_ids_expected >>> unexpected_trade_ids {324562}  we can also do it the other way round to look for missing trades >>> missing_trade_ids = trade_ids_expected - set(trade_ids) >>> missing_trade_ids {324569}  These two operations can be particularly useful when validating the inputs to functions. ## Sets items must be immutable We can also iterate sets in the same way that we iterate lists and tuples. Objects can also be part of sets as long as they are immutable - i.e. unchanging. Recall that lists are mutable and tuples are immutable. This means that we can have a set of tuples >>> set((1, 2,), (3, 4,)) {(1, 2,), (3, 4,)}  but not a set of lists >>> {[1, 2], [3, 4]} --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-2-a0ff115cb325> in <module>() ----> 1 {[1, 2], [3, 4]} TypeError: unhashable type: 'list'  # Dictionaries Dictionaries are python’s version of that is known as a hash map or hash table in other languages. If anyone in an interview asks you for a hash map in python you’ll know they just mean a dict (also they probably don’t really know python that well!) All this jargon means is a key-value lookup where the key is unambigouously unique. Think VLOOKUP in Excel but if there couldn’t be any keys that are identical. We set up a dictionary with a key value pair like follows >>> d = { ... 'akey': 'avalue', ... 'anotherkey': 'avalue' ... }  values can be anything. I actually wrote a load of stuff about this but I deleted it because I think you shoudl get used looking at python documentation now you are more familiar with the language. See the offical python guide on dict - don’t bother with dict comprehensions yet as we will come onto those but have a read of the dictionaries and looping techniques sections. ## Example: Trades by Asset Class Lets assume we have the following data which we read in from a csv into a pandas DataFrame. Lets also assume that your credit and commodity desks for some reason give you the trade_id as a str - this is very annoying for you but a typical problem. Finally, for any more advanced readers this example is focused on dict and not pandas so we shall avoid using pandas for now >>> import pandas as pd >>> data = [['rates', 346455, 568789.345], ... ['rates', 3467457, 4568679.345], ... ['rates', 56858, -6578965789.45], ... ['fx', 93875, 67896789.34], ... ['fx', 34896, -3464754.456], ... ['fx', 30986, 0.3456457], ... ['credit', '234537', 45765.456], ... ['credit', '457568', -3455436.213], ... ['credit', '3467457', 456546.034], ... ['commodities', '93875', -34563456.23235], ... ['commodities', '34457', 4560456.4567], ... ['commodities', '457478', 4575678.345346], ... ['equities', 3466, -457567.345], ... ['equities', 564756, -12.93045], ... ['equities', 457568, 546636.438996]] >>> df = pd.DataFrame(data, columns=['risk', 'trade_id', 'dv01'])  ### How many trades are there per asset class with delta risk? >>> trade_by_asset_class = dict() >>> for asset_class, trade_id in df.values: ... trade_by_asset_class[asset_class] = trade_id  Lets now figure out what went wrong here… Remembering the stack method we see that there are too many values to unpack and that the arrow is on the for line (if you are useing pyCharm - you know who you are - then you may have no arrow!) With iteration errors it is often easiest to index the first element to see why we couldn’t unpack it: >>> df.values[0] array(['rates', 346455, 568789.345], dtype=object)  Here we can see there are three items and we are trying to unpack to two elements asset_class and trade_id therefore we need a third element even if we don’t currently care about the delta! A standard way of creating throwaway elements is to use _ like >>> trade_by_asset_class = dict() >>> for asset_class, trade_id, _ in df.values: ... trade_by_asset_class[asset_class] = trade_id  but this doesn’t really help because each iteration we have overwritten the value! >>> trade_by_asset_class {'rates': 56858, 'fx': 30986, 'credit': '3467457', 'commodities': '457478', 'equities': 457568}  we therefore need to create a list as a value item and then append to the list - this is one of the most common dictionary structures. >>> trade_by_asset_class = dict() >>> for asset_class, trade_id, _ in df.values: ... if asset_class not in trade_by_asset_class: ... trade_by_asset_class[asset_class] = [] ... trade_by_asset_class[asset_class].append(trade_id)  Think about these operations if you have a large number of rows: The following should bve quicker have a think about why this might be the case… >>> trade_by_asset_class = dict() >>> for ac in set(df['risk']): ... trade_by_asset_class[ac] = [] >>> for asset_class, trade_id, _ in df.values: ... trade_by_asset_class[asset_class].append(trade_id)  which gives >>> trade_by_asset_class {'rates': [346455, 3467457, 56858], 'fx': [93875, 34896, 30986], 'commodities': ['93875', '34457', '457478'], 'equities': [3466, 564756, 457568], 'credit': ['234537', '457568', '3467457']}  we now have a structure for answering the question: >>> for a, t in trade_by_asset_class.items(): ... print('risk: {:12s} trades: {:2d}'.format(a, len(t))) risk: rates trades: 3 risk: fx trades: 3 risk: commodities trades: 3 risk: equities trades: 3 risk: credit trades: 3  For more information string padding see: https://pyformat.info/#string_pad_align ### Simplifying iterations with dictionaries Lets imagine that the credit trading PnL system for some reason prepends '0s' on all the database ids under a length of 7 because some lunatic decided it looked nice in the 90s. To link your PnL you will have to also prepend zeros to every trade_id Whilst cussing out Diana Bloggs who retired last year after a distinguished trading career; but yet who also royally screwed you with one decision she made as a grad on a drizzly friday morning in 1999 We can zero pad integers to a length of 7 like >>> str(346).zfill(8) '00000346'  A naiive way of doing this would be just to call one of the following >>> df['trade_id_pad'] = df['trade_id'].astype(str).str.zfill(8)  This example shows us two new operations: Firstly that we can call .astype on a pandas.Series (a series is a single column of a DataFrame). Secondly, if a pandas series is a str type then we can call .str to access operations that are normally found within str object types. Imagine now that this dataframe is10^6$x larger WARNING if your laptop is aweful you may not want to run this next section >>> df = pd.DataFrame(data*1e6, columns=['risk', 'trade_id', 'dv01'])  Timing this for me took about 10 seconds! In [100]: %timeit df['trade_id'].astype(str).str.zfill(8) 10.5 s ± 464 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  Lets now simplify the process by using the hashmap In [101]: %%timeit ...: trade_ids = df['trade_id'].unique() # pandas way to get unique items is fast ...: lookup = {} ...: for trade_id in trade_ids: ...: lookup = {trade_id: str(trade_id).zfill(8)} ...: df['trade_id_pad'] = df['trade_id'].apply(lookup.get) ...: 2.38 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  Here we see the .apply method in action. This is pandas version of a map. A map iterates a single function across an array of items. map actually exists in python as a default function and we can call it like: >>> list(map(str, [3456, 4576, 7343])) ['3456', '4576', '7343']  and >>> list(map(len, ['36', '45sd76', '7343'])) [2, 6, 4]  pandas.Series.apply works in the same way and in this example iterates the .get method of lookup across every item in the dataframe. Here the lookup method is exceedingly fast and creating it only requires us to use the far slower line str(trade_id).zfill(8) 15 times instead of 15 million times! # Exercises ## Exercise 10.1: Dimensionality reduction This example aims to build on previous examples to reinforce the idea of hash maps for reducing complexity. You are working on an end-of-day regulatory risk model that requires the revaluation of all trades (e.g. Basel III: FRTB Sensitivities Based Approach). You have been instructed to calculate the present value (PV) as a new column in an Excel sheet. Someone else has done this and complained it was impossibly long and took over 40 hours to calculate. They have requested access to a compute grid to speed up their Excel sheet worth$10k per year

Assume your pricing function to get the PV of the trade is this:

>>> import time
...     """Gets the given a trade_id and returns a random pv"""
...     time.sleep(.1)
...     return 2e9 * np.random.random() - 1e9


and it is called in Excel something like =DODGY_PRICER($B3) Assume we have already read the Excel sheet with python (use our previous example data and make 1.5m rows as the example sheet data) >>> df = pd.DataFrame(data * 10000, columns=['risk', 'trade_id', 'dv01'])  Given that >>> .1 * 15e5 / 60 / 60 41.666666666666664  Use your knowledge of dictionaries to reduce the problem set and claim a portion of the cost savings for your bonus. # Solve me  ## Exercise 10.2: Extracting corporate financials via an API This example aims to help you understand how to navigate a simple dictionary object and also that there is a library for almost every problem. You should note that in a real Investment Banking use case there will be an API for internal / paid datasets and I would recommend you talk to your IT / Data desk / Strat emphasising that every dataset available in Excel will be available in python I googled "python get balance sheet" and found the following decent library: https://github.com/JECSand/yahoofinancials We can tell this is a decent library because: • It has good documentation • There are issues being worked upon recently • The last commit was recent These are all extremely important particularly when the core premise of the library is webscraping a site that can (and does regularly!) change. We can follow the example on the github homepage and adapt for our use case. I found for various reasons I needed to do a few improvements to this API to get exactly what we needed. We can pip install libraries directly from github. Please use the improved-fork I created of this API until it is accepted by the library author as follows: $ pip install git+https://github.com/flipdazed/yahoofinancials


Your exercise is to use this API to extract the Equity for Apple. Lets assume that Apple is a 100% public company (I have no idea if it is or not) so then the equity is just the Market Cap.

Hint The ticker for Apple is "AAPL"… How did I know market capitalisation was definitely there? Check the API documentation!

# Solve me


## Exercise 10.3: Comparable (Comp) analysis : Calculating Levered Beta

This is example aims to demonstrate some use cases of IBD-style analysis that can be done in python NB: I work in trading and not IBD so apologies for any mistakes!

The $\beta$ is the a measure used in CAPM modelling to measure the systemmatic risk of an individual stock vs. the unsystemmatic risk of the entire market (c.f. https://www.investopedia.com/terms/b/beta.asp)

In order to calculate the $\beta$ from comps we must first “deleverage” the $\beta$ for each comp which can be done via

where $D$ is debt, $E$ is equity, $r_T$ is effective tax rate and $E_p$ is preferred stock.

As a toy example lets calculate the $\beta$ of Microsoft (MSFT) from some comparables and compare to the Yahoo value.

Lets take the main competitors and use them as comps in the examples that follow

>>> comps = ['AAPL', 'SAP', 'IBM', 'ORCL', 'GOOG']


## Exercise 10.3.1: Obtain the Debt/Equity Ratios for comps

This example shows you a real result of an API which typically contain nested dictionary objects and how they can be parsed into a pandas.DataFrame object for ease of use.

You can create a pandas.DataFrame object like follows:

>>> import pandas as pd
>>> data = {'AAPL': {'grossMargins': 0.346, 'profitMargins': 45.23},
...         'SAP': {'grossMargins': 346, 'profitMargins': 23}}
>>> df_comps = pd.DataFrame(data)
>>> df_comps

AAPL SAP
grossMargins 0.346 346
profitMargins 45.230 23

I can assure you that it will be a bit simpler if we transpose this dataframe like follows and use the tickers as the row indices as well like follows. This will let us use columns for our calculations. As beginners in pandas objects I think this easier

>>> df_comps = pd.DataFrame(data).T
>>> df_comps

grossMargins profitMargins
AAPL 0.346 45.23
SAP 346.000 23.00

To start you off, this will get the data - you can find this all on the README.md of the github repository

>>> from yahoofinancials import YahooFinancials
>>> comps = ['AAPL', 'SAP', 'IBM', 'ORCL', 'GOOG']
>>> parser = YahooFinancials(comps)
>>> bals = parser.get_financial_data()  # This is why we use my fork: We can't get D/E from the original


Convert this to a pandas DataFrame as shown above where the tickers are the row indices. Then filter the columns to only leave only the column ['debtToEquity'] this can be done by

>>> df_comps = df_comps[['debtToEquity']]
>>> df_comps['debtToEquity'] /= 100.  # we need to convert the values from percentages to decimal

# Solve me


## Exercise 10.3.2: Calculate the Effective Tax Rate for Comps

This example will contain a non-trivial data structure containing lists and dictionaries nested within each other. It will also challenge you interact with the pandas.DataFrame object you have just created using .loc to create new entries

Calculate the effective tax rate for all these tech companies: This is given in the API as incomeTaxExpense / incomeBeforeTax as in https://github.com/JECSand/yahoofinancials#examples-of-returned-json-data

I will start you off with the following

>>> inc = parser.get_financial_stmts('annual', 'income')  # annual is more stable estimate to avoid rebates


Hint that you can create a new column/row combination in a dataframe by doing:

>>> df_comps.loc['AAPL', 'incomeBeforeTax'] = 45


You will want to iterate through each ticker and column combination to fill the values

# Solve me


## Exercise 10.3.3: Calculate the beta for each comp

This example requires you to read the technical documentation for a repository and utilise it for a solution.

Create a new column for the $\beta$ for each comp.

Explore the documentation at https://github.com/JECSand/yahoofinancials to find a sensible method (There is more than one way to do this!)

# Solve me


## Exercise 10.3.4: Calculate unlevered beta as a new column

This example uses some basic vectorised mathematics with pandas DataFrames

If you have forgotten the formula to calculate $\beta_{unlev}$ is

Calculate the $\beta_{unlev}$ for each comp in a new column called 'beta_unlev'. You should google how to do basic maths in pandas if you are unsure.

# Solve me


## Exercise 10.3.5: Calculate the beta for Microsoft

We now need to relever the $\beta$ with the debt-equity ratio and effective tax rate of microsoft.

### Exercise 10.3.5: Obtain $D/E$ and Effective Tax rate for Microsoft

This exercise revises the previous dictionary and list circus skills

Repeat just the vital steps above to obtain 'debtToEquity' and effectiveTax for "MSFT"

# Solve me


### Exercise 10.3.5: Calculate the estimate of unlevered beta as the median of the comps

The exercise demonstrates how we can use numpy functionality through pandas

This is a maths problem and so naturally we use numpy which has the function np.median. We can use this on the column of the pd.DataFrame containing the 'beta_unlev' to obtain the median, remembering that Pandas DataFrames are really just wrappers around numpy arrays

Note Most simple math functions can be found as a .method() on the end of the pandas DataFrame. For example try typing

>>> df_comps['beta_unlev'].


Then hit TAB after the . and see all the methods available. You should find one to calculate the median too.

Assign the median beta to a variable name beta_unlev_msft

# Solve me


### Exercise 10.3.5: Calculate the estimate of beta

The calculate $\beta$ for Microsoft as $\beta_{unlev}\left(1 + (1-r_T)\frac{D}{E}\right) = \beta$ where $r_T$ is the effective tax rate for Microsoft and D/E is the Debt-to-Equity ratio for Microsoft and we already calcualted $\beta_{unlev}$ above

# Solve me


# Appendix: Calculate WACC for Microsoft from comparables

This exercise is really aimed at those in Investment Banking roles because it’s nice to complete the picture

## Calculate beta based upon comparables D/E

As a guess of an “optimal” capital structure from comparables, take the median debt and median equity from the comparables. Calculate the D/E ratio from these medians and use it to create a $\beta$, using the same effectixeTax. Add a new row containing ['debtToEquity', 'effectiveTax', 'beta'] for this new $\beta$ and D/E using the index name: 'MSFT_beta_optcap'

This won’t affect Bankers working in the insudstry as you’ll have your own systems / APIs firms… However, sometimes the 'totalDebt' can be a bit rogue in Yahoo Finance. I would recommend pulling (check API documentation!) the 'totalShareholderEquity' as the total Equity and calculating 'totalDebt' from the trivial relation

total_debt = debt_to_equity * total_shareholder_equity


So we don’t give away the solutions to previous sections, I will hardcode the expected results required (as of November 13th!) below and use that in this demonstration

>>> df_msft = pd.DataFrame(
...     [[0.8068, 0.101813, 1.229776, 0.7130553],
...      [0.8068, 0.101813, 1.073342, 0.6223507]],
...     columns=['debtToEquity', 'effectiveTax', 'beta', 'beta_unlev'],
...     index=['MSFT_beta_hist', 'MSFT_beta_comp']
... )

debtToEquity effectiveTax beta beta_unlev
MSFT_beta_hist 0.8068 0.101813 1.229776 0.713055
MSFT_beta_comp 0.8068 0.101813 1.073342 0.622351

and the comparables

>>> df_comps = pd.DataFrame(
...     [[1.194, 65.737e9, 10.481e9, 0.1594, 1.247, 0.6224, 90.488e9, 108.047e9, 0.4558, 0.5442],
...      [0.5439, 5.6e9, 1.511e9, 0.2698, 1.0666, 0.7634, 30.159e9, 16.403e9, 0.6477, 0.3523],
...      [3.9512, 11.342e9, 2.619e9, 0.2309, 1.3442, 0.3328, 17.956e9, 70.947e9, 0.202, 0.798],
...      [2.9852, 12.268e9, 1.185e9, 0.0966, 1.1506, 0.3112, 18.433e9, 55.027e9, 0.2509, 0.7491],
...      [0.0763, 34.913e9, 4.177e9, 0.1196, 1.0146, 0.9508, 194.969e9, 14.872e9, 0.9291, 0.0709]],
...      index=[ 'AAPL', 'SAP', 'IBM', 'ORCL', 'GOOG'],
...      columns=['debtToEquity', 'incomeBeforeTax', 'incomeTaxExpense',
...               'effectiveTax', 'beta', 'beta_unlev', 'totalStockholderEquity',
...               'totalDebt', 'ratioEquity', 'ratioDebt']
... )

debtToEquity incomeBeforeTax incomeTaxExpense effectiveTax beta beta_unlev totalStockholderEquity totalDebt ratioEquity ratioDebt
AAPL 1.1940 6.573700e+10 1.048100e+10 0.1594 1.2470 0.6224 9.048800e+10 1.080470e+11 0.4558 0.5442
SAP 0.5439 5.600000e+09 1.511000e+09 0.2698 1.0666 0.7634 3.015900e+10 1.640300e+10 0.6477 0.3523
IBM 3.9512 1.134200e+10 2.619000e+09 0.2309 1.3442 0.3328 1.795600e+10 7.094700e+10 0.2020 0.7980
ORCL 2.9852 1.226800e+10 1.185000e+09 0.0966 1.1506 0.3112 1.843300e+10 5.502700e+10 0.2509 0.7491
GOOG 0.0763 3.491300e+10 4.177000e+09 0.1196 1.0146 0.9508 1.949690e+11 1.487200e+10 0.9291 0.0709

### Obtain Total Debt and Total Equity for comps

I use the last 'quarterly’ data for unlevering $\beta$

>>> from yahoofinancials import YahooFinancials
>>> comps = ['AAPL', 'SAP', 'IBM', 'ORCL', 'GOOG']
>>> parser = YahooFinancials(comps)
>>> bal_comps = parser.get_financial_stmts('quarterly', 'balance')


Then extract the relevant elements from the balance sheet API return values and calculate the 'totalDebt' and totalEquity ratios

>>> for p in comps:
...     bal_last = list(bal_comps['balanceSheetHistoryQuarterly'][p][0].values())[0]
...     df_comps.loc[p, 'totalStockholderEquity'] = bal_last['totalStockholderEquity']
>>> df_comps['totalDebt'] = df_comps['debtToEquity'] * df_comps['totalStockholderEquity']
>>> df_comps['ratioEquity'] = df_comps['totalStockholderEquity'] / (df_comps['totalDebt'] + df_comps['totalStockholderEquity'])
>>> df_comps['ratioDebt'] = 1 - df_comps['ratioEquity'] # A lot of assumptions here!

debtToEquity incomeBeforeTax incomeTaxExpense effectiveTax beta beta_unlev totalStockholderEquity totalDebt ratioEquity ratioDebt
AAPL 1.1940 6.573700e+10 1.048100e+10 0.1594 1.2470 0.6224 9.048800e+10 1.080427e+11 0.455789 0.544211
SAP 0.5439 5.600000e+09 1.511000e+09 0.2698 1.0666 0.7634 3.015900e+10 1.640348e+10 0.647710 0.352290
IBM 3.9512 1.134200e+10 2.619000e+09 0.2309 1.3442 0.3328 1.795600e+10 7.094775e+10 0.201971 0.798029
ORCL 2.9852 1.226800e+10 1.185000e+09 0.0966 1.1506 0.3112 1.843300e+10 5.502619e+10 0.250928 0.749072
GOOG 0.0763 3.491300e+10 4.177000e+09 0.1196 1.0146 0.9508 1.949690e+11 1.487613e+10 0.929109 0.070891

### Obtain Total Debt and Total Equity for Microsoft

We can also extract the 'totalStockholderEquity' for Microsoft as well

>>> ticker = "MSFT"
>>> parser = YahooFinancials(ticker)
>>> bal_msft = parser.get_financial_stmts('quarterly', 'balance')
>>> bal_msft = list(bal_msft['balanceSheetHistoryQuarterly'][ticker][0].values())[0]
>>> df_msft.loc[:, 'totalStockholderEquity'] = bal_msft['totalStockholderEquity']
>>> df_msft['totalDebt'] = df_msft['debtToEquity'] * df_msft['totalStockholderEquity']
>>> df_msft['ratioEquity'] = df_msft['totalStockholderEquity'] / (df_msft['totalDebt'] + df_msft['totalStockholderEquity'])
>>> df_msft['ratioDebt'] = 1 - df_msft['ratioEquity']

debtToEquity effectiveTax beta beta_unlev totalStockholderEquity totalDebt ratioEquity ratioDebt
MSFT_beta_hist 0.8068 0.101813 1.229776 0.713055 106061000000 8.557001e+10 0.553465 0.446535
MSFT_beta_comp 0.8068 0.101813 1.073342 0.622351 106061000000 8.557001e+10 0.553465 0.446535

### Include an ‘Optimal’ capital structured based on comps

Create new ‘optimal’ capital structure by taking the medians of the comps Debt and Equity ratios

>>> i = f'{ticker}_beta_optcap'
>>> df_msft.loc[i, :] = df_msft.loc[f'{ticker}_beta_hist', :].copy()
>>> df_msft.loc[i, 'ratioDebt'] = df_comps['ratioDebt'].median()
>>> df_msft.loc[i, 'ratioEquity'] = df_comps['ratioEquity'].median()

debtToEquity effectiveTax beta beta_unlev totalStockholderEquity totalDebt ratioEquity ratioDebt
MSFT_beta_hist 0.8068 0.101813 1.229776 0.713055 1.060610e+11 8.557001e+10 0.553465 0.446535
MSFT_beta_comp 0.8068 0.101813 1.073342 0.622351 1.060610e+11 8.557001e+10 0.553465 0.446535
MSFT_beta_optcap 0.8068 0.101813 1.229776 0.713055 1.060610e+11 8.557001e+10 0.455789 0.544211

### Create beta values for the various capital structures

>>> df_msft.loc[i, 'totalDebt'] = df_msft.loc[i, 'ratioDebt'] * (df_msft.loc[f'{ticker}_beta_hist', 'totalDebt'] + df_msft.loc[f'{ticker}_beta_hist', 'totalStockholderEquity'])
>>> df_msft.loc[i, 'totalStockholderEquity'] = df_msft.loc[i, 'ratioEquity'] * (df_msft.loc[f'{ticker}_beta_hist', 'totalDebt'] + df_msft.loc[f'{ticker}_beta_hist', 'totalStockholderEquity'])

>>> df_msft.loc[i, 'debtToEquity'] = df_msft.loc[i, 'totalDebt'] / df_msft.loc[i, 'totalStockholderEquity']
>>> df_msft.loc[i, 'beta'] = df_msft.loc[i, 'beta_unlev'] * (1 + (1 - df_msft.loc[i, 'effectiveTax']) *  df_msft.loc[i, 'debtToEquity'])

debtToEquity effectiveTax beta beta_unlev totalStockholderEquity totalDebt ratioEquity ratioDebt
MSFT_beta_hist 0.8068 0.101813 1.229776 0.713055 1.060610e+11 8.557001e+10 0.553465 0.446535
MSFT_beta_comp 0.8068 0.101813 1.073342 0.622351 1.060610e+11 8.557001e+10 0.553465 0.446535
MSFT_beta_optcap 1.1940 0.101813 1.477761 0.713055 8.734322e+10 1.042878e+11 0.455789 0.544211

## Calculate Cost of Equity, ${r}_{E}$

Lets assume the following below

• risk free rate taken from 10 year US Treasury bonds when writing on 13th Nov
• Current cost of debt, ${r}_{D}$, for Microsoft
>>> erp = 0.0509
>>> rf = 0.0189
>>> debt_cost_msft = 0.033539


Then cost of equity can be calculated as

>>> df_msft['costDebt'] = debt_cost_msft
>>> df_msft['costEquity'] = rf + erp * df_msft['beta']  # cost_equity = rf + erp * beta_levered

debtToEquity effectiveTax beta beta_unlev totalStockholderEquity totalDebt ratioEquity ratioDebt costDebt costEquity
MSFT_beta_hist 0.8068 0.101813 1.229776 0.713055 1.060610e+11 8.557001e+10 0.553465 0.446535 0.033539 0.081496
MSFT_beta_comp 0.8068 0.101813 1.073342 0.622351 1.060610e+11 8.557001e+10 0.553465 0.446535 0.033539 0.073533
MSFT_beta_optcap 1.1940 0.101813 1.477761 0.713055 8.734322e+10 1.042878e+11 0.455789 0.544211 0.033539 0.094118

## Calculate the Weighted Average Cost of Capital (WACC)

The WACC formula is (ignoring tweaks and preferred stock)

so we can calculate in python as

>>> df_msft['WACC'] = df_msft['costEquity'] * df_msft['ratioEquity'] + df_msft['costDebt'] * df_msft['ratioDebt'] * (1 - df_msft['effectiveTax'])

debtToEquity effectiveTax beta beta_unlev totalStockholderEquity totalDebt ratioEquity ratioDebt costDebt costEquity WACC
MSFT_beta_hist 0.8068 0.101813 1.229776 0.713055 1.060610e+11 8.557001e+10 0.553465 0.446535 0.033539 0.081496 0.058556
MSFT_beta_comp 0.8068 0.101813 1.073342 0.622351 1.060610e+11 8.557001e+10 0.553465 0.446535 0.033539 0.073533 0.054150
MSFT_beta_optcap 1.1940 0.101813 1.477761 0.713055 8.734322e+10 1.042878e+11 0.455789 0.544211 0.033539 0.094118 0.059292

It turns out that this is slightly off the mark with what I found on the internet but I suspect that is due to the $\beta$ on Yahoo Finance being paticularly high compared to other sources

# Next Topic

11: Classes and Modules

# Learning Outcomes

• Declare and use both list and tuple
• Differences between list and tuple
• Some list operations
• Mutable vs. immutable types
• numpy arrays
• vector operations on numpy arrays

# tuple types

A tuple consists of a number of values separated by commas, for instance:

>>> t = 12345, 54321, 'hello!'
>>> t[0]
12345

>>> t
(12345, 54321, 'hello!')


They may be nested

>>> u = t, (1, 2, 3, 4, 5)
>>> u
((12345, 54321, 'hello!'), (1, 2, 3, 4, 5))


As you see, on output tuples are always enclosed in parentheses, so that nested tuples are interpreted correctly; they may be input with or without surrounding parentheses

Note When you create a tuple of one item you must always include a trailing comma like

>>> (1,)
(1,)


else it doesn’t create a tuple

>>> (1)
1


empty tuples can be created without a comma

>>> ()
()


this is also the same with an empty list

# Indexing

Access items by referring to the index number (starting from 0)

>>> t = 12345, 54321, 'hello!', 'goodbye!'
>>> t[1]
54321


Negative indices start from the back and in the reverse

>>> t[-1], t[-2], t[-3], t[-len(t)]
('goodbye!', 'hello!', 54321, 12345)


Note t[-0] == t[0] so the reverse indexing actually starts at 1

Ranges of indices can be specified by the function slice which specifies the slice from and including the start index to but not including the end index

>>> a_tuple = 0, 1, 2, 3, 4, 5, 6, 7,
>>> a_tuple[slice(0, 2)]
(0, 1)


An optional thirs argument (the step) can also be specified like

>>> a_tuple[slice(0, None, 2)]
(0, 2, 4, 6)


Here we gave a step of 2 so it starts at the 0th item and returns every 2nd index until the end.

In the slice object providing None gives the default behaviour which is to start at 0 and slice to the end with a step of 1… This is given by

>>> a_tuple[slice(None, None, None)]
(0, 1, 2, 3, 4, 5, 6, 7)


In reality no-one really uses slice as there is a shorthand for it: The colon :!

When used between square brackets as an indexing operation is effectively separates what would otherwise be a call to the slice1 operator for example

>>> a_tuple[3:]
(3, 4, 5, 6, 7)


remember slices are up until (NOT INCLUSIVE) of the end index

>>> a_tuple[-4:-1]
(4, 5, 6)


this is the same as slice(0, 5, 2)

>>> a_tuple[0:5:2]
(0, 2, 4)


this is the same as slice(1, None, 2)

>>> a_tuple[1::2]
(1, 3, 5, 7)

>>> a_tuple[::-2]  # same as slice(None, None, -2)
(7, 5, 3, 1)


### A side note on range

We have used range a few times. It is worth pointing out that its arguments operate in exactly the same way as slice. Except, instead of returning an index slice, it returns actual index values as a special range object.

>>> a = range(5)
>>> for i in a:
...     print(i, end=', ')
0, 1, 2, 3, 4,

>>> a
range(0, 5)


This can be converted to a list or tuple if we wish to use it as such

>>> list(range(5)), tuple(range(6, 2, -1))
([0, 1, 2, 3, 4], (6, 5, 4, 3))


# list types

A list is a collection which is ordered and changeable. In Python lists are written with square brackets.

>>> this_list = ["apple", "banana", "cherry"]
>>> this_list
['apple', 'banana', 'cherry']


Lists are indexed in eactly the same ways as tuples

>>> this_list[1::-1]
['banana', 'apple']


We have already encountered .append. There are also other additional things with lists as we have extra methods available, some are shown below

>>> fruits = ['pear', 'banana', 'kiwi', 'apple', 'banana', 'grape']
>>> fruits.index('banana', 4)  # Find next banana starting a position 4
4


if no argument is provided it will pop -1

>>> fruits.pop(0)
'pear'


pop will also remove that item from the list

>>> fruits
['banana', 'kiwi', 'apple', 'banana', 'grape']


Both list and tuple types can be multiplied

>>> fruits[:2] * 2
['banana', 'kiwi', 'banana', 'kiwi']

>>> [tuple(fruits[:2])] * 2
[('banana', 'kiwi'), ('banana', 'kiwi')]


For a more complete list google something like “list methods python”

### Example: Fibonacci Series

Here we use use the slice notation and python’s builtin sum function to simplify

>>> result = [0, 1]
>>> for i in range(5):
...     result.append(sum(result[-2:]))
>>> result
[0, 1, 1, 2, 3, 5, 8]


### Example: A common misunderstood feature with the python list

>>> a_list = list(range(3))
>>> b_list = [a_list] * 3
>>> a_list.append('test')
>>> b_list
[[0, 1, 2, 'test'], [0, 1, 2, 'test'], [0, 1, 2, 'test']]


Whereas with tuples…

>>> a_tuple = tuple(range(3))
>>> b_tuple = (a_tuple,) * 3
>>> a_tuple += ('test',)
>>> b_tuple
((0, 1, 2), (0, 1, 2), (0, 1, 2))


# Mutable vs. Immutable object types

Though tuples may seem similar to lists, they are often used in different situations and for different purposes

1. Immutable objects can’t be changed.
2. Mutable objects can be changed.

Immuatable object types like tuples cannot be mutated by assignment

>>> example_tuple = 1, 2, 3, 4
>>> example_tuple[2] = 5

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-36-3faa60cab3a6> in <module>()
1 example_tuple = 1, 2, 3, 4
----> 2 example_tuple[2] = 5

TypeError: 'tuple' object does not support item assignment


Mutable object types like lists can be mutated by assignment

>>> example_list = [1, 2, 3, 4]
>>> example_list[2] = 5
>>> example_list
[1, 2, 5, 4]


## A brief foray into c

Python was concieved as a scriptiong language built on the more low level language c. In c, programmers must be more-so aware that variables are simply link to memory address locations (Ever heard of RAM?). The random access memory (RAM) is where all these variables are stored. Conceptually imagine it as a big list with numbers assigned to it

We can query what the address location of each python object is by doing

>>> id(0)    # always put comments at least 1 space away from the last code piece
>>> id([0])  # unless visually making a block like this
4532216456


From this we can see that the object 0 has an address location and the object [0] has another different address location

What a mutable type really means is that we can change the address location of the type’s contents

>>> example_list = list(range(5))  # remember that a range(n) give an iterable from 0 to n-1
>>> id(example_list), example_list
(4534667592, [0, 1, 2, 3, 4])


whilst keeping the memory address of the container the same

>>> example_list += [5]             # note that you can put any math operator infront of =
>>> id(example_list), example_list  # to do an inplace assignment! e.g. -=, /=, %= etc
(4534667592, [0, 1, 2, 3, 4, 5])


Whereas with the tuple

>>> example_tuple = tuple(range(5))
>>> id(example_tuple), id(example_tuple)
(4532001432, 4532001432)


the address changes when it is mutated

>>> example_tuple += (5,)
>>> id(example_tuple), example_tuple
(4531868968, (0, 1, 2, 3, 4, 5))


If the variable is assigned to another variable, the variable points to the same address location which can be seen by

>>> example_tuple = tuple(range(5))
>>> another_variable = example_tuple
>>> id(example_tuple), id(another_variable)
(4532001784, 4532001784)


However, if we then change the address location of the original variable we break the link!

>>> example_tuple += (5,)
>>> id(example_tuple), id(another_variable)
(4531868872, 4532001784)

>>> example_tuple, another_variable
((0, 1, 2, 3, 4, 5), (0, 1, 2, 3, 4))


Whereas with mutable types like lists this reference persists!

>>> example_list = list(range(5))
>>> another_variable = example_list
>>> example_list += (5,)
>>> id(example_list) == id(another_variable)
True


A better way of checking the variables are exactly the same object value is by using is

>>> example_list is another_variable
True


A slice, will create a copy of a list object to a new address location. A blank slice will also do this

>>> example_list[:] is another_variable
False


For a more detailed overview of how exactly this works check out https://realpython.com/pointers-in-python/#immutable-vs-mutable-objects

# Numpy Arrays

Numpy arrays are exceeedingly important to your future career as a python expert! They offer a way of doing optimised mathematics on lists.

## The need for numpy arrays

Take the following as an example:

>>> arr = list(range(3))


Note that we can’t simply multiply a normal list and expect its elements to be doubled! The example below is expected behavour.

>>> arr * 2
[[1, 2, 3], [1, 2, 3]]


Think of a list like an object e.g. like an apple. If you multiply 10xApples you don’t suddenly expect 10x the number of seeds and one apple. As with the 10 apples you should expect 10 lists.

Thus we are required to do something like

>>> arr2 = arr.copy()  # As a teaser, try without .copy() and see what happens to arr!
>>> for i in range(arr.shape[0]):
...     arr2[i] = arr[i] * 2


## The numpy solution

Compare to the numpy solution

>>> import numpy as np
>>> arr = np.array([1, 2, 3])
>>> np.array(arr) * 2
array([0, 2, 4])


In fact we can use np.arange to further simplify

>>> np.arange(3) * 2


Not only is this more concise, but numpy uses optimised C++ libraries that are incredibly difficult for humans to beat in terms of speed. Thus all operations we can push to the C++ part of numpy are as fast if not faster than some seriously intense C++ code (As a reminder C++ is a very fast language and what most core quant pricers are written in as a result)

As an example you may have a look at this numpy module which does matrix multiplication (note matmul is actually arr * arr.T not arr * 2) this will call a library called BLAS if possible which is a highly optimised linear algebra module written in Fortran and C++. In a nutshell: Goodluck at writing faster code than the authors of BLAS routines.

## Iteration and indexing in numpy

Iteration and indexing in numpy are somewhat funky and this can make newwer users veryu confused between list and numpy.array iteration… I will try to not confuse you by making explicit what is specifically only allowed in numpy and what is allowed in list objects as well

All normal list indexing tricks can be used on numpy arrays

>>> import numpy as np
>>> arr = np.arange(10)
>>> arr[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
>>> arr[:9]
array([0, 1, 2, 3, 4, 5, 6, 7, 8])


However only in numpy arrays we can use an array itself to mask or select items

>>> arr[arr > 5]
array([6, 7, 8, 9])


This will not work in a normal python list!

>>> l = list(range(10))
>>> l[l > 5]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-24-33f477ef3cbe> in <module>()
1 l = list(range(10))
----> 2 l[l > 5]

TypeError: '>' not supported between instances of 'list' and 'int'


This works because solely because

>>> arr > 5
array([False, False, False, False, False, False,  True,  True, True, True])


as a comparison see what happens when we do

>>> arr[[True] * 3 + [False] * 7]
array([0, 1, 2])


similarly we could do (remember the bitwise logic?)

>>> arr[(arr == 2) | (arr == 5)]
array([2, 5])


### Fancy indexing

Another thing we can do in numpy is index explicitly with another array

>>> arr = np.arange(10)**2
>>> idx = np.arange(10) * 3
>>> arr[idx[idx < 10]]  # take this statement apart to understand it!
array([ 0,  9, 36, 81])


Whereas a similar statement won’t work with lists

>>> l = list(range(10))
>>> l[[0, 2, 4]]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-5efbb062c778> in <module>()
1 l = list(range(10))
----> 2 l[[0, 2, 4]]

TypeError: list indices must be integers or slices, not list


Confusingly, we can actually use a raw list to index a numpy array in fact we can use both list and tuple

>>> arr = np.arange(10)**2
>>> l = [3, 7, 9]
>>> arr[l]
array([ 9, 49, 81])


## Optimising code for numpy

The key takeaway is to push all our compute into numpy’s C++ core

How do we achieve this in practice?

It can be tricky but in short use numpy arrays where possible for any math that needs to be done on every element this includes ** + / - * % // ^ | & as examples.

>>> np.arange(100)


# Exercises

## Exercise 9.1: Nested loops

This exercise is to help you understand iteration across more than one dimension.

Find the trace of the matrix a. The trace is defined as the sum of all the diagonal elements.

a = np.array([[1, 3, 5],[1, 4, 6],[7, 6, 9]])

>>> a
array([[1, 3, 5],
[1, 4, 6],
[7, 6, 9]])


Hint You may also want to play with

>>> for i, ai in enumerate(a):
...     print(i, ai)


recalling how multiple variable assignment works. Google may be your friend here!

Hint For the less maths-inclined the diagonal is [1, 4, 9]. We don’t want the anti-diagonal of [5, 4, 7].

# Solve me!


## Exercise 9.2: Monte Carlo modelling using Geometric Brownian Motion

This exercise is to expand on the previous example with a bit of quantitative finance (Don’t worry though the maths will all be solved out for you!).

Here you will be guided through a monte carlo stock price model using Geometric Brownian Motion. You will also show that the returns from a GBM model follow a normal distribution.

This exercise is intended to be partcularly challenging and may require some googling but don’t give up - 99% of coding is being frustrated and feeling stupid… I sort of didn’t tell you that when you signed up haha!

### Exercise 9.2.1: Storing iteration results in a list

Modify your previous example to the brownian motion exercise so that you store the results in a list named path while the stock price is greater than 0. Set path_len = 504 (i.e. 2 years) and only run for this many time steps.

Use the same values for dt = 1/252, sigma, mean etc that you used before, some (not all) are shown below

path_len = 504
dt = 1/252
r = 1.02
sigma = 1000
# solve me


### Exercise 9.2.2: Plotting results

Plot your stock path using matplotlib - you will certainly want to google this

Hint Just use plt.plot nothing fancy!

# Solve me!


### Exercise 9.2.3: Monte Carlo

In Exercise 9.2.1 you simulated a single path of a stock using GBM. Now create paths_num = 1000 paths and store each of these paths in another list named paths

Hint paths will be a list of lists & you will require two for loops. I have declared the empty list for you to fill below

paths = []
# Solve me!


### Exercise 9.2.4: Plotting multiple lines on one plot

Plot all 1000 paths with pandas using dt as the index. This is a little more involved so read carefully!

When we created paths we created an object like

[
[ s_00, s_01, ... , s_0n],  # indexed by paths[0]
[ s_10, s_11, ... , s_1n],  # indexed by paths[1]
[ ... , ... , ... , ... ],
[ s_m0, s_m1, ... , s_mn]   # indexed by paths[m]
] #   ^paths[m][0]      ^paths[m][n]


This makes the zeroth axis the path index and the first axis the time index. However, in pandas and plotting in matplotlib we generally store the time index as the zeroth index and the path index as the first index.

To store in pandas we need to transpose (swap round the referencing) so that in the above example we would change paths[m][n] -> paths[n][m]. This is the same as transposing in Excel.

You can do this with a few different options

1. paths_df = pd.DataFrame(paths).T
2. paths_df = pd.DataFrame(list(zip(*paths)))
3. paths_df = pd.DataFrame(np.array(paths).T)

For sanity would should also probably make the index match the timesteps we are using. Lets create a proper index.

import pandas as pd
import numpy as np
index = pd.Index(np.arange(path_len) * dt, name='years')


Now create the DataFrame with this index. Google if you get stuck.

paths_df = pd.DataFrame(paths, columns=index).T


Make sure that you can pass this test! If you can’t think about the order in which you created the index and then transposed the DataFrame

assert all(paths_df.index.values == index.values)


Now plot with the default pandas plot function. Make sure to pass the argument legend=None!

# Solve me!


## Exercise 9.2.5: Prove empirically GBM returns are Normally distributed

Prove that the 1-step stock returns are normally distributed under the GBM model by fitting a Gaussian distribution to the probability distribution function of the 1-day returns - Don’t worry it’s not as hard as it sounds :)

### Exercise 9.2.5.1: Calculate 1-day returns using a pandas object

You will need to calculate 1-day returns. You have two options for this:

1. Absolute returns $\frac{s_{t} - s_{t-1}}{s_{t-1}}$
2. Log returns $\ln s_{t} - \ln s_{t-1}$

Without proof 2 is equal to 1 if $\Delta t \ll T$

e.g. if using years you want to be using a unit of $\Delta t$ that is roughly 100th of a year or smaller

Hint there are four things that might help in googling: "pandas diff", numpy natural log, "pandas shift", "pandas dropna" also remember Google !!

# Solve me!


### Exercise 9.2.5.3: Check your results

This isn’t so much of an exercise but simply a check for the previous exercise to ensure you have it correct! The maths isn’t too important. If you like, you can ignore it all and just run the code below. You should expect differences less than 1% if you have it correct

Without proof (see here for one) we state that the n-day returns, $R_n$ follow a Normal distribution

where

$\mu_n = \left(r_T - \frac{1}{2}\sigma_T^2\right)n\Delta t$ $\sigma_n = \sigma_T \sqrt{n\Delta t}$

such that for T=252 (i.e. a year in our model) $r_T$ is the annualised rate of return (drift) and similarly $\sigma_T$ is the annualised volatility.

Lets take as a fact that we can model a probability distribution function (how likely an event) by using scipy.stats.norm.fit. Then the following snippet of code will confirm your results with theory as a check

# rets_1d should be your result from previous ... you should be able to run it to get the below
import scipy.stats
mean_th = (r - .5*sigma**2)*dt
stdv_th = sigma * np.sqrt(dt)

mean_ac, stdv_ac = scipy.stats.norm.fit(rets_1d)
print(f'Difference in mean samples vs theory: {100*(mean_th - mean_ac)/mean_ac:5.2f}%')
print(f'Difference in stdv samples vs theory: {100*(stdv_th - stdv_ac)/stdv_ac:5.2f}%')


Note Google f-string formatting!

# Solve me


### Exercise 9.2.5.3: Plot theory vs. empirical results (reusing matplotlib axes)

This exercise isn’t so much an exercise rather than an example of how to resuse a matplotlib axis object to draw another line.

Compare this with the theory visually by plotting the theoretical distribution and showing its equivalence with the following code

import scipy.stats

# you can access the numpy array beneath the pandas object by .values - then everything is numpy again
rets_1d_all_paths = rets_1d.values.ravel() # get the returns from every path for more samples (google ravel)
x = np.linspace(-.5, .5, 200)              # an array of 200 numbers equally spaced between -.5 and .5

# note that you can chain "." methods in ()
# because recall that unless you have (item,) it doesn't create a tuple!
ax = (
pd.Series(rets_1d_all_paths) # convert back to pd.Series to use .hist()
.hist(bins=100, density=True, label='1-day Monte Carlo')
)

# use the same axis object that was returned from .hist()
theory = scipy.stats.norm.pdf(x, mean_th, stdv_th)
ax.plot(x, theory, alpha=0.5, label='1-day Analytic')
ax.legend()

# Solve me!


### [Optional] Exercise 9.2.5.4: Show this holds for 20-day returns

Show the same is true for 20day returns by plotting the relationship with 20-day returns

Note that this exercise is really aimed at structurers, derivatives traders and quants - if you haven’t studied stochastic calculus at university then just ignore this question or ask a quant if you find it interesting

# Solve me!


# Next Topic

10: Dictionaries and Sets

# Learning Outcomes

• Declare a function with good style
• Variable scope
• Include examples and use as tests with doctest
• Writing code in a text editor / IDE

# Defining functions

• The keyword def introduces a function definition
• follow with a single space
• name of function (style tip always name functions like lower_case_with_underscores)
• parenthethis containing comma separated arguments

The statements that form the body of the function start at the next line, and must be indented.

def raise_to_power(a, b):
"""prints the results of a**b"""
result = a**b
print(result)

result = raise_to_power(4, 3)


If the last line of a function finishes without return then the returned value is the python object: None - we can check this by using the built-in function type

>>> type(result)
NoneType


We can instead return the value as below: Note that we no longer get a printed value, intead we get an Out value

def raise_to_power(a, b):
"""returns the results of a**b"""
return a**b

raise_to_power(6, 2)


functions themselves are an object

>>> raise_to_power
<function __main__.raise_to_power(a, b)>


We can therefore assign and override functions that have been declared in the same way as we can do for variables, for example

raise_to_power_2 = raise_to_power
raise_to_power_2(6, 2)


and then we can overwrite it with. This isn’t normally sensible but shows how easy it is to checnge values in python

raise_to_power = 2
raise_to_power


## Docstrings

For smaller shorter functions it’s acceptable to write a simple one line description. For longer less obvious functions you should write a more descriptive docstring

def example_function(arg1, arg2, keywordarg1='keywordvalue1'):
"""
This is a doc string describing the function.
Generally use 3 double quotes for docstrings

Args:
arg1: the first argument
arg2: the second argument
keywordarg1: a keyword argument

Example:
>>> example_function(1, 2, 3)
(1, 2, 3)
"""
return arg1, arg2, keywordarg1


Google-style python docstrings are a common way to write documentation in python. Furthermore, advanced users can automatically generate static webpage documentation from docstrings. Two good examples of this are

The following is a really good reference for the syntax but the above is a good minimal example

### Doctests

The most important part of this docstring is the example. Note the syntax: It is a deliberate replication of what we would see in the python console.

It can be used as a test case to ensure the function runs as expected as follows

import doctest
doctest.testmod()


Note The result (1, 2, 3) must be exactly how the result is in the return value. For example

>>> example_function(1, 2, 3)
(1, 2, 3,)


will fail the test result as it will simply compare text results not python. It is easy to create these examples by just running small snippets in python/ipython and pasting the result in the Example.

Lets not get too hung up on this now but if we do this correctly we can all the following for a small time investment:

• A unit test like below
• An use example
• static documentation

In my personal opinion the first two are invaluable particularly in Investment Banking and Trading where someone may need to understand / run / change your models.

A full doctest reference can be found here: https://docs.python.org/2/library/doctest.html

## Variable Scope

The execution of a function introduces a new symbol table used for the local variables of the function. This can be seen in the following example

a = 100
d = "d is a Global Variable"

def demo_locals(a, b):
"""Prints a, b, c, d where c is external to function"""
d = 'An Example of a Local Variable taking preference'
print(a, b, c, d)

# Using global variables like this in functions
# is a sure fire way to screw yourself over as a beginner
c = 200

demo_locals(1, 2)
print(d)

1 2 200 An Example of a Local Variable taking preference
d is a Global Variable


Under the local (function) scope, the global variable d is overwritten with a new local value. The following print statement shows that this override only persists in the local scope within the function declaration. Finally, we can use global variables from within the function (e.g. c) providing they are in the global variable scope before the function is executed.

### Scope Summary

• Functions introduce a new local symbol table
• Functions revert to look up varible names in the global table as second precedence
• Local variable declarations / overrides do not persist globally
• If a function calls a second function (e.g. print) the new local scope is also declared for that function
• Defining a function is like defining a variable. It becomes present in the current scope.

Note the emphasis on variable. We shall see examples later where we can modify global states from within functions. There are specific ways to set global variables from functions but in general it’s a terrible idea so we won’t even cover it.

# Using a Text Editor / IDE

Now we will write longer more complex code we shoudl use a text editor / integrated development environment (IDE). These both present a sane way of editing longer code.

## Text Editor or IDE? What’s the difference

A text decent editor is simply a tool that will highlight the lines of code you write and possibly lint it for you. Linting is the act of checking for basic errors. Some may also offer autocompletion options. Often they will allow multiple plugins that are each individually aimed at increasing your productivity in different areas.

An IDE however, is typically a fixed set of addins with a specific look-and-feel. In fact really an IDE is just a text editor on steriods and once you start adding lots of additional functionality to your text editor it basically turns into an IDE.

I’ll only bother introducing 2 options to keep it simple (and for my own sanity)

• PyCharm as an IDE
• SublimeText3 as a Text Editor
• Jupyter as an IDE

Choosing your text editor / IDE is a fairly lifelong commitment so like marriage you will be stuck in it for better or for worse; for richer or for poorer. You may get into arguments with other people about their editors. Occasionally people cheat on their first editor but this is an unusual move.

I would recommend trying a few editors before making this commitment. Like most people though, you will probably settle for the first one in front of you.

If you have already chosen your editor then you can skip the rest of this section and move to the exercises as the rest is me opining on why I use SublimeText

## My Setup / Why I hate IDEs and use a Text Editor

Ok now this is entirely my own opinion but I don’t really like IDEs: Especially for beginners…

I personally use SublimeText3 and ipython to write code. I test out snippets in ipython and adjust them in SublimeText as required. I run sections of code in my text editor by copying to my clipboard and running %paste in ipython

More often than not, IDEs require a certain level of set up because of their additional functionality. Taking pyCharm as an example, if you set up your project settings incorrectly it can be extremely hard for a beginner to ascertain if your error is a PyCharm config error or an actual issue with your code. A more explicit example: If you configure you IDE to use a different python installation by mistake, none the libraries you thought should be installed will be installed. This can really throw you off as a newbie.

### Obsfucation

IDEs intentionally obsfucate or abstract away some areas which we have already covered like Terminal skills and knowledge of PATH evnironment variables. This can often mean that new users think of python as being something that is tied to their IDE. For example, a new user can often think of programs in a Windows sense (e.g. Microsoft Office): Microsoft Word does not exist outside the program that opens that big white blank sheet on your machine. However python can run fine without your IDE.

I personally thought for a month that Fortran code (my first language) could only be written in a special program. It took a while for me to realise that all the text editor does is produce a file with words in it.

You’re already learning a new progeamming language. Learning a new editor just adds insult to injury when you have a bug in your code and also get confuse with your IDE.

Text editors allow you to build your own custom IDE as your skill level increases, slowly adding highly customised optionality for your own style.

# Exercises

## Exercise 8.1: Abstracting with functions

This example will help you to identify possible variable values in small snippets of code to generalise them.

Take Geometric Brownian Motion as defined below $S_{t+1} = S_t e^{(r - \frac{1}{2}\sigma^2) \Delta t + \sigma\epsilon\sqrt{\Delta t}}$ and write it as function generating a random variable with numpy using a random seed of 42.

I have started it off for you below

def gbm(s, r=0.01, sigma=0.1, dt=1/252):
pass


Note that pass is used as above for placeholder functions that contain nothing. You will of course need to actually return a value. Finally, we also introduce default keyword arguments - writing them like above we can assign default values.

Style Tip Don’t put any spaces in default keyword argument definitions

# Solve Me!

assert gbm(100, .02, .1, 1/252) - 100.31936176261115 < 1e-10


## Exercise 8.2: Doctests

The following minimal example is a way of calling the function above and guaranting the same result

>>> np.random.seed(42)
>>> s = gbm(100, .02, .1, 1/252)
>>> s - 100.31936176261115 < 1e-10
True


Use this example to write a proper docstring for the function gbm and run the test case with the doctest module as shown further up the page in an earlier example.

## Exercise 8.3: The XOR function

Given two variables a and b the OR function will return True for every instance except when

>>> a = False
>>> b = False
>>> a or b
False


The Exclusive OR function (XOR), however, will also return False when

>>> a = True
>>> b = True
False


write a function for this in python taking in two varibles a and b as arguments. Write the four test cases in the doctstring in doctest format and test that the function works as expected by running doctest

# Solve Me!


## Exercise 8.4: Functions called from functions

Let us introduce a new method of doing loops for this example:

for i in range(10):
print(i)


Recall the stock motion example in Exercise 5.2: Write a function to calculate 10 steps of Geometric Brownian Motion. Call the function gbm from within this function.

Use the same default values as before in Exercise 5.2 and don’t forget to set the random seed as below so you can compare your solution!

import numpy as np
np.random.seed(42)

# Solve Me!


## Exercise 8.5: Recursion

You can also call a function itself from within itself! As an example the Fibonacci Series can be written

def fibonacci(n):
"""Returns the nth number of the fibonacci series

Args:
n: An positive integer

Example:
We can compare to a theoretical value

>>> z = (5**0.5 + 1) / 2
>>> theoretical = int((z**10 - (-z)**-10) * 5**-0.5)
>>> fibonacci(10) == theoretical
True
"""
if n < 0 or not isinstance(n, int):
raise ValueError('n must be a positive integer')
elif n == 0:
return 0
elif n == 1:
return 1
else:
return fibonacci(n - 1) + fibonacci(n - 2)


The factorial function n! is defined as the product of all numbers 1 to n (including n) i.e.

where the null product is defined as equal to 1.

Explicitly

• $0! = 1$
• $1! = 1$
• $5! = 1\times2\times3\times4\times5 = 120$.

# Next Topic

09: Lists and Tuples

# Learning Outcomes

• Understand how to use if, else and elif
• Construct for statements across an iterator
• Understand basics of for loop iteration in python
• Refresh Terminal skills

NB As a reminder you should be using ipython now!

# if statements

Perhaps the most well-known statement type is the if statement. For example:

number = int(input('Input a number:'))
if number < 0:
print("number is negative")
elif number % 2:
print("number is positive and odd")        # Style Tip: Make sure you indent
else: print("number is even and non-negative") # this is legit python but it's UGLY - don't do this


There can be zero or more elif parts, and the else part is optional. The keyword elif is short for else if, and is useful to avoid excessive indentation. An ifelifelif … sequence is a substitute for the switch or case statements found in other languages.

# for Statements

The for statement in Python supports repeated execution of a statement or block of statements that is controlled by an iterable expression. Here’s the syntax for the for statement:

# Measure some strings:
words = ['cat', 'window', 'floccinaucinihilipilification']
for w in words:
print(w, len(w))


### Example: Loop until a condition Prime identification

Loop statements may have an else clause; it is executed when the loop terminates through exhaustion of the iterable (with for) or when the condition becomes False (with while), but not when the loop is terminated by a break statement.

max_number = int(input('Enter a number to check primes up to:'))
for n in range(2, max_number):
for x in range(2, n):
if n % x == 0:
print(n, 'equals', x, '*', n//x)
break
else: # only triggered if iterable is exhausted without breaking
print(n, 'is a prime number')


### Example: Fibonacci Series

Here we use lists instead of the previous variable assignment method and run the iteration for precisely 5 loops rather than ending with a while condition

result = [0, 1]
for i in range(5):
result.append(result[-1] + result[-2])


which will then contain

>>> result
[0, 1, 1, 2, 3, 5, 8]


# Exercises

## Exercise 7.1: Find all files matching pattern

The function os.walk(source_directory) will return three items (root, directories, files) at each iteration. There is one iteration for every end-point sub-directory in the directory tree

For example os.walk('./root') called from file test.py

test.py
root/
dir1/
f1.py
dir2/
f2.csv
data.xlsx
dir21/
g.py


will iterate 4 times

• firstly, once at the root level returning ('./root', ['dir1', 'dir2'], ['g.py'])
• once for dir1 returning ('./root/dir1', [], ['f1.py'])
• once for dir2 returning ('./root/dir2', ['dir21'], ['f2.csv', 'data.xlsx'],)
• finally, once for dir21 returning ('./root/dir2', [], [],)

Note that if you type os.walk('./root') it will return a generator object. To see the output of generator objects you can just generate them by calling list(the_object) for for example here we would do list(os.walk('./root'))

• Hint You will need touch / mkdir. ls and mv may also be helpful (On Windows just do echo null >> filename instead of touch) ### Exercise 7.1.2: Print the path of all files that have an extension of .py • Hint A modified form of the expression if '.py' in 'files.py': will be key • Hint You can combine the root and the filename with os.path.join • Hint Remember how to iterate an object that contains multiple items at each iteration. Multiple variable assignment! The following snippets might help you understand this  >>> for i, j, k in [['a0', ['b0', 'c0'], ['d0', 'e0']], ['a1', ['b1', 'c1'], ['d1', 'e1']]]: >>> for l in k: >>> print('second list elements', k)  # Solve me!  You should get ./exercises/root/g.py ./exercises/root/dir1/f1.py  ## Exercise 7.2: Searching and modifying file contents In this exercise we look at a very common use of python which is a multi-file search and replace. The following code snippet will edit some of the files you have created in 7.1. Look for common elements / repititions in this script and replace them with a loop import os # os.path.join allows cross platform compatibility # as windows uses \ and linux uses / to separate paths # IT also allows you to combined and modify filepaths on the fly base_dir = os.path.join('.', 'exercises', 'root','dir2') filepath1 = os.path.join(base_dir, 'f2.csv') filepath2 = os.path.join(base_dir, 'data.xlsx') with open(filepath1, 'w') as fobj: # WARNING! This line ERASES all contents in filepath1! fobj.write('test') # if you want to read data ALWAYS use 'r' with open(filepath2, 'w') as fobj: # and do data = fobj.read() to explore the contents! fobj.write('test')  Hint File extensions e.g. .xlsx means nothing when programming. They are just hints to the operating system (Windows / OS X) of how to open the files when double clicking. I can even create a file like myfile.alex.xlsx.rubbish. Generally you will use common file extensions like .txt or .dat when dumping data. ### Exercise 7.2.1: Utility of context managers (the with statement) This exercise is to show you why we tend to use the with statement (a context manager) when opening files. Do some research on the line with open(filepath1, 'w') as fobj:  Try to understand what the difference between this and fobj = open(filepath1, 'w')  As an exercise, open the .xlsx file for reading in python using f = open(filepath, 'r') without closing the file object and now try opening it in Excel by double clicking in the explorer. Try again the other way round. ### Exercise 7.2.2: Finding files containing strings Modify the solution to 7.1 to find all files containing the string 'test' Hint Don’t open the files with 'w' else you will overwrite with a blank file! The following may also be useful >>>'test' in 'test case this is a testcase' True  ### Exercise 7.2.3: Terminal recap: Copy directory contents as a backup This exercise is a recap of the terminal skills in session 1. It is also extremely sensible when playing around with file contents as there is no Cntrl+Z once code is executed! In the Terminal copy (not move!) the files from solution 7.2.2 into a new temporary directory ./exercises/backup_root/ as a backup ### Exercise 7.2.4: Find and replace! Find all files containing the string 'test' and modify it to be 'test2'. Check your solution. Hint If you simply use fobj.write('test2') you will overwrite the entire file contents! This may not be ideal in most scenarios. • Use the open() function this time like open(filepath, 'r') instead of 'w' • Then read the file contents to a temporary object • Then open a new file protocol with 'w' to overwrite the entire contents back to the file # Next Topic 08: Functions and Editors # Learning Outcomes • Be able to use pytthon as a caculator • Understand logical and boolean operations available • lazy boolean evaluation • Read a traceback to debug code • Basic loop, control & indentation • Simple printing / formatting • Multiple assignment ## Contents # Python as Calculator The interpreter acts as a simple calculator: you can type an expression at it and it will write the value. Expression syntax is straightforward: the operators +, -, *, ** and / work just like in most other languages (for example, Pascal or C); parentheses () can be used for grouping. For example: spam = 2 eggs = 7 [ spam * eggs, -eggs**spam, # note that - has a lower priority to ** eggs - spam**spam, eggs * (1 - spam), eggs % spam, # remainder eggs // spam # floor division (integer division) ]  [14, -49, 3, -7, 1, 3]  This return structure is a unique python object called a list. We have also introduced the idea of assignment i.e. assigning a value to a variable name (Later we will see that this is not limited to numbers) Style Tip Always add a single space around mathematical operators unless you want to emphasise which get enacted first e.g. eggs - spam**spam helps the reader see that spam**spam is evaluated first. Please briefly read: https://www.python.org/dev/peps/pep-0008/#other-recommendations for a guide on how to write expressions Similarly there are a number of logical operations spam = 2 eggs = 7 [ eggs > spam, eggs < spam, eggs != spam, eggs == spam, eggs >= spam, eggs <= spam, ]  [True, False, True, False, True, False]  and boolean operations spam = 0 eggs = 7 [ eggs and spam, eggs or spam, eggs & spam, # bitwise and eggs | spam, # bitwise or not eggs, (not eggs) and spam, not (eggs and spam), # for a real bit of banter ~ is the operator for: (-x)-1 ~eggs, (~~eggs) == eggs, # which is symmetric (not(not eggs)) == eggs # unlike not ]  [0, 7, 0, 7, False, False, True, -8, True, False]  Style Tip Note how I indent the contents of large lists by four spaces. This makes it easier to read and also easier to comment on things within the list ## Lazy Boolean Evaluation A huge source of errors in python… the following will evaluate!! eggs > spam or an_undefined_variable  True  but in contrast eggs > spam | an_undefined_variable  --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-116-3eba06dec764> in <module>() ----> 1 eggs > spam | an_undefined_variable NameError: name 'an_undefined_variable' is not defined  This is because and and or are evaluated lazily. If there is no need to evaluate the second expression then it won’t ever hit the compiler. Their bitwise counterparts & and repsectively | will always evaluate the entire line. This property is very useful, however. Imagine if the second condition took 1hr to evaluate! # The traceback You will notice the traceback in the example above. The following is a guide on how to read the traceback In Python, it’s best to read the traceback from the bottom up: 1. Blue box: The last line of the traceback is the error message line. It contains the exception name that was raised. 2. Green box: After the exception name is the error message. This message usually contains helpful information for understanding the reason for the exception being raised. 3. Yellow box: Further up the traceback are the various function calls moving from bottom to top, most recent to least recent. These calls are represented by two-line entries for each call. The first line of each call contains information like the file name, line number, and module name, all specifying where the code can be found. 4. Red underline: The second line for these calls contains the actual code that was executed. The way I personally read tracebacks is as follows: 1. Blue box - what error is this? 2. Green box - is this error description obvious? 3. read next red underline starting from bottom - what caused it? 4. Is the line of code a line of code you wrote or something reall obscure? 5. If not obscure… i. orange box - If the error is here, go to the file and on this line correct the error ii. If you don’t recognise the line of code google it 6. If obscure and you can go up one line, goto 3. and loop 7. If you run out of lines then google the Blue and green box ommiting any specific variables (e.g. I wouldn’t include 'someon' in my google search but I would include the rest) 8. If that doesn’t yeild results start at the top orange box and replicate each function call working to the bottom ### Golden rules of debugging 1. The machine is always right It will tell you precisely what you asked 2. The machine doesn’t read your mind What you wanted is not usually aligned to what you asked 3. Accept your stupidity self-loathing is a common side effect to debugging. Just try not to swear in public places 4. Don’t be arrogant 50% of the time your code is wrong every time ### Example: Fibonacci Series The sum of two elements defines the next. a, b, = 0, 1 while a < 10: print(a) a, b = b, a + b  0 1 1 2 3 5 8  This example introduces several new features. ## Multiple assignment The first line contains a multiple assignment: the variables a and b simultaneously get the new values 0 and 1. Expressions on the right-hand side are all evaluated first before any of the assignments take place. The right-hand side expressions are evaluated from the left to the right. ## Simple Loops The while loop executes as long as the condition a < 10 remains True. Any sequence or number with a non-zero length is true, empty sequences are false. The standard comparison operators are written the same as in C: < (less than), > (greater than), == (equal to), <= (less than or equal to), >= (greater than or equal to) and != (not equal to). ## Printing The print() function writes the value of the argument(s) it is given. You can format strings like print('this is a decimal third: {0:10.4f}'.format(-1/3))  this is a decimal third: -0.3333  • Where the f10.4 indicates that we want to output a float (a decimal) with length no less than 10 (including the - and .) with a 4 decimals. • To the left of the colon this is the argument number (zero starting) in .format(arg0, arg1) that you wish to be inserted here. • Blank on the left of : will just take arguments in order. • Blank on the right will just format as if you did str(-1/3) ## Indentation / the colon : The body of the loop is indented: indentation is Python’s way of grouping statements. You must indent with 4 spaces for each indented line. Each line within a basic block must be indented by the same amount. In addition to being used in formatting… The : symbol is used to start an indent suite of statements in case of: if, while, for, def and class statements. Everything following this : should be indented by 4 spaces. eggs = 4 bacon = 2 if eggs and bacon: print('eggs and bacon') while bacon < eggs: print('{} {}'.format(bacon, eggs)) bacon +=1  eggs and bacon 2 4 3 4  As a sneak peak at indentations we will use later… iterable = 'string' # enumerate will return [(0, item_0), (1, item_1), ..., (n, item_n)] # from an iterable that is [item_0, item_1, ..., item_n] for i, s in enumerate(iterable): # remember multiple assignment? assign i, s = n, item_n # d means format as integer. 02d means ensure length of 2, padded with 0 print('Num: {1:02d} char: {0:s}'.format(s, i)) def my_func(x): return x**2 class MyClass: def another_func(self, y): return my_func(self)  Num: 00 char: s Num: 01 char: t Num: 02 char: r Num: 03 char: i Num: 04 char: n Num: 05 char: g  # Exercises It is assumed that you will get multiple tracebacks in the questions. Some will be very long and really obscure - stick to the recipie above and you should surface! For these exercises, we will use ipython. You may need to pip install it, google how to do this and once installed you can simply type  ipython


in the terminal

## Exercise 5.1: Debugging confusing code

This example shows some code that you may not understand. This is common when using libraries such as pandas and numpy which may throw some really obscure errors. This exercise is to help you realise that all problems are debuggable if you read the traceback carefully and don’t get bogged down in that what you may not understand.

Hint Don’t skip any steps when reading the stack! You can insert print and type functions by inserting them (with 4 space indentation) to help debug errors like

def my_func(a, b):
print('a is:', a)
print('a has the type of:', type(a))
return second_func(x, y)


Here is the question… Understand why the below function does’t return 25 and fix the issue below. You will need to run the functions below in python

Trick In ipython you can copy blocks of python code to your clipboard and paste them directly in by typing %paste in the ipython terminal

def first_func(x, y):
return second_func(x, y)

def second_func(a, b):
return third_func('a', 'b')

def third_func(a, b):
return a**b


and the executing the following line

>>> first_func(5, 2)


will give the error you need to fix as below

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-710f1e0a547e> in <module>()
----> 1 first_func(5, 2)

<ipython-input-5-a33026a6e4d1> in first_func(x, y)
1 def first_func(x, y):
----> 2     return second_func(x, y)
3
4 def second_func(a, b):
5     return third_func('a', 'b')

<ipython-input-5-a33026a6e4d1> in second_func(a, b)
3
4 def second_func(a, b):
----> 5     return third_func('a', 'b')
6
7 def third_func(a, b):

<ipython-input-5-a33026a6e4d1> in third_func(a, b)
6
7 def third_func(a, b):
----> 8     return a**b

TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'str'

# Solve me!


## Exercise 5.2: Geometric Brownian Motion (GBM)

This exercise builds on your ability to translate mathematical operations into code

$S_{t+1} = S_t e^{(r - \frac{1}{2}\sigma^2) \Delta t + \sigma\epsilon\sqrt{\Delta t}}$

We introduce the numpy library and note that you can rename library names on import with as to make it more convinient. Traditionally, most users will rename numpy as np

import numpy as np  # style tip: Put imports at the top and separate from statements

np.random.seed(42)  # Popularity tip: Always use 42 as a random seed.
sigma = 0.1                             # annualised volatility of 10%
epsilon = np.random.standard_normal()
s_t = 100
dt = 1/252                              # using annualised vol so need days as frac of yr
r = 0.02                                # annualised expected return


### Exercise 5.2.2: Calculate the value of $S_{t+1}$ for a single time step

Use the function np.exp() as the exponential function e and np.sqrt for the squareroot function

# Solve me!


### Exercise 5.2.3: GBM for an Equity

This exercise will reinforce the learning in 5.2.2 and also require knowledge of the boolean operations covered earlier in this section.

Equities cannot have a value less than zero, $S_{t} \ge0\ \forall t$

Lets increase the volatility $\sigma$ significantly to 1000% and now model the stock price until it goes to zero.

Modify the earlier Fibonacci Series example to loop the calculation of $S_{t+1}$:

1. Set sigma to have a value of 1000%
2. print the current price s_t at each time step
3. Stop the similution once the stock price reaches zero

Hint Remember that you can reassign the variable s_t and you will want a new epsilon = np.random.standard_normal() to be calculated at each iteration. Also using 0 exactly might take a while to converge, for example $10^{-250}$ is still not zero but in finance anything less than $0.01 is essentially zero. Another way you can represent 0.01 is in exponential form as 1e-2 which stands for$10^{-2}$# Solve me!  # Next Topic 06: Libraries ## Learning Outcomes • Be able to install libraries from pypi using pip • Be able to install libraries requiring C++ compilers using conda • Understand that machines require precision in commands • Be able to locate a proxy in a corporate setting ## Contents Run the following $ python -c "import requests"


we will see how to fix this issue

# pip

pip is the default python package manager. It is to python what homebrew is to Mac OS X.

Install a new PACKAGE with

$python -m pip install PACKAGE  for example $ python -m pip install requests --proxy https://path.to.my.proxy:1234 -IU


Here we introduce three very useful flags when in a corporate environment. Almost all corporations will have proxies. You need to specify the proxy to hit the internet programmatically.

To see other flags type

$python -m pip install -h  # conda Recall anaconda is what we installed python with. This is a more sophisticated python package manager. Some libraries like scipy require advanced C++ libraries (LAPACK / BLAS) in the case of scipy. These can be extremely difficult to install on Windows machines in particular (last time installing LAPACK/BLAS from scratch took me 2 days reading documentation plus 8 hours of compilation!) conda comes with prebuilt versions and compilers which mean you get scipy in under a minute $ conda install scipy


Downsides are that not all libraries are supported by conda and there is less help on Stackoverflow for it.

# Exercises

## Exercise 4.1: Supplying arguments to a program

This exercise is to help you get used to the idea of self help in coding.

Try and figure out what the -IU means and notice that you can stack single letter flags on a single -. The combination of -IU is usually only used when an installation is broken or doing something weird and won’t work.

## Exercise 4.2: Debugging bad command

I can’t really tell you the aim of this exercise without giving away the answer :)

Why does this not work?

$python - m pip -h  Hint It shouldn’t just take you to the python console. It should show you the Help for pip! ## Exercise 4.3: Understanding importance of $PATH

This exercise aims to educate you at how your machine works under the hood. How does your machine know how to execute programs in Program Files or other obscure locations? This should make that clear.

Recall that the first command must be an executable program. Why does $pip -h work? Hint On Windows powershell type $ $env:PATH or on Mac OS X type $ echo $PATH. You can use ls to display files that match patterns, for example $ ls "/a/directory/*pip*"


use this on the /bin directory for anaconda.

## Exercise 4.4: Corporate Proxies

This exercise is really more of an example to make a point that most people in corporate environments think IT have blocked them from using python - genuinely some of the smartest quants I know have also thought this. When in fact it’s nothing of the sort. Instead it’s just a symptom of being behind a firewall.

If you are on a corporate network find your proxy. In windows follow this guide. On Mac OS X follow this guide

This proxy is hardcoded in your browser e.g. Google Chrome / IE and is how it hits the internet. The example above shows how to pip install using a proxy.

# Next Topic

05: Basic Mathematical Operations & Debugging

# Learning Outcomes

• Awareness of numpy, pandas and matplotlib
• Basic use of library functions
• Self help for library function use
• Familiarity with ipython

# numpy, pandas & matplotlib

numpy is the main math library in python and pandas (paid for by 2$\sigma$) is probabaly the most important data manipulation library. matplotlib is the most popular core plotting engine used by python developers.

Pandas is nothing more than a convinience tool based in numpy. This is crucially important to understand how they work together

First lets look at simply importing them and calling basic functions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


Note the different ways in which the libraries are called. All good python libraries provide a quick-start guide.

Remember that we are just tying these libraries to object names which can be changed, so the following is a terrible idea!

>>> np = 10
>>> print(np)
10


Now we have overrode our numpy library with the number 10! We can easily revert though by re-importing as below

import numpy as np


Common Bug Error when scripting or hacking code together is to forget that you have previously defined variable names. This is particularly common with variables names like df which will often be used for pandas.DataFrame objects (as you will see). However, coders will regularly reuse df and overwrite previous values which can lead to bugs.

Another common error is overriding default python statements. Common mistakes are assigning new values to the built-in types below. As we can see when we simply run them in python they contain object information (recall that if you enter an undefined name you will get a NameError - check this)

>>> object, type, dir, locals
(object, type, <function dir>, <function locals()>)


Just don’t use the names above: You can generally avoid these issues by giving descriptive names like

my_object = 1
curve_type = False
base_dir = r'\\my\path'
local_string = 'this'


## Using library functions

You will generally be solving a problem so for example to find the exponential function in python just google “exponential function in python”. If as is common, you know a function exists in a library but are unsure of spelling (or are lazy) you can start typing the method name and hit TAB like pd.read_ TAB will give a series of autocompletion options to select from

pd.read_csv('06_test.csv')

Programming language Designed by Appeared Extension
0 Python Guido van Rossum 1991 .py
1 Java James Gosling 1995 .java
2 C++ Bjarne Stroustrup 1983 .cpp

you can find help for a function by doing pd.read_csv? in ipython. A more genereric way of finding help for a function is by

help(pd.read_csv)


## PATH and ipython

What follows is a little divergence into the internals of your computer so that you understand how ipython magically starts …

### A brief foray into PATH and system env variables

Python scripts can be installed as executable files (ones you can double click) by pip - the tool we encountered before. In order to run these scripts from the commandline we need to make sure that they are on our system PATH.

The system PATH is common in both windows and linux. It contains a list of directories that (may) contain executable files. Any executable files in these directories can directly be ran in the Terminal (or by any other program running). Examples of programs (hopefully!) on your path are:

• ls, the simple program we used earlier for finding files in a directory
• python, the python interpretor

The PATH environment variable is simple one of many environment variables that will be o your machine. You can actually create environment variables named whatever you fancy. You may have heard of PYTHONPATH. This is another environment variable only used by python and contains a list of all directories that might have .py files that can be ran from python, hence the nomenclature.

#### Modifying PATH

We can add directories to our path in various ways

##### Temporarily
• Windows set PATH = $env:PATH";\\path\to\dir" • Linux export PATH="$PATH:/path/to/dir"

### ipython

There are two ways to run ipython depending on your python setup

• python -m ipython
• ipython

The second is my preference because if it doesn’t work then some folders are incorrectly set up. The first is the default and should always work. If it doesn’t then you should pip install ipython again. Possibly also with the -IU mentioned in a previous exercise!

ipython has some great features like

• TAB-autocompletion (I TAB as soon as I write more than 3 letters of a variable name that already exists!)
• syntax highlighting
• auto indentation
• magic commands e.g. %paste, %timeit, %hist

We will use ipython for the next few sections before introducing other options

# Exercises

In this example we are going to use a combination of these libraries to webscrape the Bank of England mortgage rates. This is a jump from what we have done before to get you familiar with more complex python scripts

We can find a relevant dataset of interest by exploring http://www.bankofengland.co.uk/boeapps/iadb/AtoZ.asp?Travel=NIx and after a bit of faff I found the help area for submitting requests in the top left of the home page http://www.bankofengland.co.uk/boeapps/iadb/help.asp?Back=Y&Highlight=CSV#CSV. I decided to use the 2, 3 & 5 year fixed mortgage rates for 75% LTV. (Explicitly this is Monthly interest rate of UK monetary financial institutions (excl. Central Bank) sterling 2, 3 and 5 year (75% LTV) fixed rate mortgages to households (in percent) not seasonally adjusted)

Don’t worry about fully understanding the python structures below yet. This is just to formulate the example

You can copy the text below and then type %paste in ipython to run all the code on your clipboard

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# googled "query website with python using parameters" - you will notice they all use 'requests'
import requests

# This format comes from googling 'python query website with requests' and combining it with the parameters required
# by the help link on the BoE site (second link above)
'Datefrom'   : '01/Jan/1970',
'Dateto'     : '01/Sep/2019',
'SeriesCodes': 'IUMBV34,IUMBV37,IUMBV42',
'CSVF'       : 'TN',
'UsingCodes' : 'Y',
'VPD'        : 'Y',
'VFD'        : 'N'
}


After googling "pandas read_csv from string" I came up with the following

from io import BytesIO


Now peek at result to check it looks sensible

df.head()

DATE IUMBV34 IUMBV37 IUMBV42
0 31 Jan 1995 8.13 8.84 9.44
1 28 Feb 1995 8.38 8.96 9.38
2 31 Mar 1995 8.08 8.64 9.33
3 30 Apr 1995 8.29 8.54 9.4
4 31 May 1995 8.22 8.18 9.4

## Exercise 6.1: Using pandas built-in functions

pandas will allow you to plot data directly with little effort if you have it in a particular format. The df.index is usually the x-coord and the columns are each treated as y-coord for plotting simple scatter / line plots.

You will notice the numeric index on the left of the dataframe contained df. We actually want the 'DATE' to be on the x-axis so you will need to make that the index. Secondly, The df.index is in pandas default datetime. This tends to not plot particularly well. We want to convert it to datetime.datetime format.

1. Google how to convert a column of a dataframe into datetime format and change the 'DATE' column to datetime.
2. Google how to set an index in a pandas dataframe and set the index so that df.index will show the 'DATE' field.
3. Use the .plot method to plot the dataframe.

Note that in ipython you will also need to show the plot. Google "How to show matplotlib plots" if it doesn’t appear.

# Solve me!


# Next Topic

07: Flow Control

## Learning Outcomes

• Able to launch / exit python in the terminal and run a simple command
• Able to import libraries
• Check python version
• Understand weakness of jit compilation

# Hello World

Open the Terminal and type

$python  This should then load the basic python interpreter. When we are in a python interpreter, I will use the following syntax: >>> for new logical blocks, ... for line continuations or logical block continuations and results will be without padding. Try entering the following to print thing to the console window >>> 'Hello World' Hello World  This gives the string representation of the string 'Hello World' which is by definition Hello World # Statements A statement in python is something that is expressed in a single logical line such as >>> import sys  or another example is the assert statement which throws an error unless the following expression evaluates to True >>> assert True  You may be familiar with some other statements like - break - continue - return - yield - raise  amongst others which can be found on the docs page # Using python libraries To use python libraries we use the statement import Worth noting is the following to check what python program is currently running your script (can be very important for debugging strange errors!) >>> import sys >>> print(sys.executable) /usr/local/anaconda3/bin/python  and for checking what version of python you have running. We should all be using at least python 3.6 by now (higher versions are fine) >>> print(sys.version) 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]  if you get ModuleNotFoundError this library is a non-standard python library and we can address that in another section. # Exiting python You can exit python by entering >>> exit()  If you are in the middle of executing code, you can force code to stop by entering Control+C. Sometimes you may have to hold Control+C or press it quite a few times to actually register particularly if you are doing something where the interpreter hasn’t got time to register your command! # Exercises ## Exercise 3.1: The issue with just-in-time (jit) complilation This exercise is designed to demonstrate a common issue that can be caused by python’s JIT compilation (i.e. the fact that it runs one line at a time). In comparision Fortran, java & c++ will check syntax across all the code (as well as numerous other checks) before you can run a single line. In the python console type (this is meant to give a bug!) >>> print(HelloWorld)  Notice that the compiler tells you three things • what file the issue was in • what line the issue was on • an arrow signalling which bit of code was bad • An error type and description of the error Lets write your first function (Note we will cover functions properly in section 8) as follows >>> def hello_world(): ... print(HelloWorld)  This function will execute everything indented until it hits a non-indented block (or a return / yeild statement). If this was c++ we would instantly get an error as before. However in python we won’t get an error until the line itself is actually ran - similar to VBA >>> hello_world()  This is one of the biggest weaknesses of python and why most core quant libraries are still built in java or C++. Most of the bugs you create will be due to this. NB You’re not expected to fix this code yet! Just move onto the next exercise :) # Next Topic 04: Installing Python Packages ## Learning Outcomes • Know how to install python on your own machine • Know how to find where python is currently installed ## Contents It’s important not to skip steps. That is all. We will use anaconda for simplicity. You can do what you like but I’ve installed python in lots of different ways on Linux, Mac and Windows and anaconda is by far the most straightforward for beginners. # Windows Make sure you tick the bit that asks you if you want to add anaconda to your PATH in the advanced settings. Otherwise you won’t be able to just run python nicely from the Terminal. Google something like “How to install anaconda Windows” and follow the guide. # Mac OS X Macs are a bit annoying in that you can’t simply uninstall things nicely. To get around this we use homewbrew which is a package manager for Mac OS X. Google something like “installing hombrew mac os x” and at the time of writing it should take you to https://brew.sh/ with a line of code that you copy and paste into Terminal. Normally sholdn’t ever run code you don’t understand but this code is pretty legit. You must install anaconda via the Terminal and so we use brew to install anaconda. Therefore, google something like “install anaconda brew mac os x” (Note: You don’t want the graphical installer!!) I found this link decent at the time of writing. ## Adding python to your PATH for Mac OS X In Mac OS X you’ll need to add the location of your python installation into a special file that is ran when the Terminal starts. To do this we use a command to create a file if it doesn’t exist $ touch ~/.bash_profile


The ~/ is a shortcut for your root user file or home directory. On Windows this is C:\users\yourname. The . at the beginning just denotes a hidden file.

We can then add the PATH to the python executable to this file by

$echo 'export PATH=/path/to/your/conda/bin:"$PATH"' >>  ~/.bash_profile


Note that at the time of writing the PATH was /usr/local/anaconda3/bin. This command adds the line export PATH=/path/to/your/conda/bin:"$PATH" to the file ~/.bash_profile by writing the output of the command echo to the file specified after the operator >> whcih signifies that you want to redirect the output and append to a file. For more information see here If you ever have weird things happen in the future when using brew check out https://hashrocket.com/blog/posts/keep-anaconda-from-constricting-your-homebrew-installs # Exercises ## Exercise 2.1: Verifying your python executable This exercise shows you how to debug a fairly common issue where users are running a different python executable to the one that they just installed. Check what python is your system default On Windows $ Get-Command python


On Mac OS X

$which python  ## Exercise 2.2: Verifying your python version This exercise helps you check your python version. Another source of confusion for some beginners. Check you have python 3.6.5 or greater $ python --version


If you don’t get "Python 3.x.x :: Anaconda, Inc." for python 3.6.5 or greater than seek help / reinstall.

## Exercise 2.3: Running a .py file in python

This exercise aims to show you how python files can be executed from the command line.

Run the file ~/pylrn/test.py with python

# Next Topic

03: The Python Interpreter

## Learning Outcomes

• Be able to find / open the terminal
• Be able to navigate filesystem within the Terminal
• Basic usage of linux commands

# Opening the Terminal

Why do I care? Basically because all the self help on the internet will rely on you being able to reproduce what you messed up in a minimal example. A lot of errors at the early stages are often to do with set up and nothing to do with python. Being able to run code in a minimal way in the terminal will give you confidence in programming in python and be able to isolate python issues from mor generic setup or configuration issues which can be very difficult to otherwise debug.

## Opening a console on Mac OS X

OS X’s standard console is a program called Terminal. Open Terminal by navigating to Applications, then Utilities, then double-click the Terminal program. You can also easily search for it in the system search tool in the top right.

The command line Terminal is a tool for interacting with your computer. A window will open with a command line prompt message, something like this:

mycomputer:~ myusername$ ## Opening a console on Windows Window’s console is called the Command Prompt, named cmd. However it sucks. I would always recommend using powershell. An easy way to powershell to it is by using the key combination Windows+R (Windows meaning the windows logo button), which should open a Run dialog. Then type powershell and hit Enter or click Ok. You can also search for it from the start menu. It should look like: C:\Users\myusername>  Window’s powershell is not quite as powerful as its counterparts on Linux and OS X but they’ve copied all the relevant parts that we will want to use # Navigating around the Terminal From this point onwards anything denoted like $ program


will represent a Terminal interface (this is not python!). The above represents running a command named program in the Terminal.

## Moving around directories

These commands will soon become second nature & after some practice they are quite a lot faster than navigating the user interface file explorer!

Check what directory you are in

$pwd  show what’s in the current directory $ ls


make a new directory

$mkdir newdirectory  enter this directory $ cd directory


go up one directory

$cd ..  show whats inside the directory $ ls directory/


remove a file

$rm file.ext  Change the path (rename) a file (and / or extension) $ mv directory/file.dat anotherdirectory/file2.csv


remove a directory (I probabaly wouldn’t do this hastily!)

$rm -r directory  The -r is a known as a flag. This modifies the behaviour of the program rm. IN this instance it allows rm to delete directories which is otherwise a risky default ability. These are pretty much all the commands you need. Try to familiarise yourself with them. All of these commands e.g. ls, rm, mkdir are all programs themselves that take in arguments from the command line depending on their operation. # Exercises Your local machine home directory is always denoted by ~/ (note the slashes are backwards in Windows so this is ~\for Windows machines) These exercises aim to get you more comfortable with using the command line rather than being an actual problem solving task ## Exercise 1.1: Creating directories • cd into the directory ~ • Create a new directory named temp in this directory ## Exercise 1.2: Renaming directories Change the name of this directory to pylrn You can test this worked by doing $ ls ~/pyl*


the * will wildcard all text following it

## Exercise 1.3: Create a file with contents

Run the following command to create a new file

Max OS X users do this

$touch ~/pylrn/test.py$ echo print('Hello World') >> ~/pylrn/test.py


Windows users do this

$"print('Hello World')" | Out-File ~/pylrn/test.py -encoding ascii  verify it is there with $ ls ~/pylrn/*.py


# Next Topic

02: Installing Python for Mac OS X / Windows

# Quick Start (Coinbase)

There are two simple steps

Why GDAX? Literally the only reason I chose GDAX is that it has an API and the fees don’t seem to be too bad. Coinbase and GDAX are the same company. GDAX is just the exchange side of the Coinbase wallet. They are kept separate I guess to simplify the currency conversion process for the everyday enthusiast.

A heads up - Coinbase is for bitcoin newbs this means in a nutshell that they behave like any other bank (for better or for worse, which can be subjective). Your account will probably get closed / sales refunded if you buy things (not limited to): Hookers, Weapons, Drugs, Gambling etc.

I find the gambling moderation ironic since the whole purpose is to allow you to speculate on crypto-currencies.

# Depositing Funds

## Depositing via Bank Account (SEPA)

This is the cheapest option to deposit funds as you’ll have to do it anyway if you want to get money back into pounds.

You need to deposit €6 into the EUR account to complete the validation process that will allow you to withdraw funds if needed. I found this process was useless and a major reason Coinbase gets such bad press! The waiting periods for validation seem to take forever.

## Depositing via Debit Card

I thought this was a better option for amounts <£250 as the card charges at time of writing are 4% and bank fees are a flat £10.

However, currently it is not possible to withdraw funds onto a debit card. In short just validate a bank account

# Validate a Withdrawal Account

This is required if you want to extract any funds from your wallet back into your bank account. With my NatWest account on Wednesday 30th August, it took 4 days to get validated at midday on Sunday 3rd September.

Buying currency through coinbase is expensive. There is a 99¢ charge on every transaction. This is where GDAX is used.

You can log into the GDAX account and deposit from the coinbase wallet into the exchange. Once this is done you face only the exchange fees for buying and selling cryptocurrency.

Be aware that there are minimum/maximum buy/sell orders of 0.01 in each cryptocurrency

You can transfer funds back into your coinbase account and deposit these into a normal bank account if you so wish.

Quick start guide going over how I set up MySQL on macOS Sierra

# Install with homebrew

Update brew and fix any pressing issues with doctor

brew update
brew doctor


then install with

brew install mysql


and you should receive an output similar to

We've installed your MySQL database without a root password. To secure it run:
mysql_secure_installation

To connect run:
mysql -uroot

To have launchd start mysql now and restart at login:
brew services start mysql
Or, if you don't want/need a background service you can just run:
mysql.server start
==> Summary
🍺  /usr/local/Cellar/mysql/5.7.16: 13,511 files, 439M


# Secure Installation

1. Start mysql by running

 brew services start mysql

2. Run the installation script

 mysql_secure_installation

3. You will be asked to set up a setup VALIDATE PASSWORD plugin. Enter y to do this.
4. Select the required password validation (Doesn’t really matter if it is just you using the database)
5. Now select y for all the remaining options: Remove anon. users; disallow remote root logins; remove test database; reload privileges tables.
6. Now you should receive a message of

 All done!


If you receive the error of

Error: Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)


then it is likely you don’t have mysql running. Make sure to do the first step here

Log in to MySQL as root with

mysql -u root -p


See the docs for creating users, databases etc. The version can be obtained by

mysql --version


where confusingly Distrib refers to the documentation number

# Managing the Background Service

The command

brew services start mysql


will start mysql and this service will be started at login in perpetuity. To stop the service in perpetuity run the command

brew services stop mysql


further examples can be found by reading the simple docs

# Python Interface

Install pymysql with

pip install pymysql


and read here for the advantages over MySQLdb. A simple example interfacing using the root that was set up earlier can be seen by the following example. Log in via

mysql -u root -p


and enter the following SQL code to create a test DATABASE and TABLE with

CREATE DATABASE example_database;
USE example_database;
CREATE TABLE users (
id int(11) NOT NULL AUTO_INCREMENT,
email varchar(255) COLLATE utf8_bin NOT NULL,
password varchar(255) COLLATE utf8_bin NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
AUTO_INCREMENT=1 ;

now the following script saved as test_pymysql.py

import pymysql
import argparse
import getpass

def __call__(self, parser, namespace, values, option_string):
if values is None:
values = getpass.getpass()
setattr(namespace, self.dest, values)

parser = argparse.ArgumentParser()
args = parser.parse_args()

connection = pymysql.connect(host='localhost',
user=args.user,
db='example_database',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)

try:
with connection.cursor() as cursor:
# Create a new record
sql = "INSERT INTO users (email, password) VALUES (%s, %s)"
cursor.execute(sql, ('webmaster@python.org', 'very-secret'))

# connection is not autocommit by default. So you must commit to save
connection.commit()

with connection.cursor() as cursor:
sql = "SELECT id, password FROM users WHERE email=%s"
cursor.execute(sql, ('webmaster@python.org',))
result = cursor.fetchone()
print(result)
finally:
connection.close()

and run as

python test_pymysql.py -u root -p


to give the output of

{u'password': u'very-secret', u'id': 1}


to remove the example_database, log in to mysql and invoke

DROP DATABASE example_database;


# References

This was solved by turning off all extensions and turning them back on one at a time.

The offender was TeX All the Things and comments were available again with it disabled. This Reddit Post also indicates that Chrome extensions can often cause issues for the comments box.

# References

Start the console

open /Applications/Utilities/Console.app

Look at system.log reports. This can also be viewed by

syslog -w

My mac has a launch daemon running repeatedly in the background every 10s

Nov  7 15:32:15 iBanterMacBook com.apple.xpc.launchd[1] (net.juniper.AccessService[3276]): Service could not initialize: Unable to set current working directory. error = 2: No such file or directory, path = /Applications/Junos Pulse.app/Contents/Plugins/JUNS: 16B2555: xpcproxy + 11207 [1356][4A173B59-A786-3CBF-9740-CFC693060FEF]: 0x2
Nov  7 15:32:15 iBanterMacBook com.apple.xpc.launchd[1] (net.juniper.AccessService): Service only ran for 0 seconds. Pushing respawn out by 10 seconds.
Nov  7 15:32:25 iBanterMacBook com.apple.xpc.launchd[1] (net.juniper.AccessService[3277]): Service could not initialize: Unable to set current working directory. error = 2: No such file or directory, path = /Applications/Junos Pulse.app/Contents/Plugins/JUNS: 16B2555: xpcproxy + 11207 [1356][4A173B59-A786-3CBF-9740-CFC693060FEF]: 0x2
Nov  7 15:32:25 iBanterMacBook com.apple.xpc.launchd[1] (net.juniper.AccessService): Service only ran for 0 seconds. Pushing respawn out by 10 seconds.


I has uninstalled this program months previously. I finding the relevant launch daemon by

grep -lR juniper /System/Library/Launch* /Library/Launch* ~/Library/LaunchAgents/

where grep -lR juniper will search for all files in the following directories containing the pattern juniper

In this case the offenders were located at

/Library/LaunchDaemons/net.juniper.AccessService.plist
/Library/LaunchDaemons/net.juniper.UninstallPulse.plist
/Library/LaunchAgents/net.juniper.pulsetray.plist


I can then remove then by

GREP_NAME=juniper
for i in $(grep -lR$GREP_NAME /System/Library/Launch* /Library/Launch* ~/Library/LaunchAgents/);
do sudo launchctl unload $i echo 'unloaded: '$i
sudo rm $i echo 'removed: '$i
done

you can check the launch daemons by invoking

sudo launchctl list | grep $GREP_NAME If it still remains restart the system. Mine remained as I did not first unload the processes while writing. Restarting restarted launchd properly. Download the source code for this page here # Contents # Introduction This is more of a personal bookmark so I don’t forget this super simple solution # Method Include the following in the _config.yml file markdown: kramdown  then add this line where you want the TOC to appear in the markdown blog entry * This will become a table of contents (this text will be scraped). {:toc}  This will also include Contents inside TOC which looks dumb. Correct this by {:.no_toc} following a title you don’t want in the TOC. # References This guide builds up to launching an arbitrary number of GPU cloud instances with pre-loaded parameters. This allows for the distribution of lengthy tasks onto CUDA cores. This is extremely useful for running a series of Deep Neural Networks with different parameterisations ! You will need an AWS account set up: See previous blog # Contents # Set Up IAM Access Key To use the CLI you need an IAM access key and code pair • Navigate the AWS IAM portal • Select Users in the left navigation pane • Create New User • Enter a user name & click Create • Download Credentials and store securely # Install with pip This is the simplest method if you have pip pip install awscli add the flag --user if you don’t have root access. See the docs for other installation options. To check the installation aws --version aws-cli/1.11.10 Python/2.7.12 Darwin/16.1.0 botocore/1.4.67  # Configure CLI Configure the default user by the command aws configure as follows aws configure AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY Default region name [None]: us-west-2 Default output format [None]: json  A list of region codes can be found here this creates two files at ~/.aws. This bash command will show the simple contents for f in$(ls ~/.aws/)
do echo -e "# start ~/.aws/$f" cat ~/.aws/$f;echo -e "# end file\n"
done

which could be edited by either running aws configure or editing the contents directly

# start ~/.aws/config
[default]
region = us-west-2
# end file

# start ~/.aws/credentials
[default]
aws_access_key_id = AKIAIOSFODNN7EXAMPLE
aws_secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
# end file


See the docs for specifying additional users with different regions and keys.

Enable command completion by adding the following lines to ~/.bash_profile

# This is for the amazon web services CLI to enable autocompletion
complete -C '/usr/local/bin/aws_completer' aws
export PATH=/usr/local/bin/aws:$PATH now refresh the ~/.bash_profile by source ~/.bash_profile see the docs if you installed without pip. Check the path to aws is correct by which aws. You can test by entering aws s and hitting TAB. I found the autocompletion was bit rubbish. I had to TAB twice for it to display the first time # Create Security Group, Key-Pair and Role The following command creates the security group example-name with an example description aws ec2 create-security-group \ --group-name example-name \ --description "an example description" then you can add a rule to allow ssh traffic over port 22 for a specific ip range specified by a CIDR (see below) aws ec2 authorize-security-group-ingress\ --group-name example-name \ --protocol tcp \ --port 22 \ --cidr 0.0.0.0/0 If you’re lazy, leave it as is.To be more secure, determine your CIDR code from a range of IP addresses from your router using the python module netaddr import netaddr start_ip = "63.223.64.0" # an example start ip end_ip = "63.223.127.255" # an example end ip cidrs = netaddr.iprange_to_cidrs(start_ip, end_ip) for cidr in cidrs: print cidr You can now test this worked by running aws ec2 describe-security-groups which will probably fail with the following result An error occurred (UnauthorizedOperation) when calling the DescribeSecurityGroups operation: You are not authorized to perform this operation.  You can fix this by navigating again to the Amazon IAM portal • Select Policies (and Get Started if you haven’t visited this page before) • Filter Policy Type by admin in the top search bar • Select AdministratorAccess • Click Policy Actions and Attach • Select your username if it asks you to • Attach Policy • Wait 10 secs or so and retry the command above which should return JSON If any errors are encountered, visit the StackOverflow thread as the required fix may change over time. # Create an Instance from the CLI It is possible to create an ssh key directly through the CLI aws ec2 create-key-pair \ --key-name example-key-name \ --query 'KeyMaterial' \ --output text \ > ~/Downloads/example-key-name.pem chmod 400 ~/Downloads/example-key-name.pem This means you will have a new key for each session which is a bit more secure than the global ssh key method described in the previous guide. A new instance can be created with the following command, which returns the InstanceId when successful. Note that this is a Free Tier request so it won’t bill and is ideal for playing with the CLI aws ec2 run-instances \ --image-id ami-ed82e39e \ --security-group-ids example-name \ --count 1 \ --instance-type t2.micro \ --key-name example-key-name \ --query 'Instances[0].InstanceId' The full JSON response is filtered by --query 'Instances[0].InstanceId' where 'Instances[0].InstanceId' means selecting the first JSON value for the key Instances and returning the value for the key InstanceId. More details on --query can be found in the docs. When specifying an --image-id make sure it corresponds to the correct region. In the case of an error code of (InvalidAMIID.NotFound) try specifying --region eu-west-1 or the relevant region code. The public IP for the instance can be returned by subsequently running aws ec2 describe-instances \ --instance-ids yourInstanceId \ --query 'Reservations[0].Instances[0].PublicIpAddress' alternatively, the DNS can be found by using --query 'Reservations[0].Instances[0].PublicDnsName' This can simply be adjusted to return multiple instances with the following syntax aws ec2 describe-instances --query 'Reservations[0].Instances[*].PublicDnsName' # Connect Connecting is straightforward and simply requires ssh -i ~/Downloads/example-key-name.pem ubuntu@[PublicIp_or_DNS] and you can connect either the IP or the DNS specified. If you used a default key stored in ~/.ssh then -i /path/to/key.pem isn’t required. If you can’t connect and get Operation timed out then try the following aws ec2 describe-instances \ --instance-id yourInstanceId \ --query 'Reservations[0].Instances[0].NetworkInterfaces[0].Groups[0].GroupName' and if it returns "default" then you messed up and blocked all incoming traffic which nicely ties into the last section # Terminating an Instance An instance may be terminated by aws ec2 terminate-instances --instance-ids instance_id where multiple instances may be terminated through aws ec2 terminate-instances \ --instance-ids instance_id0 instance_id1 instance_id2 which is equivalent to specifying the full JSON --instance-ids ["instance_id0", "instance_id1", "instance_id2"] for more shorthand notation see the docs # Tagging Instances Instances may only be tagged after they are created through the command aws ec2 create-tags --resources=yourInstanceId --tags Key=TestName,Value=TestName however, it is possible to do this in one line through xargs and shell script as aws ec2 run-instances \ --image-id ami-ed82e39e \ --security-group-ids example-name \ --count 1 \ --instance-type t2.micro \ --key-name example-key-name \ --query 'Instances[0].InstanceId' | \ xargs -I {} sh -c " echo 'InstanceId:'{} aws ec2 create-tags \ --resources={} \ --tags Key=Name,Value=TestName" annoyingly there is no confirmation that the request was successful so I prefer to use this sh code (you cannot do this with xargs) SEC_GRP=example-name KEY=example-key-name ID=$(aws ec2 run-instances \
--image-id ami-ed82e39e \
--security-group-ids $SEC_GRP \ --count 1 \ --instance-type t2.micro \ --key-name$KEY \
--query 'Instances[0].InstanceId' \
--output text)
echo 'Started InstanceId: '$ID echo 'Creating Tags ...' aws ec2 create-tags \ --resources=$ID \
--tags Key=Name,Value=TestName
echo 'Tags created: Key, Value'
aws ec2 describe-instances \
--instance-id $ID \ --query 'Reservations[0].Instances[0].Tags' \ --output text # Spot Requests with CLI Spot requests can be requested with the following command aws ec2 request-spot-instances \ --spot-price "0.050" \ --instance-count 2 \ --block-duration-minutes 120 \ --type "one-time" \ --launch-specification file://~/Desktop/test.json \ --query 'SpotInstanceRequests[*].SpotInstanceRequestId' \ --output text This will output the SpotInstanceRequestIds that manages requests • --spot-price is specifying a max bid price of USD$ 0.05
• --instance-count 2 looks to launch 2 instances with the same parameters
• --block-duration-minutes 120 means the request will stop at 120 if not already terminated or interrupted by a spot price rise
• --type "one-time" means that when interrupted / terminated no further instances will be launched

for more details and customisations see the docs. I would also recommend checking the docs to make sure --spot-price isn’t changed to specify as units of 1000*USD$ ! The file ~/Desktop/test.json should contain something similar to { "ImageId": "ami-0d77397e", "KeyName": "example-key-name", "SecurityGroupIds": [ "sg-youSecurityGroupID" ], "InstanceType": "m4.large", "Placement": { "AvailabilityvZone": "eu-west-1a" } } which specifies that we want m4.large instances in the eu-west-1a (Ireland) region with the AMI ami-0d77397e (64bit Ubuntu) more detailed examples of --launch-specification files can be found here You can view the IDs of these requests by aws ec2 describe-spot-instance-requests \ --query SpotInstanceRequests[*].{ID:InstanceId} If one or more are NULL then view the status by aws ec2 describe-spot-instance-requests \ --query SpotInstanceRequests[*].Status.Message which will most likely show something like [ "Your Spot request price of 0.05 is lower than the minimum required Spot request fulfillment price of 0.081.", "Your Spot request price of 0.05 is lower than the minimum required Spot request fulfillment price of 0.081." ]  # Getting the Spot Prices The current spot price can be obtained from an API at this endpoint which can be handled in python import json import operator import requests machine_type = 'p2.xlarge' api_url = "http://spot-price.s3.amazonaws.com/spot.js" print "Loading spots for Machine Type: {} ...".format(machine_type) res = requests.get(api_url) cleaned = res.content[len('callback('):-len(');')] result = json.loads(cleaned) # get all spots by region reg_machine_spots = { region['region']:{ size['size']: [ os['prices']['USD'] for os in size['valueColumns'] if os['name']=='linux' ][0] for it in region['instanceTypes'] for size in it['sizes'] } for region in result['config']['regions'] } # get all regional spots spots = { region: prices[machine_type] for region,prices in reg_machine_spots.iteritems() } # print the prices sorted lowest first ami_spots = sorted(spots.items(), key=operator.itemgetter(1)) for reg,spot in ami_spots: print reg.ljust(15) + spot My command line version is available here and can be run as ./cheapest_spot.py -t p2.xlarge  check the StackOverflow post if the link is outdated but this should return something like the following and is very helpful for determining instant prices Loading spots for Machine Type: p2.xlarge ... us-west-2 0.1315 eu-ireland 0.1643 us-east 0.1887 apac-sin N/A* us-west N/A* ap-northeast-2 N/A* us-east-2 N/A* apac-tokyo N/A* apac-syd N/A* ap-south-1 N/A* eu-central-1 N/A* sa-east-1 N/A*  It is also possible to obtain the historic spot prices using the CLI as follows aws ec2 describe-spot-price-history \ --instance-types m1.xlarge \ --product-description "Linux/UNIX (Amazon VPC)" \ --start-time 2016-10-31T03:00:00 \ --end-time 2016-10-31T03:16:00 \ --query 'SpotPriceHistory[*].[Timestamp,SpotPrice]' The historic time is limited and check the [docs][15] for latest details # Terminate Spot Instances To cancel spot instance requests aws ec2 cancel-spot-instance-requests \ --spot-instance-request-ids sir-08b93456 sir-08b93458 where sir-08b93456 sir-08b93458 are not the instanceIds The instances themselves should also then be terminated aws ec2 terminate-instances \ --instance-ids i-1234567890abcdef0 i-0598c7d356eba48d7 here the instance IDs are made explicit to differentiate them from the Spot Instance ID. Make sure to do both as you may end up having instances running without being aware ! # References This is a most basic starter with boost-python using homebrew with the following structure All the source code can be found in this repository. # Assumptions • Familiarity python as a minimum requirement • Have homebrew installed • python 2.7 • Vague C/C++ knowledge i.e. be able to write a function! • Mac OS X - I used El Capitan 10.11.6 at the time of writing # Get boost-python You can obtain the correct boost-python through homebrew via the command: brew install boost --with-python --build-from-source I think brew install boost should work but it’s a big install and life is short to do it twice # A simple example: "Hello World!" The following code wraps the C++ function greet() as a python extension named hello_ext. Save this as a .cpp file. To avoid ambiguity with internal functions name it something random e.g. hippopotaplusplus.cpp #include <boost/python.hpp> char const* greet() { return "Greetings!"; } BOOST_PYTHON_MODULE(hello_ext) { using namespace boost::python; def("greet", greet); } • This code cannot be called until it is compiled - see below • greet() is exclusively a C++ function • BOOST_PYTHON_MODULE(hello_ext) defines a module that can be called from python as import hello_ext • using namespace boost::python allowsdef instead of boost::python::def each time • The method boost::python::def defines the second argument (the C++ function) greet as a python function (named by the string) "greet". We call this function in python as import hello_ext hello_ext.greet() if we had used def("example", greet); we would call in python as hello_ext.example() ## Compiling with Python The following python module can be used to compile and link the code in hippopotaplusplus.cpp to a python-friendly module from distutils.core import setup from distutils.extension import Extension ext_instance = Extension( 'hello_ext', sources=['hippopotaplusplus.cpp'], libraries=['boost_python-mt'], ) setup( name='hello-world', version='0.1', ext_modules=[ext_instance]) • The first string argument of Extension must match the variable declared in the .cpp file by the line BOOST_PYTHON_MODULE(variable) in this case it is hello_ext. If this is incorrect, the code will compile but an attempt to import in python will produce the error ImportError: dynamic module does not define init function • The augment sources defines the cpp file that is to be compiled • The argument of libraries=['boost_python-mt'] is the required library installed by homebrew • The argument to ext_modules=[ext_instance]) points to the instance of Extension • The name and version are really for building packages and not necessary here • For more details, the [official docs][5] for distutils are well written ## Running The code is compiled with python through the following command python setup.py build_ext --inplace which will create the following build/ directory and file build/ hello_ext.so  This can be directly now called by python with import hello_ext hello_ext.greet() # Wrapping multiple functions in a module We can define several functions in external files by moving greet to its own file greet.cpp and creating another file with the function meet saved as meet.cpp char const* meet() { return "Nice to meet you from Boost!"; } We can link this in the following manner with hippopotaplusplus.cpp as #include <greet.hpp> #include <meet.hpp> #include <boost/python.hpp> BOOST_PYTHON_MODULE(hello_ext) { using namespace boost::python; def("greet", greet); def("meet", meet); } ## Compiling The setup.py file to compile the code is structured exactly the same as above ## Running The running of the code is also exactly the same as above except now we can call both functions import hello_ext hello_ext.greet() hello_ext.meet() # References This is the second part of a multi-part guide on GPU cloud computing for Deep Learning This entry demonstrates how you can offload computational tasks to an Amazon Elastic Compute Cloud (EC2) instance through Amazon Web Services (AWS). The guide focuses on CUDA support for Theano. # Requirements • Can set up an EC2 Instance - see part one • Familiarity with Linux and Bash e.g. sudo, wget, export • Familiarity with Ubuntu for apt-get # Contents # Connect Connect to the Instance through SSH. Assuming you followed part 1 this is just ssh ubuntu@[DNS] # Load Software See the references for the sources of these instructions. This code is almost identical with a few tweaks. Note you will have to do this each time you start a new Instance You can download this code as install.sh # update software sudo apt-get update sudo apt-get -y dist-upgrade # install dependencies sudo apt-get install -y gcc g++ gfortran build-essential \ git wget linux-image-generic libopenblas-dev \ python-dev python-pip ipython python-nose\ python-numpy python-scipy\ linux-image-extra-virtual\ gnuplot-qt # a lot quicker than matplotlib for runtime plots # install bleeding edge theano sudo pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git # get CUDA sudo wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.0-28_amd64.deb # depackage and install CUDA sudo dpkg -i cuda-repo-ubuntu1404_7.0-28_amd64.deb sudo apt-get update sudo apt-get install -y cuda # update PATH variables { echo -e "\nexport PATH=/usr/local/cuda/bin:\$PATH";
echo -e "export LD_LIBRARY_PATH=/usr/local/cuda/lib64";
} >> ~/.bashrc

# reboot for CUDA
sudo reboot

After waiting about a minute for the reboot, ssh back into the Instance

You can download this code as cuda_check.sh

# install included samples and test cuda
ver=8.0 # version number -- will get a more robust method in a later edit
echo "CUDA version: ${ver}" /usr/local/cuda/bin/cuda-install-samples-${ver}.sh ~/
cd NVIDIA\_CUDA-${ver}\_Samples/1\_Utilities/deviceQuery make ./deviceQuery  Make sure the test shows that a GPU exists - common errors are listed here. If you don’t have a GPU then skip the next step or use a GPU EC2 Instance # set up the theano config file to use gpu by default { echo -e "\n[global]\nfloatX=float32\ndevice=gpu"; echo -e "[mode]=FAST_RUN"; echo -e "\n[nvcc]"; echo -e "fastmath=True"; echo -e "\n[cuda]"; echo -e "root=/usr/local/cuda"; }>> ~/.theanorc Install any other dependencies you may require. ## Get CuDNN To obtain CuDNN you must register with the NVIDIA developer programme here. The download page for CuDNN is here and it’s simplest to download the latest Library for Linux to your local machine and scp it over to EC2 as follows scp -r ~/Downloads/cudnn-8.0-linux-x64-v5.1.tar ubuntu@[DNS]:/home/ubuntu/ where the [DNS] needs to be entered and the filename will differ as the software is updated. Once the scp has transferred the file move to the active ssh terminal instance in EC2 and do the following to install CuDNN tar -xvf cudnn-8.0-linux-x64-v5.1-tgz # use tar -xzf cudnn-8.0-linux-x64-v5.1-tgz if the extension is .tgz cd cuda sudo cp lib64/* /usr/local/cuda/lib64/ sudo cp include/* /usr/local/cuda/include/ Now add the following to enable CNMeM echo -e "\n[lib]\ncnmem=0.5" >> ~/.theanorc A value between 0-1 allocates this fraction of GPU memory to theano so here we allocate 50% to not be stingy. Now check that theano is configured properly by opening ipython and running import theano.sandbox.cuda theano.sandbox.cuda.use("gpu0") which gave me the output Using gpu device 0: Tesla K80 (CNMeM is enabled with initial size: 50.0% of memory, cuDNN 5105) # Run Code Transfer the relevant code across the the Cloud e.g. • Pull from an existing git repository • scp files across If you are running code in a Spot Instance, I would recommend saving results at runtime and passing them back to your local machine. It is sensible to pickle the state of the neural net at runtime so that you can easily continue the training process from a saved state rather than having to run again from scratch. # Close Don’t forget to Stop or Close the instance once it has completed the task! Make sure that you check the instance has been closed in addition to the Spot request in the dashboard! I received a 31 hour bill for an unclosed GPU Compute instance that I had thought I closed which was rather annoying. In theory this can be automated by running the following as root after code has been executed shutdown -h now but now I don’t particularly trust the methodology in practice. # Common Errors ## CUDA Failures A few common errors encountered with installing CUDA ### No GPU If no GPU exists you will receive the following output ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) NVIDIA: no NVIDIA devices found cudaGetDeviceCount returned 30 -> unknown error Result = FAIL ubuntu@ip-172-31-36-215:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
deviceQuery        deviceQuery.cpp    deviceQuery.o      Makefile           NsightEclipse.xml  readme.txt
ubuntu@ip-172-31-36-215:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) NVIDIA: no NVIDIA devices found cudaGetDeviceCount returned 30 -> unknown error Result = FAIL The resolution is to cancel the instance and get a GPU instance if you require CUDA support. ### Unknown symbol in module This is a slightly more complicated issue that arose since CUDA 7.5 ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) modprobe: ERROR: could not insert 'nvidia_361_uvm': Unknown symbol in module, or unknown parameter (see dmesg) cudaGetDeviceCount returned 30 -> unknown error Result = FAIL The resolution for this is fairly simple and means that you didn’t install linux-image-extra-virtual as above. This is probably because you followed one of the guides in the references which are now out of date. Simply run this line # install the required library sudo apt-get install linux-image-extra-virtual # restart instance sudo reboot then wait a minute or so for a restart and ssh back in and run the CUDA check again which should give the following at the end of the output deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K80 Result = PASS # References This is the first part of a multi-part guide on GPU cloud computing for Deep Learning This guide more generally demonstrates how to register for Amazon Web Services and set up the Amazon Elastic Compute Cloud (EC2). # Contents # Requirements • A Linux OS. I have a Macbook • A credit / debit card • A phone number you are willing to use # Registering for Amazon Web Services • Visit AWS and Create a Free Account The process is straightforward and the usual series of personal details (you need to use a real phone number), email verifications and captcha images to fill out. I registered a Personal Account. You will also be asked to submit card details as this is a paid service. You will then be asked to receive a call from AWS to verify a PIN. This call occurs immediately after requesting. I then chose the Basic support plan on the premise that it would be easier to upgrade than downgrade. You will then receive an email within a few seconds confirming the account is ready. You can then Sign In again. ### Free Tier Half way through writing this guide I received a nice email stating Thank you for creating an Amazon Web Services (AWS) account. For the next 12 months, you will have free access to compute, storage, database, and application services. Learn more by visiting our Free Tier page. The GPU computing instances are not free unfortunately. However, a Free Tier instance is a good way of familiarising with launch and connection to EC2. # Create an Amazon Machine Instance • These images show how to set up an Instance • The Free Tier will be used for this example but GPU Compute should be used in production ## Set Region ## Create Key-Value Pair This will save some effort later. Create a key-value pair and make sure to save the .pem file securely after doing so. ## Storing the Key Properly It is advisable to store the key with the following set of commands #set filename filename=pem_filename.pem # create a root directory for SSH mkdir ~/.ssh # sometimes the files come from amazon as ".pem.txt" mv ~/Downloads/${filename}* ~/.ssh/${filename} # change permissions so only root can read the file chmod 400 ~/.ssh/${filename}
{
echo "Host *amazonaws.com"

# Usage

The simplest way to use this template is to ensure that res.cls and cv.tex are in the same directory and then just remember to swap out all my details for yours.

1. ShareLaTeX is a good example of such a site.