Neural Sports

ABOUTARTICLES

NumPy CheatSheet

May 13, 20208 min read

NumPy stands for Numerical Python and is popular in the scientific world especially for the likes of Data Scientists, Quant Analysts, and Robotic Engineers. You could say NumPy arrays are similar to Python lists shown in the Python Cheatsheet, however, NumPy arrays offer ‘more efficient storage and data operations as the arrays grow larger in size’. The official NumPy website actually has a great article on how NumPy is aiding Cricket analysis.

Python lists are more flexible than NumPy arrays. By this I mean we can store different types in a Python list such as ['Arsenal', 'score', False, 69]. How these are stored is that the list contains a pointer to a block of pointers, each of which in turn points to a full Python object in memory whereas NumPy arrays contain a single pointer so are more efficient.

Creating an array


Arrays are constrained to being the same type. EG) Either all integers or floats.

From a Python list:

np.array([1, 2, 3])

Filled with a certain number of zeros:

np.zeros(5) # array([0, 0, 0, 0, 0])

Filled with ones in a (5, 3) shape:

np.ones((2, 3)) # array([[1., 1., 1.], [1., 1., 1.]])

Filled with a specific number:

np.full((2, 3), 9) # array([[9, 9, 9], [9, 9, 9]])

Filled with a particular sequence, similar to range function in Python:

np.arange(0, 10, 2) # array([0, 2, 4, 6, 8])

Filled with random integers:

np.random.random(3) # array([0.48357814, 0.10731914, 0.09018561])

If you want random integers within scope of a range and a certain sized matrix:

np.random.randint(0, 10, (2, 3)) # array([[0, 3, 7], [0, 0, 9]])

Identity matrix:

np.eye(2) # array([[1., 0.], [0., 1.]])

If you want 50 random numbers between certain intervals:

np.linspace(0, 1)

Array indexing


x = np.array([1, 2, 3, 4, 5])

x[1] # 2

x[-1] # 5

Multidimensional arrays:

x = np.array([[1, 2, 3], [4, 5, 6]])

x[0, 1] # 2

x[1, 2] # 6

We can also manipulate items as normal:

x = np.zeros(5)

x[2] = 20 # array([0, 0, 20, 0, 0])

If you try to place a float into an integer based array the value will be truncated. For example: 3.6 to 3

Array slicing


The slicing syntax follows this pattern x[start:stop:step]. Remember, when slicing, it goes up to and not including the stop index as the index starts at 0.

x = np.array([1, 2, 3, 4, 5])

x[:3] # [1, 2, 3]

x[3:] # [4, 5]

x[1:4] # [3, 4]

x[::2] # [2, 4] every second element

x[1::2] # [1, 3, 5]

x[::-1] # [5, 4, 3, 2, 1] reversves an array

Multidimensional array slicing


x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

x[:1, :3] # one row, three columns [1, 2, 3]

x[:3, :2] # all the rows, three columns [[1, 2], [4, 5], [7, 8]]

x[:,0] # access first column [1, 4, 7]

x[0,:] # access first row [1, 2, 3]

How slicing in NumPy differs from slicing in Python lists is that it creates a view of the data instead of a copy. So if you manipulate a value like below:

x = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

x_slice = x[:2, :3] # [[1, 2, 3], [5 ,6, 7]]

x_slice[0] = [99, 99, 99]

What this will actually do is transform the first row in x because x_slice is a view of x. so x will look like:

array([[99, 99, 99,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

Computations on arrays


Computations on NumPy arrays can be very fast or very slow if not used properly. In order to make these computations fast we must use vectorized operations that are made possible through numpys ufuncs.

Loops in Python can be slow due to the type checking that has to be done on every item. This is put down to the dynamic nature of Python. NumPy simply performs an operation on the array, that will then be applied to every element.

For example, in Python we would simply loop through an array and divide each element by one when in NumPy we just do one divided by the array and it will assign that operation on every value.

x = np.arange(5)
print(x  + 1) # array([1, 2, 3, 4, 5, 6])
print(x * 2) # array([0, 2, 4, 6, 8, 10])
print(x / 2) # array([0, 0.5, 1., 1.5, 2., 2.5])

These operators are wrappers for particular functions, this image is taken from Python Data Science Handbook.

Operators

Aggregations


Aggregations can be super useful if you wish to find, for example, the sum of the of an array. Reduce applies an operation to every element until there is one left.

It should be noted these aggregations run across the whole array regardless of how many rows or columns there are.

x = np.array([1, 2, 3, 4])
np.add.reduce(x) # 10
np.multiply.reduce(x) # 24

Alternatively, instead of using the reduce method you can use functions like the sum function as well as the min and max.

x = np.array([5, 6, 7])
np.sum(x) # 18
x = np.array([9, 12, 18])
np.min(x) # 9
np.max(x) # 18

If you wish to find the sum, min or max over the rows or columns we can pass the axis to these functions.

x = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
x.min(axis=0) # across column array([1, 2, 3])
x.max(axis=1) # across rows array([3, 6, 9])

Further aggregations taken also from the Python Data Science Handbook.

Aggregations

Broadcasting (advanced)


Broadcasting is an advanced topic within NumPy but once understood, you will notice how handy it can be.

Broadcasting allows operations to be performed on arrays of different sizes. This is a unique feature to NumPy. Think of broadcasting as stretching an individual integer or float across an array so it can be applied to each integer or float in another array based on the operator.

x = np.array([1, 1, 1])
x + 5 # array([6, 6, 6])

The 5 is being stretched to theoretically become [5, 5, 5]. It is important to think of it being stretched and not duplicated into an actual array of [5, 5, 5].

x = np.array([1, 1, 1])
y = np.zeros((3, 3))

x + y = array([[ 1.,  1.,  1.],
               [ 1.,  1.,  1.],
               [ 1.,  1.,  1.]])

[1, 1, 1] is stretched across the 3x3 array of 0’s.

For more advanced examples and a thorough in depth explanation of broadcasting please refer to the bible of Data Science that is Python Data Science Handbook by Jake VanderPlas.

Masking


If you’re familiar with Pandas, masking will make complete sense to you, if not, do not worry I have a Pandas cheatsheet coming your way too. Masking basically allows you to extract values from a NumPy array based on boolean logic. For example, find all values greater than 2 or where all values are a factor of 4.

You could just loop over the list and compare every value to some boolean logic and then store those values within another list but that is inefficient when you have NumPy at your disposal.

x = np.array([2, 4, 6, 8, 10])

x.sum(x > 3) # 4 - as there as 4 values greater than 3
x.sum( x % 2 == 1) # 0 - as every int is divisble by 2

You can also apply these masks directly onto the array.

x = np.array([5, 10, 15, 20])
x[x < 12] # array([5, 10])

If you wish to use expressions such as and or or it is best to use | or & instead because and and or are performed on an entire object whereas | and & are performed on individual elements.

Sorting


Python has built in sorting algorithms. Knowing the sorting algorithms is seen as a fundamental in Computer Science. Of course they are a pain in the ass but stick with them as they’ll be useful down the line. Here is a great resource from Toptal explaining the algorithms along with some lovely animations.

x = np.arrray([3, 8, 5, 9, 2])
np.sort(x) # array([2, 3, 5, 8, 9])

If you wish to return the indices of the sorted array.

x = np.arrray([3, 8, 5, 9, 2])
np.argsort(x) # array([4, 0, 2, 1, 3])

You can also sort along columns or rows independently using the axis argument shown earlier.

np.sort(x, axis=0) # for columns
np.sory(x, axis=1) # for rows

Summary


And that’s it! NumPy is a very useful package if used correctly. I strongly recommend getting to grips with the basics of NumPy first before delving into Pandas as some operations that can be performed within Pandas make much more sense once the basics of NumPy are understood. The best thing about NumPy is how it will speed up certain operations that you’ve been using Python lists for. So next time you come across a numerical list of some sort, consider NumPy.

References


Havard Labs

Stanford Labs

Python Data Science Handbook by Jake VanderPlas


Developed by Sean O'Connor, a sports and AI enthusiast.