Model Building - I¶

Model building is the process of creating a machine learning model that can predict outcomes based on input data.

Once you have preprocessed your data, the next step is to choose an appropriate machine learning algorithm.

What is an ML model?

In a nutshell,

It's a program that can find patterns or make decisions from a previously unseen dataset.
You train a model over a set of data, providing it an algorithm that it can use to reason over and learn from the data.
Once you have trained the model, you can use it to reason over data that it hasn't seen before, and make predictions about those data.

Types of Machine Learning¶

Why do we even have types?

At it's core, Machine Learning models are just math. Most times, unfortunately, complex maths..

5-1.jpg.webp

Machine Learning sits at the intersection of statistics and computer science, yet it can still wear many different masks.

When we say Types of machine learning, we actually refer to the type of algorithm used for machine learning.

So, for various applications, we have different machine learning algorithms.
Some algorithms give more accurate results over others.

There are four types of machine learning:

Supervised Learning
Unsupervised Learning
Semi-supervised Learning (class of supervised learning)
Reinforcement Learning

1*qHbAsMNmdWQJkzm2SUA-8w.jpeg

Here is a brief description:

Supervised Learning¶

Involves machine learning algorithms that learn under the presence of a supervisor.

Relies on labelled input and output training data.
The example-label pairs are fed one by one, allowing the algorithm to predict the label for each example, giving it feedback as to whether it predicted the right answer or not.
Over time, the algorithm will learn to approximate the nature of the relationship between examples and their labels.
Training process continues until the model achieves a desired level of accuracy on the training data.

Here is a simple analogy:

Consider a programming instructor teaching us students how to write code in a new programming language.

The instructor starts by teaching us the basic syntax and programming concepts, and then have us practice writing code to solve simple problems.

As we practice, the instructor corrects our mistakes and provides feedback on our code structure, design, variable names, program flow, etc.

And over time, we can become more proficient in the language and can start to write more complex programs and solve more challenging problems. The instructor continues to provide feedback and guidance, helping the us improve our programming skills.

ML Algorithms:

Regression
Decision Tree
Random Forest
KNN
Logistic Regression

Advantages:

The model learns from past experiences, i.e., the introduced data.
Availability of a significantly larger pool of algorithms compared to others.

Disadvantages:

Challenging and time-consuming to label massive data
Difficult to predict accurately if the distribution of the test data differs significantly from that of the training dataset.

Applications:

Speech recognition
Image classification
Spam detection
Weather Forecast
Face Recognition

Throughout this session, we shall focus on ML models based on supervised learning.

Unsupervised Learning¶

Unsupervised learning is very much the opposite of supervised learning. It features no labels. Instead, our algorithm would be fed a lot of data and given the tools to understand the properties of the data. From there, it can learn to group, cluster, and/or organize the data in a way such that a human (or other intelligent algorithm) can come in and make sense of the newly organized data.

Unsupervised learning is particularly useful in finding unknown patterns in a dataset. It aids in finding features needed for categorization. Your images, videos, or any data provided doesn’t have to be annotated or labeled.

Involves learning from unlabeled data without any supervision.

No predefined labels or output for the data, and the algorithm must identify patterns or relationships on its own.
Algorithm is trained on the input data and tasked with finding hidden structures or relationships within it.
The algorithm continues to learn and refine its understanding of the data until it can identify meaningful patterns and structures.

Here is a simple analogy:

Think about when you first arrived at college, and wanted to make friends with other students who share similar interests, but you didn't know anyone. To find other like-minded students, you attend different events and gatherings on campus, such as club fairs, sports games, and academic talks.

As you interact with other students and attend different events, you start to notice that certain groups of students tend to congregate together and share common interests.

For example, you might notice that there is a group of students who are passionate about technology and often participate in hackathons and coding competitions.

As you continue to attend different events and interact with more students, you start to identify clusters of students based on their interests and activities.

ML Algorithms:

Clustering
Principal Component Analysis (PCA)
Singular Value Decomposition (SVD)
Independent Component Analysis (ICA)
K-Means Clustering

Advantages:

Unsupervised learning does not require labeled data, which can be difficult and time-consuming to obtain.
This type of learning can uncover new insights and patterns that may not have been noticed by humans.

Disadvantages:

Evaluating the accuracy of an unsupervised learning algorithm is often more difficult than with supervised learning, as there is no clear way to measure correctness.
Unsupervised learning can be computationally expensive and may require more complex algorithms and hardware to achieve good results.

Applications:

Anomaly detection
Data compression
Image segmentation
Clustering customers based on behavior or preferences
Recommender systems

Semi Supervised Learning¶

Involves machine learning algorithms that learn from a combination of labelled and unlabelled data.

Typically, a small set of labelled data is provided, along with a much larger set of unlabelled data.
Algorithm uses the labelled data to learn patterns in the data and then applies this learning to the unlabelled data.
The goal is to use the unlabelled data to improve the accuracy of the model beyond what could be achieved with just the labelled data.

Here is a simple analogy:
Think of it like a chef creating a new recipe with a mix of familiar and unfamiliar ingredients. By experimenting with the known ingredients, the chef identifies patterns and builds a foundation for the recipe. Then, by introducing the unfamiliar ingredients and using the patterns identified with the known ingredients as a guide, the chef is able to create a more complex and interesting recipe than they could have with just the familiar ingredients alone.

ML Algorithms:

Self-Training
Co-Training
Multi-View Learning
Semi-Supervised SVM
Graph-Based Learning

Advantages:

Less time-consuming and costly than fully-supervised learning, as it requires less labelled data.
Can improve the accuracy of a model by incorporating additional unlabelled data.

Disadvantages:

May not be as accurate as fully-supervised learning, since it relies on unlabelled data to provide additional information.
Difficult to balance the amount of labelled and unlabelled data for optimal performance.

Applications:

Text classification
Image classification
Anomaly detection
Protein sequence analysis
Speech recognition

Reinforcement Learning¶

Involves machine learning algorithms that learn through interactions with an environment.

The algorithm learns by taking actions in an environment and receiving feedback in the form of rewards or penalties.
The goal is to learn a policy, or a set of rules for selecting actions, that maximizes the cumulative reward over time.

Here is another analogy:

Think about the time you were deciding which college extracurricular activities to participate in. You don't know which activities will be the most enjoyable or rewarding, so you try out different options and observe how they affect your mood and overall experience.

For example, you might try out for the sports team, music club, or have joined the debate club. After participating in each activity for a while, you evaluate how much you enjoyed it and how much it contributed to your personal growth.

Over time, you start to learn which activities are the most enjoyable and beneficial for you. You begin to focus on these activities more and prioritize them over less enjoyable or less rewarding activities.

ML Algorithms:

Q-Learning
SARSA
Deep Q-Networks (DQN)
Policy Gradients
Actor-Critic

Advantages:

Can learn optimal policies in complex environments where the optimal solution is not known or is too difficult to calculate.
Can learn from experience without the need for labelled data.

Disadvantages:

Can be time-consuming and computationally expensive to train, especially for complex environments.
Can be difficult to design a reward function that accurately reflects the desired behavior.

Applications:

Game playing
Robotics
Autonomous driving
Recommendation systems
Control systems for power grids, water treatment plants, etc.

An explanation in a nutshell:¶

No description has been provided for this image

Throughout this session, we shall focus on ML models based on supervised learning.

Linear Regression¶

Does this look familiar?

You all have already done this for your physics lab!

The trendline option in excel generates a regression based on the data provided.

We will be doing a similar thing in python, but in a more advanced level.

A short explanation:

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal of linear regression is to find the linear equation that best describes the relationship between the variables.

In a simple linear regression, there is one independent variable and one dependent variable.

The linear equation takes the form of:

$y = a + bx$

where y is the dependent variable, x is the independent variable, a is the intercept, and b is the slope. The intercept represents the value of y when x is equal to zero, and the slope represents the change in y for a one-unit increase in x.

All the math involved in building the model is abstracted away into functions, so that machine learning engineers need not implement the algorithm from scratch!

Import necessary libraries

In [39]:

Copied!





import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Loading the data set¶

The seaborn library comes with a few data sets built-in, let's check it out!

In [40]:

Copied!

sns.get_dataset_names()
sns.get_dataset_names()

Out[40]:

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

Let's load the tips dataset

In [41]:

Copied!

tips_data = sns.load_dataset('tips')
tips_data = sns.load_dataset('tips')

In [42]:

Copied!

tips_data
tips_data

Out[42]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

Visualizing the data:¶

In [43]:

Copied!

tips_data
tips_data

Out[43]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
...	...	...	...	...	...	...	...
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

244 rows × 7 columns

The Tips dataset is a data frame with 244 rows and 7 variables which represents some tipping data where one waiter recorded information about each tip he received over a period of a few months working in one restaurant.

The waiter collected several variables:

Tip in dollars, the bill in dollars
Sex of the bill payer
Whether there were smokers in the party
Day of the week
Time of day
Size of the party.

Exploring the data:

In [44]:

Copied!

sns.pairplot(tips_data)
sns.pairplot(tips_data)

Out[44]:

<seaborn.axisgrid.PairGrid at 0x24f00869e80>

From the pairplot we can see that there is almost a linear correspondence between total_bill and tip.

Let's try to make a linear regression model between total_bill and tip, where the model predicts the tip given to a waiter based on the total bill the customer pays.

In [45]:

Copied!

sns.scatterplot(x='total_bill', y='tip', data=tips_data)
sns.scatterplot(x='total_bill', y='tip', data=tips_data)

Out[45]:

<AxesSubplot:xlabel='total_bill', ylabel='tip'>

This will create a scatterplot of the total_bill vs tip variables.

Checking for a linear model:

In [46]:

Copied!

sns.lmplot(x='total_bill', y='tip', data=tips_data)
sns.lmplot(x='total_bill', y='tip', data=tips_data)

Out[46]:

<seaborn.axisgrid.FacetGrid at 0x24f0267b370>

This will add a regression line to the scatterplot, showing the relationship between 'total_bill' and 'tip'.

Training and Testing Data¶

Now that we've explored the data a bit, let's go ahead and split the data into training and testing sets.

First we will set a variable x equal to the numerical features of the tips data and a variable y equal to the tips column.

In [47]:

Copied!

y = tips_data['tip']
y = tips_data['tip']

In [48]:

Copied!

x = tips_data['total_bill']
x = tips_data['total_bill']

Use model_selection.train_test_split from sklearn to split the data into training and testing sets.

In [49]:

Copied!

TEST_SIZE = 0.2
TEST_SIZE = 0.2

In [50]:

Copied!

from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split

In [51]:

Copied!

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=1)

Training the Model¶

Now its time to train our model on our training data!

In [52]:

Copied!

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LinearRegression

Creating an instance of a linear regression model:

In [53]:

Copied!

lm = LinearRegression(fit_intercept=False)
lm = LinearRegression(fit_intercept=False)

fit_intercept sets the intercept value to zero.

In [54]:

Copied!

lm.fit(x_train, y_train)
lm.fit(x_train, y_train)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_10436/141332336.py in <module>
----> 1 lm.fit(x_train, y_train)

c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\linear_model\_base.py in fit(self, X, y, sample_weight)
    516         accept_sparse = False if self.positive else ['csr', 'csc', 'coo']
    517 
--> 518         X, y = self._validate_data(X, y, accept_sparse=accept_sparse,
    519                                    y_numeric=True, multi_output=True)
    520 

c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    431                 y = check_array(y, **check_y_params)
    432             else:
--> 433                 X, y = check_X_y(X, y, **check_params)
    434             out = X, y
    435 

c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    869         raise ValueError("y cannot be None")
    870 
--> 871     X = check_array(X, accept_sparse=accept_sparse,
    872                     accept_large_sparse=accept_large_sparse,
    873                     dtype=dtype, order=order, copy=copy,

c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    692             # If input is 1D raise error
    693             if array.ndim == 1:
--> 694                 raise ValueError(
    695                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    696                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:
array=[16.99 19.77 31.71 14.   16.27  7.56 40.55 20.69 21.7  10.33  9.78 32.83
 12.6  10.63 22.76 32.4  16.58 16.04  7.25 17.51 13.39 12.46 31.85 17.89
 40.17 48.27 44.3  11.87 15.98 24.71 14.83 13.42 20.29 38.01 12.76 19.49
 25.71 15.98  9.94 19.08 16.82 34.63 11.69 13.94 16.29 25.29 45.35 18.71
 28.55 43.11 16.45 12.02 28.17 38.07 10.34 25.56 10.77 21.01 13.28 14.52
 18.24  8.35 12.03 10.07 15.95 15.36 11.61 22.75 17.26 15.42 13.81 16.
  8.58 10.59 19.81 13.51 24.01 16.66 12.48 21.5  12.66 18.43 28.15 20.76
 18.29 16.49 22.42 16.31 13.   20.45 15.69 15.53 10.51 14.07 25.89 23.1
 21.01 30.46  8.77 16.47 17.82 27.2  23.95 20.08 13.13 22.23 14.73  5.75
 34.65 18.28 27.05 15.81 10.27 16.21 15.06 10.33 32.68 15.48 24.52 29.85
 11.35 29.8  39.42  8.52 14.78 24.27 20.92 24.55 18.15  8.51  7.25 22.12
 17.59 21.58 17.46 12.74 14.31 19.44 34.81 13.37 17.92  9.68 19.82 23.68
 38.73 18.04  7.51 20.27 15.69  9.55 13.42 32.9  17.31 13.27 15.04 20.29
 11.38 10.34 26.41 15.77 13.81 18.29 26.88 29.03 34.3  13.03 27.28 11.59
 12.9  20.23 12.54 41.19 25.   48.17 13.16 48.33 18.64 15.38 17.92  9.6
 17.81 18.26 17.07 20.53 22.82 16.43 28.44 17.29 12.26 16.4  14.15 26.86
 17.47 10.07 16.93].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Hmm, why is the above statement giving an error..

Because in most cases linear regression in multivariate, i.e, y depends on more than one variable.

Our x is a single dimensional array, but the model only works on multi-dimesional arrays..

In our case, we could consider the size column of the data set to take into account the dependence in size as well.

random_state sets the seed for the random generator so that we can ensure that the results that we get can be reproduced.

test_size sets aside part of the data for testing purpose.

In [55]:

Copied!

TEST_SIZE = 0.1
RANDOM_STATE = 2
TEST_SIZE = 0.1
RANDOM_STATE = 2

In [56]:

Copied!

x = tips_data[['total_bill', 'size']]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
x = tips_data[['total_bill', 'size']]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

In [57]:

Copied!

lm.fit(x_train, y_train)
lm.fit(x_train, y_train)

Out[57]:

LinearRegression(fit_intercept=False)

In [58]:

Copied!

print('Coefficients: \n', lm.coef_)
print('Coefficients: \n', lm.coef_)

Coefficients: 
 [0.09632358 0.37962516]

But if you see the pair plot from the initial steps, there is no correlation between size and waiter tip..

So, you could reshape the original total_bill array, as suggested by the error message.

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [61]:

Copied!

x = tips_data['total_bill'].values.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
x = tips_data['total_bill'].values.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)

In [62]:

Copied!

lm.fit(x_train, y_train)
lm.fit(x_train, y_train)

Out[62]:

LinearRegression(fit_intercept=False)

In [63]:

Copied!

print('Coefficients: \n', lm.coef_)
print('Coefficients: \n', lm.coef_)

Coefficients: 
 [0.14113406]

Predicting Test Data¶

Now that we have fit our model, let's evaluate its performance by predicting off the test values!

In [64]:

Copied!

tip_predictions = lm.predict(x_test)
tip_predictions = lm.predict(x_test)

In [65]:

Copied!

tip_predictions
tip_predictions

Out[65]:

array([4.91569945, 3.60738668, 1.20246223, 2.30330792, 3.38016083,
       4.8408984 , 1.4028726 , 2.83397201, 3.5283516 , 6.82100931,
       1.44944684, 2.48254819, 2.98639679, 2.40915847, 1.34783031,
       3.34205464, 1.94906142, 2.25532234, 1.6357438 , 2.79586581,
       2.29625122, 2.1706419 , 3.39568558, 2.79727715, 2.43597394])

An these are the real values:

In [66]:

Copied!

y_test
y_test

Out[66]:

85     5.17
54     4.34
126    1.48
93     4.30
113    2.55
141    6.70
53     1.56
65     3.15
157    3.75
212    9.00
10     1.71
64     2.64
89     3.00
71     3.00
30     1.45
3      3.31
163    2.00
84     2.03
217    1.50
191    4.19
225    2.50
101    3.00
35     3.60
24     3.18
152    2.74
Name: tip, dtype: float64

Let's now create a scatterplot of the real test values versus the predicted values.

In [67]:

Copied!

plt.scatter(y_test, tip_predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.scatter(y_test, tip_predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')

Out[67]:

Text(0, 0.5, 'Predicted Y')

Evaluating the model:¶

You can use metrics such as the mean squared error (MSE) or the coefficient of determination (R-squared) to evaluate the accuracy of the linear regression model.

In [68]:

Copied!

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import mean_squared_error, r2_score

In [69]:

Copied!

mse = mean_squared_error(y_test, tip_predictions)
r2 = r2_score(y_test, tip_predictions)
mse = mean_squared_error(y_test, tip_predictions)
r2 = r2_score(y_test, tip_predictions)

In [70]:

Copied!

print('Mean squared error:', mse)
print('R-squared:', r2)
print('Mean squared error:', mse)
print('R-squared:', r2)

Mean squared error: 0.688445648107554
R-squared: 0.7602517788014659

Lower the MSE, the more accurate a model is. An MSE of zero is a perfect model.

R-squared gives a accuracy measure in terms of percentage.

In [71]:

Copied!

accuracy = round(r2*100, 2)
print(f'Model accuracy ≈', accuracy, '%')
accuracy = round(r2*100, 2)
print(f'Model accuracy ≈', accuracy, '%')

Model accuracy ≈ 76.03 %

Let's compare the predicted and actual values:

In [72]:

Copied!

tip_predictions = [round(prediction, 2) for prediction in tip_predictions] #
y_test = list(y_test)
tip_predictions = [round(prediction, 2) for prediction in tip_predictions] #
y_test = list(y_test)

In [73]:

Copied!

comparision_df = pd.DataFrame({'Predicted': tip_predictions, 'Actual': y_test})
comparision_df = pd.DataFrame({'Predicted': tip_predictions, 'Actual': y_test})

In [74]:

Copied!

comparision_df
comparision_df

Out[74]:

	Predicted	Actual
0	4.92	5.17
1	3.61	4.34
2	1.20	1.48
3	2.30	4.30
4	3.38	2.55
5	4.84	6.70
6	1.40	1.56
7	2.83	3.15
8	3.53	3.75
9	6.82	9.00
10	1.45	1.71
11	2.48	2.64
12	2.99	3.00
13	2.41	3.00
14	1.35	1.45
15	3.34	3.31
16	1.95	2.00
17	2.26	2.03
18	1.64	1.50
19	2.80	4.19
20	2.30	2.50
21	2.17	3.00
22	3.40	3.60
23	2.80	3.18
24	2.44	2.74

Linear regression is a powerful method for modeling the relationship between input features and target variables, and Seaborn makes it easy to perform and visualize linear regression in Python.

Logistic Regression¶

Its similar to linear regression with a bit of difference. Linear Regression is used to handle regression problems whereas Logistic regression is used to handle the classification problems. Linear regression provides a continuous output but Logistic regression provides discreet output.

In [75]:

Copied!

%matplotlib inline
%matplotlib inline

In [76]:

Copied!





import io
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy.stats
import scipy.special
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook')
import io
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy.stats
import scipy.special
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook')

Spider data from Suzuki et al. (2006) In following cells what we analyze is the same as the analysis made in the paper. However we won't go in that direction because of obvious reasons.

In [77]:

Copied!





data = """Grain size (mm)	Spiders
245	absent
247	absent
285	present
299	present
327	present
347	present
356	absent
36	present
363	absent
364	present
398	absent
4	present
409	absent
421	present
432	absent
473	present
509	present
529	present
561	absent
569	absent
594	present
638	present
656	present
816	present
853	present
938	present
036	present
045	present
"""
df = pd.read_table(io.StringIO(data))
df.Spiders = df.Spiders == 'present'
df.head()
data = """Grain size (mm)	Spiders
245	absent
247	absent
285	present
299	present
327	present
347	present
356	absent
36	present
363	absent
364	present
398	absent
4	present
409	absent
421	present
432	absent
473	present
509	present
529	present
561	absent
569	absent
594	present
638	present
656	present
816	present
853	present
938	present
036	present
045	present
"""
df = pd.read_table(io.StringIO(data))
df.Spiders = df.Spiders == 'present'
df.head()

Out[77]:

	Grain size (mm)	Spiders
0	0.245	False
1	0.247	False
2	0.285	True
3	0.299	True
4	0.327	True

In [78]:

Copied!

df
df

Out[78]:

	Grain size (mm)	Spiders
0	0.245	False
1	0.247	False
2	0.285	True
3	0.299	True
4	0.327	True
5	0.347	True
6	0.356	False
7	0.360	True
8	0.363	False
9	0.364	True
10	0.398	False
11	0.400	True
12	0.409	False
13	0.421	True
14	0.432	False
15	0.473	True
16	0.509	True
17	0.529	True
18	0.561	False
19	0.569	False
20	0.594	True
21	0.638	True
22	0.656	True
23	0.816	True
24	0.853	True
25	0.938	True
26	1.036	True
27	1.045	True

In [79]:

Copied!

df["Spiders"]
df["Spiders"]

Out[79]:

0     False
1     False
2      True
3      True
4      True
5      True
6     False
7      True
8     False
9      True
10    False
11     True
12    False
13     True
14    False
15     True
16     True
17     True
18    False
19    False
20     True
21     True
22     True
23     True
24     True
25     True
26     True
27     True
Name: Spiders, dtype: bool

In [80]:

Copied!

plt.scatter(df["Grain size (mm)"], df["Spiders"])
plt.ylabel('Spiders present?')
sns.despine()
plt.scatter(df["Grain size (mm)"], df["Spiders"])
plt.ylabel('Spiders present?')
sns.despine()

In [81]:

Copied!

import sklearn.linear_model
import sklearn.linear_model

scikit-learn has a logisitic regression classifier which uses regularization. To eliminate regularization, we set the regularization parameter C to $10^{12}$.

In [82]:

Copied!





# C=1e12 is effectively no regularization - see https://github.com/scikit-learn/scikit-learn/issues/6738
clf = sklearn.linear_model.LogisticRegression(C=1e12, random_state=0)
clf.fit(df['Grain size (mm)'].values.reshape(-1, 1), df['Spiders'])
print(clf.intercept_, clf.coef_)
# C=1e12 is effectively no regularization - see https://github.com/scikit-learn/scikit-learn/issues/6738
clf = sklearn.linear_model.LogisticRegression(C=1e12, random_state=0)
clf.fit(df['Grain size (mm)'].values.reshape(-1, 1), df['Spiders'])
print(clf.intercept_, clf.coef_)

[-1.64761964] [[5.12153717]]

In [83]:

Copied!





def plot_log_reg(x, y, data, clf, xmin=None, xmax=None, alpha=1, ax=None):
    if ax is None:
        fig, ax = plt.subplots()
    else:
        fig = ax.figure
    ax.scatter(data[x], data[y], color='black', zorder=20, alpha=alpha)
    if xmin is None:
        xmin = x.min()
    if xmax is None:
        xmax = x.max()
    X_test = np.linspace(xmin, xmax, 300)

    loss = scipy.special.expit(X_test * clf.coef_ + clf.intercept_).ravel()
    ax.plot(X_test, loss, linewidth=3)

    ax.set_xlabel(x)
    ax.set_ylabel(y)
    fig.tight_layout()
    sns.despine()
    return fig, ax
def plot_log_reg(x, y, data, clf, xmin=None, xmax=None, alpha=1, ax=None):
    if ax is None:
        fig, ax = plt.subplots()
    else:
        fig = ax.figure
    ax.scatter(data[x], data[y], color='black', zorder=20, alpha=alpha)
    if xmin is None:
        xmin = x.min()
    if xmax is None:
        xmax = x.max()
    X_test = np.linspace(xmin, xmax, 300)

    loss = scipy.special.expit(X_test * clf.coef_ + clf.intercept_).ravel()
    ax.plot(X_test, loss, linewidth=3)

    ax.set_xlabel(x)
    ax.set_ylabel(y)
    fig.tight_layout()
    sns.despine()
    return fig, ax

In [84]:

Copied!

plot_log_reg(x='Grain size (mm)', y='Spiders', data=df, clf=clf, xmin=0, xmax=1.5);
plot_log_reg(x='Grain size (mm)', y='Spiders', data=df, clf=clf, xmin=0, xmax=1.5);

KNN (K-Nearest-Neighbors)¶

KNN is a simple concept: define some distance metric between the items in your dataset, and find the K closest items. You can then use those items to predict some property of a test item, by having them somehow "vote" on it.

As an example, let's look at the MovieLens data. We'll try to guess the rating of a movie by looking at the 10 movies that are closest to it in terms of genres and popularity.

To start, we'll load up every rating in the data set into a Pandas DataFrame:

In [85]:

Copied!

import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()
import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()

Out[85]:

	user_id	movie_id	rating
0	0	50	5
1	0	172	5
2	0	133	1
3	196	242	3
4	186	302	3

Now, we'll group everything by movie ID, and compute the total number of ratings (each movie's popularity) and the average rating for every movie:

In [86]:

Copied!

import numpy as np

movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movieProperties.head()
import numpy as np

movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movieProperties.head()

Out[86]:

	rating
	size	mean
movie_id
1	452	3.878319
2	131	3.206107
3	90	3.033333
4	209	3.550239
5	86	3.302326

The raw number of ratings isn't very useful for computing distances between movies, so we'll create a new DataFrame that contains the normalized number of ratings. So, a value of 0 means nobody rated it, and a value of 1 will mean it's the most popular movie there is.

In [87]:

Copied!

movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head()
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head()

Out[87]:

	size
movie_id
1	0.773585
2	0.222985
3	0.152659
4	0.356775
5	0.145798

Now, let's get the genre information from the u.item file. The way this works is there are 19 fields, each corresponding to a specific genre - a value of '0' means it is not in that genre, and '1' means it is in that genre. A movie may have more than one genre associated with it.

While we're at it, we'll put together everything into one big Python dictionary called movieDict. Each entry will contain the movie name, list of genre values, the normalized popularity score, and the average rating for each movie:

In [88]:

Copied!





movieDict = {}
with open(r'u.item', encoding="ISO-8859-1") as f:
    temp = ''
    for line in f:
        #line.decode("ISO-8859-1")
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        movieDict[movieID] = (name, np.array(list(genres)), movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))
movieDict = {}
with open(r'u.item', encoding="ISO-8859-1") as f:
    temp = ''
    for line in f:
        #line.decode("ISO-8859-1")
        fields = line.rstrip('\n').split('|')
        movieID = int(fields[0])
        name = fields[1]
        genres = fields[5:25]
        genres = map(int, genres)
        movieDict[movieID] = (name, np.array(list(genres)), movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))

For example, here's the record we end up with for movie ID 1, "Toy Story":

In [89]:

Copied!

print(movieDict[1])
print(movieDict[1])

('Toy Story (1995)', array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.7735849056603774, 3.8783185840707963)

Now let's define a function that computes the "distance" between two movies based on how similar their genres are, and how similar their popularity is. Just to make sure it works, we'll compute the distance between movie ID's 2 and 4:

In [90]:

Copied!





from scipy import spatial

def ComputeDistance(a, b):
    genresA = a[1]
    genresB = b[1]
    genreDistance = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance
    
ComputeDistance(movieDict[2], movieDict[4])
from scipy import spatial

def ComputeDistance(a, b):
    genresA = a[1]
    genresB = b[1]
    genreDistance = spatial.distance.cosine(genresA, genresB)
    popularityA = a[2]
    popularityB = b[2]
    popularityDistance = abs(popularityA - popularityB)
    return genreDistance + popularityDistance
    
ComputeDistance(movieDict[2], movieDict[4])

Out[90]:

0.8004574042309892

Remember the higher the distance, the less similar the movies are. Let's check what movies 2 and 4 actually are - and confirm they're not really all that similar:

In [91]:

Copied!

print(movieDict[2])
print(movieDict[4])
print(movieDict[2])
print(movieDict[4])

('GoldenEye (1995)', array([0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), 0.22298456260720412, 3.2061068702290076)
('Get Shorty (1995)', array([0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.3567753001715266, 3.550239234449761)

Now, we just need a little code to compute the distance between some given test movie (Toy Story, in this example) and all of the movies in our data set. When the sort those by distance, and print out the K nearest neighbors:

In [92]:

Copied!





import operator

def getNeighbors(movieID, K):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

K = 10
avgRating = 0
neighbors = getNeighbors(1, K)
for neighbor in neighbors:
    avgRating += movieDict[neighbor][3]
    print (movieDict[neighbor][0] + " " + str(movieDict[neighbor][3]))
    
avgRating /= K
import operator

def getNeighbors(movieID, K):
    distances = []
    for movie in movieDict:
        if (movie != movieID):
            dist = ComputeDistance(movieDict[movieID], movieDict[movie])
            distances.append((movie, dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(K):
        neighbors.append(distances[x][0])
    return neighbors

K = 10
avgRating = 0
neighbors = getNeighbors(1, K)
for neighbor in neighbors:
    avgRating += movieDict[neighbor][3]
    print (movieDict[neighbor][0] + " " + str(movieDict[neighbor][3]))
    
avgRating /= K

Liar Liar (1997) 3.156701030927835
Aladdin (1992) 3.8127853881278537
Willy Wonka and the Chocolate Factory (1971) 3.6319018404907975
Monty Python and the Holy Grail (1974) 4.0664556962025316
Full Monty, The (1997) 3.926984126984127
George of the Jungle (1997) 2.685185185185185
Beavis and Butt-head Do America (1996) 2.7884615384615383
Birdcage, The (1996) 3.4436860068259385
Home Alone (1990) 3.0875912408759123
Aladdin and the King of Thieves (1996) 2.8461538461538463

While we were at it, we computed the average rating of the 10 nearest neighbors to Toy Story:

In [93]:

Copied!

avgRating
avgRating

Out[93]:

3.3445905900235564

In [94]:

Copied!

movieDict[1]
movieDict[1]

Out[94]:

('Toy Story (1995)',
 array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 0.7735849056603774,
 3.8783185840707963)

K Means Clustering and Elbow Method¶

In [95]:

Copied!





from numpy import random, array

#Create fake income/age clusters for N people in k clusters
def createClusteredData(N, k):
    random.seed(10)
    pointsPerCluster = float(N)/k
    X = []
    for i in range (k):
        incomeCentroid = random.uniform(20000.0, 200000.0)
        ageCentroid = random.uniform(20.0, 70.0)
        for j in range(int(pointsPerCluster)):
            X.append([random.normal(incomeCentroid, 10000.0), random.normal(ageCentroid, 2.0)])
    X = array(X)
    return X
from numpy import random, array

#Create fake income/age clusters for N people in k clusters
def createClusteredData(N, k):
    random.seed(10)
    pointsPerCluster = float(N)/k
    X = []
    for i in range (k):
        incomeCentroid = random.uniform(20000.0, 200000.0)
        ageCentroid = random.uniform(20.0, 70.0)
        for j in range(int(pointsPerCluster)):
            X.append([random.normal(incomeCentroid, 10000.0), random.normal(ageCentroid, 2.0)])
    X = array(X)
    return X

In [96]:

Copied!





%matplotlib inline

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from numpy import random, float

data = createClusteredData(100, 5)

model = KMeans(n_clusters=5)

# Note I'm scaling the data to normalize it! Important for good results.
model = model.fit(scale(data))

# We can look at the clusters each data point was assigned to
print(model.labels_)

# And we'll visualize it:
plt.figure(figsize=(8, 6))
plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float))
plt.show()
%matplotlib inline

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from numpy import random, float

data = createClusteredData(100, 5)

model = KMeans(n_clusters=5)

# Note I'm scaling the data to normalize it! Important for good results.
model = model.fit(scale(data))

# We can look at the clusters each data point was assigned to
print(model.labels_)

# And we'll visualize it:
plt.figure(figsize=(8, 6))
plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float))
plt.show()

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

In [97]:

Copied!

len(data)
len(data)

Out[97]:

Within cluster sum of square

If zero means all data points are separte clusters, hence not helpful
If max means all data points are in same cluster

Hence a middle ground, low WCCS value required

In [98]:

Copied!

model.inertia_
model.inertia_

Out[98]:

5.300772956616055

In [99]:

Copied!





wcss = []
for i in range(1,101):
    model = KMeans(i)
    model.fit(data)
    wcss_iter = model.inertia_
    wcss.append(wcss_iter)
wcss = []
for i in range(1,101):
    model = KMeans(i)
    model.fit(data)
    wcss_iter = model.inertia_
    wcss.append(wcss_iter)

In [100]:

Copied!





number_clusters = range(1,101)
plt.plot(number_clusters, wcss)
plt.title("Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("Within cluster sum of square")
number_clusters = range(1,101)
plt.plot(number_clusters, wcss)
plt.title("Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("Within cluster sum of square")

Out[100]:

Text(0, 0.5, 'Within cluster sum of square')

In [101]:

Copied!





number_clusters = range(1,11)
plt.plot(number_clusters, wcss[0:10])
plt.title("Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("Within cluster sum of square")
number_clusters = range(1,11)
plt.plot(number_clusters, wcss[0:10])
plt.title("Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("Within cluster sum of square")

Out[101]:

Text(0, 0.5, 'Within cluster sum of square')

From the graph we can conclude that 4 should be our choice for number of clusters