Model Building - I¶
Model building is the process of creating a machine learning model that can predict outcomes based on input data.
Once you have preprocessed your data, the next step is to choose an appropriate machine learning algorithm.
What is an ML model?
In a nutshell,
- It's a program that can find patterns or make decisions from a previously unseen dataset.
- You train a model over a set of data, providing it an algorithm that it can use to reason over and learn from the data.
- Once you have trained the model, you can use it to reason over data that it hasn't seen before, and make predictions about those data.
Types of Machine Learning¶
Why do we even have types?
At it's core, Machine Learning models are just math. Most times, unfortunately, complex maths..
Machine Learning sits at the intersection of statistics and computer science, yet it can still wear many different masks.
When we say Types of machine learning, we actually refer to the type of algorithm used for machine learning.
So, for various applications, we have different machine learning algorithms.
Some algorithms give more accurate results over others.
There are four types of machine learning:
- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning (class of supervised learning)
- Reinforcement Learning
Here is a brief description:
Supervised Learning¶
Involves machine learning algorithms that learn under the presence of a supervisor.
- Relies on labelled input and output training data.
- The example-label pairs are fed one by one, allowing the algorithm to predict the label for each example, giving it feedback as to whether it predicted the right answer or not.
- Over time, the algorithm will learn to approximate the nature of the relationship between examples and their labels.
- Training process continues until the model achieves a desired level of accuracy on the training data.
Here is a simple analogy:
Consider a programming instructor teaching us students how to write code in a new programming language.
The instructor starts by teaching us the basic syntax and programming concepts, and then have us practice writing code to solve simple problems.
As we practice, the instructor corrects our mistakes and provides feedback on our code structure, design, variable names, program flow, etc.
And over time, we can become more proficient in the language and can start to write more complex programs and solve more challenging problems. The instructor continues to provide feedback and guidance, helping the us improve our programming skills.
ML Algorithms:
- Regression
- Decision Tree
- Random Forest
- KNN
- Logistic Regression
Advantages:
- The model learns from past experiences, i.e., the introduced data.
- Availability of a significantly larger pool of algorithms compared to others.
Disadvantages:
- Challenging and time-consuming to label massive data
- Difficult to predict accurately if the distribution of the test data differs significantly from that of the training dataset.
Applications:
- Speech recognition
- Image classification
- Spam detection
- Weather Forecast
- Face Recognition
Throughout this session, we shall focus on ML models based on supervised learning.
Unsupervised Learning¶
Unsupervised learning is very much the opposite of supervised learning. It features no labels. Instead, our algorithm would be fed a lot of data and given the tools to understand the properties of the data. From there, it can learn to group, cluster, and/or organize the data in a way such that a human (or other intelligent algorithm) can come in and make sense of the newly organized data.
Unsupervised learning is particularly useful in finding unknown patterns in a dataset. It aids in finding features needed for categorization. Your images, videos, or any data provided doesn’t have to be annotated or labeled.
Involves learning from unlabeled data without any supervision.
- No predefined labels or output for the data, and the algorithm must identify patterns or relationships on its own.
- Algorithm is trained on the input data and tasked with finding hidden structures or relationships within it.
- The algorithm continues to learn and refine its understanding of the data until it can identify meaningful patterns and structures.
Here is a simple analogy:
Think about when you first arrived at college, and wanted to make friends with other students who share similar interests, but you didn't know anyone. To find other like-minded students, you attend different events and gatherings on campus, such as club fairs, sports games, and academic talks.
As you interact with other students and attend different events, you start to notice that certain groups of students tend to congregate together and share common interests.
For example, you might notice that there is a group of students who are passionate about technology and often participate in hackathons and coding competitions.
As you continue to attend different events and interact with more students, you start to identify clusters of students based on their interests and activities.
ML Algorithms:
- Clustering
- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)
- Independent Component Analysis (ICA)
- K-Means Clustering
Advantages:
- Unsupervised learning does not require labeled data, which can be difficult and time-consuming to obtain.
- This type of learning can uncover new insights and patterns that may not have been noticed by humans.
Disadvantages:
- Evaluating the accuracy of an unsupervised learning algorithm is often more difficult than with supervised learning, as there is no clear way to measure correctness.
- Unsupervised learning can be computationally expensive and may require more complex algorithms and hardware to achieve good results.
Applications:
- Anomaly detection
- Data compression
- Image segmentation
- Clustering customers based on behavior or preferences
- Recommender systems
Semi Supervised Learning¶
Involves machine learning algorithms that learn from a combination of labelled and unlabelled data.
- Typically, a small set of labelled data is provided, along with a much larger set of unlabelled data.
- Algorithm uses the labelled data to learn patterns in the data and then applies this learning to the unlabelled data.
- The goal is to use the unlabelled data to improve the accuracy of the model beyond what could be achieved with just the labelled data.
Here is a simple analogy:
Think of it like a chef creating a new recipe with a mix of familiar and unfamiliar ingredients. By experimenting with the known ingredients, the chef identifies patterns and builds a foundation for the recipe. Then, by introducing the unfamiliar ingredients and using the patterns identified with the known ingredients as a guide, the chef is able to create a more complex and interesting recipe than they could have with just the familiar ingredients alone.
ML Algorithms:
- Self-Training
- Co-Training
- Multi-View Learning
- Semi-Supervised SVM
- Graph-Based Learning
Advantages:
- Less time-consuming and costly than fully-supervised learning, as it requires less labelled data.
- Can improve the accuracy of a model by incorporating additional unlabelled data.
Disadvantages:
- May not be as accurate as fully-supervised learning, since it relies on unlabelled data to provide additional information.
- Difficult to balance the amount of labelled and unlabelled data for optimal performance.
Applications:
- Text classification
- Image classification
- Anomaly detection
- Protein sequence analysis
- Speech recognition
Reinforcement Learning¶
Involves machine learning algorithms that learn through interactions with an environment.
- The algorithm learns by taking actions in an environment and receiving feedback in the form of rewards or penalties.
- The goal is to learn a policy, or a set of rules for selecting actions, that maximizes the cumulative reward over time.
Here is another analogy:
Think about the time you were deciding which college extracurricular activities to participate in. You don't know which activities will be the most enjoyable or rewarding, so you try out different options and observe how they affect your mood and overall experience.
For example, you might try out for the sports team, music club, or have joined the debate club. After participating in each activity for a while, you evaluate how much you enjoyed it and how much it contributed to your personal growth.
Over time, you start to learn which activities are the most enjoyable and beneficial for you. You begin to focus on these activities more and prioritize them over less enjoyable or less rewarding activities.
ML Algorithms:
- Q-Learning
- SARSA
- Deep Q-Networks (DQN)
- Policy Gradients
- Actor-Critic
Advantages:
- Can learn optimal policies in complex environments where the optimal solution is not known or is too difficult to calculate.
- Can learn from experience without the need for labelled data.
Disadvantages:
- Can be time-consuming and computationally expensive to train, especially for complex environments.
- Can be difficult to design a reward function that accurately reflects the desired behavior.
Applications:
- Game playing
- Robotics
- Autonomous driving
- Recommendation systems
- Control systems for power grids, water treatment plants, etc.
An explanation in a nutshell:¶
Throughout this session, we shall focus on ML models based on supervised learning.
Linear Regression¶
Does this look familiar?
You all have already done this for your physics lab!
The trendline option in excel generates a regression based on the data provided.
We will be doing a similar thing in python, but in a more advanced level.
A short explanation:
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables. The goal of linear regression is to find the linear equation that best describes the relationship between the variables.
In a simple linear regression, there is one independent variable and one dependent variable.
The linear equation takes the form of:
$y = a + bx$
where y is the dependent variable, x is the independent variable, a is the intercept, and b is the slope. The intercept represents the value of y when x is equal to zero, and the slope represents the change in y for a one-unit increase in x.
All the math involved in building the model is abstracted away into functions, so that machine learning engineers need not implement the algorithm from scratch!
Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Loading the data set¶
The seaborn library comes with a few data sets built-in, let's check it out!
sns.get_dataset_names()
['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']
Let's load the tips dataset
tips_data = sns.load_dataset('tips')
tips_data
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
Visualizing the data:¶
tips_data
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
244 rows × 7 columns
The Tips dataset is a data frame with 244 rows and 7 variables which represents some tipping data where one waiter recorded information about each tip he received over a period of a few months working in one restaurant.
The waiter collected several variables:
- Tip in dollars, the bill in dollars
- Sex of the bill payer
- Whether there were smokers in the party
- Day of the week
- Time of day
- Size of the party.
Exploring the data:
sns.pairplot(tips_data)
<seaborn.axisgrid.PairGrid at 0x24f00869e80>
From the pairplot we can see that there is almost a linear correspondence between total_bill and tip.
Let's try to make a linear regression model between total_bill
and tip
, where the model predicts the tip given to a waiter based on the total bill the customer pays.
sns.scatterplot(x='total_bill', y='tip', data=tips_data)
<AxesSubplot:xlabel='total_bill', ylabel='tip'>
This will create a scatterplot of the total_bill
vs tip
variables.
Checking for a linear model:
sns.lmplot(x='total_bill', y='tip', data=tips_data)
<seaborn.axisgrid.FacetGrid at 0x24f0267b370>
This will add a regression line to the scatterplot, showing the relationship between 'total_bill' and 'tip'.
Training and Testing Data¶
Now that we've explored the data a bit, let's go ahead and split the data into training and testing sets.
First we will set a variable x
equal to the numerical features of the tips data and a variable y
equal to the tips
column.
y = tips_data['tip']
x = tips_data['total_bill']
Use model_selection.train_test_split from sklearn to split the data into training and testing sets.
TEST_SIZE = 0.2
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=1)
Training the Model¶
Now its time to train our model on our training data!
from sklearn.linear_model import LinearRegression
Creating an instance of a linear regression model:
lm = LinearRegression(fit_intercept=False)
fit_intercept sets the intercept value to zero.
lm.fit(x_train, y_train)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_10436/141332336.py in <module> ----> 1 lm.fit(x_train, y_train) c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\linear_model\_base.py in fit(self, X, y, sample_weight) 516 accept_sparse = False if self.positive else ['csr', 'csc', 'coo'] 517 --> 518 X, y = self._validate_data(X, y, accept_sparse=accept_sparse, 519 y_numeric=True, multi_output=True) 520 c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params) 431 y = check_array(y, **check_y_params) 432 else: --> 433 X, y = check_X_y(X, y, **check_params) 434 out = X, y 435 c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0 c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator) 869 raise ValueError("y cannot be None") 870 --> 871 X = check_array(X, accept_sparse=accept_sparse, 872 accept_large_sparse=accept_large_sparse, 873 dtype=dtype, order=order, copy=copy, c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0 c:\users\anura\appdata\local\programs\python\python39\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 692 # If input is 1D raise error 693 if array.ndim == 1: --> 694 raise ValueError( 695 "Expected 2D array, got 1D array instead:\narray={}.\n" 696 "Reshape your data either using array.reshape(-1, 1) if " ValueError: Expected 2D array, got 1D array instead: array=[16.99 19.77 31.71 14. 16.27 7.56 40.55 20.69 21.7 10.33 9.78 32.83 12.6 10.63 22.76 32.4 16.58 16.04 7.25 17.51 13.39 12.46 31.85 17.89 40.17 48.27 44.3 11.87 15.98 24.71 14.83 13.42 20.29 38.01 12.76 19.49 25.71 15.98 9.94 19.08 16.82 34.63 11.69 13.94 16.29 25.29 45.35 18.71 28.55 43.11 16.45 12.02 28.17 38.07 10.34 25.56 10.77 21.01 13.28 14.52 18.24 8.35 12.03 10.07 15.95 15.36 11.61 22.75 17.26 15.42 13.81 16. 8.58 10.59 19.81 13.51 24.01 16.66 12.48 21.5 12.66 18.43 28.15 20.76 18.29 16.49 22.42 16.31 13. 20.45 15.69 15.53 10.51 14.07 25.89 23.1 21.01 30.46 8.77 16.47 17.82 27.2 23.95 20.08 13.13 22.23 14.73 5.75 34.65 18.28 27.05 15.81 10.27 16.21 15.06 10.33 32.68 15.48 24.52 29.85 11.35 29.8 39.42 8.52 14.78 24.27 20.92 24.55 18.15 8.51 7.25 22.12 17.59 21.58 17.46 12.74 14.31 19.44 34.81 13.37 17.92 9.68 19.82 23.68 38.73 18.04 7.51 20.27 15.69 9.55 13.42 32.9 17.31 13.27 15.04 20.29 11.38 10.34 26.41 15.77 13.81 18.29 26.88 29.03 34.3 13.03 27.28 11.59 12.9 20.23 12.54 41.19 25. 48.17 13.16 48.33 18.64 15.38 17.92 9.6 17.81 18.26 17.07 20.53 22.82 16.43 28.44 17.29 12.26 16.4 14.15 26.86 17.47 10.07 16.93]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Hmm, why is the above statement giving an error..
Because in most cases linear regression in multivariate, i.e, y
depends on more than one variable.
Our x
is a single dimensional array, but the model only works on multi-dimesional arrays..
In our case, we could consider the size
column of the data set to take into account the dependence in size as well.
random_state sets the seed for the random generator so that we can ensure that the results that we get can be reproduced.
test_size sets aside part of the data for testing purpose.
TEST_SIZE = 0.1
RANDOM_STATE = 2
x = tips_data[['total_bill', 'size']]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
lm.fit(x_train, y_train)
LinearRegression(fit_intercept=False)
print('Coefficients: \n', lm.coef_)
Coefficients: [0.09632358 0.37962516]
But if you see the pair plot from the initial steps, there is no correlation between size and waiter tip..
So, you could reshape the original total_bill array, as suggested by the error message.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
x = tips_data['total_bill'].values.reshape(-1, 1)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=TEST_SIZE, random_state=RANDOM_STATE)
lm.fit(x_train, y_train)
LinearRegression(fit_intercept=False)
print('Coefficients: \n', lm.coef_)
Coefficients: [0.14113406]
Predicting Test Data¶
Now that we have fit our model, let's evaluate its performance by predicting off the test values!
tip_predictions = lm.predict(x_test)
tip_predictions
array([4.91569945, 3.60738668, 1.20246223, 2.30330792, 3.38016083, 4.8408984 , 1.4028726 , 2.83397201, 3.5283516 , 6.82100931, 1.44944684, 2.48254819, 2.98639679, 2.40915847, 1.34783031, 3.34205464, 1.94906142, 2.25532234, 1.6357438 , 2.79586581, 2.29625122, 2.1706419 , 3.39568558, 2.79727715, 2.43597394])
An these are the real values:
y_test
85 5.17 54 4.34 126 1.48 93 4.30 113 2.55 141 6.70 53 1.56 65 3.15 157 3.75 212 9.00 10 1.71 64 2.64 89 3.00 71 3.00 30 1.45 3 3.31 163 2.00 84 2.03 217 1.50 191 4.19 225 2.50 101 3.00 35 3.60 24 3.18 152 2.74 Name: tip, dtype: float64
Let's now create a scatterplot of the real test values versus the predicted values.
plt.scatter(y_test, tip_predictions)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
Text(0, 0.5, 'Predicted Y')
Evaluating the model:¶
You can use metrics such as the mean squared error (MSE) or the coefficient of determination (R-squared) to evaluate the accuracy of the linear regression model.
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, tip_predictions)
r2 = r2_score(y_test, tip_predictions)
print('Mean squared error:', mse)
print('R-squared:', r2)
Mean squared error: 0.688445648107554 R-squared: 0.7602517788014659
Lower the MSE, the more accurate a model is. An MSE of zero is a perfect model.
R-squared gives a accuracy measure in terms of percentage.
accuracy = round(r2*100, 2)
print(f'Model accuracy ≈', accuracy, '%')
Model accuracy ≈ 76.03 %
Let's compare the predicted and actual values:
tip_predictions = [round(prediction, 2) for prediction in tip_predictions] #
y_test = list(y_test)
comparision_df = pd.DataFrame({'Predicted': tip_predictions, 'Actual': y_test})
comparision_df
Predicted | Actual | |
---|---|---|
0 | 4.92 | 5.17 |
1 | 3.61 | 4.34 |
2 | 1.20 | 1.48 |
3 | 2.30 | 4.30 |
4 | 3.38 | 2.55 |
5 | 4.84 | 6.70 |
6 | 1.40 | 1.56 |
7 | 2.83 | 3.15 |
8 | 3.53 | 3.75 |
9 | 6.82 | 9.00 |
10 | 1.45 | 1.71 |
11 | 2.48 | 2.64 |
12 | 2.99 | 3.00 |
13 | 2.41 | 3.00 |
14 | 1.35 | 1.45 |
15 | 3.34 | 3.31 |
16 | 1.95 | 2.00 |
17 | 2.26 | 2.03 |
18 | 1.64 | 1.50 |
19 | 2.80 | 4.19 |
20 | 2.30 | 2.50 |
21 | 2.17 | 3.00 |
22 | 3.40 | 3.60 |
23 | 2.80 | 3.18 |
24 | 2.44 | 2.74 |
Linear regression is a powerful method for modeling the relationship between input features and target variables, and Seaborn makes it easy to perform and visualize linear regression in Python.
Logistic Regression¶
Its similar to linear regression with a bit of difference. Linear Regression is used to handle regression problems whereas Logistic regression is used to handle the classification problems. Linear regression provides a continuous output but Logistic regression provides discreet output.
%matplotlib inline
import io
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy.stats
import scipy.special
import seaborn as sns
sns.set_style('white')
sns.set_context('notebook')
Spider data from Suzuki et al. (2006) In following cells what we analyze is the same as the analysis made in the paper. However we won't go in that direction because of obvious reasons.
data = """Grain size (mm) Spiders
0.245 absent
0.247 absent
0.285 present
0.299 present
0.327 present
0.347 present
0.356 absent
0.36 present
0.363 absent
0.364 present
0.398 absent
0.4 present
0.409 absent
0.421 present
0.432 absent
0.473 present
0.509 present
0.529 present
0.561 absent
0.569 absent
0.594 present
0.638 present
0.656 present
0.816 present
0.853 present
0.938 present
1.036 present
1.045 present
"""
df = pd.read_table(io.StringIO(data))
df.Spiders = df.Spiders == 'present'
df.head()
Grain size (mm) | Spiders | |
---|---|---|
0 | 0.245 | False |
1 | 0.247 | False |
2 | 0.285 | True |
3 | 0.299 | True |
4 | 0.327 | True |
df
Grain size (mm) | Spiders | |
---|---|---|
0 | 0.245 | False |
1 | 0.247 | False |
2 | 0.285 | True |
3 | 0.299 | True |
4 | 0.327 | True |
5 | 0.347 | True |
6 | 0.356 | False |
7 | 0.360 | True |
8 | 0.363 | False |
9 | 0.364 | True |
10 | 0.398 | False |
11 | 0.400 | True |
12 | 0.409 | False |
13 | 0.421 | True |
14 | 0.432 | False |
15 | 0.473 | True |
16 | 0.509 | True |
17 | 0.529 | True |
18 | 0.561 | False |
19 | 0.569 | False |
20 | 0.594 | True |
21 | 0.638 | True |
22 | 0.656 | True |
23 | 0.816 | True |
24 | 0.853 | True |
25 | 0.938 | True |
26 | 1.036 | True |
27 | 1.045 | True |
df["Spiders"]
0 False 1 False 2 True 3 True 4 True 5 True 6 False 7 True 8 False 9 True 10 False 11 True 12 False 13 True 14 False 15 True 16 True 17 True 18 False 19 False 20 True 21 True 22 True 23 True 24 True 25 True 26 True 27 True Name: Spiders, dtype: bool
plt.scatter(df["Grain size (mm)"], df["Spiders"])
plt.ylabel('Spiders present?')
sns.despine()
import sklearn.linear_model
scikit-learn has a logisitic regression classifier which uses regularization. To eliminate regularization, we set the regularization parameter C
to $10^{12}$.
# C=1e12 is effectively no regularization - see https://github.com/scikit-learn/scikit-learn/issues/6738
clf = sklearn.linear_model.LogisticRegression(C=1e12, random_state=0)
clf.fit(df['Grain size (mm)'].values.reshape(-1, 1), df['Spiders'])
print(clf.intercept_, clf.coef_)
[-1.64761964] [[5.12153717]]
def plot_log_reg(x, y, data, clf, xmin=None, xmax=None, alpha=1, ax=None):
if ax is None:
fig, ax = plt.subplots()
else:
fig = ax.figure
ax.scatter(data[x], data[y], color='black', zorder=20, alpha=alpha)
if xmin is None:
xmin = x.min()
if xmax is None:
xmax = x.max()
X_test = np.linspace(xmin, xmax, 300)
loss = scipy.special.expit(X_test * clf.coef_ + clf.intercept_).ravel()
ax.plot(X_test, loss, linewidth=3)
ax.set_xlabel(x)
ax.set_ylabel(y)
fig.tight_layout()
sns.despine()
return fig, ax
plot_log_reg(x='Grain size (mm)', y='Spiders', data=df, clf=clf, xmin=0, xmax=1.5);
KNN (K-Nearest-Neighbors)¶
KNN is a simple concept: define some distance metric between the items in your dataset, and find the K closest items. You can then use those items to predict some property of a test item, by having them somehow "vote" on it.
As an example, let's look at the MovieLens data. We'll try to guess the rating of a movie by looking at the 10 movies that are closest to it in terms of genres and popularity.
To start, we'll load up every rating in the data set into a Pandas DataFrame:
import pandas as pd
r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, usecols=range(3))
ratings.head()
user_id | movie_id | rating | |
---|---|---|---|
0 | 0 | 50 | 5 |
1 | 0 | 172 | 5 |
2 | 0 | 133 | 1 |
3 | 196 | 242 | 3 |
4 | 186 | 302 | 3 |
Now, we'll group everything by movie ID, and compute the total number of ratings (each movie's popularity) and the average rating for every movie:
import numpy as np
movieProperties = ratings.groupby('movie_id').agg({'rating': [np.size, np.mean]})
movieProperties.head()
rating | ||
---|---|---|
size | mean | |
movie_id | ||
1 | 452 | 3.878319 |
2 | 131 | 3.206107 |
3 | 90 | 3.033333 |
4 | 209 | 3.550239 |
5 | 86 | 3.302326 |
The raw number of ratings isn't very useful for computing distances between movies, so we'll create a new DataFrame that contains the normalized number of ratings. So, a value of 0 means nobody rated it, and a value of 1 will mean it's the most popular movie there is.
movieNumRatings = pd.DataFrame(movieProperties['rating']['size'])
movieNormalizedNumRatings = movieNumRatings.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
movieNormalizedNumRatings.head()
size | |
---|---|
movie_id | |
1 | 0.773585 |
2 | 0.222985 |
3 | 0.152659 |
4 | 0.356775 |
5 | 0.145798 |
Now, let's get the genre information from the u.item file. The way this works is there are 19 fields, each corresponding to a specific genre - a value of '0' means it is not in that genre, and '1' means it is in that genre. A movie may have more than one genre associated with it.
While we're at it, we'll put together everything into one big Python dictionary called movieDict. Each entry will contain the movie name, list of genre values, the normalized popularity score, and the average rating for each movie:
movieDict = {}
with open(r'u.item', encoding="ISO-8859-1") as f:
temp = ''
for line in f:
#line.decode("ISO-8859-1")
fields = line.rstrip('\n').split('|')
movieID = int(fields[0])
name = fields[1]
genres = fields[5:25]
genres = map(int, genres)
movieDict[movieID] = (name, np.array(list(genres)), movieNormalizedNumRatings.loc[movieID].get('size'), movieProperties.loc[movieID].rating.get('mean'))
For example, here's the record we end up with for movie ID 1, "Toy Story":
print(movieDict[1])
('Toy Story (1995)', array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.7735849056603774, 3.8783185840707963)
Now let's define a function that computes the "distance" between two movies based on how similar their genres are, and how similar their popularity is. Just to make sure it works, we'll compute the distance between movie ID's 2 and 4:
from scipy import spatial
def ComputeDistance(a, b):
genresA = a[1]
genresB = b[1]
genreDistance = spatial.distance.cosine(genresA, genresB)
popularityA = a[2]
popularityB = b[2]
popularityDistance = abs(popularityA - popularityB)
return genreDistance + popularityDistance
ComputeDistance(movieDict[2], movieDict[4])
0.8004574042309892
Remember the higher the distance, the less similar the movies are. Let's check what movies 2 and 4 actually are - and confirm they're not really all that similar:
print(movieDict[2])
print(movieDict[4])
('GoldenEye (1995)', array([0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]), 0.22298456260720412, 3.2061068702290076) ('Get Shorty (1995)', array([0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.3567753001715266, 3.550239234449761)
Now, we just need a little code to compute the distance between some given test movie (Toy Story, in this example) and all of the movies in our data set. When the sort those by distance, and print out the K nearest neighbors:
import operator
def getNeighbors(movieID, K):
distances = []
for movie in movieDict:
if (movie != movieID):
dist = ComputeDistance(movieDict[movieID], movieDict[movie])
distances.append((movie, dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(K):
neighbors.append(distances[x][0])
return neighbors
K = 10
avgRating = 0
neighbors = getNeighbors(1, K)
for neighbor in neighbors:
avgRating += movieDict[neighbor][3]
print (movieDict[neighbor][0] + " " + str(movieDict[neighbor][3]))
avgRating /= K
Liar Liar (1997) 3.156701030927835 Aladdin (1992) 3.8127853881278537 Willy Wonka and the Chocolate Factory (1971) 3.6319018404907975 Monty Python and the Holy Grail (1974) 4.0664556962025316 Full Monty, The (1997) 3.926984126984127 George of the Jungle (1997) 2.685185185185185 Beavis and Butt-head Do America (1996) 2.7884615384615383 Birdcage, The (1996) 3.4436860068259385 Home Alone (1990) 3.0875912408759123 Aladdin and the King of Thieves (1996) 2.8461538461538463
While we were at it, we computed the average rating of the 10 nearest neighbors to Toy Story:
avgRating
3.3445905900235564
movieDict[1]
('Toy Story (1995)', array([0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 0.7735849056603774, 3.8783185840707963)
K Means Clustering and Elbow Method¶
from numpy import random, array
#Create fake income/age clusters for N people in k clusters
def createClusteredData(N, k):
random.seed(10)
pointsPerCluster = float(N)/k
X = []
for i in range (k):
incomeCentroid = random.uniform(20000.0, 200000.0)
ageCentroid = random.uniform(20.0, 70.0)
for j in range(int(pointsPerCluster)):
X.append([random.normal(incomeCentroid, 10000.0), random.normal(ageCentroid, 2.0)])
X = array(X)
return X
%matplotlib inline
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from numpy import random, float
data = createClusteredData(100, 5)
model = KMeans(n_clusters=5)
# Note I'm scaling the data to normalize it! Important for good results.
model = model.fit(scale(data))
# We can look at the clusters each data point was assigned to
print(model.labels_)
# And we'll visualize it:
plt.figure(figsize=(8, 6))
plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float))
plt.show()
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
len(data)
100
Within cluster sum of square
- If zero means all data points are separte clusters, hence not helpful
- If max means all data points are in same cluster
Hence a middle ground, low WCCS value required
model.inertia_
5.300772956616055
wcss = []
for i in range(1,101):
model = KMeans(i)
model.fit(data)
wcss_iter = model.inertia_
wcss.append(wcss_iter)
number_clusters = range(1,101)
plt.plot(number_clusters, wcss)
plt.title("Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("Within cluster sum of square")
Text(0, 0.5, 'Within cluster sum of square')
number_clusters = range(1,11)
plt.plot(number_clusters, wcss[0:10])
plt.title("Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("Within cluster sum of square")
Text(0, 0.5, 'Within cluster sum of square')
From the graph we can conclude that 4 should be our choice for number of clusters