Logistic Regression Implementation using Python

6 minute read

Logistic Regression

Basic Assumptions:
1) target variable y is binary and follows Bernoulli distribution
2) independent variable should have little or no collinearity
3) linearly related to the log odds
4) should have sufficient amount of data
5) Model parameters are estimated using gradient descent or Max Conditional Likelihood estimation

Problem Statement

The objective is to predict whether an individual will default on his credit card payment, on the basis of his bank balanc, income and whether he is a student or not

Dataset can be downloaded from here.

A simulated data set containing information on ten thousand customers.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

!pip install xlrd
org_data = pd.read_excel("LogisticRegressionDemo.xlsx")

Requirement already satisfied: xlrd in c:\users\i330087\appdata\local\continuum\anaconda3\lib\site-packages (1.2.0)

#org_data = org_data.drop(['Unnamed: 0'], axis=1)
org_data.head()

	Unnamed: 0	default	student	balance	income
0	1	No	No	729.526495	44361.625074
1	2	No	Yes	817.180407	12106.134700
2	3	No	No	1073.549164	31767.138947
3	4	No	No	529.250605	35704.493935
4	5	No	No	785.655883	38463.495879

# check the data set dimensions and obtain basic information. <br> shape attribute provides data size i.e. no of obs as rows and no. of features as columns.<br>
# describe function provides statistical summary
print(org_data.shape)
org_data = org_data.drop(['Unnamed: 0'], axis=1)
org_data.info()
# Try org_data.describe() for printing the statistical summary

(10000, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
default    10000 non-null object
student    10000 non-null object
balance    10000 non-null float64
income     10000 non-null float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB

Notice that there are no missing values in the dataset. All columns have non-null values

Print the number of positive and negative instances

print(org_data.default.value_counts())

No     9667
Yes     333
Name: default, dtype: int64

A heavily class imbalanced problem with only 3.5% non-defaulters

data = org_data[['default',"student",'balance','income']]
data = data[1:1000]
print(data.shape)
data.default.value_counts()
#data.head()

(999, 4)

No     961
Yes     38
Name: default, dtype: int64

Since the response variable “default” and the predictor student have categorical values in the form of “yes” and “no”, it is important to encode them using 0/1 form to feed as input to the model.
factorize() method is used to perform this conversion

data['default'],default_unique = data['default'].factorize()
data['student'], st_unique = data['student'].factorize()
print("unique student labels",st_unique)
print("unique default labels",default_unique)
data.head()

unique student labels Index(['Yes', 'No'], dtype='object')
unique default labels Index(['No', 'Yes'], dtype='object')

	student	balance	income
1	0	817.180407	12106.134700
2	1	1073.549164	31767.138947
3	1	529.250605	35704.493935
4	1	785.655883	38463.495879
5	0	919.588530	7491.558572

This implies that student=’Yes’ is encoded as 0 and student=’No’ is encoded as 1. Similarly, default=’No’ is encoded as 0 and default=’yes’ as 1.

data.corr()

	default	student	balance	income
default	1.000000	-0.018135	0.360888	0.014253
student	-0.018135	1.000000	-0.227472	0.755412
balance	0.360888	-0.227472	1.000000	-0.137466
income	0.014253	0.755412	-0.137466	1.000000

the highest correlation between default and the other predictors is with Balance. Thus, balance has strong infulence on whether the candidate will default on his credit payment or not. The correlation beween student and income is much lesser.

student has negative correlation with default i.e. chances of a student defaulting is lesser than that of a non-student defaulting. Also notice the high positive correlation between student and income, negative correlation between student and balance.

It will be good to explore this for different subsets and then the entire dataset to notice the change in these correlation coefficients and observe true relatinship

Visualize relationship between response vs predictors

The response variable for a binary linear regression problem has only two options: 0/1, creating a scatter plot is not helpful here. Instead, box plots and contingency tables provide good understanding of the data.

pd.crosstab(data.student, data.default).plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x255e04307b8>

LogisticRegression

Is student a good indicator of defaulter? For defaulter = NO, i.e. non-deafulters, the difference between student=yes(0) and non-student is very prominent i.e. chances of student defaulting is much less than the chances of non-students defaulting. But for defaulters, the distinction is not that significant.

# boxplot
data.boxplot(column='balance',by='default')

<matplotlib.axes._subplots.AxesSubplot at 0x255e069c9b0>

LogisticRegression

The purpose of visualization is to think about questions such as: What does this bar chart or box plot convey?
There is a huge difference in balance between defaulters and non-defaulters. Check for income vs default as well.

Another visualization style in practice is to see relationship between predictors and response using coloured scatter plots as follows

plt.scatter(data.balance, data.income, c=data.default)

<matplotlib.collections.PathCollection at 0x255e05ac940>

LogisticRegression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

model1 = LogisticRegression()
# X = data.balance.values.reshape(-1,1)
#X = data[['balance','student']]
X = data[['balance','student','income']]
# print(X.shape)
y = data.default.values.reshape(-1,1)

x_train, x_test, y_train, y_test = train_test_split(X,data.default,test_size=0.20,random_state=105)
model1.fit(X, y)

C:\Users\I330087\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\I330087\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

print('classes: ',model1.classes_)
print('coefficients: ',model1.coef_)
print('intercept :', model1.intercept_)

classes:  [0 1]
coefficients:  [[ 2.30714566e-04  3.36201089e-07 -1.09177044e-04]]
intercept : [-1.43338127e-06]

Model Equation \begin{equation} z = 2.30714566e-04balance + 3.36201089e^{-07} * student - 1.091e^{-04}income \end{equation}

\begin{equation} sigma(z) = \frac{1}{ 1+e^{-z}} \end{equation}

# test1={'balance':40, 'student':1,'income':1500}
# x_test = pd.DataFrame(test1, index=[0])
predictions = model1.predict(x_test)
#print(predictions)
score = accuracy_score(predictions, y_test)
print(score)

0.945

from sklearn import metrics
print("prediction shape",predictions.shape)
print("y_test.shape", y_test.shape)
cm = metrics.confusion_matrix(predictions, y_test)
print(cm)

prediction shape (200,)
y_test.shape (200,)
[[189  11]
 [  0   0]]

# Predicting for an unknown test case
test1={'balance':3000, 'student':1,'income':1500}
x_test1 = pd.DataFrame(test1, index=[0])
predictions = model1.predict(x_test1)
print(predictions)

prob = model1.predict_proba(x_test1)
print(prob)

[1]
[[0.3708955 0.6291045]]

How do you interpret the result?

The sklearn model doesnot provide statistical summary to help us understand which features are not relevant and can be removed.
Statsmodel provides a good alternative to this and provides a detailed statistical summary that helps in feature selection as well.

model2 = LogisticRegression()
X2 = data[['student']]
y2 = data[['default']]
model2.fit(X2, y2)

C:\Users\I330087\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
C:\Users\I330087\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

print(model2.coef_)
print(model2.intercept_)
test2 = {'student':1}
x_test2 = pd.DataFrame(test2, index=[0])
model2.predict(x_test2)
prob = model2.predict_proba(x_test2)
print(prob)

[[-0.37162045]]
[-2.90741904]
[[0.9637027 0.0362973]]

Contrasting results: model 1 predicts the prob of a studen defaulting is higher than non-defaulting whereas model 2 predicts the chance ofa student defaulting is much higehr.

from statsmodels.api import OLS
OLS(data.default,X).fit().summary()

OLS Regression Results
Dep. Variable:	default	R-squared (uncentered):	0.138
Model:	OLS	Adj. R-squared (uncentered):	0.135
Method:	Least Squares	F-statistic:	52.98
Date:	Sat, 14 Dec 2019	Prob (F-statistic):	8.92e-32
Time:	14:32:04	Log-Likelihood:	289.39
No. Observations:	999	AIC:	-572.8
Df Residuals:	996	BIC:	-558.1
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
balance	0.0001	1e-05	10.750	0.000	8.82e-05	0.000
student	0.0215	0.019	1.103	0.270	-0.017	0.060
income	-1.666e-06	5.38e-07	-3.095	0.002	-2.72e-06	-6.1e-07

Omnibus:	878.203	Durbin-Watson:	1.838
Prob(Omnibus):	0.000	Jarque-Bera (JB):	16270.196
Skew:	4.239	Prob(JB):	0.00
Kurtosis:	20.861	Cond. No.	1.22e+05

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.22e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

NOTE Remember we had observed high correlation between student and income. Important to notice the p-values for each of the coefficients. The t-statistics of student [P>|t|] is a very high value and we need to accept the hypothesis tht student has no relation with default and this is one feature that can be removed from consideration.

from sklearn.naive_bayes import GaussianNB
gnb_model = GaussianNB()

gnb_model.fit(x_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

gnb_predictions = gnb_model.predict(x_test)
#print(gnb_predictions.shape)
#print(y_test.shape)
score = accuracy_score(gnb_predictions, y_test)
print("Gaussian NB accuracy : ", score)

Gaussian NB accuracy :  0.96

gnb_model.class_prior_

array([0.96620776, 0.03379224])

For the current test case, the probability of defaulter (class=0) is much higher than that of a non-defaulter.

NOTE: For the current subset of data, performance of GNB is better than that of Logistic regression because the training data set is of limited size. Try testing it on the whole dataset and then compare the performance.

Share on

Twitter Facebook LinkedIn

Gaurav Chaudhary

Logistic Regression Implementation using Python

Logistic Regression

Problem Statement

Dataset can be downloaded from here.

Visualize relationship between response vs predictors

Share on

You may also enjoy

Image classification task using Naive Bayes and logistic regression classifier

Spam Filter using Logistic Regression