[ad_1]
CLASSIFICATION ALGORITHM
Whereas some probabilistic-based machine studying fashions (like Naive Bayes) make daring assumptions about characteristic independence, logistic regression takes a extra measured strategy. Consider it as drawing a line (or aircraft) that separates two outcomes, permitting us to foretell chances with a bit extra flexibility.
Logistic regression is a statistical methodology used for predicting binary outcomes. Regardless of its identify, it’s used for classification reasonably than regression. It estimates the likelihood that an occasion belongs to a specific class. If the estimated likelihood is larger than 50%, the mannequin predicts that the occasion belongs to that class (or vice versa).
All through this text, we’ll use this synthetic golf dataset (impressed by [1]) for example. This dataset predicts whether or not an individual will play golf primarily based on climate circumstances.
Similar to in KNN, logistic regression requires the information to be scaled first. Convert categorical columns into 0 & 1 and in addition scale the numerical options in order that no single characteristic dominates the gap metric.
# Import required libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np# Create dataset from dictionary
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
# Put together knowledge: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)
# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]
# Cut up knowledge into options and goal
X, y = df.drop(columns='Play'), df['Play']
# Cut up knowledge into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Scale numerical options
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.rework(X_test[['Temperature', 'Humidity']])
# Print outcomes
print("Coaching set:")
print(pd.concat([X_train, y_train], axis=1), 'n')
print("Check set:")
print(pd.concat([X_test, y_test], axis=1))
Logistic regression works by making use of the logistic operate to a linear mixture of the enter options. Right here’s the way it operates:
- Calculate a weighted sum of the enter options (much like linear regression).
- Apply the logistic operate (additionally referred to as sigmoid operate) to this sum, which maps any actual quantity to a worth between 0 and 1.
- Interpret this worth because the likelihood of belonging to the constructive class.
- Use a threshold (usually 0.5) to make the ultimate classification choice.
The coaching course of for logistic regression includes discovering the very best weights for the enter options. Right here’s the overall define:
- Initialize the weights (usually to small random values).
# Initialize weights (together with bias) to 0.1
initial_weights = np.full(X_train_np.form[1], 0.1)# Create and show DataFrame for preliminary weights
print(f"Preliminary Weights: {initial_weights}")
2. For every coaching instance:
a. Calculate the expected likelihood utilizing the present weights.
def sigmoid(z):
return 1 / (1 + np.exp(-z))def calculate_probabilities(X, weights):
z = np.dot(X, weights)
return sigmoid(z)
def calculate_log_loss(chances, y):
return -y * np.log(chances) - (1 - y) * np.log(1 - chances)
def create_output_dataframe(X, y, weights):
chances = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(chances, y)
df = pd.DataFrame({
'Chance': chances,
'Label': y,
'Log Loss': log_losses
})
return df
def calculate_average_log_loss(X, y, weights):
chances = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(chances, y)
return np.imply(log_losses)
# Convert X_train and y_train to numpy arrays for simpler computation
X_train_np = X_train.to_numpy()
y_train_np = y_train.to_numpy()
# Add a column of 1s to X_train_np for the bias time period
X_train_np = np.column_stack((np.ones(X_train_np.form[0]), X_train_np))
# Create and show DataFrame for preliminary weights
initial_df = create_output_dataframe(X_train_np, y_train_np, initial_weights)
print(initial_df.to_string(index=False, float_format=lambda x: f"{x:.6f}"))
print(f"nAverage Log Loss: {calculate_average_log_loss(X_train_np, y_train_np, initial_weights):.6f}")
b. Evaluate this likelihood to the precise class label by calculating its log loss.
3. Replace the weights to attenuate the loss (normally utilizing some optimization algorithm, like gradient descent. This embody repeatedly do Step 2 till log loss can not get smaller).
def gradient_descent_step(X, y, weights, learning_rate):
m = len(y)
chances = calculate_probabilities(X, weights)
gradient = np.dot(X.T, (chances - y)) / m
new_weights = weights - learning_rate * gradient # Create new array for up to date weights
return new_weights# Carry out one step of gradient descent (one of many easiest optimization algorithm)
learning_rate = 0.1
updated_weights = gradient_descent_step(X_train_np, y_train_np, initial_weights, learning_rate)
# Print preliminary and up to date weights
print("nInitial weights:")
for characteristic, weight in zip(['Bias'] + listing(X_train.columns), initial_weights):
print(f"{characteristic:11}: {weight:.2f}")
print("nUpdated weights after one iteration:")
for characteristic, weight in zip(['Bias'] + listing(X_train.columns), updated_weights):
print(f"{characteristic:11}: {weight:.2f}")
# With sklearn, you may get the ultimate weights (coefficients)
# and closing bias (intercepts) simply.
# The result's nearly the identical as doing it manually above.from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression(penalty=None, solver='saga')
lr_clf.match(X_train, y_train)
coefficients = lr_clf.coef_
intercept = lr_clf.intercept_
y_train_prob = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(y_train_prob) + (1 - y_train) * np.log(1 - y_train_prob))
print(f"Weights & Bias Remaining: {coefficients[0].spherical(2)}, {spherical(intercept[0],2)}")
print("Loss Remaining:", loss.spherical(3))
As soon as the mannequin is educated:
1. For a brand new occasion, calculate the likelihood with the ultimate weights (additionally referred to as coefficients), similar to through the coaching step.
2. Interpret the output by seeing the likelihood: if p ≥ 0.5, predict class 1; in any other case, predict class 0
# Calculate prediction likelihood
predicted_probs = lr_clf.predict_proba(X_test)[:, 1]z_values = np.log(predicted_probs / (1 - predicted_probs))
result_df = pd.DataFrame({
'ID': X_test.index,
'Z-Values': z_values.spherical(3),
'Chances': predicted_probs.spherical(3)
}).set_index('ID')
print(result_df)
# Make predictions
y_pred = lr_clf.predict(X_test)
print(y_pred)
Analysis Step
result_df = pd.DataFrame({
'ID': X_test.index,
'Label': y_test,
'Chances': predicted_probs.spherical(2),
'Prediction': y_pred,
}).set_index('ID')print(result_df)
Logistic regression has a number of vital parameters that management its habits:
1.Penalty: The kind of regularization to make use of (‘l1’, ‘l2’, ‘elasticnet’, or ‘none’). Regularization in logistic regression prevents overfitting by including a penalty time period to the mannequin’s loss operate, that encourages easier fashions.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_scoreregs = [None, 'l1', 'l2']
coeff_dict = {}
for reg in regs:
lr_clf = LogisticRegression(penalty=reg, solver='saga')
lr_clf.match(X_train, y_train)
coefficients = lr_clf.coef_
intercept = lr_clf.intercept_
predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
coeff_dict[reg] = {
'Coefficients': coefficients,
'Intercept': intercept,
'Loss': loss,
'Accuracy': accuracy
}
for reg, vals in coeff_dict.objects():
print(f"{reg}: Coeff: {vals['Coefficients'][0].spherical(2)}, Intercept: {vals['Intercept'].spherical(2)}, Loss: {vals['Loss'].spherical(3)}, Accuracy: {vals['Accuracy'].spherical(3)}")
2. Regularization Energy (C): Controls the trade-off between becoming the coaching knowledge and protecting the mannequin easy. A smaller C means stronger regularization.
# Record of regularization strengths to attempt for L1
strengths = [0.001, 0.01, 0.1, 1, 10, 100]coeff_dict = {}
for power in strengths:
lr_clf = LogisticRegression(penalty='l1', C=power, solver='saga')
lr_clf.match(X_train, y_train)
coefficients = lr_clf.coef_
intercept = lr_clf.intercept_
predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
coeff_dict[f'L1_{strength}'] = {
'Coefficients': coefficients[0].spherical(2),
'Intercept': spherical(intercept[0],2),
'Loss': spherical(loss,3),
'Accuracy': spherical(accuracy*100,2)
}
print(pd.DataFrame(coeff_dict).T)
# Record of regularization strengths to attempt for L2
strengths = [0.001, 0.01, 0.1, 1, 10, 100]coeff_dict = {}
for power in strengths:
lr_clf = LogisticRegression(penalty='l2', C=power, solver='saga')
lr_clf.match(X_train, y_train)
coefficients = lr_clf.coef_
intercept = lr_clf.intercept_
predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
coeff_dict[f'L2_{strength}'] = {
'Coefficients': coefficients[0].spherical(2),
'Intercept': spherical(intercept[0],2),
'Loss': spherical(loss,3),
'Accuracy': spherical(accuracy*100,2)
}
print(pd.DataFrame(coeff_dict).T)
3. Solver: The algorithm to make use of for optimization (‘liblinear’, ‘newton-cg’, ‘lbfgs’, ‘sag’, ‘saga’). Some regularization may require a specific algorithm.
4. Max Iterations: The utmost variety of iterations for the solver to converge.
For our golf dataset, we’d begin with ‘l2’ penalty, ‘liblinear’ solver, and C=1.0 as a baseline.
Like all algorithm in machine studying, logistic regression has its strengths and limitations.
Professionals:
- Simplicity: Straightforward to implement and perceive.
- Interpretability: The weights straight present the significance of every characteristic.
- Effectivity: Doesn’t require an excessive amount of computational energy.
- Probabilistic Output: Supplies chances reasonably than simply classifications.
Cons:
- Linearity Assumption: Assumes a linear relationship between options and log-odds of the end result.
- Function Independence: Assumes options will not be extremely correlated.
- Restricted Complexity: Might underfit in circumstances the place the choice boundary is very non-linear.
- Requires Extra Knowledge: Wants a comparatively giant pattern dimension for secure outcomes.
In our golf instance, logistic regression may present a transparent, interpretable mannequin of how every climate issue influences the choice to play golf. Nevertheless, it’d wrestle if the choice includes advanced interactions between climate circumstances that may’t be captured by a linear mannequin.
Logistic regression shines as a strong but easy classification software. It stands out for its potential to deal with advanced knowledge whereas remaining simple to interpret. Not like another primary fashions, it offers easy likelihood estimates and works nicely with many options. In the actual world, from predicting buyer habits to medical diagnoses, logistic regression usually performs surprisingly nicely. It’s not only a stepping stone — it’s a dependable mannequin that may match extra advanced fashions in lots of conditions.
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score# Load the dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)
# Put together knowledge: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)
# Cut up knowledge into coaching and testing units
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)
# Scale numerical options
scaler = StandardScaler()
float_cols = X_train.select_dtypes(embody=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.rework(X_test[float_cols])
# Practice the mannequin
lr_clf = LogisticRegression(penalty='l2', C=1, solver='saga')
lr_clf.match(X_train, y_train)
# Make predictions
y_pred = lr_clf.predict(X_test)
# Consider the mannequin
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
[ad_2]
Samy Baladram
2024-09-10 04:45:15
Source hyperlink:https://towardsdatascience.com/logistic-regression-explained-a-visual-guide-with-code-examples-for-beginners-81baf5871505?source=rss—-7f60cf5620c9—4