Skip to content

Creating a Predictor Using Machine Learning from a Pandas DataFrame in Biostatistics

Introduction

In biostatistics, leveraging machine learning allows researchers to extract valuable insights from complex datasets. This blog post will guide you through the process of creating a single predictor using machine learning techniques with data organized in a Pandas DataFrame, making use of libraries such as Scikit-learn, NumPy, and Matplotlib.

Understanding the Dataset

For demonstration purposes, we will create a simulated dataset focused on health metrics commonly found in biostatistics research. The dataset will encompass the following columns: - Age: Age of the patient (years) - BMI: Body Mass Index (kg/m²) - Cholesterol: Cholesterol level (mg/dL) - Diabetes: Target variable indicating presence of diabetes (1 for yes, 0 for no)

Creating the Sample DataFrame

To begin our analysis, let's create a sample DataFrame using the following Python code:

import pandas as pd
import numpy as np

data = {
    'Age': np.random.randint(20, 70, size=100),
    'BMI': np.random.uniform(18.5, 40.0, size=100),
    'Cholesterol': np.random.randint(150, 300, size=100),
    'Diabetes': np.random.randint(0, 2, size=100)
}

df = pd.DataFrame(data)
print("Sample DataFrame:")
print(df.head())

Data Preprocessing

Before building the model, preprocessing the data is essential: 1. Feature Selection: Choose independent variables to predict the target variable. 2. Data Splitting: Divide the DataFrame into training and testing sets.

Splitting the Data

from sklearn.model_selection import train_test_split

# Features and target variable
X = df[['Age', 'BMI', 'Cholesterol']]
y = df['Diabetes']

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training data size: {X_train.shape[0]}, Testing data size: {X_test.shape[0]}")

Building the Predictor

We’ll employ a Logistic Regression model suitable for binary classification problems such as predicting diabetes.

Training the Model

from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)
print("Model training complete.")

Making Predictions

After training the model, we can make predictions based on the test dataset:

# Making predictions
y_pred = model.predict(X_test)
print("Predictions on test set complete.")

Model Evaluation

To assess the effectiveness of our predictor, we’ll evaluate it using accuracy, precision, and confusion matrix.

Evaluating the Model

from sklearn.metrics import classification_report, confusion_matrix

# Model evaluation
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))")
print(classification_report(y_test, y_pred))

Visualizing Predictions

Creating a confusion matrix visualization can enhance understanding of the model's performance:

import matplotlib.pyplot as plt
import seaborn as sns

conf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Incorporating techniques from the biostatistics articles, such as merging, appending, and concatenating functions, enriches data management. For example, utilizing these techniques prepares datasets for machine learning. Please refer to the related posts on Merging and Joining Data and Adding New Data.

Conclusion

In this article, we explored how to create a single predictor for binary classification using Machine Learning from a Pandas DataFrame structure. Understanding these fundamental steps is crucial for biostatistics applications and the methodology can easily be adapted to various datasets and predictive modeling techniques.

References

  • Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O'Reilly Media.
  • McKinney, W. (2010). Data Analysis with Pandas. O'Reilly Media.
  • VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media.
  • Altman, D.G., & Bland, J.M. (1999). Statistics in Medical Journals: Developments in the 1980s and 1990s. BMJ.