Getting Started with Python Machine Learning: Building Your First Linear Regression Model from Scratch-Common Knowledge Sharing Platform

Origins

Do you often hear about machine learning but don't know where to start? As a Python programmer, I deeply understand this feeling. I remember feeling lost when I first encountered machine learning, facing complex concepts and vast code libraries. Today, let's enter the world of machine learning step by step, starting with the most basic linear regression.

Fundamentals

Before we start coding, we need to understand some basic concepts. What is machine learning essentially? It's like teaching computers to learn patterns from data. Just like when we learned to read as children - the more we see, the better we recognize.

Supervised learning is one of the most fundamental methods in machine learning. Imagine teaching a child to recognize fruits - you point to an apple and say "this is an apple," point to a banana and say "this is a banana" - this is the process of supervised learning. In this process, you play the role of a "supervisor," constantly correcting and guiding.

Linear regression is the simplest and most classic algorithm in supervised learning. It tries to find the linear relationship between input features (like house size) and output results (like house price). It's like drawing a line that best represents the overall trend in data points.

Preparation

Before writing code, we need to prepare the necessary tools. Python's scientific computing ecosystem is very powerful, mainly including these libraries:

NumPy is the fundamental library for scientific computing, providing efficient array operations. It's like our calculator, helping us handle various mathematical operations.

Pandas is like our Excel, helping us manage and analyze data. It can easily handle various data file formats, perform data cleaning and transformation.

Scikit-learn is one of the most important libraries for machine learning, providing rich algorithm implementations. It's like a powerful toolbox containing various ready-to-use machine learning tools.

Practice

Let's implement a complete linear regression project together. Suppose we want to predict house prices, which is a classic regression problem.

First, we need to import the necessary libraries:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

Data

Data is the fuel of machine learning. In real work, I've found that data processing often takes up 80% of project time. Here's how to process an example dataset:

np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Modeling

Model training is the most exciting part of machine learning. I remember the excitement when I first saw the model successfully predict results:

model = LinearRegression()
model.fit(X_train, y_train)


y_pred = model.predict(X_test)


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

Visualization

Data visualization helps us understand model performance more intuitively:

plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual Values')
plt.plot(X_test, y_pred, color='red', label='Predicted Values')
plt.xlabel('Features')
plt.ylabel('Target Values')
plt.title('Linear Regression Prediction Results')
plt.legend()
plt.grid(True)
plt.show()

Optimization

Model optimization is an iterative process. In practice, I've found these points particularly important:

Feature engineering is key to improving model performance. Like cooking, the same ingredients can taste different with different preparation methods. We can try: - Handling outliers - Feature scaling - Feature selection - Creating interaction features

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


model = LinearRegression()
model.fit(X_train_scaled, y_train)

Application

Theory must ultimately be put into practice. Let's look at a complete house price prediction example:

def predict_house_price(area, bedrooms, age):
    # Data preprocessing
    features = np.array([[area, bedrooms, age]])
    features_scaled = scaler.transform(features)

    # Predict
    price = model.predict(features_scaled)[0]

    return price


house_features = {
    'area': 120,  # square meters
    'bedrooms': 3,
    'age': 5  # years
}

predicted_price = predict_house_price(
    house_features['area'],
    house_features['bedrooms'],
    house_features['age']
)

Experience

In practice, I've summarized some important experiences:

Data quality is crucial. Garbage in, garbage out - this saying is particularly applicable in machine learning. I recommend spending time understanding and cleaning data before starting modeling.

Model selection should be moderate. Sometimes, simple models can achieve better results. I've encountered many cases where simple linear regression was more accurate than complex deep learning models.

Cross-validation is necessary. It helps us evaluate model performance more accurately and avoid overfitting.

from sklearn.model_selection import cross_val_score


cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print(f"Cross-validation scores: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

Extension

Once you've mastered linear regression, you've got the key to machine learning. Next, you can explore more interesting algorithms:

Logistic Regression: Used for classification problems, like predicting whether a user will click an ad. Decision Trees: Like a decision flowchart, suitable for handling non-linear relationships. Random Forests: A combination of multiple decision trees, usually achieving better results.

from sklearn.ensemble import RandomForestRegressor


rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train.ravel())


rf_pred = rf_model.predict(X_test_scaled)
rf_r2 = r2_score(y_test, rf_pred)
print(f"Random Forest R2 Score: {rf_r2:.3f}")
print(f"Linear Regression R2 Score: {r2:.3f}")

Reflection

Learning machine learning has transformed my understanding of data. It's not just a technology, but a way of thinking. Through continuous practice, we can:

Develop data thinking and learn to discover patterns from data
Improve problem-solving abilities and learn to verify hypotheses scientifically
Expand career development opportunities and seize more chances in the AI era

Finally, I want to say that while the learning curve for machine learning might be steep, you can definitely reach the peak of this technology if you maintain patience and curiosity. Are you ready to start this journey?

If you have any questions about any part of the article or want to understand a concept more deeply, feel free to leave a comment. Let's continue exploring together on the path of machine learning.

Python machine learning deep learning basics machine learning libraries