简介:本文为Python开发者提供机器学习与深度学习的核心代码速查表,涵盖数据预处理、模型构建、训练与评估全流程,结合scikit-learn、TensorFlow/Keras、PyTorch三大框架的实用代码示例,助力快速实现AI项目开发。
在人工智能快速发展的今天,Python凭借其丰富的生态库(如scikit-learn、TensorFlow、PyTorch)成为机器学习与深度学习的首选语言。然而,开发者在实际项目中常面临代码记忆困难、框架选择困惑等问题。本文整理了一份涵盖数据预处理、模型构建、训练与评估的全流程代码速查表,结合三大主流框架的对比与实用技巧,助力开发者高效实现AI项目。
数据质量直接影响模型性能,预处理步骤包括数据清洗、特征工程、标准化等。
使用Pandas快速加载数据并初步分析:
import pandas as pddata = pd.read_csv('dataset.csv')print(data.head()) # 查看前5行print(data.describe()) # 统计描述print(data.isnull().sum()) # 检查缺失值
关键点:通过describe()快速识别数值分布,isnull()定位缺失数据。
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_scaled = scaler.fit_transform(X_train) # 训练集拟合与转换X_test_scaled = scaler.transform(X_test) # 测试集仅转换
对比:
from sklearn.preprocessing import OneHotEncoderencoder = OneHotEncoder(sparse_output=False)X_encoded = encoder.fit_transform(X_cat)
StandardScaler适用于正态分布数据,MinMaxScaler适合非高斯分布;OneHotEncoder避免类别特征引入虚假顺序关系。
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
应用场景:图像分类中,旋转、翻转可扩充数据集,提升模型泛化能力。
from tensorflow.keras.preprocessing.image import ImageDataGeneratordatagen = ImageDataGenerator(rotation_range=20, horizontal_flip=True)train_generator = datagen.flow(X_train, y_train, batch_size=32)
根据任务类型(分类、回归)选择合适模型,并对比框架实现差异。
from sklearn.linear_model import LinearRegressionmodel = LinearRegression()model.fit(X_train, y_train)print(model.coef_) # 输出权重
调参建议:通过
from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier(n_estimators=100, max_depth=5)rf.fit(X_train, y_train)print(rf.feature_importances_) # 特征重要性
GridSearchCV搜索最佳参数组合。
from tensorflow.keras.models import Sequentialfrom tensorflow.keras.layers import Densemodel = Sequential([Dense(64, activation='relu', input_shape=(n_features,)),Dense(32, activation='relu'),Dense(1, activation='sigmoid') # 二分类输出])model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)
import torch.nn as nn
class CNN(nn.Module):
def init(self):
super().init()
self.conv1 = nn.Conv2d(1, 32, 3)
self.pool = nn.MaxPool2d(2, 2)
self.fc1 = nn.Linear(32 13 13, 10)
def forward(self, x):
x = self.pool(torch.relu(self.conv1(x)))
x = x.view(-1, 32 13 13)
x = torch.softmax(self.fc1(x), dim=1)
return x
**框架选择**:Keras适合快速原型开发,PyTorch提供更灵活的动态计算图。### 3. 模型保存与加载- **Scikit-learn**:```pythonimport joblibjoblib.dump(rf, 'random_forest.pkl') # 保存loaded_rf = joblib.load('random_forest.pkl') # 加载
model.save('my_model.h5') # 保存完整模型(结构+权重)loaded_model = tf.keras.models.load_model('my_model.h5')
torch.save(model.state_dict(), 'model_weights.pth') # 仅保存权重model.load_state_dict(torch.load('model_weights.pth'))
通过指标计算与可视化工具诊断模型问题。
from sklearn.metrics import accuracy_score, classification_report, confusion_matrixy_pred = model.predict(X_test)print(accuracy_score(y_test, y_pred))print(classification_report(y_test, y_pred))print(confusion_matrix(y_test, y_pred))
from sklearn.metrics import mean_squared_error, r2_scoremse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)
import matplotlib.pyplot as pltplt.plot(history.history['loss'], label='train_loss')plt.plot(history.history['val_loss'], label='val_loss')plt.legend()plt.show()
import seaborn as snscm = confusion_matrix(y_test, y_pred)sns.heatmap(cm, annot=True, fmt='d')
from sklearn.model_selection import GridSearchCVparam_grid = {'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 7]}grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)grid_search.fit(X_train, y_train)print(grid_search.best_params_)
作用:L2正则化防止过拟合,Dropout层随机失活神经元增强泛化能力。
from tensorflow.keras import regularizersmodel.add(Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
import tensorflow as tfprint(tf.config.list_physical_devices('GPU')) # 检查可用GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')model.to(device)
strategy = tf.distribute.MultiWorkerMirroredStrategy()with strategy.scope():model = create_model() # 在策略范围内定义模型
SavedModel格式,通过gRPC或REST API部署。
from flask import Flask, request, jsonifyapp = Flask(__name__)@app.route('/predict', methods=['POST'])def predict():data = request.json['data']prediction = loaded_rf.predict(data)return jsonify({'prediction': prediction.tolist()})
本文整理的代码速查表覆盖了从数据预处理到模型部署的全流程,开发者可根据任务需求选择合适框架:
实践建议:
通过掌握这些核心代码与技巧,开发者能够显著提升AI项目的开发效率与模型性能。