Blackhole预测航班延迟案例
所有文档

          BML 全功能AI开发平台

          Blackhole预测航班延迟案例

          使用Blackhole对美国国内航班延迟情况进行预测


          数据来源于美国国内航空公司1987~2008年航线数据,具体任务为:预测航线延迟是否超过15分钟,为二分类任务,选用逻辑回归LogisticRegression算法。

          案例来源:https://github.com/szilard/benchm-ml
          数据集地址:https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HG7NV7

          Blackhole环境准备

          CodeLab平台默认不安装Blackhole,请先到导航左边“包管理”页面安装blackhole。
          更多关于blackhole使用方法和案例,请参考Blackhole简介和基本用法

          数据集准备

          本案例选取2005年~2008年3年的数据来进行分析和训练,并且考虑到用户场景,我们准备了两份数据集:
          小数据集: 每年各抽样5000行数据,大小13M,让用户能够在本地端环境里,快速体验Blackhole。
          大数据集: 为3年全量数据,大约2200万行,大小在2G左右,让用户通过端云同步,体验Blackhole处理大数据量过程。

          ## 小数据量词表路径:https://codelab-dataset.cdn.bcebos.com/small/competition/airline.zip
          ## 大数据量词表路径:https://codelab-dataset.cdn.bcebos.com/full/competition/airline.zip
          ! wget https://codelab-dataset.cdn.bcebos.com/small/competition/airline.zip && unzip -o airline.zip
          --2021-04-22 05:11:10--  https://codelab-dataset.cdn.bcebos.com/small/competition/airline.zip
          Resolving codelab-dataset.cdn.bcebos.com (codelab-dataset.cdn.bcebos.com)... 123.125.132.35
          Connecting to codelab-dataset.cdn.bcebos.com (codelab-dataset.cdn.bcebos.com)|123.125.132.35|:443... connected.
          HTTP request sent, awaiting response... 200 OK
          Length: 3644017 (3.5M) [application/zip]
          Saving to: 'airline.zip'
          
          airline.zip         100%[===================>]   3.47M  5.52MB/s    in 0.6s    
          
          2021-04-22 05:11:11 (5.52 MB/s) - 'airline.zip' saved [3644017/3644017]
          
          Archive:  airline.zip
             creating: airline/
            inflating: airline/2006.csv        
            inflating: airline/2007.csv        
            inflating: airline/2005.csv        

          步骤1: 导入Blackhole依赖

          import os
          import numpy as np
          
          import blackhole.dataframe as bhdf
          from blackhole.ml.linear_model import LogisticRegression
          from blackhole.ml.model_selection import train_test_split
          from blackhole.ml.metrics import roc_auc_score
          
          import warnings
          warnings.filterwarnings('ignore')

          步骤2: 数据导入

          分别读取3年数据,并加数据合并到一起

          data_dir = "./airline/"
          df = bhdf.DataFrame()
          for file in os.listdir(data_dir):
              df = df.append(bhdf.read_csv(data_dir + file))

          步骤3: 数据分析与特征工程

          3.1 数据查看

          df.head()
          Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay
          0 2007 True 1 1 1232 1225 1341 1340 WN 2891 N351 69 75 54 1 7 SMF ONT 389 4 11 False NaN False 0 0 0 0 0
          1 2007 True 1 1 1918 1905 2043 2035 WN 462 N370 85 90 74 8 13 SMF PDX 479 5 6 False NaN False 0 0 0 0 0
          2 2007 True 1 1 2206 2130 2334 2300 WN 1229 N685 88 90 73 34 36 SMF PDX 479 6 9 False NaN False 3 0 0 0 31
          3 2007 True 1 1 1230 1200 1356 1330 WN 1355 N364 86 90 75 26 30 SMF PDX 479 3 8 False NaN False 23 0 0 0 3
          4 2007 True 1 1 831 830 957 1000 WN 2278 N480 86 90 74 -3 1 SMF PDX 479 3 9 False NaN False 0 0 0 0 0
          %%time
          df.info()
          <class 'blackhole.dataframe.frame.DataFrame'>
          Index: 149997 entries, 0 to 49998
          Data columns (total 29 columns):
           #   Column             Non-Null Count   Dtype 
          ---  ------             --------------   ----- 
           0   Year               149997 non-null  int32 
           1   Month              149997 non-null  bool  
           2   DayofMonth         149997 non-null  int32 
           3   DayOfWeek          149997 non-null  int32 
           4   DepTime            146393 non-null  int32 
           5   CRSDepTime         149997 non-null  int32 
           6   ArrTime            146001 non-null  int32 
           7   CRSArrTime         149997 non-null  int32 
           8   UniqueCarrier      149997 non-null  object
           9   FlightNum          149997 non-null  int32 
           10  TailNum            149997 non-null  object
           11  ActualElapsedTime  146001 non-null  int32 
           12  CRSElapsedTime     149997 non-null  int32 
           13  AirTime            146001 non-null  int32 
           14  ArrDelay           146001 non-null  int32 
           15  DepDelay           146393 non-null  int32 
           16  Origin             149997 non-null  object
           17  Dest               149997 non-null  object
           18  Distance           149997 non-null  int32 
           19  TaxiIn             149997 non-null  int32 
           20  TaxiOut            149997 non-null  int32 
           21  Cancelled          149997 non-null  bool  
           22  CancellationCode   3604 non-null    object
           23  Diverted           149997 non-null  bool  
           24  CarrierDelay       149997 non-null  int32 
           25  WeatherDelay       149997 non-null  int32 
           26  NASDelay           149997 non-null  int32 
           27  SecurityDelay      149997 non-null  int32 
           28  LateAircraftDelay  149997 non-null  int32 
          dtypes: bool(3), int32(21), object(5)CPU times: user 17.5 s, sys: 40.8 ms, total: 17.5 s
          Wall time: 651 ms

          3.2 修改类型

          df["Month"] = df["Month"].astype('int8')
          df.info()
          <class 'blackhole.dataframe.frame.DataFrame'>
          Index: 149997 entries, 0 to 49998
          Data columns (total 29 columns):
           #   Column             Non-Null Count   Dtype 
          ---  ------             --------------   ----- 
           0   Year               149997 non-null  int32 
           1   Month              149997 non-null  int8  
           2   DayofMonth         149997 non-null  int32 
           3   DayOfWeek          149997 non-null  int32 
           4   DepTime            146393 non-null  int32 
           5   CRSDepTime         149997 non-null  int32 
           6   ArrTime            146001 non-null  int32 
           7   CRSArrTime         149997 non-null  int32 
           8   UniqueCarrier      149997 non-null  object
           9   FlightNum          149997 non-null  int32 
           10  TailNum            149997 non-null  object
           11  ActualElapsedTime  146001 non-null  int32 
           12  CRSElapsedTime     149997 non-null  int32 
           13  AirTime            146001 non-null  int32 
           14  ArrDelay           146001 non-null  int32 
           15  DepDelay           146393 non-null  int32 
           16  Origin             149997 non-null  object
           17  Dest               149997 non-null  object
           18  Distance           149997 non-null  int32 
           19  TaxiIn             149997 non-null  int32 
           20  TaxiOut            149997 non-null  int32 
           21  Cancelled          149997 non-null  bool  
           22  CancellationCode   3604 non-null    object
           23  Diverted           149997 non-null  bool  
           24  CarrierDelay       149997 non-null  int32 
           25  WeatherDelay       149997 non-null  int32 
           26  NASDelay           149997 non-null  int32 
           27  SecurityDelay      149997 non-null  int32 
           28  LateAircraftDelay  149997 non-null  int32 
          dtypes: bool(2), int32(21), int8(1), object(5)

          3.3 去除空值行

          去除延迟时间记录为空的行

          %%time
          df.dropna(subset=['DepDelay'], inplace=True)
          CPU times: user 895 ms, sys: 61 µs, total: 896 ms
          Wall time: 36.5 ms
          df.shape
          (146393, 29)

          3.4 生成延迟信息

          def is_dep_delayed_15min(x):
              if x >= 15:
                  return 1
              else:
                  return 0
          df["dep_delayed_15min"] = df["DepDelay"].apply(is_dep_delayed_15min)

          3.5 去除多余列

          use_columns = ["Month", "DayofMonth", "DayOfWeek", "DepTime", "UniqueCarrier", "Origin", "Dest", "Distance","dep_delayed_15min"]
          df = df[use_columns]
          df.head()
          Month DayofMonth DayOfWeek DepTime UniqueCarrier Origin Dest Distance dep_delayed_15min
          0 1 1 1 1232 WN SMF ONT 389 0
          1 1 1 1 1918 WN SMF PDX 479 0
          2 1 1 1 2206 WN SMF PDX 479 1
          3 1 1 1 1230 WN SMF PDX 479 1
          4 1 1 1 831 WN SMF PDX 479 0

          3.6 one-hot编码

          vars_categ = ["Month","DayofMonth","DayOfWeek","UniqueCarrier", "Origin", "Dest"]
          df = bhdf.get_dummies(df, columns=vars_categ)
          df.head()
          DepTime Distance dep_delayed_15min Month_1 DayofMonth_1 DayofMonth_2 DayofMonth_3 DayofMonth_4 DayofMonth_5 DayofMonth_6 DayofMonth_7 DayofMonth_8 DayofMonth_9 DayofMonth_10 DayofMonth_11 DayofMonth_12 DayofMonth_13 DayofMonth_14 DayofMonth_15 DayofMonth_16 DayofMonth_17 DayofMonth_18 DayofMonth_19 DayofMonth_20 DayofMonth_21 DayofMonth_22 DayofMonth_23 DayofMonth_24 DayofMonth_25 DayofMonth_26 DayofMonth_27 DayofMonth_28 DayofMonth_29 DayofMonth_30 DayofMonth_31 DayOfWeek_1 DayOfWeek_2 DayOfWeek_3 DayOfWeek_4 DayOfWeek_5 DayOfWeek_6 DayOfWeek_7 UniqueCarrier_UA UniqueCarrier_US UniqueCarrier_WN Origin_ABE Origin_ABQ Origin_ALB Origin_AMA Origin_ANC Origin_ATL Origin_AUS Origin_BDL Origin_BHM Origin_BIL Origin_BNA Origin_BOI Origin_BOS Origin_BTV Origin_BUF Origin_BUR Origin_BWI Origin_CHS Origin_CLE Origin_CLT Origin_CMH Origin_COS Origin_CRP Origin_CVG Origin_DAL Origin_DAY Origin_DCA Origin_DEN Origin_DFW Origin_DSM Origin_DTW Origin_EGE Origin_ELP Origin_EWR Origin_EYW Origin_FLL Origin_GEG Origin_GRR Origin_GSO Origin_GUC Origin_HNL Origin_HOU Origin_HRL Origin_IAD Origin_IAH Origin_ICT Origin_ILM Origin_IND Origin_ISP Origin_JAN Origin_JAX Origin_JFK Origin_KOA Origin_LAS Origin_LAX Origin_LBB Origin_LGA Origin_LIH Origin_LIT Origin_MAF Origin_MCI Origin_MCO Origin_MDT Origin_MDW Origin_MHT Origin_MIA Origin_MKE Origin_MSP Origin_MSY Origin_MTJ Origin_MYR Origin_OAK Origin_OGG Origin_OKC Origin_OMA Origin_ONT Origin_ORD Origin_ORF Origin_PBI Origin_PDX Origin_PHL Origin_PHX Origin_PIT Origin_PSP Origin_PVD Origin_PWM Origin_RDU Origin_RIC Origin_RNO Origin_ROC Origin_RSW Origin_SAN Origin_SAT Origin_SAV Origin_SDF Origin_SEA Origin_SFO Origin_SJC Origin_SJU Origin_SLC Origin_SMF Origin_SNA Origin_SRQ Origin_STL Origin_STT Origin_STX Origin_SYR Origin_TPA Origin_TUL Origin_TUS Dest_ABE Dest_ABQ Dest_ALB Dest_AMA Dest_ANC Dest_ATL Dest_AUS Dest_BDL Dest_BHM Dest_BIL Dest_BNA Dest_BOI Dest_BOS Dest_BTV Dest_BUF Dest_BUR Dest_BWI Dest_CHS Dest_CLE Dest_CLT Dest_CMH Dest_COS Dest_CRP Dest_CVG Dest_DAL Dest_DAY Dest_DCA Dest_DEN Dest_DFW Dest_DSM Dest_DTW Dest_EGE Dest_ELP Dest_EWR Dest_EYW Dest_FLL Dest_GEG Dest_GRR Dest_GSO Dest_GUC Dest_HNL Dest_HOU Dest_HRL Dest_IAD Dest_IAH Dest_ICT Dest_ILM Dest_IND Dest_ISP Dest_JAC Dest_JAN Dest_JAX Dest_JFK Dest_KOA Dest_LAS Dest_LAX Dest_LBB Dest_LGA Dest_LIT Dest_MAF Dest_MCI Dest_MCO Dest_MDT Dest_MDW Dest_MHT Dest_MIA Dest_MKE Dest_MSP Dest_MSY Dest_MTJ Dest_MYR Dest_OAK Dest_OGG Dest_OKC Dest_OMA Dest_ONT Dest_ORD Dest_ORF Dest_PBI Dest_PDX Dest_PHL Dest_PHX Dest_PIT Dest_PSP Dest_PVD Dest_PWM Dest_RDU Dest_RIC Dest_RNO Dest_ROC Dest_RSW Dest_SAN Dest_SAT Dest_SAV Dest_SDF Dest_SEA Dest_SFO Dest_SJC Dest_SJU Dest_SLC Dest_SMF Dest_SNA Dest_SRQ Dest_STL Dest_STT Dest_STX Dest_SYR Dest_TPA Dest_TUL Dest_TUS
          0 1232 389 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          1 1918 479 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          2 2206 479 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          3 1230 479 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          4 831 479 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
          df.shape
          (146393, 265)

          3.7 生成训练/测试数据集

          X = df[df.columns.difference(["dep_delayed_15min"])]
          Y = df["dep_delayed_15min"]
          X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)
          len(X_train),len(X_test)
          (98083, 48310)
          X_train.shape, X_test.shape, y_train.shape, y_test.shape
          ((98083, 264), (48310, 264), (98083,), (48310,))

          步骤4: 模型训练和评估

          4.1 模型训练

          %%time
          lr = LogisticRegression()
          lr.fit(X_train, y_train)
          Parse progress: |█████████████████████████████████████████████████████████| 100%
          Parse progress: |█████████████████████████████████████████████████████████| 100%
          glm Model Build progress: |███████████████████████████████████████████████| 100%
          CPU times: user 1min 50s, sys: 2.5 s, total: 1min 52s
          Wall time: 6.48 s

          4.2 模型预测

          %%time
          y_predict = lr.predict(X_test)
          Parse progress: |█████████████████████████████████████████████████████████| 100%
          glm prediction progress: |████████████████████████████████████████████████| 100%
          Export File progress: |███████████████████████████████████████████████████| 100%
          CPU times: user 16.2 s, sys: 2.67 s, total: 18.9 s
          Wall time: 1.75 s

          4.3 模型评估

          %%time
          roc_auc_score(y_test, y_predict)
          CPU times: user 2.75 s, sys: 18.2 ms, total: 2.77 s
          Wall time: 106 ms
          
          
          
          
          
          0.6743132437903748

          步骤5: 模型导出保存

          # 模型保存
          from blackhole.ml import save_model
          import shutil
          ##云端执行时输出路径必须在环境变量“OUTPUT_PATH”下面
          output_path = os.getenv('OUTPUT_PATH', "./")  #输出路径
          model_path = os.path.join(output_path, 'bh_airline_model') #模型保存路径
          # 判断模型路径是否存在,如果已存在删除并更新
          if os.path.exists(model_path):
              shutil.rmtree(model_path)
          save_model(lr, model_path)
          上一篇
          快速开始
          下一篇
          Blackhole预测出租车票价案例