Kaggle for earthquake

 

 







In [0]:
## Types of models used
-lightGBM
-Genetic programming
-SVM 

Predicting earthquakes has long been thought to be near-impossible. But same has been said about many other things, I still remember how 15 years ago my IT teacher in high school was saying that speech recognition can never be done reliably by a computer, and today it's more reliable than a human. So who knows, maybe this competition will be a major step in solving another "impossible" problem. Importance of predicting earthquakes is hard to underestimate - selected years of our century claimed hundreds of thousands of causalities. Being able to predict earthquakes could allow us to better protect human life and property.

Supervised learning

How was the data generated? Why is this important?

-Show the 1st 2%, then zoom in
-audio quakes right before the earthquake
-Add some features?
-K Folds for Validation
- Types of models used (LightGBM, genetic programming, SVM)
In [4]:
!pip install kaggle
 
Requirement already satisfied: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.2)
Requirement already satisfied: urllib3<1.23.0,>=1.15 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.22)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.11.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2018.11.29)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.5.3)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.18.4)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.28.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.0.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.6)
Requirement already satisfied: Unidecode>=0.04.16 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.0.23)
In [3]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
 



Upload widget is only available when the cell has been executed in the
current browser session. Please rerun this cell to enable.

 
Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 62 bytes
In [5]:
!kaggle competitions list
 
ref                                            deadline             category            reward  teamCount  userHasEntered
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       2632           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge       9863           False  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge       4138           False  
imagenet-object-localization-challenge         2029-12-31 07:00:00  Research         Knowledge         33           False  
competitive-data-science-predict-future-sales  2019-12-31 23:59:00  Playground           Kudos       2283           False  
two-sigma-financial-news                       2019-07-15 23:59:00  Featured          $100,000       2897           False  
LANL-Earthquake-Prediction                     2019-06-03 23:59:00  Research           $50,000        974            True  
tmdb-box-office-prediction                     2019-05-30 23:59:00  Playground       Knowledge         26           False  
dont-overfit-ii                                2019-05-07 23:59:00  Playground            Swag          9           False  
gendered-pronoun-resolution                    2019-04-22 23:59:00  Research           $25,000         79           False  
histopathologic-cancer-detection               2019-03-30 23:59:00  Playground       Knowledge        568           False  
petfinder-adoption-prediction                  2019-03-28 23:59:00  Featured           $25,000       1000           False  
vsb-power-line-fault-detection                 2019-03-21 23:59:00  Featured           $25,000        808           False  
microsoft-malware-prediction                   2019-03-13 23:59:00  Research           $25,000       1526           False  
humpback-whale-identification                  2019-02-28 23:59:00  Featured           $25,000       1747           False  
elo-merchant-category-recommendation           2019-02-26 23:59:00  Featured           $50,000       3559           False  
quora-insincere-questions-classification       2019-02-26 23:59:00  Featured           $25,000       4037           False  
ga-customer-revenue-prediction                 2019-02-15 23:59:00  Featured           $45,000       1104           False  
reducing-commercial-aviation-fatalities        2019-02-12 23:59:00  Playground            Swag        152           False  
pubg-finish-placement-prediction               2019-01-30 23:59:00  Playground            Swag       1534           False  
In [4]:
!kaggle competitions download -c LANL-Earthquake-Prediction
 
Downloading sample_submission.csv to /content
  0% 0.00/33.3k [00:00<?, ?B/s]
100% 33.3k/33.3k [00:00<00:00, 30.7MB/s]
Downloading test.zip to /content
 96% 233M/242M [00:03<00:00, 62.8MB/s]
100% 242M/242M [00:03<00:00, 81.0MB/s]
Downloading train.csv.zip to /content
100% 2.02G/2.03G [00:22<00:00, 98.4MB/s]
100% 2.03G/2.03G [00:22<00:00, 97.5MB/s]
In [5]:
!unzip train.csv.zip
 
Archive:  train.csv.zip
  inflating: train.csv               
In [4]:
!ls
 
sample_data  sample_submission.csv  test.zip  train.csv  train.csv.zip
In [30]:
import pandas as pd
import numpy as np

!pip install numpy==1.15.0

train_df = pd.read_csv('train.csv', dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})

#train = pd.read_csv('train.csv', iterator=True, chunksize=150_000, dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})
 
Requirement already satisfied: numpy==1.15.0 in /usr/local/lib/python3.6/dist-packages (1.15.0)
 
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in read(self, nrows)
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/common.py in is_integer_dtype(arr_or_dtype)
    776 
--> 777 def is_integer_dtype(arr_or_dtype):
    778     """

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-30-fceb7d7f50bb> in <module>()
      4 get_ipython().system('pip install numpy==1.15.0')
      5 
----> 6 train_df = pd.read_csv('train.csv', dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})
      7 
      8 #train = pd.read_csv('train.csv', iterator=True, chunksize=150_000, dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    453 
    454     try:
--> 455         data = parser.read(nrows)
    456     finally:
    457         parser.close()

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in read(self, nrows)
   1067                 raise ValueError('skipfooter not supported for iteration')
   1068 
-> 1069         ret = self._engine.read(nrows)
   1070 
   1071         if self.options.get('as_recarray'):

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in read(self, nrows)
   1837     def read(self, nrows=None):
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:
   1841             if self._first_chunk:

KeyboardInterrupt:
In [0]:
##print("Train: rows:{} cols:{}".format(train_df.shape[0], train_df.shape[1]))
def gen_features(X):
    strain = []
    strain.append(X.mean())
    strain.append(X.std())
    strain.append(X.min())
    strain.append(X.max())
    strain.append(X.mad())
    strain.append(X.kurtosis())
    strain.append(X.skew())
    strain.append(np.quantile(X,0.01))
    strain.append(np.quantile(X,0.05))
    strain.append(np.quantile(X,0.95))
    strain.append(np.quantile(X,0.99))
    strain.append(np.abs(X).max())
    strain.append(np.abs(X).mean())
    strain.append(np.abs(X).std())
    return pd.Series(strain)
  
X_train = pd.DataFrame()
y_train = pd.Series()
for df in train:
    ch = gen_features(df['acoustic_data'])
    X_train = X_train.append(ch, ignore_index=True)
    y_train = y_train.append(pd.Series(df['time_to_failure'].values[-1]))
In [4]:
X_train.describe()
Out[4]:
  0 1 2 3 4 5 6 7 8 9 10 11 12 13
count 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000 4195.000000
mean 4.519475 6.547788 -149.190942 163.522288 3.482899 68.297997 0.125830 -11.224603 -2.184779 11.231716 20.321890 170.046246 5.547367 5.750165
std 0.256049 8.503939 265.087984 272.930331 1.621138 70.532565 0.477901 14.106852 2.346558 2.358067 14.225526 296.887015 1.517038 8.339211
min 3.596313 2.802720 -5515.000000 23.000000 2.199321 0.648602 -4.091826 -336.000000 -39.000000 9.000000 11.000000 23.000000 4.147707 2.589085
25% 4.349497 4.478637 -154.000000 92.000000 2.820130 28.090227 -0.040779 -14.000000 -3.000000 10.000000 15.000000 94.000000 5.061843 3.862810
50% 4.522147 5.618798 -111.000000 123.000000 3.324506 45.816625 0.085620 -10.000000 -2.000000 11.000000 19.000000 127.000000 5.380853 4.781513
75% 4.693350 6.880904 -79.000000 170.000000 3.805475 78.664202 0.253930 -6.000000 -1.000000 12.000000 23.000000 175.000000 5.748553 5.887947
max 5.391993 153.703569 -15.000000 5444.000000 31.724966 631.158927 4.219429 -2.000000 0.000000 50.000000 337.000000 5515.000000 32.762073 150.432368
In [0]:
#!pip install catboost
from catboost import CatBoostRegressor, Pool


train_pool = Pool(X_train, y_train)
In [9]:
m = CatBoostRegressor(iterations=10000, loss_function='MAE', boosting_type='Ordered')
m.fit(X_train, y_train, silent=True)
m.best_score_
Out[9]:
{'learn': {'MAE': 1.74357544138688}}
In [0]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import NuSVR, SVR


scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
In [18]:
parameters = [{'gamma': [0.001, 0.005, 0.01, 0.02, 0.05, 0.1],
               'C': [0.1, 0.2, 0.25, 0.5, 1, 1.5, 2]}]
               #'nu': [0.75, 0.8, 0.85, 0.9, 0.95, 0.97]}]

reg1 = GridSearchCV(SVR(kernel='rbf', tol=0.01), parameters, cv=5, scoring='neg_mean_absolute_error')
reg1.fit(X_train_scaled, y_train.values.flatten())
y_pred1 = reg1.predict(X_train_scaled)

print("Best CV score: {:.4f}".format(reg1.best_score_))
print(reg1.best_params_)
 
Best CV score: -2.1651
{'C': 2, 'gamma': 0.02}
In [20]:
from sklearn.kernel_ridge import KernelRidge

parameters = [{'gamma': np.linspace(0.001, 0.1, 10),
               'alpha': [0.005, 0.01, 0.02, 0.05, 0.1]}]

reg2 = GridSearchCV(KernelRidge(kernel='rbf'), parameters, cv=5, scoring='neg_mean_absolute_error')
reg2.fit(X_train_scaled, y_train.values.flatten())
y_pred2 = reg2.predict(X_train_scaled)

print("Best CV score: {:.4f}".format(reg2.best_score_))
print(reg2.best_params_)
 
Best CV score: -2.1774
{'alpha': 0.005, 'gamma': 0.001}
In [25]:
import lightgbm as lgb
from tqdm import tqdm_notebook as tqdm
import random


fixed_params = {
    'objective': 'regression_l1',
    'boosting': 'gbdt',
    'verbosity': -1,
    'random_seed': 19
}

param_grid = {
    'learning_rate': [0.1, 0.08, 0.05, 0.01],
    'num_leaves': [32, 46, 52, 58, 68, 72, 80, 92],
    'max_depth': [3, 4, 5, 6, 8, 12, 16, -1],
    'feature_fraction': [0.8, 0.85, 0.9, 0.95, 1],
    'subsample': [0.8, 0.85, 0.9, 0.95, 1],
    'lambda_l1': [0, 0.1, 0.2, 0.4, 0.6, 0.9],
    'lambda_l2': [0, 0.1, 0.2, 0.4, 0.6, 0.9],
    'min_data_in_leaf': [10, 20, 40, 60, 100],
    'min_gain_to_split': [0, 0.001, 0.01, 0.1],
}

best_score = 999
dataset = lgb.Dataset(X_train, label=y_train)  # no need to scale features

for i in tqdm(range(200)):
    params = {k: random.choice(v) for k, v in param_grid.items()}
    params.update(fixed_params)
    result = lgb.cv(params, dataset, nfold=5, early_stopping_rounds=50,
                    num_boost_round=20000, stratified=False)
    
    if result['l1-mean'][-1] < best_score:
        best_score = result['l1-mean'][-1]
        best_params = params
        best_nrounds = len(result['l1-mean'])
 
 

 
 
In [26]:
print("Best mean score: {:.4f}, num rounds: {}".format(best_score, best_nrounds))
print(best_params)
reg3 = lgb.train(best_params, dataset, best_nrounds)
y_pred3 = reg3.predict(X_train)
 
Best mean score: 2.0900, num rounds: 97
{'learning_rate': 0.08, 'num_leaves': 68, 'max_depth': 3, 'feature_fraction': 0.95, 'subsample': 0.9, 'lambda_l1': 0.6, 'lambda_l2': 0, 'min_data_in_leaf': 10, 'min_gain_to_split': 0, 'objective': 'regression_l1', 'boosting': 'gbdt', 'verbosity': -1, 'random_seed': 19}
In [27]:
plt.tight_layout()
f = plt.figure(figsize=(12, 6))
f.add_subplot(1, 3, 1)
plt.scatter(y_train.values.flatten(), y_pred1)
plt.title('SVR', fontsize=16)
plt.xlim(0, 20)
plt.ylim(0, 20)
plt.xlabel('actual', fontsize=12)
plt.ylabel('predicted', fontsize=12)
plt.plot([(0, 0), (20, 20)], [(0, 0), (20, 20)])

f.add_subplot(1, 3, 2)
plt.scatter(y_train.values.flatten(), y_pred2)
plt.title('Kernel Ridge', fontsize=16)
plt.xlim(0, 20)
plt.ylim(0, 20)
plt.xlabel('actual', fontsize=12)
plt.plot([(0, 0), (20, 20)], [(0, 0), (20, 20)])

f.add_subplot(1, 3, 3)
plt.scatter(y_train.values.flatten(), y_pred3)
plt.title('Gradient boosting', fontsize=16)
plt.xlim(0, 20)
plt.ylim(0, 20)
plt.xlabel('actual', fontsize=12)
plt.plot([(0, 0), (20, 20)], [(0, 0), (20, 20)])
plt.show(block=True)

# second plot
plt.figure(figsize=(10, 5))
plt.plot(y_train.values.flatten(), color='blue', label='y_train')
plt.plot(y_pred1, color='orange', label='SVR')
plt.legend()
plt.title('SVR predictions vs actual')

# third plot
plt.figure(figsize=(10, 5))
plt.plot(y_train.values.flatten(), color='blue', label='y_train')
plt.plot(y_pred2, color='gray', label='KernelRidge')
plt.legend()
plt.title('Kernel Ridge predictions vs actual')

# fourth plot
plt.figure(figsize=(10, 5))
plt.plot(y_train.values.flatten(), color='blue', label='y_train')
plt.plot(y_pred3, color='green', label='Gradient boosting')
plt.legend()
plt.title('GBDT predictions vs actual')
 
<Figure size 576x396 with 0 Axes>
 
Out[27]:
Text(0.5, 1.0, 'GBDT predictions vs actual')
 
 
 
In [21]:
import matplotlib.pyplot as plt


train_ad_sample_df = X_train['acoustic_data'].values[::100]
train_ttf_sample_df = X_train['time_to_failure'].values[::100]

def plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df, title="Acoustic data and time to failure: 1% sampled data"):
    fig, ax1 = plt.subplots(figsize=(12, 8))
    plt.title(title)
    plt.plot(train_ad_sample_df, color='r')
    ax1.set_ylabel('acoustic data', color='r')
    plt.legend(['acoustic data'], loc=(0.01, 0.95))
    ax2 = ax1.twinx()
    plt.plot(train_ttf_sample_df, color='b')
    ax2.set_ylabel('time to failure', color='b')
    plt.legend(['time to failure'], loc=(0.01, 0.9))
    plt.grid(True)

plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df)
del train_ad_sample_df
del train_ttf_sample_df
 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

TypeError: an integer is required

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

KeyError: 'acoustic_data'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

TypeError: an integer is required

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-782216d32bc2> in <module>()
      2 
      3 
----> 4 train_ad_sample_df = X_train['acoustic_data'].values[::100]
      5 train_ttf_sample_df = X_train['time_to_failure'].values[::100]
      6 

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

/usr/local/lib/python3.6/dist-packages/pandas/core/internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

KeyError: 'acoustic_data'
In [3]:
train_ad_sample_df = train_df['acoustic_data'].values[:6291455]
train_ttf_sample_df = train_df['time_to_failure'].values[:6291455]
plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df, title="Acoustic data and time to failure: 1% of data")
del train_ad_sample_df
del train_ttf_sample_df
 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-efc0d064277a> in <module>()
----> 1 train_ad_sample_df = train_df['acoustic_data'].values[:6291455]
      2 train_ttf_sample_df = train_df['time_to_failure'].values[:6291455]
      3 plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df, title="Acoustic data and time to failure: 1% of data")
      4 del train_ad_sample_df
      5 del train_ttf_sample_df

NameError: name 'train_df' is not defined
In [0]:
def gen_features(X):
    strain = []
    strain.append(X.mean())
    strain.append(X.std())
    strain.append(X.min())
    strain.append(X.max())
    strain.append(X.mad())
    strain.append(X.kurtosis())
    strain.append(X.skew())
    strain.append(np.quantile(X,0.01))
    strain.append(np.quantile(X,0.05))
    strain.append(np.quantile(X,0.95))
    strain.append(np.quantile(X,0.99))
    strain.append(np.abs(X).max())
    strain.append(np.abs(X).mean())
    strain.append(np.abs(X).std())
    return pd.Series(strain)
In [2]:
= pd.DataFrame()
y_train = pd.Series()
for df in train_df:
    ch = gen_features(train_df['acoustic_data'])
    X_train = X_train.append(ch, ignore_index=True)
    y_train = y_train.append(pd.Series(df['time_to_failure'].values[-1]))
 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-216dc1df0463> in <module>()
----> 1 X_train = pd.DataFrame()
      2 y_train = pd.Series()
      3 for df in train_df:
      4     ch = gen_features(train_df['acoustic_data'])
      5     X_train = X_train.append(ch, ignore_index=True)

NameError: name 'pd' is not defined