## Types of models used
-lightGBM
-Genetic programming
-SVM 

Predicting earthquakes has long been thought to be near-impossible. But same has been said about many other things, I still remember how 15 years ago my IT teacher in high school was saying that speech recognition can never be done reliably by a computer, and today it's more reliable than a human. So who knows, maybe this competition will be a major step in solving another "impossible" problem. Importance of predicting earthquakes is hard to underestimate - selected years of our century claimed hundreds of thousands of causalities. Being able to predict earthquakes could allow us to better protect human life and property.

Supervised learning

How was the data generated? Why is this important?

-Show the 1st 2%, then zoom in
-audio quakes right before the earthquake
-Add some features?
-K Folds for Validation
- Types of models used (LightGBM, genetic programming, SVM)

!pip install kaggle

Requirement already satisfied: kaggle in /usr/local/lib/python3.6/dist-packages (1.5.2)
Requirement already satisfied: urllib3<1.23.0,>=1.15 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.22)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle) (1.11.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle) (2018.11.29)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.5.3)
Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.18.4)
Requirement already satisfied: tqdm in /usr/local/lib/python3.6/dist-packages (from kaggle) (4.28.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.6/dist-packages (from kaggle) (2.0.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (3.0.4)
Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle) (2.6)
Requirement already satisfied: Unidecode>=0.04.16 in /usr/local/lib/python3.6/dist-packages (from python-slugify->kaggle) (1.0.23)

from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 62 bytes

!kaggle competitions list

ref                                            deadline             category            reward  teamCount  userHasEntered
---------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
digit-recognizer                               2030-01-01 00:00:00  Getting Started  Knowledge       2632           False  
titanic                                        2030-01-01 00:00:00  Getting Started  Knowledge       9863           False  
house-prices-advanced-regression-techniques    2030-01-01 00:00:00  Getting Started  Knowledge       4138           False  
imagenet-object-localization-challenge         2029-12-31 07:00:00  Research         Knowledge         33           False  
competitive-data-science-predict-future-sales  2019-12-31 23:59:00  Playground           Kudos       2283           False  
two-sigma-financial-news                       2019-07-15 23:59:00  Featured          $100,000       2897           False  
LANL-Earthquake-Prediction                     2019-06-03 23:59:00  Research           $50,000        974            True  
tmdb-box-office-prediction                     2019-05-30 23:59:00  Playground       Knowledge         26           False  
dont-overfit-ii                                2019-05-07 23:59:00  Playground            Swag          9           False  
gendered-pronoun-resolution                    2019-04-22 23:59:00  Research           $25,000         79           False  
histopathologic-cancer-detection               2019-03-30 23:59:00  Playground       Knowledge        568           False  
petfinder-adoption-prediction                  2019-03-28 23:59:00  Featured           $25,000       1000           False  
vsb-power-line-fault-detection                 2019-03-21 23:59:00  Featured           $25,000        808           False  
microsoft-malware-prediction                   2019-03-13 23:59:00  Research           $25,000       1526           False  
humpback-whale-identification                  2019-02-28 23:59:00  Featured           $25,000       1747           False  
elo-merchant-category-recommendation           2019-02-26 23:59:00  Featured           $50,000       3559           False  
quora-insincere-questions-classification       2019-02-26 23:59:00  Featured           $25,000       4037           False  
ga-customer-revenue-prediction                 2019-02-15 23:59:00  Featured           $45,000       1104           False  
reducing-commercial-aviation-fatalities        2019-02-12 23:59:00  Playground            Swag        152           False  
pubg-finish-placement-prediction               2019-01-30 23:59:00  Playground            Swag       1534           False

!kaggle competitions download -c LANL-Earthquake-Prediction

Downloading sample_submission.csv to /content
  0% 0.00/33.3k [00:00<?, ?B/s]
100% 33.3k/33.3k [00:00<00:00, 30.7MB/s]
Downloading test.zip to /content
 96% 233M/242M [00:03<00:00, 62.8MB/s]
100% 242M/242M [00:03<00:00, 81.0MB/s]
Downloading train.csv.zip to /content
100% 2.02G/2.03G [00:22<00:00, 98.4MB/s]
100% 2.03G/2.03G [00:22<00:00, 97.5MB/s]

!unzip train.csv.zip

Archive:  train.csv.zip
  inflating: train.csv

!ls

sample_data  sample_submission.csv  test.zip  train.csv  train.csv.zip

import pandas as pd
import numpy as np

!pip install numpy==1.15.0

train_df = pd.read_csv('train.csv', dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})

#train = pd.read_csv('train.csv', iterator=True, chunksize=150_000, dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})

Requirement already satisfied: numpy==1.15.0 in /usr/local/lib/python3.6/dist-packages (1.15.0)

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in read(self, nrows)
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_column_data()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_tokens()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._convert_with_dtype()

/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/common.py in is_integer_dtype(arr_or_dtype)
    776 
--> 777 def is_integer_dtype(arr_or_dtype):
    778     """

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-30-fceb7d7f50bb> in <module>()
      4 get_ipython().system('pip install numpy==1.15.0')
      5 
----> 6 train_df = pd.read_csv('train.csv', dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})
      7 
      8 #train = pd.read_csv('train.csv', iterator=True, chunksize=150_000, dtype={'acoustic_data': np.int16, 'time_to_failure': np.float64})

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    707                     skip_blank_lines=skip_blank_lines)
    708 
--> 709         return _read(filepath_or_buffer, kwds)
    710 
    711     parser_f.__name__ = name

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    453 
    454     try:
--> 455         data = parser.read(nrows)
    456     finally:
    457         parser.close()

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in read(self, nrows)
   1067                 raise ValueError('skipfooter not supported for iteration')
   1068 
-> 1069         ret = self._engine.read(nrows)
   1070 
   1071         if self.options.get('as_recarray'):

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in read(self, nrows)
   1837     def read(self, nrows=None):
   1838         try:
-> 1839             data = self._reader.read(nrows)
   1840         except StopIteration:
   1841             if self._first_chunk:

KeyboardInterrupt:

##print("Train: rows:{} cols:{}".format(train_df.shape[0], train_df.shape[1]))
def gen_features(X):
    strain = []
    strain.append(X.mean())
    strain.append(X.std())
    strain.append(X.min())
    strain.append(X.max())
    strain.append(X.mad())
    strain.append(X.kurtosis())
    strain.append(X.skew())
    strain.append(np.quantile(X,0.01))
    strain.append(np.quantile(X,0.05))
    strain.append(np.quantile(X,0.95))
    strain.append(np.quantile(X,0.99))
    strain.append(np.abs(X).max())
    strain.append(np.abs(X).mean())
    strain.append(np.abs(X).std())
    return pd.Series(strain)
  
X_train = pd.DataFrame()
y_train = pd.Series()
for df in train:
    ch = gen_features(df['acoustic_data'])
    X_train = X_train.append(ch, ignore_index=True)
    y_train = y_train.append(pd.Series(df['time_to_failure'].values[-1]))

X_train.describe()

#!pip install catboost
from catboost import CatBoostRegressor, Pool


train_pool = Pool(X_train, y_train)

m = CatBoostRegressor(iterations=10000, loss_function='MAE', boosting_type='Ordered')
m.fit(X_train, y_train, silent=True)
m.best_score_

{'learn': {'MAE': 1.74357544138688}}

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import NuSVR, SVR


scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

parameters = [{'gamma': [0.001, 0.005, 0.01, 0.02, 0.05, 0.1],
               'C': [0.1, 0.2, 0.25, 0.5, 1, 1.5, 2]}]
               #'nu': [0.75, 0.8, 0.85, 0.9, 0.95, 0.97]}]

reg1 = GridSearchCV(SVR(kernel='rbf', tol=0.01), parameters, cv=5, scoring='neg_mean_absolute_error')
reg1.fit(X_train_scaled, y_train.values.flatten())
y_pred1 = reg1.predict(X_train_scaled)

print("Best CV score: {:.4f}".format(reg1.best_score_))
print(reg1.best_params_)

Best CV score: -2.1651
{'C': 2, 'gamma': 0.02}

from sklearn.kernel_ridge import KernelRidge

parameters = [{'gamma': np.linspace(0.001, 0.1, 10),
               'alpha': [0.005, 0.01, 0.02, 0.05, 0.1]}]

reg2 = GridSearchCV(KernelRidge(kernel='rbf'), parameters, cv=5, scoring='neg_mean_absolute_error')
reg2.fit(X_train_scaled, y_train.values.flatten())
y_pred2 = reg2.predict(X_train_scaled)

print("Best CV score: {:.4f}".format(reg2.best_score_))
print(reg2.best_params_)

Best CV score: -2.1774
{'alpha': 0.005, 'gamma': 0.001}

import lightgbm as lgb
from tqdm import tqdm_notebook as tqdm
import random


fixed_params = {
    'objective': 'regression_l1',
    'boosting': 'gbdt',
    'verbosity': -1,
    'random_seed': 19
}

param_grid = {
    'learning_rate': [0.1, 0.08, 0.05, 0.01],
    'num_leaves': [32, 46, 52, 58, 68, 72, 80, 92],
    'max_depth': [3, 4, 5, 6, 8, 12, 16, -1],
    'feature_fraction': [0.8, 0.85, 0.9, 0.95, 1],
    'subsample': [0.8, 0.85, 0.9, 0.95, 1],
    'lambda_l1': [0, 0.1, 0.2, 0.4, 0.6, 0.9],
    'lambda_l2': [0, 0.1, 0.2, 0.4, 0.6, 0.9],
    'min_data_in_leaf': [10, 20, 40, 60, 100],
    'min_gain_to_split': [0, 0.001, 0.01, 0.1],
}

best_score = 999
dataset = lgb.Dataset(X_train, label=y_train)  # no need to scale features

for i in tqdm(range(200)):
    params = {k: random.choice(v) for k, v in param_grid.items()}
    params.update(fixed_params)
    result = lgb.cv(params, dataset, nfold=5, early_stopping_rounds=50,
                    num_boost_round=20000, stratified=False)
    
    if result['l1-mean'][-1] < best_score:
        best_score = result['l1-mean'][-1]
        best_params = params
        best_nrounds = len(result['l1-mean'])

print("Best mean score: {:.4f}, num rounds: {}".format(best_score, best_nrounds))
print(best_params)
reg3 = lgb.train(best_params, dataset, best_nrounds)
y_pred3 = reg3.predict(X_train)

Best mean score: 2.0900, num rounds: 97
{'learning_rate': 0.08, 'num_leaves': 68, 'max_depth': 3, 'feature_fraction': 0.95, 'subsample': 0.9, 'lambda_l1': 0.6, 'lambda_l2': 0, 'min_data_in_leaf': 10, 'min_gain_to_split': 0, 'objective': 'regression_l1', 'boosting': 'gbdt', 'verbosity': -1, 'random_seed': 19}

plt.tight_layout()
f = plt.figure(figsize=(12, 6))
f.add_subplot(1, 3, 1)
plt.scatter(y_train.values.flatten(), y_pred1)
plt.title('SVR', fontsize=16)
plt.xlim(0, 20)
plt.ylim(0, 20)
plt.xlabel('actual', fontsize=12)
plt.ylabel('predicted', fontsize=12)
plt.plot([(0, 0), (20, 20)], [(0, 0), (20, 20)])

f.add_subplot(1, 3, 2)
plt.scatter(y_train.values.flatten(), y_pred2)
plt.title('Kernel Ridge', fontsize=16)
plt.xlim(0, 20)
plt.ylim(0, 20)
plt.xlabel('actual', fontsize=12)
plt.plot([(0, 0), (20, 20)], [(0, 0), (20, 20)])

f.add_subplot(1, 3, 3)
plt.scatter(y_train.values.flatten(), y_pred3)
plt.title('Gradient boosting', fontsize=16)
plt.xlim(0, 20)
plt.ylim(0, 20)
plt.xlabel('actual', fontsize=12)
plt.plot([(0, 0), (20, 20)], [(0, 0), (20, 20)])
plt.show(block=True)

# second plot
plt.figure(figsize=(10, 5))
plt.plot(y_train.values.flatten(), color='blue', label='y_train')
plt.plot(y_pred1, color='orange', label='SVR')
plt.legend()
plt.title('SVR predictions vs actual')

# third plot
plt.figure(figsize=(10, 5))
plt.plot(y_train.values.flatten(), color='blue', label='y_train')
plt.plot(y_pred2, color='gray', label='KernelRidge')
plt.legend()
plt.title('Kernel Ridge predictions vs actual')

# fourth plot
plt.figure(figsize=(10, 5))
plt.plot(y_train.values.flatten(), color='blue', label='y_train')
plt.plot(y_pred3, color='green', label='Gradient boosting')
plt.legend()
plt.title('GBDT predictions vs actual')

<Figure size 576x396 with 0 Axes>

Text(0.5, 1.0, 'GBDT predictions vs actual')

import matplotlib.pyplot as plt


train_ad_sample_df = X_train['acoustic_data'].values[::100]
train_ttf_sample_df = X_train['time_to_failure'].values[::100]

def plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df, title="Acoustic data and time to failure: 1% sampled data"):
    fig, ax1 = plt.subplots(figsize=(12, 8))
    plt.title(title)
    plt.plot(train_ad_sample_df, color='r')
    ax1.set_ylabel('acoustic data', color='r')
    plt.legend(['acoustic data'], loc=(0.01, 0.95))
    ax2 = ax1.twinx()
    plt.plot(train_ttf_sample_df, color='b')
    ax2.set_ylabel('time to failure', color='b')
    plt.legend(['time to failure'], loc=(0.01, 0.9))
    plt.grid(True)

plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df)
del train_ad_sample_df
del train_ttf_sample_df

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

TypeError: an integer is required

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2524             try:
-> 2525                 return self._engine.get_loc(key)
   2526             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

KeyError: 'acoustic_data'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

TypeError: an integer is required

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-782216d32bc2> in <module>()
      2 
      3 
----> 4 train_ad_sample_df = X_train['acoustic_data'].values[::100]
      5 train_ttf_sample_df = X_train['time_to_failure'].values[::100]
      6 

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2137             return self._getitem_multilevel(key)
   2138         else:
-> 2139             return self._getitem_column(key)
   2140 
   2141     def _getitem_column(self, key):

/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in _getitem_column(self, key)
   2144         # get column
   2145         if self.columns.is_unique:
-> 2146             return self._get_item_cache(key)
   2147 
   2148         # duplicate columns & possible reduce dimensionality

/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1840         res = cache.get(item)
   1841         if res is None:
-> 1842             values = self._data.get(item)
   1843             res = self._box_item_values(item, values)
   1844             cache[item] = res

/usr/local/lib/python3.6/dist-packages/pandas/core/internals.py in get(self, item, fastpath)
   3841 
   3842             if not isna(item):
-> 3843                 loc = self.items.get_loc(item)
   3844             else:
   3845                 indexer = np.arange(len(self.items))[isna(self.items)]

/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2525                 return self._engine.get_loc(key)
   2526             except KeyError:
-> 2527                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2528 
   2529         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

KeyError: 'acoustic_data'

train_ad_sample_df = train_df['acoustic_data'].values[:6291455]
train_ttf_sample_df = train_df['time_to_failure'].values[:6291455]
plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df, title="Acoustic data and time to failure: 1% of data")
del train_ad_sample_df
del train_ttf_sample_df

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-efc0d064277a> in <module>()
----> 1 train_ad_sample_df = train_df['acoustic_data'].values[:6291455]
      2 train_ttf_sample_df = train_df['time_to_failure'].values[:6291455]
      3 plot_acc_ttf_data(train_ad_sample_df, train_ttf_sample_df, title="Acoustic data and time to failure: 1% of data")
      4 del train_ad_sample_df
      5 del train_ttf_sample_df

NameError: name 'train_df' is not defined

def gen_features(X):
    strain = []
    strain.append(X.mean())
    strain.append(X.std())
    strain.append(X.min())
    strain.append(X.max())
    strain.append(X.mad())
    strain.append(X.kurtosis())
    strain.append(X.skew())
    strain.append(np.quantile(X,0.01))
    strain.append(np.quantile(X,0.05))
    strain.append(np.quantile(X,0.95))
    strain.append(np.quantile(X,0.99))
    strain.append(np.abs(X).max())
    strain.append(np.abs(X).mean())
    strain.append(np.abs(X).std())
    return pd.Series(strain)

= pd.DataFrame()
y_train = pd.Series()
for df in train_df:
    ch = gen_features(train_df['acoustic_data'])
    X_train = X_train.append(ch, ignore_index=True)
    y_train = y_train.append(pd.Series(df['time_to_failure'].values[-1]))

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-216dc1df0463> in <module>()
----> 1 X_train = pd.DataFrame()
      2 y_train = pd.Series()
      3 for df in train_df:
      4     ch = gen_features(train_df['acoustic_data'])
      5     X_train = X_train.append(ch, ignore_index=True)

NameError: name 'pd' is not defined

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
count	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000	4195.000000
mean	4.519475	6.547788	-149.190942	163.522288	3.482899	68.297997	0.125830	-11.224603	-2.184779	11.231716	20.321890	170.046246	5.547367	5.750165
std	0.256049	8.503939	265.087984	272.930331	1.621138	70.532565	0.477901	14.106852	2.346558	2.358067	14.225526	296.887015	1.517038	8.339211
min	3.596313	2.802720	-5515.000000	23.000000	2.199321	0.648602	-4.091826	-336.000000	-39.000000	9.000000	11.000000	23.000000	4.147707	2.589085
25%	4.349497	4.478637	-154.000000	92.000000	2.820130	28.090227	-0.040779	-14.000000	-3.000000	10.000000	15.000000	94.000000	5.061843	3.862810
50%	4.522147	5.618798	-111.000000	123.000000	3.324506	45.816625	0.085620	-10.000000	-2.000000	11.000000	19.000000	127.000000	5.380853	4.781513
75%	4.693350	6.880904	-79.000000	170.000000	3.805475	78.664202	0.253930	-6.000000	-1.000000	12.000000	23.000000	175.000000	5.748553	5.887947
max	5.391993	153.703569	-15.000000	5444.000000	31.724966	631.158927	4.219429	-2.000000	0.000000	50.000000	337.000000	5515.000000	32.762073	150.432368