Machine learning with H2O - Hands-on guide for data scientists
H2O is the world’s number one machine learning platform. It is an open-source software, the H2O-3 GitHub repository is available for anyone to start hacking. This hands-on guide aims to explain the basic principles behind H2O and get you as a data scientist started as quickly as possible in the most simple way. The rest is just machine learning :)
After reading this guide, you’ll be able to
- understand which basic problems does H2O solve and why,
- play with H2O - explore data, create & tune models,
- see beyond the horizon. Understand where H2O can take you.
As a data scientist, you’re most likely to use R and/or Python. H2O integrates with both. Interestingly, H2O makes it easy to seamlessly switch Python, R and other data science tools, while still working on the same project. This allows data scientists to interact more easily, as well as using the best tool for the job. But the possibilities do not stop there. H2O also offers it’s own web-based interface named Flow. By means of Flow, data scientists are able to import, explore and modify datasets, play with models, verify models performance’s and much more. Flow is beatiful and quick way to do machine learning. Flows can be saved and given to other data scientists, making cooperation easy.
H2O respects habits of data scientists and does not get into their way. Using Python, data scientists are familiar with Pandas, Scikit, numpy and others. H2O ‘s syntax is very similar to those. H2O is able to work directly with Pandas’ data structures, as well as being compatible with numpy’s arrays and primitive Python lists and collections. H2O follows the same pattern with R, respecting the naming and syntax R developers are used to.
Getting started
The preparations before take-off are short. H2O is extremely easy to start with. All it takes is a common laptop to get started. Once H2O is installed, it is very easy and convenient to import a dataset and create a model out of it. In the examples below, the famous Airlines Delay dataset is used. There is no need to download it, as H2O is going to take care of downloading the dataset for you. The dataset is very intuitive to work with, however if you’re unfamiliar with it or simply want to know more, visit Kaggle website for a brief description of each column.
Once fluent in R, working with H2O in Python’s environment is easy. And it always works the other way around, as the APIs are very similar, while respecting the target platforms.
Getting started in R
Open up R CLI (by typing R
in terminal on most systems) or start R-Studio. It takes just a few lines to install H2O in R.
Before installing H2O itself, H2O requires two packages: RCurl
and jsonlite
. Install those by entering the following command into R console.
Installation
1
install.packages("RCurl","jsonlite")
After RCurl
and jsonlite
are installed, one last step is to install H2O itself. Installation of latest stable release is done as demonstrated in the following snippet. During the installation, a one-time download of the H2O backend containing all the algorithms and computing know-how will occur.
1
install.packages("h2o", type="source", repos=(c("https://h2o-release.s3.amazonaws.com/h2o/latest_stable_R")))
That’s it. H2O is now installed and ready to be used. As a first step, it is required to tell R to import the H2O library with library(h2o)
command. Once the library is imported, instruct H2O to start itself by calling h2o.init()
. Both commands are placed in the following code snippet for clarity.
1
2
library(h2o)
h2o.init()
The h2o.init()
command is pretty smart and does a lot of things. First, an attempt is made to search for an existing H2O instance being started already, before starting a new one. When none is found automatically or specified manually with argument available, a new instance of H2O is started. As this is a fresh installation and it is highly unlikely there is an instance of H2O already running in your environment, a new instance is started right away. During startup, H2O is going to print some useful information. The R version it is running on, H2O’s version, how to connect to H2O’s Flow interface or where error logs reside, just to name a few. As usual, an example of H2O’s output during startup is to be found below this text in the snippet.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
> h2o.init()
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/RtmpYs7uDC/h2o_pavel_started_from_r.out
/tmp/RtmpYs7uDC/h2o_pavel_started_from_r.err
java version "1.8.0_171"
Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
Starting H2O JVM and connecting: . Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 3 seconds 44 milliseconds
H2O cluster timezone: Europe/Prague
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.2
H2O cluster version age: 8 days
H2O cluster name: H2O_started_from_R_pavel_yuw261
H2O cluster total nodes: 1
H2O cluster total memory: 5.21 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.4 (2018-03-15)
Data import
Let’s import a dataset and train a model on it very quickly !
1
airlinesTrainData <- h2o.importFile("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv")
H2O will automatically download the dataset and parse it. It will also try to guess the datatype of each column automatically. H2O does a great job at datatype recognition, however each decision can be overridden manually by the user, if required. The imported dataset can also be given a name using destination_frame
argument. For example h2o.importFile("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv", destination_frame='airlines_train')
imports the very same dataset with airplane delays that can be further addressed by name airlines_train
, even from other interfaces like Python, another R console, Flow, Java or direct API calls. If no name is provided, H2O will generate an artificial name. Simply put, an imported dataset is called Frame
in H2O. List of frames can be shown by using the h2o.ls()
function. An example output of calling h2o.ls()
function can be found in the following code snippet. The very first record in the example is a named frame, the second one is a frame name generated automatically by H2O.
1
2
3
4
> h2o.ls()
key
1 airlines_train
2 allyears2k.hex_sid_ae6f_1
A preview of the data imported can be displayed with by typing the variable pointing to the H2OFrame
, in this case airlinesTrainData
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
> airlinesTrainData
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay
1 1987 10 14 3 741 730 912 849 PS 1451 NA 91 79 NaN 23 11
2 1987 10 15 4 729 730 903 849 PS 1451 NA 94 79 NaN 14 -1
3 1987 10 17 6 741 730 918 849 PS 1451 NA 97 79 NaN 29 11
4 1987 10 18 7 729 730 847 849 PS 1451 NA 78 79 NaN -2 -1
5 1987 10 19 1 749 730 922 849 PS 1451 NA 93 79 NaN 33 19
6 1987 10 21 3 728 730 848 849 PS 1451 NA 80 79 NaN -1 -2
Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay IsArrDelayed
1 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN YES
2 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN YES
3 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN YES
4 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN NO
5 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN YES
6 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN NO
IsDepDelayed
1 YES
2 NO
3 YES
4 NO
5 YES
6 NO
Model training
On top of the data imported, a model can be built quickly. There are many algorithms available in H2O. For the purpose of this tutorial, a widely known Gradient Boosting Machines method will be used. Let’s train a model that is able to predict if the plane arrives late based on month, day of week and distance the plane has to travel before reaching its destination. By invoking h2o.gbm(...)
, H2O will run a gradient boosting algorithm on the data. There are many variables to play with each and every data scientist can explore on his/her own. Overriding the default hyperparameters would only make this tutorial more complicated. H2O only needs to know three things:
- predictor columns,
- response variable column,
- training frame - a dataset to train the model on.
Nothing more. It is even able to guess the distribution of the response variable, eventhough as stated before, everything can be overridden manually by the data scientist, if required. After a model is trained, basic information about the model can be shown just by typing the name of the variable pointing to the trained model, in this case gbmModel
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
> gbmModel <- h2o.gbm(x=c("Month", "DayOfWeek", "Distance"), y="IsArrDelayed", training_frame = airlinesTrainData)
|===========================================================================================================================================================| 100%
> gbmModel
Model Details:
==============
H2OBinomialModel: gbm
Model ID: GBM_model_R_1529834562303_470
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 50 50 20591 5 5 5.00000 18 32 27.78000
H2OBinomialMetrics: gbm
** Reported on training data. **
MSE: 0.2349663
RMSE: 0.4847332
LogLoss: 0.6609351
Mean Per-Class Error: 0.4892883
AUC: 0.6237765
Gini: 0.2475531
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
NO YES Error Rate
NO 604 18933 0.969084 =18933/19537
YES 232 24209 0.009492 =232/24441
Totals 836 43142 0.435786 =19165/43978
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.431588 0.716423 368
2 max f2 0.355217 0.862241 395
3 max f0point5 0.513454 0.633681 278
4 max accuracy 0.511993 0.594661 279
5 max precision 0.972534 1.000000 0
6 max recall 0.347469 1.000000 397
7 max specificity 0.972534 1.000000 0
8 max absolute_mcc 0.605839 0.175948 151
9 max min_per_class_accuracy 0.539888 0.582464 234
10 max mean_per_class_accuracy 0.547177 0.584595 225
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Overall, this model is not expected to perform very well, given the huge error rate to be observed in the confusion matrix. By playing with different GBM hyperparameters and including different predictors into the model, much better results can be achieved. As a data scientist, the task of making the model perform better is easy for you, that’s certain. In the h2o.
package, there are many additional functions to work with the model and help a data scientist understand what happened during the training phase much more. As an example, the h2o.varimp
function shows importances of variables (relative, percentage) taken into account in the model. As you begin exploring H2O, the reference guide will guide you through all the H2O’s functionality.
1
2
3
4
5
6
> h2o.varimp(gbmModel)
Variable Importances:
variable relative_importance scaled_importance percentage
1 Distance 1379.994019 1.000000 0.501313
2 Month 970.739441 0.703437 0.352643
3 DayOfWeek 402.024353 0.291323 0.146044
Looks like distance is much more important than month or day of week when it comes to plane being delayed. Of course, according to this very basic model given the default parameters. A pro-tip at the end: H2O supports XGBoost. It is trivial to swap GBM for XGBoost in this phase and see how the model changes with default hyperparameters:
1
xgBoostModel <- h2o.xgboost(x=c("Month", "DayOfWeek", "Distance"), y="IsArrDelayed", training_frame = airlinesTrainData)
Prediction
Prediction is very simple as well. Calling h2o.predict(model, data)
, where model
is the variable pointing to the model trained and data
is the H2OFrame
with data to do the prediction on. To test the prediction is functional in a very simple way, let’s use the gbmModel
and let it predict the original training dataset.
1
2
3
4
5
6
7
8
9
10
11
> h2o.predict(gbmModel, airlinesTrainData)
|===========================================================================================================================================================| 100%
predict NO YES
1 YES 0.1419937 0.8580063
2 YES 0.1015739 0.8984261
3 YES 0.2036055 0.7963945
4 YES 0.1239904 0.8760096
5 YES 0.1384360 0.8615640
6 YES 0.1419937 0.8580063
[43978 rows x 3 columns]
The prediction is not very accurate in case there was no delay. This is expected, as the model is very basic. Of course, the confusion matrix seen earlier in this tutorial gave out the information about such “bad” performance beforehand.
Getting started in Python
In order to get started in Python, only few lines of code are required. A common way of installing dependencies in python is pip
or anaconda
. In this tutorial, pip
is preferred due to most user being familiar with it. If you’d like to use Conda, please follow the tutorial in H2O documentation. Python 2.7 up to Python 3.x are supported. The differences are minimal and this guide should work on both versions.
Installation
Before installing H2O itself, a few dependencies are required. Please install them using pip install
. On some systems, super-user privileges may be required. If so, adding sudo
before pip install
will solve the problem.
1
2
3
4
5
pip install requests
pip install tabulate
pip install scikit-learn
pip install colorama
pip install future
Once the dependencies required are installed, one last step is to install H2O itself. Installation of latest stable release is done as demonstrated in the following snippet. During the installation, a one-time download of the H2O backend containing all the algorithms and computing know-how will occur.
1
pip install -f https://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o
That’s it. H2O is now installed and ready to be used. As a first step, it is required to tell Python to import the H2O module with import h2o
command. Once the module is imported, instruct H2O to start itself by calling h2o.init()
. Both commands are placed in the following code snippet for clarity. The process of setup is very similar to R.
1
2
import h2o
h2o.init()
The h2o.init()
command is pretty smart and does a lot of things. First, an attempt is made to search for an existing H2O instance being started already, before starting a new one. When none is found automatically or specified manually with argument available, a new instance of H2O is started. As this is a fresh installation and it is highly unlikely there is an instance of H2O already running in your environment, a new instance is started right away. During startup, H2O is going to print some useful information. Version of the Python it is running on, H2O’s version, how to connect to H2O’s Flow interface or where error logs reside, just to name a few. As usual, an example of H2O’s output during startup is to be found below this text in the snippet.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
>>> h2o.init()
Checking whether there is an H2O instance running at https://localhost:54321..... not found.
Attempting to start a local H2O server...
Java Version: java version "1.8.0_171"; Java(TM) SE Runtime Environment (build 1.8.0_171-b11); Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
Starting server from /usr/local/lib/python2.7/dist-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmp7R8OvB
JVM stdout: /tmp/tmp7R8OvB/h2o_pavel_started_from_python.out
JVM stderr: /tmp/tmp7R8OvB/h2o_pavel_started_from_python.err
Server is running at https://127.0.0.1:54321
Connecting to H2O server at https://127.0.0.1:54321... successful.
versionFromGradle='3.19.0',projectVersion='3.19.0.99999',branch='pavel_pubdev-5336',lastCommitHash='4e976bde05d5096d31a5889a340af56ae256c8c0',gitDescribe='jenkins-master-4235-5-g4e976bd',compiledOn='2018-03-18 16:38:51',compiledBy='pavel'
-------------------------- ----------------------------------------
H2O cluster uptime: 01 secs
H2O cluster timezone: Europe/Prague
H2O data parsing timezone: UTC
H2O cluster version: 3.19.0.99999
H2O cluster version age: 3 months and 5 days
H2O cluster name: H2O_from_python_pavel_oy9g50
H2O cluster total nodes: 1
H2O cluster free memory: 5.207 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: https://127.0.0.1:54321
H2O connection proxy:
H2O internal security: False
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
Python version: 2.7.15 candidate
-------------------------- ----------------------------------------
Data import
Let’s import a dataset and train a model on it very quickly !
1
airlines_train_data = h2o.import_file("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv")
H2O will automatically download the dataset and parse it. It will also try to guess the datatype of each column automatically. H2O does a great job at datatype recognition, however each decision can be overridden manually by the user, if required. The imported dataset can also be given a name using destination_frame
argument. For example h2o.import_file("https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv", destination_frame='airlines_train')
imports the very same dataset with airplane delays that can be further addressed by name airlines_train
, even from other interfaces like Python, another R console, Flow, Java or direct API calls. If no name is provided, H2O will generate an artificial name. Simply put, an imported dataset is called Frame
in H2O. List of frames can be shown by using the h2o.ls()
function. An example output of calling h2o.ls()
function can be found in the following code snippet. The very first record in the example is a named frame, the second one is a frame name generated automatically by H2O.
1
2
3
4
>>> h2o.ls()
key
0 airlines_train
1 allyears2k.hex
A preview of the data imported can be displayed with by typing the variable pointing to the H2O Frame, in this case airlines_train_data
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
>>> airlines_train_data
Year Month DayofMonth DayOfWeek DepTime CRSDepTime ArrTime CRSArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime CRSElapsedTime AirTime ArrDelay DepDelay
1 1987 10 14 3 741 730 912 849 PS 1451 NA 91 79 NaN 23 11
2 1987 10 15 4 729 730 903 849 PS 1451 NA 94 79 NaN 14 -1
3 1987 10 17 6 741 730 918 849 PS 1451 NA 97 79 NaN 29 11
4 1987 10 18 7 729 730 847 849 PS 1451 NA 78 79 NaN -2 -1
5 1987 10 19 1 749 730 922 849 PS 1451 NA 93 79 NaN 33 19
6 1987 10 21 3 728 730 848 849 PS 1451 NA 80 79 NaN -1 -2
Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted CarrierDelay WeatherDelay NASDelay SecurityDelay LateAircraftDelay IsArrDelayed
1 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN YES
2 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN YES
3 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN YES
4 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN NO
5 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN YES
6 SAN SFO 447 NaN NaN 0 NA 0 NaN NaN NaN NaN NaN NO
IsDepDelayed
1 YES
2 NO
3 YES
4 NO
5 YES
6 NO
Model training
On top of the data imported, a model can be built quickly. There are many algorithms available in H2O. For the purpose of this tutorial, a widely known Gradient Boosting Machines method will be used. Let’s train a model that is able to predict if the plane arrives late based on month, day of week and distance the plane has to travel before reaching its destination. GBM resides in h2o.estimators.gbm
package. First step is to import H2OGradientBoostingEstimator
to avoid the need for typing a fully qualified name in future.
1
from h2o.estimators.gbm import H2OGradientBoostingEstimator
First, it is required to construct a new GBM estimator instance by calling gbm_model = H2OGradientBoostingEstimator()
constructor. By invoking gbm_model.train(...)
, H2O will run a gradient boosting algorithm on the data. There are many variables to play with each and every data scientist can explore on his/her own. Overriding the default hyperparameters would only make this tutorial more complicated. H2O only needs to know three things:
- predictor columns,
- response variable column,
- training frame - a dataset to train the model on.
1
2
gbm_model = H2OGradientBoostingEstimator()
gbm_model.train(x = ["Month", "DayOfWeek", "Distance"], y = "IsArrDelayed", training_frame=airlines_train_data)
Nothing more. It is even able to guess the distribution of the response variable, eventhough as stated before, everything can be overridden manually by the data scientist, if required. After a model is trained, basic information about the model can be shown just by typing the name of the variable pointing to the trained model, in this case gbm_model
. In the next figure, there is a demonstration of previous steps regarding model training, with H2O’s output included.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> gbm_model = H2OGradientBoostingEstimator();
>>> gbm_model.train(x = ["Month", "DayOfWeek", "Distance"], y = "IsArrDelayed", training_frame=airlines_train_data)
gbm Model Build progress: |███████████████████████████████████████████████████████████████████| 100%
>>> gbm_model
Model Details
=============
H2OGradientBoostingEstimator : Gradient Boosting Machine
Model Key: GBM_model_python_1529844691141_1
ModelMetricsBinomial: gbm
** Reported on train data. **
MSE: 0.234966307798
RMSE: 0.484733233643
LogLoss: 0.660935092712
Mean Per-Class Error: 0.41540494848
AUC: 0.623776525749
Gini: 0.247553051497
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.43158830279:
NO YES Error Rate
----- ---- ----- ------- -----------------
NO 604 18933 0.9691 (18933.0/19537.0)
YES 232 24209 0.0095 (232.0/24441.0)
Total 836 43142 0.4358 (19165.0/43978.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
--------------------------- ----------- -------- -----
max f1 0.431588 0.716423 368
max f2 0.355217 0.862241 395
max f0point5 0.513454 0.633681 278
max accuracy 0.511993 0.594661 279
max precision 0.972534 1 0
max recall 0.347469 1 397
max specificity 0.972534 1 0
max absolute_mcc 0.605839 0.175948 151
max min_per_class_accuracy 0.539888 0.582464 234
max mean_per_class_accuracy 0.547177 0.584595 225
Gains/Lift Table: Avg response rate: 55,58 %
group cumulative_data_fraction lower_threshold lift cumulative_lift response_rate cumulative_response_rate capture_rate cumulative_capture_rate gain cumulative_gain
-- ------- -------------------------- ----------------- -------- ----------------- --------------- -------------------------- -------------- ------------------------- -------- -----------------
1 0.0100732 0.897096 1.70593 1.70593 0.948081 0.948081 0.0171842 0.0171842 70.5933 70.5933
2 0.0221247 0.864273 1.64658 1.6736 0.915094 0.930113 0.0198437 0.0370279 64.6578 67.3602
3 0.0302651 0.83021 1.54302 1.63848 0.857542 0.910594 0.0125609 0.0495888 54.3021 63.848
4 0.040111 0.753167 1.40873 1.58208 0.78291 0.879252 0.0138701 0.0634589 40.8732 58.2085
5 0.0514803 0.696079 1.34952 1.53072 0.75 0.850707 0.0153431 0.078802 34.9515 53.0722
6 0.103688 0.649372 1.30093 1.41502 0.722997 0.786404 0.0679187 0.146721 30.0926 41.5018
7 0.150348 0.618087 1.225 1.35605 0.680799 0.75363 0.0571581 0.203879 22.4998 35.6046
8 0.20101 0.600111 1.17103 1.30942 0.650808 0.727715 0.0593265 0.263205 17.1034 30.9416
9 0.3004 0.578542 1.07854 1.23303 0.599405 0.685262 0.107197 0.370402 7.85418 23.3029
10 0.400632 0.557964 1.03234 1.18282 0.57373 0.657359 0.103474 0.473876 3.23424 18.282
11 0.500568 0.540445 1.00633 1.14758 0.559272 0.637776 0.100569 0.574445 0.632788 14.7584
12 0.606894 0.525153 0.973175 1.11703 0.540847 0.620794 0.103474 0.677918 -2.68253 11.7028
13 0.701874 0.508853 0.921 1.0905 0.511851 0.606052 0.087476 0.765394 -7.89998 9.05014
14 0.800673 0.492089 0.869653 1.06325 0.483314 0.590907 0.0859212 0.851315 -13.0347 6.32497
15 0.90261 0.465819 0.80997 1.03465 0.450145 0.575009 0.0825662 0.933882 -19.003 3.46453
16 1 0.279442 0.678906 1 0.377306 0.555755 0.0661184 1 -32.1094 0
Scoring History:
timestamp duration number_of_trees training_rmse training_logloss training_auc training_lift training_classification_error
--- ------------------- ---------- ----------------- --------------- ------------------ -------------- --------------- -------------------------------
2018-06-24 15:03:17 0.087 sec 0.0 0.49688163904 0.686916957642 0.5 1.0 0.444244849698
2018-06-24 15:03:17 0.368 sec 1.0 0.495487418342 0.684101672491 0.583881451988 1.61617611229 0.444244849698
2018-06-24 15:03:18 0.487 sec 2.0 0.494360221437 0.681803341309 0.584731061951 1.65529653938 0.444244849698
2018-06-24 15:03:18 0.571 sec 3.0 0.493435134409 0.679892007692 0.584823366972 1.68638964557 0.444244849698
2018-06-24 15:03:18 0.659 sec 4.0 0.492679667171 0.678305979102 0.585117301795 1.65729931801 0.444244849698
--- --- --- --- --- --- --- --- ---
2018-06-24 15:03:19 1.667 sec 46.0 0.485057037114 0.661619456851 0.622140956624 1.70738658629 0.435422256583
2018-06-24 15:03:19 1.688 sec 47.0 0.484945381131 0.661377063627 0.622823558497 1.70738658629 0.435604165719
2018-06-24 15:03:19 1.710 sec 48.0 0.484866365041 0.661216334425 0.623275804097 1.70738658629 0.435877029424
2018-06-24 15:03:19 1.734 sec 49.0 0.484810585256 0.661098923553 0.623470282123 1.70738658629 0.435604165719
2018-06-24 15:03:19 1.760 sec 50.0 0.484733233643 0.660935092712 0.623776525749 1.70593338378 0.435786074856
See the whole table with table.as_data_frame()
Variable Importances:
variable relative_importance scaled_importance percentage
---------- --------------------- ------------------- ------------
Distance 1379.99 1 0.501313
Month 970.739 0.703437 0.352643
DayOfWeek 402.024 0.291323 0.146044
Overall, this model is not expected to perform very well, given the huge error rate to be observed in the confusion matrix. By playing with different GBM hyperparameters and including different predictors into the model, much better results can be achieved. As a data scientist, the task of making the model perform better is easy for you, that’s certain. To get detailed information about model and its scoring history, invoke the print(gbm_model)
command. The output contains table with importances of variables (relative, percentage, scaled) taken into account in the model. Also, a detailed scoring history is available, as well as basic measures like mean squared error (MSE). As you begin exploring H2O, the reference guide will guide you through all the H2O’s functionality. A shortened example of a detailed view on GBM model is to be found in the next figure. Some of the text was omitted.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
print(gbm_model)
Model Details
=============
H2OGradientBoostingEstimator : Gradient Boosting Machine
Model Key: GBM_model_python_1529844691141_1
ModelMetricsBinomial: gbm
** Reported on train data. **
Scoring History:
timestamp duration number_of_trees training_rmse training_logloss training_auc training_lift training_classification_error
--- ------------------- ---------- ----------------- --------------- ------------------ -------------- --------------- -------------------------------
2018-06-24 15:03:17 0.087 sec 0.0 0.49688163904 0.686916957642 0.5 1.0 0.444244849698
2018-06-24 15:03:17 0.368 sec 1.0 0.495487418342 0.684101672491 0.583881451988 1.61617611229 0.444244849698
2018-06-24 15:03:18 0.487 sec 2.0 0.494360221437 0.681803341309 0.584731061951 1.65529653938 0.444244849698
2018-06-24 15:03:18 0.571 sec 3.0 0.493435134409 0.679892007692 0.584823366972 1.68638964557 0.444244849698
2018-06-24 15:03:18 0.659 sec 4.0 0.492679667171 0.678305979102 0.585117301795 1.65729931801 0.444244849698
--- --- --- --- --- --- --- --- ---
2018-06-24 15:03:19 1.667 sec 46.0 0.485057037114 0.661619456851 0.622140956624 1.70738658629 0.435422256583
2018-06-24 15:03:19 1.688 sec 47.0 0.484945381131 0.661377063627 0.622823558497 1.70738658629 0.435604165719
2018-06-24 15:03:19 1.710 sec 48.0 0.484866365041 0.661216334425 0.623275804097 1.70738658629 0.435877029424
2018-06-24 15:03:19 1.734 sec 49.0 0.484810585256 0.661098923553 0.623470282123 1.70738658629 0.435604165719
2018-06-24 15:03:19 1.760 sec 50.0 0.484733233643 0.660935092712 0.623776525749 1.70593338378 0.435786074856
See the whole table with table.as_data_frame()
Variable Importances:
variable relative_importance scaled_importance percentage
---------- --------------------- ------------------- ------------
Distance 1379.99 1 0.501313
Month 970.739 0.703437 0.352643
DayOfWeek 402.024 0.291323 0.146044
Looks like distance is much more important than month or day of week when it comes to plane being delayed. Of course, according to this very basic model given the default parameters. A pro-tip at the end: H2O supports XGBoost. It is trivial to swap GBM for XGBoost in this phase and see how the model changes with default hyperparameters:
1
2
from h2o.estimators.xgboost import H2OXGBoostEstimator
xgb_model.train(x = ["Month", "DayOfWeek", "Distance"], y = "IsArrDelayed", training_frame=airlines_train_data)
Prediction
Once a model is created, predictions are simply done by calling predict(data)
method on a model, where the data
argument is the variable pointing to an H2OFrame
with data to do the prediction on. To test the prediction is functional in a very simple way, let’s use the gbm_model
and let it predict the original training dataset by issuing gbm_model.predict(airlines_train_data)
command.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>>> gbm_model.predict(airlines_train_data)
gbm prediction progress: |████████████████████████████████████████████████████████████████████| 100%
predict NO YES
--------- -------- --------
YES 0.141994 0.858006
YES 0.101574 0.898426
YES 0.203606 0.796394
YES 0.12399 0.87601
YES 0.138436 0.861564
YES 0.141994 0.858006
YES 0.101574 0.898426
YES 0.103421 0.896579
YES 0.203606 0.796394
YES 0.12399 0.87601
[43978 rows x 3 columns]
A result of the prediction is a H2OFrame
. Pointer to it can be saved into a variable as well, e.g. prediction = gbm_model.predict(airlines_train_data)
. The table printed is only a preview of the first few predictions made. As the above example demonstrates, the prediction is not very accurate in case there was no delay. This is expected, as the model is very basic. Of course, the confusion matrix seen earlier in this tutorial gave out the information about such “bad” performance beforehand.
Getting started with Flow
Flow is H2O’s web interface. It is very powerful and oriented on visuals. It has it all. From fast prototyping of new models, visualizing and refining existing achievements to scoring.
Flow is active whenever H2O is started. If you’ve followed previous Python or R tutorials, during the active Python or R session, Flow can be reached from your browser. Whenever h2o.init()
is called, Flow is started with H2O. In the output after h2o.init()
, the URL and are printed. When ran locally, the Flow is usually bound to localhost
, with the default port being 54321
. Address or port may be changed by h2o.init()
arguments, e.g. h2o.init(ip='127.0.0.1', port='10001')
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>> h2o.init()
Checking whether there is an H2O instance running at https://localhost:54321..... not found.
Attempting to start a local H2O server...
//Some output omitted
-------------------------- ----------------------------------------
H2O cluster uptime: 01 secs
H2O cluster timezone: Europe/Prague
H2O data parsing timezone: UTC
H2O cluster version: 3.19.0.99999
H2O cluster version age: 3 months and 5 days
H2O cluster name: H2O_from_python_pavel_oy9g50
H2O cluster total nodes: 1
H2O cluster free memory: 5.207 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: https://127.0.0.1:54321 <<<<--- URL to FLOW
H2O connection proxy:
H2O internal security: False
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
Python version: 2.7.15 candidate
-------------------------- ----------------------------------------
From the shortened output, the actual URL can be observed: H2O connection url: https://127.0.0.1:54321
. As 127.0.0.1
resolves to localhost, copying the given URL or simply typing localhost:54321
into a web browser opens H2O Flow.
Starting Flow without Python/R
There is no need to install the Python module or R library in order to run H2O. Just download the latest stable H2O release and run it. Java is required to run H2O.
- Download package with latest stable release
- Unpack it
- Run java -jar h2o.jar
The h2o.jar
file is to be found in the extracted directory. If you’re on Windows, double-clicking the h2o.jar file should be sufficient to run it, if there is Java installed. At the time this tutorial is written, Java version 7 is the minimum required version. Java 8 is recommended.
Importing Data with Flow
On top of the page, click on Data
. As the menu appears, choose Import files
. Do not interchange with “Upload file” option, which is only useful to upload files from your very own machine. After clicking on the Import files
option, a form appears at the bottom of the page. Please fill the Search
input box with https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv
to download the Airlines dataset described at the beginning of this chapter. By clicking on the magnifying glass
next to the input box, H2O is going to verify it can reach the file. This way, even local files or files from remote filesystems (including Hadoop filesystem) can be imported. By inserting whole folders, H2O will pop-up all the files available, thus enabling multi-file import with ease. In this simple case, after clicking on the magnifying glass
, a single file named exactly as the URL inserted into the input box will appear a little bit lower. By clicking on the plus icon
, the file is marked for import. By clicking Import button
, H2O will automatically download the file.
After the file is imported, another form will appear. This time, H2O confirms the file has been imported and asks to user to starting parsing it. By clicking on Parse these files
, H2O will begin parsing.
However, H2O is not going to parse the whole file right away. The could be potentially very big (or there may be multiple files). Instead, it will automatically detect the format and suggests best parsing settings based on internal heuristics. In case of CSV imported in this tutorial, it will correctly detect the format, set separator to be a comma, detect column headers on first row. However, there is more - columns datatypes are detected. Yet the data scientist has the power to override every decision, e.g. change column type. By clicking on Parse
button at the very bottom, the file is finally parsed by H2O.
The sample dataset is very small (4,4 MB) and H2O’s CSV parser is very fast, so the parsing should be quick. After the parsing phase, H2O will inform you about the dataset being ready for an inspection. By clicking on the View
button, a table view preview of the parsed data appears. Here, the user has several choices, including viewing all the data, exporting it, splitting it into several smaller chunks (e.g.for testing), or most importantly to Build Model
.
By clicking on the Build model
button above the data overview, a dialogue asking the user to select an algorithm appears. In this tutorial, GBM is used as a simple example, therefore select Gradient Boosting Machine
. As with Python and R, let’s train a model that is able to predict if the plane arrives late based on month, day of week and distance the plane has to travel before reaching its destination. From top to bottom, only few of the choices need your attention right now. Most of these are hyperparameters, which are of course important in real world, yet for a getting started guide, the default values will suffice.
The training_frame
option is already pre-set by flow to point to the data imported. Creating validation frames from training is easy using the following option, which we’ll leave intact for now. Since this model is trying to predict if the plane will be delayed on arrival ot its destination, the response column should be set to IsArrDelayed
. The prediction is based on month, day of week and traveling distance (chosen arbitrarily). Therefore, in the ignored column section, first check all the columns as ignored and leave only Month
, DayOfWeek
, Distance
and IsArrDelayed
(response variable) unchecked.
After the parameters are set, click on Build model
button at the bottom to start the model training.
Model training in this small case should be done in an instant. H2O will again display a message about training progress and after the training is finished, the model can be displayed by clicking on the View
button. A complete model description appears. Variable importance, logloss, complete history per each round of training (including durations) & much more can be observed.
Predicting with Flow
After the model is built, click the Predict
button (with a lightning icon) on top of model summary reached in previous steps. To test the prediction is functional in a very simple way, let’s use the current model and let it predict the original training dataset. After clicking on the Predict
button, select the training frame in the dropdown menu’s Frame
option and click on Predict
. The resulting predictions appear, including various metrics. The model is not very accurate, especially the error rate regarding prediction of a plane not arriving late are very inaccurate. Creating a better model by tuning the hyperparameters or playing with various predictors is up to each and every data scientist to try now.
Where to go next ?
In this hands-on guide, very basic ways to interact with H2O were shown. However, there is so much more to H2O. You can take H2O from your laptop to large clusters, creating models and predicting on extremely large datasets. Cross-validation, stacked ensembled, tuning parameters with grid search and many, many other expected functionality. After data scientists tune the model, deploying it into production is easy with H2O’s POJO/MOJO functionality. This way, the model is simply exported into a self-containing package, making it easy for the engineers to plug it into any Java-based production environment. Model’s performance can be observed even in production at real time, providing useful reports.
Data scientists and developers are best to visit H2O documentation.It provides step-by-step tutorials for new users, as well as guides for users interested in specifically in R language or Python language. The guide gives and overview of what H2O is capable of. Besides a list of available algorithms, methods of cross-validation or the ability to build models on top of existing ones, ways of productionizing created models quickly are described.
In the documentation, everything from data import, exploration & filtering to deployment into production is described. Running on Hadoop ? No problem. Using Apache Spark ? Sparkling Water is at your service. H2O also integrates well with various cloud services. H2O also offers countless tutorials on GitHub.
Video tutorials, speeches and expert advices are available on H2O’s YouTube channel. There is so much to explore.
H2O also holds conferences named H2O World. Come and visit us, we’ll be glad to talk to you.