Easy methods to construct a choice tree mannequin in IBM Db2


After growing a machine studying mannequin, you want a spot to run your mannequin and serve predictions. If your organization is within the early stage of its AI journey or has funds constraints, it’s possible you’ll battle to discover a deployment system in your mannequin. Constructing ML infrastructure and integrating ML fashions with the bigger enterprise are main bottlenecks to AI adoption [1,2,3]. IBM Db2 may help remedy these issues with its built-in ML infrastructure. Somebody with the data of SQL and entry to a Db2 occasion, the place the in-database ML characteristic is enabled, can simply be taught to construct and use a machine studying mannequin within the database.

On this publish, I’ll present how one can develop, deploy, and use a choice tree mannequin in a Db2 database.

These are my main steps on this tutorial:

  1. Arrange Db2 tables
  2. Discover ML dataset
  3. Preprocess the dataset
  4. Practice a choice tree mannequin
  5. Generate predictions utilizing the mannequin
  6. Consider the mannequin

I carried out these steps in a Db2 Warehouse on-prem database. Db2 Warehouse on cloud additionally helps these ML options.

The machine studying use case

I’ll use a dataset of historic flights within the US. For every flight, the dataset has data such because the flight’s origin airport, departure time, flying time, and arrival time. Additionally, a column within the dataset signifies if every flight had arrived on time or late. Utilizing examples from the dataset, we’ll construct a classification mannequin with choice tree algorithm. As soon as educated, the mannequin can obtain as enter unseen flight information and predict if the flight will arrive on time or late at its vacation spot.

1. Arrange Db2 tables

The dataset I exploit on this tutorial is obtainable right here as a csv file.

Making a Db2 desk

I exploit the next SQL for making a desk for storing the dataset.

db2start
hook up with <database_name>

db2 "CREATE TABLE FLIGHTS.FLIGHTS_DATA_V3  (
ID INTEGER NOT NULL GENERATED BY DEFAULT AS IDENTITY,
YEAR INTEGER ,
QUARTER INTEGER ,
MONTH INTEGER ,
DAYOFMONTH INTEGER ,                       
DAYOFWEEK INTEGER ,                       
UNIQUECARRIER VARCHAR(50 OCTETS) ,               
ORIGIN VARCHAR(50 OCTETS) ,                      
DEST VARCHAR(50 OCTETS) ,                       
CRSDEPTIME INTEGER ,                       
DEPTIME INTEGER ,                       
DEPDELAY REAL ,                       
DEPDEL15 REAL ,                       
TAXIOUT INTEGER ,                       
WHEELSOFF INTEGER ,                       
CRSARRTIME INTEGER ,                       
CRSELAPSEDTIME INTEGER ,                       
AIRTIME INTEGER ,                       
DISTANCEGROUP INTEGER ,                       
FLIGHTSTATUS VARCHAR(1) )
ORGANIZE BY ROW";

After creating the desk, I exploit the next SQL to load the information, from the csv file, into the desk:

db2 "IMPORT FROM 'FLIGHTS_DATA_V3.csv' OF DEL COMMITCOUNT 50000 INSERT INTO FLIGHTS.FLIGHTS_DATA_V3"

I now have the ML dataset loaded into the FLIGHTS.FLIGHTS_DATA_V3 desk in Db2. I’ll copy a subset of the information from this desk to a separate desk for the ML mannequin growth and analysis, leaving the unique copy of the information intact. 

SELECT depend(*) FROM FLIGHTS.FLIGHTS_DATA_V3

 — — — 
1000000

Making a separate desk with pattern information

Create a desk with 10% pattern rows from the above desk. Use the RAND perform of Db2 for random sampling.

CREATE TABLE FLIGHT.FLIGHTS_DATA AS (SELECT * FROM FLIGHTS.FLIGHTS_DATA_V3 WHERE RAND() < 0.1) WITH DATA

Depend the variety of rows within the pattern desk.

SELECT depend(*) FROM FLIGHT.FLIGHTS_DATA

— — — 
99879

Look into the scheme definition of the desk.

SELECT NAME, COLTYPE, LENGTH
FROM SYSIBM.SYSCOLUMNS
WHERE TBCREATOR = 'FLIGHT' AND TBNAME = 'FLIGHTS_DATA'
ORDER BY COLNO

FLIGHTSTATUS is the response or the goal column. Others are characteristic columns.

Discover the DISTINCT values within the goal column.

From these values, I can see that it’s a binary classification process the place every flight arrived both on time or late. 

Discover the frequencies of distinct values within the FLIGHTSTATUS column.

SELECT FLIGHTSTATUS, depend(*) AS FREQUENCY, depend(*) / (SELECT depend(*) FROM FLIGHT.FLIGHTS_DATA) AS FRACTION
FROM FLIGHT.FLIGHTS_DATA fdf
GROUP BY FLIGHTSTATUS

From the above, I see the courses are imbalanced. Now I’ll not achieve any additional insights from your complete dataset, as this may leak data to the modeling section. 

Creating practice/check partitions of the dataset

Earlier than accumulating deeper insights into the information, I’ll divide this dataset into practice and check partitions utilizing Db2’s RANDOM_SAMPLING SP. I apply stratified sampling to protect the ratio between two courses within the generated coaching information set.

Create a TRAIN partition.

name IDAX.RANDOM_SAMPLE('intable=FLIGHT.FLIGHTS_DATA, fraction=0.8, outtable=FLIGHT.FLIGHTS_TRAIN, by=FLIGHTSTATUS')

Copy the remaining information to a check PARTITION.

CREATE TABLE FLIGHT.FLIGHTS_TEST AS (SELECT * FROM FLIGHT.FLIGHTS_DATA FDF WHERE FDF.ID NOT IN(SELECT FT.ID FROM FLIGHT.FLIGHTS_TRAIN FT)) WITH DATA

2. Discover information

On this step, I’ll take a look at each pattern information and the abstract statistics of the coaching dataset to achieve insights into the dataset.

Look into some pattern information.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN FETCH FIRST 10 ROWS ONLY

Some columns have encoded the time as numbers:

 — CRSDEPTIME: Laptop Reservation System (scheduled) Departure Time (hhmm)

 — DepTime: Departure Time (hhmm)

 — CRSArrTime: Laptop Reservation System (scheduled) Arrival Time

Now, I gather abstract statistics from the FLIGHTS_TRAIN utilizing SUMMARY1000 SP to get a world view of the traits of the dataset.

CALL IDAX.SUMMARY1000('intable=FLIGHT.FLIGHTS_TRAIN, outtable=FLIGHT.FLIGHTS_TRAIN_SUM1000')

Right here the intable has the title of the enter desk from which I would like SUMMARY1000 SP to gather statistics. outtable is the title of the desk the place SUMMARY1000 will retailer gathered statistics for your complete dataset. Apart from the outtable, SUMMARY1000 SP creates a couple of further output tables — one desk with statistics for every column kind. Our dataset has two varieties of columns — numeric and nominal. So, SUMMARY1000 will generate two further tables. These further tables comply with this naming conference: the title of the outtable + column kind. In our case, the column varieties are NUM, representing numeric, and CHAR, representing nominal. So, the names of those two further tables can be as follows:

FLIGHTS_TRAIN_SUM1000_NUM

FLIGHTS_TRAIN_SUM1000_CHAR

Having the statistics obtainable in separate tables for particular datatypes makes it simpler to view the statistics that apply to particular datatype and cut back the variety of columns whose statistics are seen collectively. This simplifies the evaluation course of. 

Test the abstract statistics of the numeric column.

SELECT * FROM FLIGHT.FLIGHTS_TRAIN_SUM1000_NUM

For the numeric columns, SUMMARY1000 collect the next statistics:

  • Lacking worth depend
  • Non-missing worth depend
  • Common
  • Variance
  • Commonplace deviation
  • Skewness
  • Extra kurtosis
  • Minimal
  • Most

Every of those statistics may help uncover insights into the dataset. As an example, I can see that DEPDEL15 and DEPDELAY columns have 49 lacking values. There are giant values in these columns: AIRTIME, CRSARRTIME, CRSDEPTIME, CRSELAPSEDTIME, DEPDELAY, DEPTIME, TAXIOUT, WHEELSOFF, and YEAR. Since I’ll create a choice tree mannequin, I don’t have to cope with the massive worth and the lacking values. Db2 will cope with each points natively. 

Subsequent, I examine the abstract statistics of the nominal columns.

choose * from FLIGHT.FLIGHTS_TRAIN_SUM1000_CHAR

For nominal columns, SUMMARY1000 gathered the next statistics:

  • Variety of lacking values
  • Variety of non-missing values
  • Variety of distinct values
  • Frequency of essentially the most frequent worth

3. Preprocess information

From the above information exploration, I can see that the dataset has no lacking values. These 4 TIME columns have giant values: AIRTIME, CRSARRTIME, DEPTIME, WHEELSOFF. I’ll depart the nominal values in all columns as-is, as the choice tree implementation in Db2 can cope with them natively. 

Extract the hour half from the TIME columns — CRSARRTIME, DEPTIME, WHEELSOFF.

From wanting up the outline of the dataset, I see the values within the CRSARRTIME, DEPTIME, and WHEELSOFF columns are encoding of hhmm of the time values. I extract the hour a part of these values to create, hopefully, higher options for the educational algorithm. 

Scale CRSARRTIME COLUMN: divide the worth with 100 provides the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the worth by 100 provides the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TRAIN SET WHEELSOFF = WHEELSOFF / 100

4. Practice a choice tree mannequin

Now the coaching dataset is prepared for the choice tree algorithm. 

I practice a choice tree mannequin utilizing GROW_DECTREE SP. 

CALL IDAX.GROW_DECTREE('mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TRAIN, id=ID, goal=FLIGHTSTATUS')

I known as this SP utilizing the next parameters:

  • mannequin: the title I need to give to the choice tree mannequin — FLIGHT_DECTREE
  • intable: the title of the desk the place the coaching dataset is saved
  • id: the title of the ID column
  • goal: the title of the goal column

After finishing the mannequin coaching, the GROW_DECTREE SP generated a number of tables with metadata from the mannequin and the coaching dataset. Listed here are a few of the key tables:

  • FLIGHT_DECTREE_MODEL: this desk comprises metadata concerning the mannequin. Examples of metadata embrace depth of the tree, technique for dealing with lacking values, and the variety of leaf nodes within the tree. 
  • FLIGHT_DECTREE_NODES: this desk supplies details about every node within the choice tree. 
  • FLIGHT_DECTREE_COLUMNS: this desk supplies data on every enter column and their position within the educated mannequin. The data contains the significance of a column in producing a prediction from the mannequin. 

This hyperlink has the entire record of mannequin tables and their particulars. 

5. Generate predictions from the mannequin

Because the FLIGHT_DECTREE mannequin is educated and deployed within the database, I can use it for producing predictions on the check information from the FLIGHTS_TEST desk.

First, I preprocess the check dataset utilizing the identical preprocessing logic that I utilized to the TRAINING dataset. 

Scale CRSARRTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET CRSARRTIME = CRSARRTIME / 100

Scale DEPTIME COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET DEPTIME = DEPTIME / 100

Scale WHEELSOFF COLUMN: divide the worth by 100 will give the hour of the flight arrival time:

UPDATE FLIGHT.FLIGHTS_TEST SET WHEELSOFF = WHEELSOFF / 100

Producing predictions

I exploit PREDICT_DECTREE SP to generate predictions from the FLIGHT_DECTREE mannequin:

CALL IDAX.PREDICT_DECTREE('mannequin=FLIGHT.flight_dectree, intable=FLIGHT.FLIGHTS_TEST, outtable=FLIGHT.FLIGHTS_TEST_PRED, prob=true, outtableprob=FLIGHT.FLIGHTS_TEST_PRED_DIST')

Right here is the record of parameters I handed when calling this SP:

  • mannequin: the title of the choice tree mannequin, FLIGHT_DECTREE
  • intable: title of the enter desk to generate predictions from
  • outtable: the title of the desk that the SP will create and retailer predictions to
  • prob: a boolean flag indicating if we need to embrace within the output the chance of every prediction
  • outputtableprob: the title of the output desk the place the chance of every prediction can be saved 

6. Consider the mannequin

Utilizing generated predictions for the check dataset, I compute a couple of metrics to judge the standard of the mannequin’s predictions.

Making a confusion matrix

I exploit CONFUSION_MATRIX SP to create a confusion matrix primarily based on the mannequin’s prediction on the TEST dataset. 

CALL IDAX.CONFUSION_MATRIX('intable=FLIGHT.FLIGHTS_TEST, resulttable=FLIGHT.FLIGHTS_TEST_PRED, id=ID, goal=FLIGHTSTATUS, matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')

In calling this SP, listed below are a few of the key parameters that I handed:

  • intable: the title of the desk that comprises the dataset and the precise worth of the goal column
  • resulttable: the title of the desk that comprises the column with predicted values from the mannequin
  • goal: the title of the goal column
  • matrixTable: The output desk the place the SP will retailer the confusion matrix

After the SP completes its run, now we have the next output desk with statistics for the confusion matrix. 

FLIGHTS_TEST_CMATRIX:

This desk has three columns. The REAL column has the precise flight standing. PREDICTION column has the anticipated flight standing. Since flight standing takes two values – 0 (on time) or 1 (delayed), now we have 4 attainable mixtures between values within the REAL and the PREDICTION columns: 

  1. TRUE NEGATIVE: REAL: 0, PREDICTION: 0 — The mannequin has precisely predicted the standing of these flights that arrived on schedule. From that CNT column, we see that 11795 rows from the TEST desk belong to this mix.
  2. FALSE POSITIVE: REAL: 0, PREDICTION: 1 — these are the flights that really arrived on time however the mannequin predicted them to be delayed. 671 is the depend of such flights. 
  3. FALSE NEGATIVE: REAL: 1, PREDICTION: 0 — these flights have arrived late, however the mannequin predicted them to be on time. From the CNT desk, we discover their depend to be 2528.
  4. TRUE POSITIVE: REAL: 1, PREDICTION: 1 — the mannequin has precisely recognized these flights that had been late. The depend is 4981. 

I exploit these counts to compute a couple of analysis metrics for the mannequin. For doing so, I exploit CMATRIX_STATS SP as follows:

CALL IDAX.CMATRIX_STATS('matrixTable=FLIGHT.FLIGHTS_TEST_CMATRIX')

The one parameter this SP wants is the title of the desk that comprises the statistics generated by the CONFUSION_MATRIX SP within the earlier step. CMATRIX_STATS SP generates two units of output. The primary one exhibits general high quality metrics of the mannequin. The second contains the mannequin’s predictive efficiency for every class. 

First output — general mannequin metrics embrace correction predictions, incorrect prediction, general accuracy, weighted accuracy. From this output, I see that the mannequin has an general accuracy of 83.98% and a weighted accuracy of 80.46%. 

With classification duties, it’s normally helpful to view the mannequin’s high quality elements for every particular person class. The second output from the CMATRIX_STATS SP contains these class stage high quality metrics. 

For every class, this output contains the True Constructive Fee (TPR), False Constructive Fee (FPR), Constructive Predictive Worth (PPV) or Precision, and F-measure (F1 rating). 

Conclusions and key takeaways

If you wish to construct and deploy an ML mannequin in a Db2 database utilizing Db2’s built-in saved procedures, I hope you’ll discover this tutorial helpful. Listed here are the principle takeaways of this tutorial:

  • Demonstrated an entire workflow of making and utilizing a choice tree mannequin in a Db2 database utilizing in-database ML Saved procedures. 
  • For every step within the workflow, I supplied concrete and useful SQL statements and saved procedures. For every code instance, when relevant, I defined intuitively what it does, and its inputs and outputs. 
  • Included references to IBM Db2’s documentation for the ML saved procedures I used on this tutorial. 

O’Reilly’s 2022 AI Adoption survey[3] underscored challenges in constructing technical infrastructure and abilities hole as two high bottlenecks to AI adoption within the enterprise. Db2 solves the primary one by supplying an end-to-end ML infrastructure within the database. It additionally lessens the latter, the talents hole, by offering easy SQL API for growing and utilizing ML fashions within the database. Within the enterprise, SQL is a extra widespread ability than ML.

Take a look at the next sources to be taught extra concerning the ML options in IBM Db2 and see further examples of ML use circumstances carried out with these options.

Discover Db2 ML Product Documentation

Discover Db2 ML samples in GitHub


References

  1. Paleyes, A., Urma, R.G. and Lawrence, N.D., 2022. Challenges in deploying machine studying: a survey of case research. ACM Computing Surveys, 55(6), pp.1–29.
  2. Amershi, S., Begel, A., Hen, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B. and Zimmermann, T., 2019, Could. Software program engineering for machine studying: A case research. In 2019 IEEE/ACM forty first Worldwide Convention on Software program Engineering: Software program Engineering in Observe (ICSE-SEIP) (pp. 291–300). IEEE.
  3. Loukides, Mike, AI Adoption within the Enterprise 2022. https://www.oreilly.com/radar/ai-adoption-in-the-enterprise-2022/



Source_link

Leave a Reply

0
    0
    Your Cart
    Your cart is emptyReturn to Shop