The Financial Information eXchange (FIX) Protocol is a standard that is used extensively within the capital markets industry to exchange information and requests between trading counterparties.
The protocol is based around FIX messages which can be used for purposes such as placing new trades, amending existing trades, exchanging market data, and for post trade processing.
If a company wishes to electronically place a trade for a stock, their systems could send a FIX message to their broker which includes details such as the symbol of the stock they wish to trade, the quantity, the price, and the type of order. The broker would then execute the trade, perhaps involving another FIX message to a stock exchange, before sending a confirmation message back to the originator to confirm the order and the price it was filled at.
A FIX message to place a new 100 lot order for Google shares looks might look like this:
8=FIX.4.2|9=178|35=D|49=CLIENT1|56=BROKER1|34=1|52=20250526-15:30:00.000|11=ORD123456|21=1|55=GOOGL|54=1|38=100|40=2|44=2795.00|59=0|10=235|
Once the order is executed, the broker could respond with an execution report showing that the full 100 lot order has been filled at an average price of $2795.00.
8=FIX.4.2|9=196|35=8|49=BROKER1|56=CLIENT1|34=2|52=20250526-15:30:01.000|37=EXEC7890|11=ORD123456|17=EX12345|150=0|39=2|55=GOOGL|54=1|38=100|14=100|6=2795.00|31=2795.00|32=100|151=0|10=072|
As you can see, each FIX message is made up of a series of numeric tags, each of which has a specific meaning. For instance, the tag 8 is the protocol version, 9 is the message length, 35 is the message type, 49 is the sender, 56 is the receiver, 34 is the sequence number, and 52 is the timestamp. By agreeing on the meaning of these tags, the FIX protocol allows for the exchange of information between different counterparties in a standardised way.
How FIX Data Is Used
At a minimum, businesses have to safely retain and archive their FIX messages for audit, regulatory and compliance purposes. If there is an issue with a trade, it is important to be able to see the exact FIX messages that were sent and received in order to understand what happened.
Beyond this though, the data can also be used for valuable analytics purposes. For instance, it may be useful to understand the volume of trades, amends and cancels. There may also be compliance use cases such as idenitfying market abuse or wash trades which may be apparent in the FIX logs. Finally, the data may be useful for risk management or to identify trading opportunities in some circumstances.
For both of these, it is important that data is easy to query, search, analyse and join. If FIX data is simply written to a text file or into an inappropriate database then it is very hard to extract value from and will lead to a lot of time for support agents manually searching through FIX logs.
Where ClickHouse Fits In
As ClickHouse is growing in promenance within capital markets, businesses such as banks, hedge funds, exchanges, brokers and trading technology vendors are increasingly interested in using it to store and analyse their FIX related data. Fortunately, ClickHouse is a natural fit for this use case.
FIX data inherently comes in high volumes. A single bank could be exchanging millions of messages per day as they process huge numbers of orders with their various counterparties. Even vanilla trading use cases could generate significant data volumes, whilst in the realm of algorithmic trading and high frequency trading, the number of messages could be in the billions per day. Over the year, we could be in petabyte scale based on FIX messages alone.
FIX data also tends to be time series and event based, meaning it can be appended to a single table without the need for significant denormalisation. This is the type of data which ClickHouse has really been optimised for as it can be stored in one big partitioned table and queried by time.
Electronic trading is also an inherently real time use case. Users such as traders or risk managers and external systems such as post grade processing or compliance systems might want to see FIX logs in near real time. ClickHouse ability to rapidly ingest and query fresh data makes it an excellent fit for this.
About This Example
We wanted to show a simple example of how to store and analyse FIX data using ClickHouse to demonstrate some techniques for working with it efficiently.
In this example, we have a trading technology vendor who are capturing trade data between various counterparty banks. These banks are trading a mixture of 10 different stocks. Each FIX message contains a sender, a receiver and a ticker signifying the counterparties and the stocks they are trading.
In some instances, a counterparty can reject a trade. This could end up costing the requester money if the market price changes. Our task in this example is to understand the financial impact of these cancellations by cross refencing against market data also stored in ClickHouse to understand the financial loss associated with a rejection.
Dataset
We used the SimpleFix Python library for generating a fictitious set of FIX logs representing buy/sell transactions between 10 banks, which we saved in a ClickHouse table called fix.log
.
The transactions refer to 10 different stocks. We generated the stock prices from a random walk and saved them in a ClickHouse table called fix.prices
. For simplicity, we considered only two types of messages:
New Orders - Single (MsgType = 'D'): A trade request is sent from one bank to another to buy or sell a stock at market price;
Confirmation (MsgType = 'AK'): The bank that received the request either confirms or rejects the trade.
We generated both the stock prices and the FIX messages at secondly frequency. We assumed that the difference between the time when the trade request is sent, the time when the trade is executed (if confirmed), and the time when the trade confirmation is received is only a few milliseconds, and that the time lag between the trade request and the trade confirmation has no impact on the market price at which the trade is executed. A preview of the data is shown below:
SELECT
UID,
Timestamp,
LEFT(Message, 50)
FROM fix.log
LIMIT 10
Query id: e13cec5b-1483-47c0-a3ba-af7063e1ad69
┌─UID─┬───────────Timestamp─┬─left(Message, 50)──────────────────────────────────┐
1. │ 1 │ 2024-03-08 09:38:45 │ 8=FIX.4.4|35=D|49=BNP Paribas|56=Goldman Sachs|34= │
2. │ 2 │ 2024-03-08 09:38:45 │ 8=FIX.4.4|35=AK|49=Goldman Sachs|56=BNP Paribas|34 │
3. │ 3 │ 2024-03-08 09:38:46 │ 8=FIX.4.4|35=D|49=BNP Paribas|56=Bank of America|3 │
4. │ 4 │ 2024-03-08 09:38:46 │ 8=FIX.4.4|35=AK|49=Bank of America|56=BNP Paribas| │
5. │ 5 │ 2024-03-08 09:38:47 │ 8=FIX.4.4|35=D|49=Morgan Stanley|56=Bank of Americ │
6. │ 6 │ 2024-03-08 09:38:47 │ 8=FIX.4.4|35=AK|49=Bank of America|56=Morgan Stanl │
7. │ 7 │ 2024-03-08 09:38:48 │ 8=FIX.4.4|35=D|49=Morgan Stanley|56=Nomura|34=2|52 │
8. │ 8 │ 2024-03-08 09:38:48 │ 8=FIX.4.4|35=AK|49=Nomura|56=Morgan Stanley|34=1|5 │
9. │ 9 │ 2024-03-08 09:38:49 │ 8=FIX.4.4|35=D|49=J.P. Morgan|56=BNP Paribas|34=1| │
10. │ 10 │ 2024-03-08 09:38:49 │ 8=FIX.4.4|35=AK|49=BNP Paribas|56=J.P. Morgan|34=3 │
└─────┴─────────────────────┴────────────────────────────────────────────────────┘
10 rows in set. Elapsed: 0.003 sec. Processed 2.12 thousand rows, 484.37 KB (706.32 thousand rows/s., 161.15 MB/s.)
Peak memory usage: 907.05 KiB.
The price data has the following format:
SELECT *
FROM fix.prices
LIMIT 10
Query id: d897451f-a51c-44a5-881d-f785bf66e8b1
┌───────────Timestamp─┬─Ticker─┬─Price─┐
1. │ 2024-03-08 09:38:45 │ GTO │ 28.89 │
2. │ 2024-03-08 09:38:45 │ KHU │ 8.82 │
3. │ 2024-03-08 09:38:45 │ GZS │ 98.57 │
4. │ 2024-03-08 09:38:45 │ WKK │ 77.46 │
5. │ 2024-03-08 09:38:45 │ XUD │ 19.92 │
6. │ 2024-03-08 09:38:45 │ HXC │ 56.27 │
7. │ 2024-03-08 09:38:45 │ VUB │ 38.77 │
8. │ 2024-03-08 09:38:45 │ XLF │ 36.6 │
9. │ 2024-03-08 09:38:45 │ BUA │ 33.55 │
10. │ 2024-03-08 09:38:45 │ LZV │ 9.18 │
└─────────────────────┴────────┴───────┘
10 rows in set. Elapsed: 0.003 sec. Processed 8.20 thousand rows, 196.80 KB (2.92 million rows/s., 70.01 MB/s.)
Peak memory usage: 453.33 KiB.
Scenario
We have observed that a large fraction of trades were rejected, which will likely have an impact on the banks' PnL. To quantify the impact of the rejected trades, we calculate two PnLs for each bank:
The expected PnL, which is calculated taking into account all trades;
The realized PnL, which is calculated taking into account only the confirmed trades.
For simplicity, we calculate each PnL using as buy price the price at which the last buy trade was executed before each sell trade.
Data Preparation
We start by creating a ClickHouse view called fix.messages
where we extract the main fields from the raw FIX messages in the fix.log
table using ClickHouse string functions and powerful array processing functions. We also do simple transformations such as convering numbers to descriptive strings to make downstream processing easier.
create or replace view
fix.messages
as select
UID,
Timestamp,
Message,
Protocol,
MessageCode,
MessageType,
TradeID,
TradeStatus,
Sender,
Receiver,
Ticker,
BuySell,
Price,
Quantity
from (
select
UID,
Timestamp,
Message,
arrayMap(x -> splitByChar('=', x), splitByChar('|', Message)) as MessageArray,
arrayFlatten(arrayMap(x -> arraySlice(x, 1, 1), MessageArray)) as Tag,
arrayFlatten(arrayMap(x -> arraySlice(x, 2, 1), MessageArray)) as Value,
arraySlice(Value, indexOf(Tag, '35'), 1)[1] as MessageCode,
if(MessageCode == 'AK', arraySlice(Value, indexOf(Tag, '665'), 1)[1], NULL) as ConfirmStatus,
arraySlice(Value, indexOf(Tag, '8'), 1)[1] as Protocol,
if(MessageCode == 'D', 'New Order - Single', if(MessageCode == 'AK', 'Confirmation', 'Other')) as MessageType,
arraySlice(Value, indexOf(Tag, '11'), 1)[1] as TradeID,
if(ConfirmStatus == '4', 'Confirmed', if(ConfirmStatus == '5', 'Rejected', NULL)) as TradeStatus,
arraySlice(Value, indexOf(Tag, '49'), 1)[1] as Sender,
arraySlice(Value, indexOf(Tag, '56'), 1)[1] as Receiver,
arraySlice(Value, indexOf(Tag, '55'), 1)[1] as Ticker,
if(arraySlice(Value, indexOf(Tag, '54'), 1)[1] == '1', 'Buy', 'Sell') as BuySell,
toFloat64(if(MessageCode == 'AK', arraySlice(Value, indexOf(Tag, '6'), 1)[1], NULL)) as Price,
toFloat64(if(MessageCode == 'D', arraySlice(Value, indexOf(Tag, '38'), 1)[1], arraySlice(Value, indexOf(Tag, '80'), 1)[1])) as Quantity
from
fix.log
)
order by
UID, Timestamp
Each trade consists of an Order - New message with MessageCode = D and a confirmation or rejection with MessageCode = X. We create a view to join the two trades for convenience.
create or replace view
fix.report
as select
Timestamp,
TradeID,
TradeStatus,
Sender,
Receiver,
Ticker,
BuySell,
Price,
RequestedQuantity,
AllocatedQuantity
from(
select
TradeID,
Sender,
Receiver,
Ticker,
BuySell,
Quantity as RequestedQuantity
from
fix.messages
where
MessageCode == 'D'
) a
inner join (
select
Timestamp,
TradeID,
TradeStatus,
Price,
Quantity as AllocatedQuantity
from
fix.messages
where
MessageCode == 'AK'
) b
on
a.TradeID == b.TradeID
SELECT *
FROM fix.report
LIMIT 10
Query id: e954df22-a21b-487a-b598-ae5d2ba41388
┌───────────Timestamp─┬─TradeID─┬─TradeStatus─┬─Sender──────────┬─Receiver────────┬─Ticker─┬─BuySell─┬─Price─┬─RequestedQuantity─┬─AllocatedQuantity─┐
1. │ 2024-03-08 09:38:45 │ 1 │ Rejected │ BNP Paribas │ Goldman Sachs │ VUB │ Buy │ 38.77 │ 108 │ 0 │
2. │ 2024-03-08 09:38:46 │ 2 │ Confirmed │ BNP Paribas │ Bank of America │ GZS │ Buy │ 98.81 │ 100 │ 100 │
3. │ 2024-03-08 09:38:47 │ 3 │ Confirmed │ Morgan Stanley │ Bank of America │ LZV │ Buy │ 7.48 │ 140 │ 140 │
4. │ 2024-03-08 09:38:48 │ 4 │ Confirmed │ Morgan Stanley │ Nomura │ XLF │ Buy │ 35.33 │ 88 │ 88 │
5. │ 2024-03-08 09:38:49 │ 5 │ Confirmed │ J.P. Morgan │ BNP Paribas │ WKK │ Buy │ 72.76 │ 105 │ 105 │
6. │ 2024-03-08 09:38:50 │ 6 │ Confirmed │ Nomura │ J.P. Morgan │ XLF │ Buy │ 36.08 │ 65 │ 65 │
7. │ 2024-03-08 09:38:51 │ 7 │ Confirmed │ Bank of America │ Nomura │ GZS │ Buy │ 96.96 │ 118 │ 118 │
8. │ 2024-03-08 09:38:52 │ 8 │ Confirmed │ Citi │ J.P. Morgan │ BUA │ Buy │ 31.84 │ 89 │ 89 │
9. │ 2024-03-08 09:38:53 │ 9 │ Confirmed │ Nomura │ BNP Paribas │ XLF │ Sell │ 37.11 │ 65 │ 65 │
10. │ 2024-03-08 09:38:54 │ 10 │ Confirmed │ BNP Paribas │ J.P. Morgan │ XLF │ Buy │ 37.38 │ 141 │ 141 │
└─────────────────────┴─────────┴─────────────┴─────────────────┴─────────────────┴────────┴─────────┴───────┴───────────────────┴───────────────────┘
10 rows in set. Elapsed: 0.055 sec. Processed 4.25 thousand rows, 969.35 KB (77.17 thousand rows/s., 17.61 MB/s.)
Peak memory usage: 4.77 MiB.
Data Analysis
We have seen that a large fraction of trades were rejected, which will likely have an impact on the banks' PnL. To quantify the impact of the rejected trades, we calculate two PnLs for each bank:
the expected PnL, which is calculated taking into account all trades; the realized PnL, which is calculated taking into account only the confirmed trades. For simplicity, we calculate each PnL using as buy price the price at which the last buy trade was executed before each sell trade.
We start by calculating the trade-level PnL, which we save in ClickHouse view called fix.profit_and_loss.
create or replace view
fix.profit_and_loss
as select
Timestamp,
Sender,
Receiver,
Ticker,
BuyTradeStatus,
SellTradeStatus,
BuyQuantity,
SellQuantity,
BuyPrice,
SellPrice,
(SellQuantity * SellPrice - BuyQuantity * BuyPrice) as PnL
from (
select
Timestamp,
Sender,
Receiver,
Ticker,
last_value(BuyQuantity) over (partition by Sender, Ticker order by Timestamp rows between unbounded preceding and current row) as BuyQuantity,
SellQuantity,
last_value(BuyTradeStatus) over (partition by Sender, Ticker order by Timestamp rows between unbounded preceding and current row) as BuyTradeStatus,
SellTradeStatus,
last_value(BuyPrice) over (partition by Sender, Ticker order by Timestamp rows between unbounded preceding and current row) as BuyPrice,
SellPrice
from (
select
Timestamp,
Sender,
Receiver,
Ticker,
if(BuySell == 'Buy', TradeStatus, NULL) as BuyTradeStatus,
if(BuySell == 'Sell', TradeStatus, NULL) as SellTradeStatus,
if(BuySell == 'Buy', RequestedQuantity, NULL) as BuyQuantity,
if(BuySell == 'Sell', RequestedQuantity, NULL) as SellQuantity,
if(BuySell == 'Buy', Price, NULL) as BuyPrice,
if(BuySell == 'Sell', Price, NULL) as SellPrice
from
fix.report
order by
Sender,
Ticker,
Timestamp
)
)
where isNotNull(SellPrice)
The view has the following format:
SELECT
Timestamp,
Sender,
Receiver,
Ticker,
PnL
FROM fix.profit_and_loss
LIMIT 10
Query id: 2e473b5f-7368-4261-8950-42f258426926
┌───────────Timestamp─┬─Sender──────┬─Receiver────────┬─Ticker─┬────────────────PnL─┐
1. │ 2024-03-08 09:39:37 │ BNP Paribas │ Morgan Stanley │ BUA │ 960.75 │
2. │ 2024-03-08 09:43:31 │ BNP Paribas │ Bank of America │ BUA │ 1087.2999999999993 │
3. │ 2024-03-08 09:47:16 │ BNP Paribas │ Deutsche Bank │ BUA │ 983.6800000000003 │
4. │ 2024-03-08 10:15:30 │ BNP Paribas │ Citi │ BUA │ 1854.5500000000002 │
5. │ 2024-03-08 09:42:33 │ BNP Paribas │ Goldman Sachs │ KHU │ 197.7800000000001 │
6. │ 2024-03-08 09:54:09 │ BNP Paribas │ Citi │ KHU │ 978.02 │
7. │ 2024-03-08 09:56:45 │ BNP Paribas │ Barclays │ KHU │ 49.92000000000007 │
8. │ 2024-03-08 09:58:27 │ BNP Paribas │ J.P. Morgan │ KHU │ 423.51 │
9. │ 2024-03-08 10:00:33 │ BNP Paribas │ Citi │ KHU │ 1004.3999999999992 │
10. │ 2024-03-08 10:06:19 │ BNP Paribas │ Morgan Stanley │ KHU │ 1023.4000000000005 │
└─────────────────────┴─────────────┴─────────────────┴────────┴────────────────────┘
10 rows in set. Elapsed: 0.073 sec. Processed 4.25 thousand rows, 969.35 KB (58.39 thousand rows/s., 13.32 MB/s.)
Peak memory usage: 6.13 MiB.
After that, we calculate the cumulative expected PnL and realized PnL of each bank, which we save in a ClickHouse view called fix.cumulative_profit_and_loss.
create or replace view
fix.cumulative_profit_and_loss
as select
Timestamp,
Sender as Bank,
ExpectedPnL,
if(isNotNull(RealizedPnL), RealizedPnL, 0) as RealizedPnL
from (
select
Timestamp,
Sender,
ExpectedPnL,
if(isNotNull(RealizedPnL), RealizedPnL, last_value(RealizedPnL) over (partition by Sender order by Timestamp rows between unbounded preceding and current row)) as RealizedPnL
from
(
select
Timestamp,
Sender,
sum(PnL) over (partition by Sender order by Timestamp rows between unbounded preceding and current row) as ExpectedPnL
from
fix.profit_and_loss
) a
left join (
select
Timestamp,
Sender,
sum(PnL) over (partition by Sender order by Timestamp rows between unbounded preceding and current row) as RealizedPnL
from
fix.profit_and_loss
where
BuyTradeStatus == 'Confirmed'
and
SellTradeStatus == 'Confirmed'
) b
on
a.Timestamp == b.Timestamp
and
a.Sender == b.Sender
)
order by Sender, Timestamp
SELECT *
FROM fix.cumulative_profit_and_loss
LIMIT 10
Query id: d57444ae-96ce-4675-92d8-cab6145566e5
┌───────────Timestamp─┬─Bank────────┬────────ExpectedPnL─┬────────RealizedPnL─┐
1. │ 2024-03-08 09:39:04 │ BNP Paribas │ 98.69999999999982 │ 0 │
2. │ 2024-03-08 09:39:37 │ BNP Paribas │ 1059.4499999999998 │ 960.75 │
3. │ 2024-03-08 09:40:03 │ BNP Paribas │ 2323.8499999999995 │ 2225.1499999999996 │
4. │ 2024-03-08 09:40:29 │ BNP Paribas │ 2583.0099999999993 │ 2225.1499999999996 │
5. │ 2024-03-08 09:40:38 │ BNP Paribas │ 3350.6099999999997 │ 2225.1499999999996 │
6. │ 2024-03-08 09:40:41 │ BNP Paribas │ 3874.5599999999986 │ 2225.1499999999996 │
7. │ 2024-03-08 09:41:10 │ BNP Paribas │ 5967.059999999999 │ 2225.1499999999996 │
8. │ 2024-03-08 09:41:33 │ BNP Paribas │ 6267.059999999999 │ 2525.1499999999996 │
9. │ 2024-03-08 09:41:38 │ BNP Paribas │ 7656.959999999999 │ 2525.1499999999996 │
10. │ 2024-03-08 09:42:08 │ BNP Paribas │ 7862.49 │ 2525.1499999999996 │
└─────────────────────┴─────────────┴────────────────────┴────────────────────┘
10 rows in set. Elapsed: 0.145 sec. Processed 8.50 thousand rows, 1.94 MB (58.58 thousand rows/s., 13.37 MB/s.)
Peak memory usage: 14.85 MiB.
We can now visualize the expected PnL against the realized PnL over time.
NOT NEEDED
Using our Hex Notebook, we can easily render a heatmap of the delivery locations around Jilin, observing that more deliveries occur in central areas:
Our model will also take account of the pickup time as a second variable. Therefore, we will also visualise the distribution of the number of orders by pickup hour and can observe that most packages are collected at 8am in the morning.
Data Preparation
Our model will predict the time elapsed between the pickup and delivery (in minutes) as a function of the distance between the pickup and the delivery locations (in meters) and of the pickup hour.
We use Clickhouse geoDistance function for calculating the distance between the pickup and the delivery locations given their coordinates (latitude and longitude), while we use Clickhouse date_diff function for calculating the time elapsed between pickup and delivery.
We also add to the dataset a randomly generated training index using randUniform function, which is equal to 1 for 80% of the data, which will be used for training, and equal to 0 for the remaining 20% of the data, which will be used for testing performance of the model.
create or replace table
deliveries_dataset (
order_id UInt32,
delivery_time Float64,
delivery_distance Float64,
Hour7 Float64,
Hour8 Float64,
Hour9 Float64,
Hour10 Float64,
Hour11 Float64,
Hour12 Float64,
Hour13 Float64,
Hour14 Float64,
Hour15 Float64,
Hour16 Float64,
training Float64
)
engine = MergeTree
order by order_id
insert into
deliveries_dataset
select
order_id,
date_diff('minute', accept_gps_time, delivery_gps_time) as delivery_time,
geoDistance(accept_gps_lng, accept_gps_lat, delivery_gps_lng, delivery_gps_lat) as delivery_distance,
if(toHour(accept_gps_time) = 7, 1, 0) as Hour7,
if(toHour(accept_gps_time) = 8, 1, 0) as Hour8,
if(toHour(accept_gps_time) = 9, 1, 0) as Hour9,
if(toHour(accept_gps_time) = 10, 1, 0) as Hour10,
if(toHour(accept_gps_time) = 11, 1, 0) as Hour11,
if(toHour(accept_gps_time) = 12, 1, 0) as Hour12,
if(toHour(accept_gps_time) = 13, 1, 0) as Hour13,
if(toHour(accept_gps_time) = 14, 1, 0) as Hour14,
if(toHour(accept_gps_time) = 15, 1, 0) as Hour15,
if(toHour(accept_gps_time) = 16, 1, 0) as Hour16,
if(randUniform(0, 1) <= 0.8, 1, 0) as training
from
deliveries
When visualised, delivery distance and delivery time are positively correlated with greater variance as journeys get longer. This is intuitively as we would expect as longer journeys become harder to predict.
Model Training
We use Clickhouse's stochasticLinearRegression function for fitting the linear regression model based on the 80% of our dataset which contains training data.
Given that this function uses gradient descent, we scale the delivery distance (which is the only continuous feature) by subtracting the training set mean and dividing by the training set standard deviation. We take the logarithm of the target to make sure that the time to delivery predicted by the model is never negative.
create view
deliveries_model
as with
(select avg(delivery_distance) from deliveries_dataset where training = 1) as loc,
(select stddevSamp(delivery_distance) from deliveries_dataset where training = 1) as scale
select
stochasticLinearRegressionState(0.1, 0.0001, 15, 'SGD')(
log(delivery_time),
assumeNotNull((delivery_distance - loc) / scale),
Hour7,
Hour8,
Hour9,
Hour10,
Hour11,
Hour12,
Hour13,
Hour14,
Hour15,
Hour16
)
as
state
from
deliveries_dataset
where training = 1
Model Evaluation
We can now use the fitted model to make predictions for the remaining 20% of our dataset. We will do this by comparing the predicted delivery times with the actuals to calculate the accuracy of our model.
create view
deliveries_results
as with
(select avg(delivery_distance) from deliveries_dataset where training = 1) as loc,
(select stddevSamp(delivery_distance) from deliveries_dataset where training = 1) as scale,
(select state from deliveries_model) as model
select
toInt32(delivery_time) as ACTUAL,
toInt32(exp(evalMLMethod(
model,
assumeNotNull((delivery_distance - loc) / scale),
Hour7,
Hour8,
Hour9,
Hour10,
Hour11,
Hour12,
Hour13,
Hour14,
Hour15,
Hour16
))) as PREDICTED
from
deliveries_dataset
where training = 0
We now have a table of ACTUAL delivery times and PREDICTED delivery times for the 20% test portion of our dataset.
SELECT *
FROM deliveries_results
LIMIT 10
Query id: 2a4cee84-bf43-4925-947e-c9f0a149c05f
┌─ACTUAL─┬─PREDICTED─┐
│ 410 │ 370 │
│ 101 │ 122 │
│ 361 │ 214 │
│ 189 │ 69 │
│ 122 │ 92 │
│ 454 │ 365 │
│ 155 │ 354 │
│ 323 │ 334 │
│ 145 │ 153 │
│ 17 │ 20 │
└────────┴───────────┘
10 rows in set. Elapsed: 0.015 sec. Processed 9.17 thousand rows, 267.76 KB (619.10 thousand rows/s., 18.07 MB/s.)
Peak memory usage: 2.28 MiB.
We can also plot these visually as per below in our notebook:
To explain the plot, if the model was performing perfectly, then we would expect PREDICTED and ACTUAL to match in every case, meaning that all points would line up on the orange curve. In reality, our model did have errors which we will now analyse.
Model Performance
Looking at the visualisation above, we can see that our model performed relatively well for shorter journeys less than 120 minutes, but predictive accuracy begins to fall away for longer distance journeys as they become more complex and harder to predict.
This would be in line with our real-world experience whereby the longer and more arduous a journey is, the harder it is to predict.
More scientifically, we can evaluate the models performance by looking at the model's mean absolute error (MAE) and root mean squared error (RMSE). This gives us a value of approximately 1 hour across the entire dataset:
SELECT
avg(abs(ACTUAL - PREDICTED)) AS MAE,
sqrt(avg(pow(ACTUAL - PREDICTED, 2))) AS RMSE
FROM deliveries_results
Query id: 0082d2a2-4735-4cf6-8d0e-dba4e010a918
┌───────────────MAE─┬──────────────RMSE─┐
│ 58.18494623655914 │ 78.10208373578114 │
└───────────────────┴───────────────────┘
1 row in set. Elapsed: 0.022 sec. Processed 9.17 thousand rows, 267.76 KB (407.90 thousand rows/s., 11.91 MB/s.)
Peak memory usage: 2.28 MiB.
If we limit this to just the shorter journeys with an ACTUAL of less than 2 hours (120 minutes), then we can see that our model performs better with an MAE and RMSE closer to 30 minutes:
SELECT
avg(abs(ACTUAL - PREDICTED)) AS MAE,
sqrt(avg(pow(ACTUAL - PREDICTED, 2))) AS RMSE
FROM deliveries_results
WHERE ACTUAL < 120
Query id: edb8eef0-24a7-4579-a856-4e3f3e10cd20
┌────────────────MAE─┬──────────────RMSE─┐
│ 29.681159420289855 │ 41.68671981213744 │
└────────────────────┴───────────────────┘
1 row in set. Elapsed: 0.014 sec. Processed 9.17 thousand rows, 267.76 KB (654.46 thousand rows/s., 19.11 MB/s.)
Peak memory usage: 2.35 MiB.
Conclusion
In this article we have demonstrated how we can use a simple linear regression function to predict output values based on 2 input variables.
The performance of the model was resasonable at shorter distances, but began to break down as the output variable became harder to predict. That said, we can see that a simple linear regression conducted entirely within ClickHouse and using only 2 variables does have some predictive capability and may perform better in other datasets and domains.
Sample Notebook
A notebook describing the full worked example can be found at this URL.