Analysing FIX Data With ClickHouse

The Financial Information eXchange (FIX) Protocol is a standard that is used extensively within the capital markets industry to exchange information and requests between trading counterparties.

The protocol is based around FIX messages which can be used for purposes such as placing new trades, amending existing trades, exchanging market data, and for post trade processing.

If a company wishes to electronically place a trade for a stock, their systems could send a FIX message to their broker which includes details such as the symbol of the stock they wish to trade, the quantity, the price, and the type of order. The broker would then execute the trade, perhaps involving another FIX message to a stock exchange, before sending a confirmation message back to the originator to confirm the order and the price it was filled at.

A FIX message to place a new 100 lot order for Google shares looks might look like this:

8=FIX.4.2|9=178|35=D|49=CLIENT1|56=BROKER1|34=1|52=20250526-15:30:00.000|11=ORD123456|21=1|55=GOOGL|54=1|38=100|40=2|44=2795.00|59=0|10=235|

Once the order is executed, the broker could respond with an execution report showing that the full 100 lot order has been filled at an average price of $2795.00.

8=FIX.4.2|9=196|35=8|49=BROKER1|56=CLIENT1|34=2|52=20250526-15:30:01.000|37=EXEC7890|11=ORD123456|17=EX12345|150=0|39=2|55=GOOGL|54=1|38=100|14=100|6=2795.00|31=2795.00|32=100|151=0|10=072|

As you can see, each FIX message is made up of a series of numeric tags, each of which has a specific meaning. For instance, the tag 8 is the protocol version, 9 is the message length, 35 is the message type, 49 is the sender, 56 is the receiver, 34 is the sequence number, and 52 is the timestamp. By agreeing on the meaning of these tags, the FIX protocol allows for the exchange of information between different counterparties in a standardised way.

How FIX Data Is Used

At a minimum, businesses have to safely retain and archive their FIX messages for audit, regulatory and compliance purposes. If there is an issue with a trade, it is important to be able to see the exact FIX messages that were sent and received in order to understand what happened.

Beyond this though, the data can also be used for valuable analytics purposes. For instance, it may be useful to understand the volume of trades, amends and cancels. There may also be compliance use cases such as idenitfying market abuse or wash trades which may be apparent in the FIX logs. Finally, the data may be useful for risk management or to identify trading opportunities in some circumstances.

For both of these, it is important that data is easy to query, search, analyse and join. If FIX data is simply written to a text file or into an inappropriate database then it is very hard to extract value from and will lead to a lot of time for support agents manually searching through FIX logs.

Where ClickHouse Fits In

As ClickHouse is growing in promenance within capital markets, businesses such as banks, hedge funds, exchanges, brokers and trading technology vendors are increasingly interested in using it to store and analyse their FIX related data. Fortunately, ClickHouse is a natural fit for this use case.

FIX data inherently comes in high volumes. A single bank could be exchanging millions of messages per day as they process huge numbers of orders with their various counterparties. Even vanilla trading use cases could generate significant data volumes, whilst in the realm of algorithmic trading and high frequency trading, the number of messages could be in the billions per day. Over the year, we could be in petabyte scale based on FIX messages alone.

FIX data also tends to be time series and event based, meaning it can be appended to a single table without the need for significant denormalisation. This is the type of data which ClickHouse has really been optimised for as it can be stored in one big partitioned table and queried by time.

Electronic trading is also an inherently real time use case. Users such as traders or risk managers and external systems such as post grade processing or compliance systems might want to see FIX logs in near real time. ClickHouse ability to rapidly ingest and query fresh data makes it an excellent fit for this.

About This Example

We wanted to show a simple example of how to store and analyse FIX data using ClickHouse to demonstrate some techniques for working with it efficiently.

In this example, we have a trading technology vendor who are capturing trade data between various counterparty banks. These banks are trading a mixture of 10 different stocks. Each FIX message contains a sender, a receiver and a ticker signifying the counterparties and the stocks they are trading.

In some instances, a counterparty can reject a trade. This could end up costing the requester money if the market price changes. Our task in this example is to understand the financial impact of these cancellations by cross refencing against market data also stored in ClickHouse to understand the financial loss associated with a rejection.

Dataset

We used the SimpleFix Python library for generating a fictitious set of FIX logs representing buy/sell transactions between 10 banks, which we saved in a ClickHouse table called fix.log.

The transactions refer to 10 different stocks. We generated the stock prices from a random walk and saved them in a ClickHouse table called fix.prices. For simplicity, we considered only two types of messages:

New Orders - Single (MsgType = 'D'): A trade request is sent from one bank to another to buy or sell a stock at market price;
Confirmation (MsgType = 'AK'): The bank that received the request either confirms or rejects the trade.

We generated both the stock prices and the FIX messages at secondly frequency. We assumed that the difference between the time when the trade request is sent, the time when the trade is executed (if confirmed), and the time when the trade confirmation is received is only a few milliseconds, and that the time lag between the trade request and the trade confirmation has no impact on the market price at which the trade is executed. A preview of the data is shown below:

SELECT
    UID,
    Timestamp,
    LEFT(Message, 50)
FROM fix.log
LIMIT 10

Query id: e13cec5b-1483-47c0-a3ba-af7063e1ad69

    ┌─UID─┬───────────Timestamp─┬─left(Message, 50)──────────────────────────────────┐
 1. │   1 │ 2024-03-08 09:38:45 │ 8=FIX.4.4|35=D|49=BNP Paribas|56=Goldman Sachs|34= │
 2. │   2 │ 2024-03-08 09:38:45 │ 8=FIX.4.4|35=AK|49=Goldman Sachs|56=BNP Paribas|34 │
 3. │   3 │ 2024-03-08 09:38:46 │ 8=FIX.4.4|35=D|49=BNP Paribas|56=Bank of America|3 │
 4. │   4 │ 2024-03-08 09:38:46 │ 8=FIX.4.4|35=AK|49=Bank of America|56=BNP Paribas| │
 5. │   5 │ 2024-03-08 09:38:47 │ 8=FIX.4.4|35=D|49=Morgan Stanley|56=Bank of Americ │
 6. │   6 │ 2024-03-08 09:38:47 │ 8=FIX.4.4|35=AK|49=Bank of America|56=Morgan Stanl │
 7. │   7 │ 2024-03-08 09:38:48 │ 8=FIX.4.4|35=D|49=Morgan Stanley|56=Nomura|34=2|52 │
 8. │   8 │ 2024-03-08 09:38:48 │ 8=FIX.4.4|35=AK|49=Nomura|56=Morgan Stanley|34=1|5 │
 9. │   9 │ 2024-03-08 09:38:49 │ 8=FIX.4.4|35=D|49=J.P. Morgan|56=BNP Paribas|34=1| │
10. │  10 │ 2024-03-08 09:38:49 │ 8=FIX.4.4|35=AK|49=BNP Paribas|56=J.P. Morgan|34=3 │
    └─────┴─────────────────────┴────────────────────────────────────────────────────┘

10 rows in set. Elapsed: 0.003 sec. Processed 2.12 thousand rows, 484.37 KB (706.32 thousand rows/s., 161.15 MB/s.)
Peak memory usage: 907.05 KiB.

The price data has the following format:

SELECT *
FROM fix.prices
LIMIT 10

Query id: d897451f-a51c-44a5-881d-f785bf66e8b1

    ┌───────────Timestamp─┬─Ticker─┬─Price─┐
 1. │ 2024-03-08 09:38:45 │ GTO    │ 28.89 │
 2. │ 2024-03-08 09:38:45 │ KHU    │  8.82 │
 3. │ 2024-03-08 09:38:45 │ GZS    │ 98.57 │
 4. │ 2024-03-08 09:38:45 │ WKK    │ 77.46 │
 5. │ 2024-03-08 09:38:45 │ XUD    │ 19.92 │
 6. │ 2024-03-08 09:38:45 │ HXC    │ 56.27 │
 7. │ 2024-03-08 09:38:45 │ VUB    │ 38.77 │
 8. │ 2024-03-08 09:38:45 │ XLF    │  36.6 │
 9. │ 2024-03-08 09:38:45 │ BUA    │ 33.55 │
10. │ 2024-03-08 09:38:45 │ LZV    │  9.18 │
    └─────────────────────┴────────┴───────┘

10 rows in set. Elapsed: 0.003 sec. Processed 8.20 thousand rows, 196.80 KB (2.92 million rows/s., 70.01 MB/s.)
Peak memory usage: 453.33 KiB.

Scenario

We have observed that a large fraction of trades were rejected, which will likely have an impact on the banks' PnL. To quantify the impact of the rejected trades, we calculate two PnLs for each bank:

The expected PnL, which is calculated taking into account all trades;

The realized PnL, which is calculated taking into account only the confirmed trades.

For simplicity, we calculate each PnL using as buy price the price at which the last buy trade was executed before each sell trade.

Data Preparation

We start by creating a ClickHouse view called fix.messages where we extract the main fields from the raw FIX messages in the fix.log table using ClickHouse string functions and powerful array processing functions. We also do simple transformations such as convering numbers to descriptive strings to make downstream processing easier.

create or replace view 
    fix.messages 
as select 
    UID,
    Timestamp,
    Message,
    Protocol,
    MessageCode,
    MessageType, 
    TradeID,
    TradeStatus,
    Sender,
    Receiver,
    Ticker,
    BuySell,
    Price,
    Quantity
from (
    select 
        UID,
        Timestamp,
        Message,
        arrayMap(x -> splitByChar('=', x), splitByChar('|', Message)) as MessageArray,
        arrayFlatten(arrayMap(x -> arraySlice(x, 1, 1), MessageArray)) as Tag,
        arrayFlatten(arrayMap(x -> arraySlice(x, 2, 1), MessageArray)) as Value,
        arraySlice(Value, indexOf(Tag, '35'), 1)[1] as MessageCode,
        if(MessageCode == 'AK', arraySlice(Value, indexOf(Tag, '665'), 1)[1], NULL) as ConfirmStatus,
        arraySlice(Value, indexOf(Tag, '8'), 1)[1] as Protocol,
        if(MessageCode == 'D', 'New Order - Single', if(MessageCode == 'AK', 'Confirmation', 'Other')) as MessageType, 
        arraySlice(Value, indexOf(Tag, '11'), 1)[1] as TradeID,
        if(ConfirmStatus == '4', 'Confirmed', if(ConfirmStatus == '5', 'Rejected', NULL)) as TradeStatus,
        arraySlice(Value, indexOf(Tag, '49'), 1)[1] as Sender,
        arraySlice(Value, indexOf(Tag, '56'), 1)[1] as Receiver,
        arraySlice(Value, indexOf(Tag, '55'), 1)[1] as Ticker,
        if(arraySlice(Value, indexOf(Tag, '54'), 1)[1] == '1', 'Buy', 'Sell') as BuySell,
        toFloat64(if(MessageCode == 'AK', arraySlice(Value, indexOf(Tag, '6'), 1)[1], NULL)) as Price,
        toFloat64(if(MessageCode == 'D', arraySlice(Value, indexOf(Tag, '38'), 1)[1], arraySlice(Value, indexOf(Tag, '80'), 1)[1])) as Quantity
    from 
        fix.log 
)
order by
    UID, Timestamp

Each trade consists of an Order - New message with MessageCode = D and a confirmation or rejection with MessageCode = X. We create a view to join the two trades for convenience.

create or replace view
    fix.report
as select 
    Timestamp,
    TradeID,
    TradeStatus,
    Sender,
    Receiver,
    Ticker,
    BuySell,
    Price,
    RequestedQuantity,
    AllocatedQuantity
from(
    select 
        TradeID,
        Sender,
        Receiver,
        Ticker,
        BuySell,
        Quantity as RequestedQuantity
    from 
        fix.messages 
    where 
        MessageCode == 'D'
) a
inner join (
        select 
        Timestamp,
        TradeID,
        TradeStatus,
        Price,
        Quantity as AllocatedQuantity
    from 
        fix.messages 
    where 
        MessageCode == 'AK'
) b  
on 
    a.TradeID == b.TradeID

SELECT *
FROM fix.report
LIMIT 10

Query id: e954df22-a21b-487a-b598-ae5d2ba41388

    ┌───────────Timestamp─┬─TradeID─┬─TradeStatus─┬─Sender──────────┬─Receiver────────┬─Ticker─┬─BuySell─┬─Price─┬─RequestedQuantity─┬─AllocatedQuantity─┐
 1. │ 2024-03-08 09:38:45 │ 1       │ Rejected    │ BNP Paribas     │ Goldman Sachs   │ VUB    │ Buy     │ 38.77 │               108 │                 0 │
 2. │ 2024-03-08 09:38:46 │ 2       │ Confirmed   │ BNP Paribas     │ Bank of America │ GZS    │ Buy     │ 98.81 │               100 │               100 │
 3. │ 2024-03-08 09:38:47 │ 3       │ Confirmed   │ Morgan Stanley  │ Bank of America │ LZV    │ Buy     │  7.48 │               140 │               140 │
 4. │ 2024-03-08 09:38:48 │ 4       │ Confirmed   │ Morgan Stanley  │ Nomura          │ XLF    │ Buy     │ 35.33 │                88 │                88 │
 5. │ 2024-03-08 09:38:49 │ 5       │ Confirmed   │ J.P. Morgan     │ BNP Paribas     │ WKK    │ Buy     │ 72.76 │               105 │               105 │
 6. │ 2024-03-08 09:38:50 │ 6       │ Confirmed   │ Nomura          │ J.P. Morgan     │ XLF    │ Buy     │ 36.08 │                65 │                65 │
 7. │ 2024-03-08 09:38:51 │ 7       │ Confirmed   │ Bank of America │ Nomura          │ GZS    │ Buy     │ 96.96 │               118 │               118 │
 8. │ 2024-03-08 09:38:52 │ 8       │ Confirmed   │ Citi            │ J.P. Morgan     │ BUA    │ Buy     │ 31.84 │                89 │                89 │
 9. │ 2024-03-08 09:38:53 │ 9       │ Confirmed   │ Nomura          │ BNP Paribas     │ XLF    │ Sell    │ 37.11 │                65 │                65 │
10. │ 2024-03-08 09:38:54 │ 10      │ Confirmed   │ BNP Paribas     │ J.P. Morgan     │ XLF    │ Buy     │ 37.38 │               141 │               141 │
    └─────────────────────┴─────────┴─────────────┴─────────────────┴─────────────────┴────────┴─────────┴───────┴───────────────────┴───────────────────┘

10 rows in set. Elapsed: 0.055 sec. Processed 4.25 thousand rows, 969.35 KB (77.17 thousand rows/s., 17.61 MB/s.)
Peak memory usage: 4.77 MiB.

Data Analysis

We have seen that a large fraction of trades were rejected, which will likely have an impact on the banks' PnL. To quantify the impact of the rejected trades, we calculate two PnLs for each bank:

the expected PnL, which is calculated taking into account all trades; the realized PnL, which is calculated taking into account only the confirmed trades. For simplicity, we calculate each PnL using as buy price the price at which the last buy trade was executed before each sell trade.

We start by calculating the trade-level PnL, which we save in ClickHouse view called fix.profit_and_loss.

create or replace view 
    fix.profit_and_loss
as select
    Timestamp,
    Sender, 
    Receiver,
    Ticker,
    BuyTradeStatus,
    SellTradeStatus,
    BuyQuantity,
    SellQuantity,
    BuyPrice,
    SellPrice,
    (SellQuantity * SellPrice - BuyQuantity * BuyPrice) as PnL 
from (
    select 
        Timestamp,
        Sender, 
        Receiver,
        Ticker,
        last_value(BuyQuantity) over (partition by Sender, Ticker order by Timestamp rows between unbounded preceding and current row) as BuyQuantity,
        SellQuantity,
        last_value(BuyTradeStatus) over (partition by Sender, Ticker order by Timestamp rows between unbounded preceding and current row) as BuyTradeStatus,
        SellTradeStatus,
        last_value(BuyPrice) over (partition by Sender, Ticker order by Timestamp rows between unbounded preceding and current row) as BuyPrice,
        SellPrice
    from (
        select
            Timestamp,
            Sender,
            Receiver,
            Ticker,
            if(BuySell == 'Buy', TradeStatus, NULL) as BuyTradeStatus,
            if(BuySell == 'Sell', TradeStatus, NULL) as SellTradeStatus,
            if(BuySell == 'Buy', RequestedQuantity, NULL) as BuyQuantity,
            if(BuySell == 'Sell', RequestedQuantity, NULL) as SellQuantity,
            if(BuySell == 'Buy', Price, NULL) as BuyPrice,
            if(BuySell == 'Sell', Price, NULL) as SellPrice
        from 
            fix.report 
        order by
            Sender,
            Ticker,
            Timestamp
    )
)
where isNotNull(SellPrice)

The view has the following format:

SELECT
    Timestamp,
    Sender,
    Receiver,
    Ticker,
    PnL
FROM fix.profit_and_loss
LIMIT 10

Query id: 2e473b5f-7368-4261-8950-42f258426926

    ┌───────────Timestamp─┬─Sender──────┬─Receiver────────┬─Ticker─┬────────────────PnL─┐
 1. │ 2024-03-08 09:39:37 │ BNP Paribas │ Morgan Stanley  │ BUA    │             960.75 │
 2. │ 2024-03-08 09:43:31 │ BNP Paribas │ Bank of America │ BUA    │ 1087.2999999999993 │
 3. │ 2024-03-08 09:47:16 │ BNP Paribas │ Deutsche Bank   │ BUA    │  983.6800000000003 │
 4. │ 2024-03-08 10:15:30 │ BNP Paribas │ Citi            │ BUA    │ 1854.5500000000002 │
 5. │ 2024-03-08 09:42:33 │ BNP Paribas │ Goldman Sachs   │ KHU    │  197.7800000000001 │
 6. │ 2024-03-08 09:54:09 │ BNP Paribas │ Citi            │ KHU    │             978.02 │
 7. │ 2024-03-08 09:56:45 │ BNP Paribas │ Barclays        │ KHU    │  49.92000000000007 │
 8. │ 2024-03-08 09:58:27 │ BNP Paribas │ J.P. Morgan     │ KHU    │             423.51 │
 9. │ 2024-03-08 10:00:33 │ BNP Paribas │ Citi            │ KHU    │ 1004.3999999999992 │
10. │ 2024-03-08 10:06:19 │ BNP Paribas │ Morgan Stanley  │ KHU    │ 1023.4000000000005 │
    └─────────────────────┴─────────────┴─────────────────┴────────┴────────────────────┘

10 rows in set. Elapsed: 0.073 sec. Processed 4.25 thousand rows, 969.35 KB (58.39 thousand rows/s., 13.32 MB/s.)
Peak memory usage: 6.13 MiB.

After that, we calculate the cumulative expected PnL and realized PnL of each bank, which we save in a ClickHouse view called fix.cumulative_profit_and_loss.

create or replace view 
    fix.cumulative_profit_and_loss
as select  
    Timestamp,
    Sender as Bank,
    ExpectedPnL,
    if(isNotNull(RealizedPnL), RealizedPnL, 0) as RealizedPnL
from (
    select
        Timestamp,
        Sender,
        ExpectedPnL,
        if(isNotNull(RealizedPnL), RealizedPnL, last_value(RealizedPnL) over (partition by Sender order by Timestamp rows between unbounded preceding and current row)) as RealizedPnL
    from
    (
        select
            Timestamp,
            Sender,
            sum(PnL) over (partition by Sender order by Timestamp rows between unbounded preceding and current row) as ExpectedPnL
        from
            fix.profit_and_loss
    ) a
    left join (
        select
            Timestamp,
            Sender,
            sum(PnL) over (partition by Sender order by Timestamp rows between unbounded preceding and current row) as RealizedPnL
        from
            fix.profit_and_loss
        where
            BuyTradeStatus == 'Confirmed'
        and
            SellTradeStatus == 'Confirmed'
    ) b  
    on 
        a.Timestamp == b.Timestamp
    and 
        a.Sender == b.Sender
)
order by Sender, Timestamp

SELECT *
FROM fix.cumulative_profit_and_loss
LIMIT 10

Query id: d57444ae-96ce-4675-92d8-cab6145566e5

    ┌───────────Timestamp─┬─Bank────────┬────────ExpectedPnL─┬────────RealizedPnL─┐
 1. │ 2024-03-08 09:39:04 │ BNP Paribas │  98.69999999999982 │                  0 │
 2. │ 2024-03-08 09:39:37 │ BNP Paribas │ 1059.4499999999998 │             960.75 │
 3. │ 2024-03-08 09:40:03 │ BNP Paribas │ 2323.8499999999995 │ 2225.1499999999996 │
 4. │ 2024-03-08 09:40:29 │ BNP Paribas │ 2583.0099999999993 │ 2225.1499999999996 │
 5. │ 2024-03-08 09:40:38 │ BNP Paribas │ 3350.6099999999997 │ 2225.1499999999996 │
 6. │ 2024-03-08 09:40:41 │ BNP Paribas │ 3874.5599999999986 │ 2225.1499999999996 │
 7. │ 2024-03-08 09:41:10 │ BNP Paribas │  5967.059999999999 │ 2225.1499999999996 │
 8. │ 2024-03-08 09:41:33 │ BNP Paribas │  6267.059999999999 │ 2525.1499999999996 │
 9. │ 2024-03-08 09:41:38 │ BNP Paribas │  7656.959999999999 │ 2525.1499999999996 │
10. │ 2024-03-08 09:42:08 │ BNP Paribas │            7862.49 │ 2525.1499999999996 │
    └─────────────────────┴─────────────┴────────────────────┴────────────────────┘

10 rows in set. Elapsed: 0.145 sec. Processed 8.50 thousand rows, 1.94 MB (58.58 thousand rows/s., 13.37 MB/s.)
Peak memory usage: 14.85 MiB.

We can now visualize the expected PnL against the realized PnL over time.

NOT NEEDED

Using our Hex Notebook, we can easily render a heatmap of the delivery locations around Jilin, observing that more deliveries occur in central areas:

Our model will also take account of the pickup time as a second variable. Therefore, we will also visualise the distribution of the number of orders by pickup hour and can observe that most packages are collected at 8am in the morning.

Data Preparation

Our model will predict the time elapsed between the pickup and delivery (in minutes) as a function of the distance between the pickup and the delivery locations (in meters) and of the pickup hour.

We use Clickhouse geoDistance function for calculating the distance between the pickup and the delivery locations given their coordinates (latitude and longitude), while we use Clickhouse date_diff function for calculating the time elapsed between pickup and delivery.

We also add to the dataset a randomly generated training index using randUniform function, which is equal to 1 for 80% of the data, which will be used for training, and equal to 0 for the remaining 20% of the data, which will be used for testing performance of the model.

create or replace table 
    deliveries_dataset (
        order_id UInt32,
        delivery_time Float64,
        delivery_distance Float64,
        Hour7 Float64,
        Hour8 Float64,
        Hour9 Float64,
        Hour10 Float64,
        Hour11 Float64,
        Hour12 Float64,
        Hour13 Float64,
        Hour14 Float64,
        Hour15 Float64,
        Hour16 Float64,
        training Float64
    )
engine = MergeTree
order by order_id

insert into
    deliveries_dataset
select 
    order_id,
    date_diff('minute', accept_gps_time, delivery_gps_time) as delivery_time,
    geoDistance(accept_gps_lng, accept_gps_lat, delivery_gps_lng, delivery_gps_lat) as delivery_distance,
    if(toHour(accept_gps_time) = 7, 1, 0) as Hour7,
    if(toHour(accept_gps_time) = 8, 1, 0) as Hour8,
    if(toHour(accept_gps_time) = 9, 1, 0) as Hour9,
    if(toHour(accept_gps_time) = 10, 1, 0) as Hour10,
    if(toHour(accept_gps_time) = 11, 1, 0) as Hour11,
    if(toHour(accept_gps_time) = 12, 1, 0) as Hour12,
    if(toHour(accept_gps_time) = 13, 1, 0) as Hour13,
    if(toHour(accept_gps_time) = 14, 1, 0) as Hour14,
    if(toHour(accept_gps_time) = 15, 1, 0) as Hour15,
    if(toHour(accept_gps_time) = 16, 1, 0) as Hour16,
    if(randUniform(0, 1) <= 0.8, 1, 0) as training
from 
    deliveries

When visualised, delivery distance and delivery time are positively correlated with greater variance as journeys get longer. This is intuitively as we would expect as longer journeys become harder to predict.

Model Training

We use Clickhouse's stochasticLinearRegression function for fitting the linear regression model based on the 80% of our dataset which contains training data.

Given that this function uses gradient descent, we scale the delivery distance (which is the only continuous feature) by subtracting the training set mean and dividing by the training set standard deviation. We take the logarithm of the target to make sure that the time to delivery predicted by the model is never negative.

create view 
    deliveries_model 
as with
    (select avg(delivery_distance) from deliveries_dataset where training = 1) as loc,
    (select stddevSamp(delivery_distance) from deliveries_dataset where training = 1) as scale
select
    stochasticLinearRegressionState(0.1, 0.0001, 15, 'SGD')(
        log(delivery_time), 
        assumeNotNull((delivery_distance - loc) / scale),
        Hour7,
        Hour8,
        Hour9,
        Hour10,
        Hour11,
        Hour12,
        Hour13,
        Hour14,
        Hour15,
        Hour16
    ) 
as 
    state
from 
    deliveries_dataset 
where training = 1

Model Evaluation

We can now use the fitted model to make predictions for the remaining 20% of our dataset. We will do this by comparing the predicted delivery times with the actuals to calculate the accuracy of our model.

create view 
    deliveries_results
as with
    (select avg(delivery_distance) from deliveries_dataset where training = 1) as loc,
    (select stddevSamp(delivery_distance) from deliveries_dataset where training = 1) as scale,
    (select state from deliveries_model) as model
select
    toInt32(delivery_time) as ACTUAL,
    toInt32(exp(evalMLMethod(
        model, 
        assumeNotNull((delivery_distance - loc) / scale),
        Hour7,
        Hour8,
        Hour9,
        Hour10,
        Hour11,
        Hour12,
        Hour13,
        Hour14,
        Hour15,
        Hour16
    ))) as PREDICTED
from
    deliveries_dataset 
where training = 0

We now have a table of ACTUAL delivery times and PREDICTED delivery times for the 20% test portion of our dataset.

SELECT *
FROM deliveries_results
LIMIT 10

Query id: 2a4cee84-bf43-4925-947e-c9f0a149c05f

┌─ACTUAL─┬─PREDICTED─┐
│    410 │       370 │
│    101 │       122 │
│    361 │       214 │
│    189 │        69 │
│    122 │        92 │
│    454 │       365 │
│    155 │       354 │
│    323 │       334 │
│    145 │       153 │
│     17 │        20 │
└────────┴───────────┘

10 rows in set. Elapsed: 0.015 sec. Processed 9.17 thousand rows, 267.76 KB (619.10 thousand rows/s., 18.07 MB/s.)
Peak memory usage: 2.28 MiB.

We can also plot these visually as per below in our notebook:

To explain the plot, if the model was performing perfectly, then we would expect PREDICTED and ACTUAL to match in every case, meaning that all points would line up on the orange curve. In reality, our model did have errors which we will now analyse.

Model Performance

Looking at the visualisation above, we can see that our model performed relatively well for shorter journeys less than 120 minutes, but predictive accuracy begins to fall away for longer distance journeys as they become more complex and harder to predict.

This would be in line with our real-world experience whereby the longer and more arduous a journey is, the harder it is to predict.

More scientifically, we can evaluate the models performance by looking at the model's mean absolute error (MAE) and root mean squared error (RMSE). This gives us a value of approximately 1 hour across the entire dataset:

SELECT
    avg(abs(ACTUAL - PREDICTED)) AS MAE,
    sqrt(avg(pow(ACTUAL - PREDICTED, 2))) AS RMSE
FROM deliveries_results

Query id: 0082d2a2-4735-4cf6-8d0e-dba4e010a918

┌───────────────MAE─┬──────────────RMSE─┐
│ 58.18494623655914 │ 78.10208373578114 │
└───────────────────┴───────────────────┘

1 row in set. Elapsed: 0.022 sec. Processed 9.17 thousand rows, 267.76 KB (407.90 thousand rows/s., 11.91 MB/s.)
Peak memory usage: 2.28 MiB.

If we limit this to just the shorter journeys with an ACTUAL of less than 2 hours (120 minutes), then we can see that our model performs better with an MAE and RMSE closer to 30 minutes:

SELECT
    avg(abs(ACTUAL - PREDICTED)) AS MAE,
    sqrt(avg(pow(ACTUAL - PREDICTED, 2))) AS RMSE
FROM deliveries_results
WHERE ACTUAL < 120

Query id: edb8eef0-24a7-4579-a856-4e3f3e10cd20

┌────────────────MAE─┬──────────────RMSE─┐
│ 29.681159420289855 │ 41.68671981213744 │
└────────────────────┴───────────────────┘

1 row in set. Elapsed: 0.014 sec. Processed 9.17 thousand rows, 267.76 KB (654.46 thousand rows/s., 19.11 MB/s.)
Peak memory usage: 2.35 MiB.

Conclusion

In this article we have demonstrated how we can use a simple linear regression function to predict output values based on 2 input variables.

The performance of the model was resasonable at shorter distances, but began to break down as the output variable became harder to predict. That said, we can see that a simple linear regression conducted entirely within ClickHouse and using only 2 variables does have some predictive capability and may perform better in other datasets and domains.

Sample Notebook

A notebook describing the full worked example can be found at this URL.

Analysing FIX Data With ClickHouse

How FIX Data Is Used

Where ClickHouse Fits In

About This Example

Dataset

Scenario

Data Preparation

Data Analysis

NOT NEEDED

Data Preparation

Model Training

Model Evaluation

Model Performance

Conclusion

Sample Notebook

Menu

Top Courses