In online applications customer journey identification and conversion tracking are essential to grow any business. The customer journey starts with a customer reaching the landing page and browsing / engaging with the content before the customer makes a purchase/logs in. It is important to tie this anonymous behaviour with the existing customer behaviour to

Personalize content like recommend products that are useful to the customer.
Understand if the landing page conversion to take an action like purchase are meaningful.
Identify and filter out bots from customer behavioural data.

Below is a simple and quick fingerprinting system design.

Definitions:

Even though we are calling session level data, it is actually event payload and cookie information.

Setup

Let us start with an internal customer-session database. We will use the session information as a unique idenfier for every single user session on the online application. This gives us a one-to-many mapping. Each customer_id can be mapped to multiple session_ids. Each session can be characterized by data from the browser, clicks, etc.

Goals

We want to build a system that classifies a given session id to an existing group of customer ids or identify it as a new customer. It is okay to classify two sessions that belong to a customer id as two differnt customers. But we do not want to 2 customer_ids to be mapped to the same cutomer id.

System requirements:

100K user base
Online user fingerprinting system
latency 20 msec.

Problem Formulation

There are a couple of ways we can formulate this solution, let us go thorugh one by one.

Multi-class Classification: We could say each customer_id is one class and train an algorithm on this data to predict the class the session_id belong to. This approach has a few drawbacks:

1. We would have huge number of classes to begin with, in the order of 100K customers to begin with and number of records per session would be very sparse around 10 or so on an average resulting in highly sparse dataset - we would run into the curse of dimensionality early on.
1. With every new customer, we would have to constantly be training every day or every few hours to update the classes. This is not a very scalabe system.

Clustering: We could use an algorithm like Kmeans or DBSCAN or some other algrithm. With this we will be able to map existing customers but new customers with multiple sessions run into challenges where we would have to retrain the algorithm with updating training set very frequently.

Embedding based Proximity Search: The main idea here is that session information of a customer are closer in a higher dimensional space compared to a session from a different customers. Now the goal is to learn the higher dimensional space vector representation and distance threshold to identify sessions from the same customer to different customers. This has the following advantages:

Customer level data sparsity: This is not an issue during inference since we are just mapping from higher dimension sparse vector space to a lower dimension dense vector representation.

New customer identification: A new customer will have higher distance from neighboring sessions compared to sessions belonging to the same customer. This comes from the problem formulation itself.

ML System Design

Stage 1/2: Embedding System Design:

The goal here is to ensure that the vector representation of the session_id has to be closer for sessions for the same customer but farther for sessions from different customers. This is a solved problem and used frequently in facial recognition. We can use triplet loss function which returns lower distance for sessions that belong to the same customer and greater distance for sessions that belong to different customers. The triplet loss function is defined as below.

Loss = max(0, d(a, p)-d(a, n)+\alpha)

Where,

d - Distance metric

a - Anchor sample

p - Positive sample

n - Negative sample

alpha - Margin

Stage 2/2: Classifier threshold

Now that we have the embeddings for any incoming session, we just need to calcuate the cosine distance between any two records to classify them as belong to the same customer or different customers. We can select this threshold as a heuristic that maximizes the evaluation metric. This can be tuned by using a Receiver Operating Characteristic (ROC) curve.

Putting it all together

Evaluation:

Overall Evaluation Metric: Macro Precision

True Positive (TP): sessions belonging to the correct customer_id

True Negative (TN): single session for a new customer.

False Positive (FP): single session mapped to wrong customer_id

False Negative (FN): session belonging to a customer (with multiple sessions) mapped to new customer.

Given that we have 50 metrics with categorical columns containing 2-3 categories in each, we can assume we will have roughly about 100 - 150 one hot encoded columns. We can now train a neural network that takes these as input columns and returns an embedding of size sqrt of 100 to sqrt 150, so roughly an embedding of 10 - 13. We will consider the embeddding size as a hyper-paramter to tune.

Neural Network Setup:

input nodes: 100-150

output nodes: 10-12

hidden layers: 2, 3, 4

Loss function for training neural network: Triplet loss with cosine distance

Add a normalization layer before cosine computation.

Hyperparatmers to tune:

hidden layers: 2, 3, 4

output nodes: 10, 11, 12

dropout: 10, 20, 30

Triplet loss alpha: 0.2, 0.3, 0.5, 0.7

Dataset:

Training dataset size:

12K unique customers with 10 records each at minimum. More sessions per customer the better.
Uniform distribution across device types (iOS vs Android), brower types (Safari vs Firefox vs Chrome)
Filter records from Browers and devices that are currently supported and not deprecated anytime soon. This ensures that the patterns learnt are consistent with what is seen today.
1K customers with 10 records each chosen for validation dataset from training. We need ensure that the distribution between training and validation is consistent.

Test set:

1K customers with 10 records each chosen for validation dataset from training. We need ensure that the distribution between training and test is consistent.

Computing and looking up cosine distance across 100K - 1M records each time will be expensive and unnecessary. Instead we can simply use efficient implementation of this using Hierarchical Navigable Small World (HNSW) for Approximate Nearest neighbor seach. An efficient implementation of this can be found in FAISS python library (from Meta).

Definitions:#

Setup#

Goals#

Problem Formulation#

ML System Design#

Stage 1/2: Embedding System Design:#

Stage 2/2: Classifier threshold#

Putting it all together#