How Deep.BI collects and stores big data for real-time and historical super fast analytics and AI

The major challenge in big data processing is its storage in a way that allows fast and flexible access and analytics, including real-time.

At Deep.BI we fought this problem using some top-notch technologies like Druid, Kafka & Flink.

How the data is collected and stored

Our Javascript tag collects single interactions on websites and apps, then convert them to JSONs.
See example:

{
  event.type: “page-open”
  timestamp: timestamp,
  user.id.cookie: cookie_id,
  attributeGroup1: {
    attr1: "value",
    attr2: "value"
  }
  attributeGroup2: {
    attr1: "value",
    attr2: "value"
  }
}

Each attribute creates a column in our database (Druid), thus as a result we get the following columns:

event.type, timestamp, user.id.cookie, attributeGroup1.attr1, attributeGroup1.attr2, attributeGroup2.attr1, attributeGroup2.attr2  

Usually, there are ~200 columns per row. Each row represents single event, and each event weights on average 4kB.

These columns and rows are stored in Druid in its native, binary format, segmented by time. Each segment consists of events from a certain time range. Worth noting is, that data are compressed per column. The algorithm compresses data 40-100 times. So, 1TB of raw data is stored as 10-40GB. In this way we can provide real-time data exploration, analytics and access on huge data sets.

How you can use this data

We provide visual data exploration and dashboard creation and sharing tools. Also, you have an access to our API.
People often ask about access to their "raw data" stored at Deep.BI.

Considering raw data usage you should have in mind a way you want to analyze it.

First, you can reverse this compression mechanism and extract raw JSONs - this is often not optimal.

Usually people want to extract some specific information from the data around users. For example for machine learning purposes you may want to extract such data mart:

[userid, attr1, attr2, …, attrN, metric1, metric2, …, metricN, label]

Example:

[UUID1, deviceModel, country, …, emailDomain, numberOfVisits[N], visitedSectionX[0,1], …, timeSpent[X], purchased[0,1]]

In this ML scenario you would use Deep.BI platform for feature engineering: extraction of attributes and creating synthetic features from metrics.

To do this you don’t actually need raw Druid segments, neither raw JSON files. You just need to create Deep.BI API queries and you’ll get CSVs with that kind of “raw data”.