How Deep.BI collects and stores big data for real-time and historical super fast analytics and AI

The major challenge in big data processing is its storage in a way that allows fast and flexible access and analytics, including real-time analysis.

At Deep.BI we fought this problem using top-notch technologies like Druid, Kafka & Flink.

How the data is collected and stored

Our Javascript tag collects single interactions on websites and apps, then convert them to JSONs.
For example:

{
  event.type: “page-open”
  timestamp: timestamp,
  user.id.cookie: cookie_id,
  attributeGroup1: {
    attr1: "value",
    attr2: "value"
  }
  attributeGroup2: {
    attr1: "value",
    attr2: "value"
  }
}

Each attribute creates a column in our database (Druid). Thus, as a result, we get the following columns:

event.type, timestamp, user.id.cookie, attributeGroup1.attr1, attributeGroup1.attr2, attributeGroup2.attr1, attributeGroup2.attr2  

Usually, there are ~200 columns per row. Each row represents a single event, and each event "weighs", on average, 4kB.

These columns and rows are stored in Druid in its native, binary format, segmented by time. Each segment consists of events from a certain time range. It is worth noting that data are compressed per column, and the algorithm compresses data 40-100 times. So, 1TB of raw data is stored as 10-40GB. In this way, we can provide real-time data exploration, analytics and access on huge data sets.

How you can use this data

We provide visual data exploration and dashboard creation as well as sharing tools. Additionally, you have access to our API.
People often ask about access to their "raw data" stored at Deep.BI.

Considering raw data usage, you should have the ways you want to analyze it in mind.

First, you can reverse this compression mechanism and extract raw JSONs - this is often not optimal.

Usually people want to extract some specific information from the data around users. For example, for machine learning purposes you may want to extract such data markers:

[userid, attr1, attr2, …, attrN, metric1, metric2, …, metricN, label]

Example:

[UUID1, deviceModel, country, …, emailDomain, numberOfVisits[N], visitedSectionX[0,1], …, timeSpent[X], purchased[0,1]]

In this ML scenario you would use the Deep.BI platform for feature engineering: the extraction of attributes and creation of synthetic features from metrics.

To do this you don’t actually need raw Druid segments, nor raw JSON files; just create Deep.BI API queries and you’ll get CSVs with that kind of “raw data”.