Some Like it Hot: Choosing a System for Large-Scale Data Analysis

Data Science Milan
3 min readOct 12, 2021

--

“The next generation of intelligence tools”

On 6th October 2021 Data Science Milan has organized a webMeetup hosting Peter Marshall to talk about data analysis system powered by Airbnb, Alibaba, Swisscom, Expedia, British Telecom and so on.

“Some Like it Hot: Choosing a System for Large-Scale Data Analysis”, by Peter Marshall, Technology Evangelist at Imply

Peter started talking about Druid.

Apache Druid is an open-source data store designed for real-time analytics on large data sets, born in 2012 under “Metamarkets” name, and working with advertising exchanges.

Below common application for Druid:

-Clickstream analytics including web and mobile analytics

-Network telemetry analytics including network performance monitoring

-Server metrics storage

-Supply chain analytics including manufacturing metrics

-Application performance metrics

-Digital marketing/advertising analytics

-Business intelligence/OLAP

One of the first adopters was Netflix in 2013, and by 2015 adopters included eBay, PayPal, Cisco, Yahoo.

Druid is a fully scalable database, it combines different technologies from data warehouses, timeseries databases, and search systems to create real-time analytics for wide use cases.

At this point Peter explained Druid’s use cases by temperature bars with hot and cold analytics applications. With cold use cases you don’t care about how slow the queries are to run. The activity is more about standard reports, and the output is fairly predictable. Cold user interface is more for careful planning, careful consideration, less about making instinctive decisions.

In the hot temperature bar, queries have to come back in sub second speed updating data, there’s lots of queries running at the same time. People aspire to be using real-time data, critical to business decisions, so people spend a lot of time doing investigation interactively, called Online Analytical Processing (OLAP). This hot lane is suitable for risk and forward analytics, data driven applications, clickstreams and web analytics, digital advertising, all applications where a database is scalable with the number of users.

OLTP RDBMS, ETL, Data Lake, and Query engines features are merged into Druid ingestion layer, a storage format, a querying layer, and a core architecture.

At the end, using indexing, data partitioning, query caching, data compression, and massive parallel processing affords Druid to have high-performance in real time analytics.

Recording&Slides:

video

slides

References:

https://druid.apache.org

http://static.druid.io/docs/druid.pdf

Interesting related links:

https://blog.knoldus.com/introducing-druid-realtime-fast-data-analytics-database/

https://blog.knoldus.com/data-ingestion-in-druid-overview/

Written by Claudio G. Giancaterino

--

--

Data Science Milan

Blog and summary of events of the Data Science Milan community.