Connecting Druid with AWS EMR via VPN to run Hadoop Indexing Jobs

It's a common case that you would need run hybrid infrastructure: your own datacenter with some services in a public cloud. At Deep.BI we have built our private cloud on rented servers and we use some external clouds like AWS or Azure.

In this post we described how to connect Druid cluster hosted in your private datacenter with Amazon cloud Hadoop called EMR ( Elastic Map Reduce) to run Hadoop Indexing Jobs solving Kafka Indexing Service "not merging segments problem.

First we modified security groups and accepted our Druid Middlemanagers. There was no problem with HDFS access as our HDFS client (snakebite) connects to webHDFS service which listens on port:8020. Unfortunately while trying to access EMR with public DNS, we encounter the same Connection refused error .

The message for us was clear: we need to have an direct access to EMR cluster with using local EMR cluster hostnames. We configured ec2-to-emr router and used VPN to access EMR.

Finally, our middle managers were able to connect to EMR cluster using its local IP. Unfortunately we still encounter the same Connection refused error. The point was that Hadoop client was selecting randomly our interfaces from [all_traffic:eth0, VPN:tun0] to communicate with cluster. We tried to "convince" him to use tun0 interface and it was partial success: no Connection refused error anymore, instead we have got error with unknown tun0 interface. EMR passed our Hadoop option to its own cluster, which had, nor was suppose to, no idea about our tun0 interface.

Hopefully adding proper entries for EMR local name resolution to /etc/hosts on Druid Middlemanagers solved all networking problems.

Jan Kogut

Lead DevOps at Deep BI, Inc.


Subscribe to Deep Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!