It's a common case that you would need run hybrid infrastructure: your own datacenter with some services in a public cloud. At Deep.BI we have built our private cloud on rented servers and we use some external clouds like AWS or Azure.
In this post we described how to connect Druid cluster hosted in your private datacenter with Amazon cloud Hadoop called EMR ( Elastic Map Reduce) to run Hadoop Indexing Jobs solving Kafka Indexing Service "not merging segments problem.
First we modified security groups and accepted our Druid Middlemanagers. There was no problem with HDFS access as our HDFS client
(snakebite) connects to webHDFS service which listens on
port:8020. Unfortunately while trying to access EMR with public DNS, we encounter the same
java.net.ConnectException: Connection refused error .
The message for us was clear: we need to have an direct access to EMR cluster with using local EMR cluster hostnames. We configured
ec2-to-emr router and used
VPN to access EMR.
Finally, our middle managers were able to connect to EMR cluster using its local IP. Unfortunately we still encounter the same
java.net.ConnectException: Connection refused error. The point was that Hadoop client was selecting randomly our interfaces from
[all_traffic:eth0, VPN:tun0] to communicate with cluster. We tried to "convince" him to use
tun0 interface and it was partial success:
no Connection refused error anymore, instead we have got error with
unknown tun0 interface. EMR passed our Hadoop option to its own cluster, which had, nor was suppose to, no idea about our
Hopefully adding proper entries for EMR local name resolution to
/etc/hosts on Druid Middlemanagers solved all networking problems.