Generic JDBC Queries on EMR Zeppelin

April 3, 2019

EMR (Elastic Map Reduce) service on amazon has some nice packages that come pre-installed, and one of them is Apache Zeppelin, which is a Jupyter Notebook interface for Spark.

Zeppelin has interpreters for spark, pyspark, spark-sql and others, but if you want to run spark-sql code on a PostgreSQL database, you need first to install the JDBC interpreter and add some extra configuration to Zeppelin.

The JDBC adapter supports a wide variety of database engines, and it allows you to configure multiple database connections, which makes data exploration much easier.

The generic JDBC interpreter supports these databases:

  • PostgreSQL
  • MariaDB
  • MySQL
  • Redshift
  • Apache Hive

In the next section, I’ll explain how to install and configure the JDBC interpreter on Apache Zeppelin.

Installing the JDBC Interpreter on Zeppelin

When you’re creating your EMR cluster on AWS, on the Software Configuration step, make sure to include these applications: Zeppelin 0.8.1, Spark 2.4.0. You can also go to Advanced Options and check all the applications you need. Also, make sure to add an EC2 key pair so you can connect to it with ssh.

Then, after your cluster is running, connect to it via ssh:

$ ssh -i [path-to-your-ec2-keypair].pem hadoop@[some-address].compute.amazonaws.com

Then run this command to install the jdbc interpreter:

$ sudo /usr/lib/zeppelin/bin/install-interpreter.sh --name jdbc

> Interpreter jdbc installed under /usr/lib/zeppelin/interpreter/jdbc.

Restart Zeppelin by running these commands:

$ sudo stop zeppelin
> zeppelin stop/waiting

$ sudo start zeppelin
> zeppelin start/running, process 24434

PROTIP: this step can also be run as a bootstrap-action for convenience. EMR allows you to define a bash script that is run when creating your EMR cluster and it will be executed when each node is started, once on the master node and once on all the slave nodes. You can define bootstrap actions before creating the EMR cluster.

Now, in order to run a query, you need to configure the Zeppelin interpreter. To access Zeppelin notebook, first you need to open an ssh tunnel to your EMR cluster and configure a proxy that will safely access the cluster through the ssh tunnel.

Connecting to EMR with FoxyProxy 6.x

In order to access the web tools on your EMR cluster, you need to configure the Web Connection. Go to Amazon EMR > Clusters > click on the cluster you just created, then click on Enable Web Connection. This will explain how to set up an ssh tunnel and configure foxyproxy.

To set up the tunnel on port 8157, run:

ssh -i ~/emr-key.pem -ND 8157 hadoop@[master-public-dns-of-your-cluster]

Then you need to configure a proxy management tool. The tutorial description on amazon explains how to set up foxyproxy, but this is not up-to-date because foxyproxy doesn’t support XML configurations anymore, so you’re gonna need to use a json config. To do that, copy this configuration example here and save it as foxyproxy-config.json.

Then, follow these steps:

  1. Install the FoxyProxy add-on, then click on the FoxyProxy icon on the top right corner of firefox
  2. click on Import
  3. On the first section Import Settings from FoxyProxy 6.x (current version), click on Browse
  4. Select foxyproxy-config.json and confirm.
  5. You’re gonna see a new proxy configuration on the foxyproxy list called EMR SOCKS Proxy.
  6. Enable the proxy by clicking on the foxyproxy icon on the top right corner of your browser and select Use proxy EMR SOCKS Proxy for all URLs (ignore patterns)

Now you should be able to access any of the web application interfaces running on your EMR cluster, like Zeppelin, Ganglia, and YARN. You can see a list of them here.

To access Zeppelin, go to http://[master-public-dns-of-your-cluster]:8890/.

Configure Database Connections

  1. On Zeppelin, go to Interpreters http://[master-public-dns-of-your-cluster]:8890/#/interpreter.
  2. Click on + Create to configure a new Interpreter
  3. Give it a name (like jdbc) to be used on your notes as %jdbc
  4. Choose JDBC as the interpreter group
  5. Configure the connection properties, like default.url, default.password, default.user and default.driver (if needed) to access your database.

The interesting thing about the JDBC interpreter is that you can define multiple connections to any different type of database by using a prefix instead of the default one and defining an appropriate driver, as described here. As an example, I wanted to connect to three different PostgreSQL databases, so I configured my connections as following:

Database DB1

db1.url -> jdbc:postgresql://db1-url:5432/db-name1 (string)
db1.password -> ******* (password)
db1.user     -> db1user (string)
db1.driver   -> org.postgresql.Driver (string)

Database DB2

db2.url -> jdbc:postgresql://db2-url:5432/db-name2 (string)
db2.password -> ******* (password)
db2.user     -> db2user (string)
db2.driver   -> org.postgresql.Driver (string)

And so on. The JDBC interpreter uses the driver and the connection details to access the database in a generic way, so it’s pretty easy to connect to any kind of database and use the same querying interface. In order to use a different connection on your notebook, you define the interpreter and add a prefix, like so:

%jdbc(db1)
SHOW TABLES;

%jdbc(db2)
SHOW TABLES;

Running Queries

To use the generic JDBC interpreter, create a new zeppelin note. Then click on Settings (Interpreter Binding) and enable/bind the JDBC interpreter to your note. You just need to enable jdbc.

Then, to use the interpreter you just created, create a new paragraph, and define the interpreter and the connection prefix you want to use:

%jdbc(prefix)
SELECT * FROM tableA;

In my case:

%jdbc(db1)
SELECT * FROM client;

Now you have a simple way to connect and explore different databases in a very simple way. Hope you find it useful!