Apache Sqoop is a tool for highly efficient data transfers between relational databases and the entire Hadoop ecosystem. One of the significant benefits of Sqoop is that it’s easy to use and can work with a variety of systems both inside and outside of the Hadoop ecosystem. With one tool, Sqoop, you can import or export data from all databases supporting the JDBC interface using the same command line arguments exposed by Sqoop. Additionally, Sqoop was designed in modular fashion, allowing you to plug in specialized additions to optimise transfers for particular database systems.
While some users of various projects within the Hadoop ecosystem use the words "connector" and "driver" interchangeably, these words mean completely different things in context of Sqoop. As both connector and driver are needed for every Sqoop invocation, we see a lot of confusion in the use and understanding of these concepts. This blog post will explain the difference between them and how Sqoop uses these concepts to transfer data between Hadoop and other systems.
The word driver in Sqoop refers to a JDBC Driver. JDBC is a standard Java API for accessing relational databases and some data warehouses. The Java language only prescribes what classes and methods this interface contains and the JDK does not have any default implementation. Each database vendor is responsible for writing their own implementation that will communicate with the corresponding database with its native protocol. As drivers are created by each database vendor, they are usually offered with restrictive licenses that prohibits them to be shipped with the Sqoop distribution. As a result each user needs to download the drivers separately and install them into Sqoop prior to its use.
Structured Query Language (SQL) is a programing language designed for communication with relational database systems. Even though that there is a standard prescribing how the language should look like almost every database has its own dialect of SQL. The basics are usually working the same across all databases, but some edge conditions might be implemented differently. Also as SQL is a very general query processing language, it might not be always the optimal way for importing or exporting data out of the database server.
Sqoop connectors allow Sqoop to overcome the differences in SQL dialects supported by various databases along with providing optimized data transfer. A connector is a pluggable piece that is used to fetch metadata about transferred data (columns, associated data types, …) and to drive the data transfer itself in the most efficient manner. The most basic connector that is shipped with Sqoop is called Generic JDBC Connector and, as the name suggests, it’s using only the JDBC interface for accessing metadata and transferring data. As a result this connector will work on most of the databases out of the box, but may not be the most optimal for your use case. Sqoop also ships with specialized connectors for MySQL, PostgreSQL, Oracle, Microsoft SQL Server, DB2 and Netezza, thus there is usually no need to download extra connectors to start transferring data. However there are special connectors available on the internet that can add support for additional database systems or improve performance of the built-in connectors.
Let’s take a closer look how Sqoop is creating the connection to the database to see how both connectors and drivers are used.
Depending on specified command line arguments and all available connectors Sqoop will try to load the one offering the best performance. This process will start by trying all extra manually downloaded connectors that are not usually shipped with Sqoop to see if they can be used. Each connector itself will get the opportunity to decide whether it’s suitable for the job or not. If there are no manually installed connectors or none can be used, Sqoop will check the JDBC URL (usually starting with jdbc:) to see if we are connecting to database for which a built-in special connector is available. For example for jdbc:mysql:// URL that is used for MySQL database, Sqoop will pick up the MySQL Connector that is optimized for MySQL and can take advantage of its features. Lastly, if no other technique was able to choose the connector, Sqoop will use the Generic JDBC Connector. If the selection mechanism does not suit your environment, you can use the argument --connection-manager with the class name of any arbitrary connector. However keep in mind that connectors are usually designed for one specific database vendor and thus for example the MySQL Connector won’t work on the PostgreSQL database.
After having the connector selected, the next step is to choose the JDBC driver. As most of the connectors are specialized for a given database and each database has usually only one JDBC driver available, it’s the connector which is determining which driver should be used. For example the MySQL connector will always use the MySQL JDBC Driver called Connector/J. The only exception is the Generic JDBC Connector that is not tied with any database, so it can’t determine what JDBC Driver should be used. As a result you have to supply the driver name with the--driver parameter on the command line.
However be careful! Using the --driver parameter will always force Sqoop to use the Generic JDBC Connector no matter whether there is more specialized connector available. For example if the MySQL specialized connector would be used because the URL is starting with jdbc:mysql://, specifying the --driver option will force Sqoop to use the generic connector instead. As a result in most cases you do not need to use the --driver option at all!
Finally when both connector and driver has been determined, Sqoop can open a connection to your database and kick off the process of transferring your data. Once the connector is identified by Sqoop, and once the connector has identified the JDBC driver it would like to use, the connection between the Sqoop client and the target database can now be established as shown in the diagram below. Please note that the picture is only describing the process of opening the initial connection and position of both connector and driver. It’s not describing all connections that Sqoop can open as for example the data transfer itself is done inside a MapReduce job and is not executed on the Sqoop client side.
Now that we have described all the pluggable pieces that Sqoop uses in order to open a connection to your database server, let us see where each piece lives. The connector is a Sqoop specific plugin, so it’s not surprising that it’s part of the Sqoop installation. In fact Sqoop ships with a variety of connectors out of the box! And various additional extra connectors can be downloaded from the internet and installed into Sqoop easily. See the installation instructions for each connector for more details on how to install the connector.
Even though drivers are database specific pieces, created and distributed by the various database vendors, they are not required to be available on the database server. Instead they need to be installed on the machine where Sqoop is executed. This is because these drivers are used by connectors to establish the connection with the database server. JDBC drivers are not shipped with Sqoop as they are released with incompatible licenses, as a result you have to download and install them manually. The drivers are usually available on database vendor web pages or are distributed as part of the database installation itself. Look into your db documentation for further details how to obtain the JDBC Driver if needed.
Both connectors and drivers need to be installed only on the machine where Sqoop is being executed. You do not need to install them on all nodes in your Hadoop cluster. Sqoop itself will propagate both whenever necessary.
I hope that this blog post has helped in understanding what connector and driver is referring to in Sqoop land and why both are needed in order to transfer data between relational databases and the Hadoop ecosystem. Happy Sqooping!