What's New in Apache Sqoop 1.4.0-incubating

Apache Sqoop recently celebrates its first incubator release, version 1.4.0-incubating.  There are several new features and improvements added in this release.  This post will cover some of those interesting changes.  Sqoop is currently undergoing incubation at The Apache Software Foundation.  More information on this project can be found at http://incubator.apache.org/sqoop.

Customized Type Mapping (SQOOP-342)

Sqoop is equipped with a default mapping from most SQL types to appropriate Java or Hive counterparts during import.  Even though, this one-mapping-fits-all approach might not be ideal in all scenarios considering a wide variety of data stores available today, not to mention there are certain vendor-specific SQL types that may not be covered by the default mapping.

To allow customized type mapping, two new arguments, map-column-java and map-column-hive, are introduced for changing mapping to Java and Hive, respectively.  The list of mapping is expected in the form of
=
, such as

$ sqoop import ... --map-column-java id=Integer,name=String

For the above example, the columns id and name will be mapped to Java Integer and String, respectively.

Boundary Query Support (SQOOP-331)

Sqoop uses a canned query (select min(), max() from

) to determine boundaries for creating splits in all cases by default.  This query may not always be the most optimal one however.  Hence, to provide flexibility for using different queries based on distinct usages, a new boundary-query argument is provided to take any arbitrary query returning two numeric columns for the same purpose of creating splits.

Date/Time Incremental Append (SQOOP-321)

Incremental import in Sqoop can be used to only retrieve those rows with the value of a check column beyond a certain threshold.  The threshold needs to be the maximum value of the check column (in append mode) or the timestamp (in lastmodified mode) at the end of last import.

Previously, in append mode, the check column has to be in numeric type.  If a date/time type is desired, the user has to manually select the maximum value out of the date/time column and then specify that value as the last-value argument in lastmodified mode instead.  As part of this release, now the check column can be in date/time type as well.

Composite Key Update (SQOOP-313)

By default, Sqoop export adds new records into a table by INSERT statements.  However, if any record is in conflict with an existing one due to table constraints (such as a unique key), the underlying INSERT statement will fail and the export process will fail.  If an existing record needs to be modified, the update-key argument can be specified and UPDATE statements will be used instead underneath.

Before this release, only a single column name can be specified in the update-key argument.  This column name will be used to determine the matching record(s) for update.  However, in many real world situations, multiple columns are required to identify the matching record(s).  Thus, starting from this release, a comma separated list of column names can be given as the update-key argument.

Mixed Update/Insert Export (SQOOP-327)

As mentioned, Sqoop export can only either insert (by default) or update (with the update-key argument) records into a table.  As a result, one issue is that if data are being inserted, they may cause constraint violations when they exist already.  Another issue is that if data are being updated, they may be silently ignored when there are no matching update keys found.  It lacks the functionality to both update those data with matching update keys and insert those without.

A new update-mode argument is introduced to resolve the above issues.  Its value can be either updateonly or allowinsert.  As the name suggests, the difference is those records without matching update keys are simply dropped when the value is updateonly or are inserted when the value is allowinsert.  Note that this feature is currently provided only for built-in Oracle connector.

IBM DB2 Support (SQOOP-329)

The extensible architecture used by Sqoop allows support for a data store to be added as a so-called connector.  By default, Sqoop comes with connectors for a variety of databases such as MySQL, PostgreSQL, Oracle, and SQL Server.  In addition, there are also third-party connectors available separately from various vendors for several other data stores, such Couchbase, VoltDB, and Netezza.  As part of this release, a new connector is provided to import and export data against IBM DB2 database.

The Final Chapter

If you are interested in learning more about the changes, a complete list for Sqoop 1.4.0-incubating can be found here.  You are also encouraged to give this new release a try.  Any help and feedback is more than welcome. For more information on how to report problems and to get involved, visit the Sqoop project website at http://incubator.apache.org/sqoop/.