Apache Drill 1.20 Released! New Connectors, Storage and Backward Compatibility with Hadoop 2
The Apache Drill PMC is pleased to announce the release of Apache Drill version 1.20! In addition to numerous bug fixes and improvements in usability, the biggest theme for Drill 1.20 are new data and storage formats, backwards compatibility with Hadoop 2, JDBC writer capability and significant improvements to the HTTP plugin. For most organizations, the backport to Hadoop 2 is most significant as it allows companies stuck on old versions of Drill to update to the latest version.
The Drill team made a lot of internal improvements, but I’d like to highlight the improvements which will impact users, so here goes!
Backward Compatibility with Hadoop 2: You can now update!
In the last few months, we’ve seen a number of Drill users dealing with issues that date back to Drill 1.16. When Drill updated to Hadoop 3, it did not include backward compatibility with Hadoop 2, which meant that if you had a data lake with Hadoop 2, you were stuck with Drill 1.16. Well…. no longer! Drill 1.20 includes a back port for Hadoop 2, so companies using Hadoop 2 can now take advantage of the last two years of Drill development. Big thanks to Deutsche Bahn Cargo for their assistance and working with us to test this functionality.
Drill and Apache Phoenix, Together at Last
Drill version 1.20 features a new connector with Apache Phoenix. Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix compiles queries and other statements into native noSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of noSQL stores.
Drill 1.20 now features a connector to Apache Phoenix which enables users to query and join data from Apache Phoenix directly from Drill. The Phoenix connector from Drill features extensive pushdowns which will make the queries as efficient as possible. Also noteworthy is that the Phoenix connector has user impersonation which allows queries to run in Phoenix as the current Drill user.
Writing to JDBC Data Sources
Drill already supported writing data to Parquet, JSON, and a few other formats. Drill 1.20 introduces a new functionality in that you can write data to JDBC compliant RDBMS such as Oracle, MySQL, Postgres and others.
New Data File Formats: Apache Iceberg, SAS
Apache Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for query engines like Drill to safely work with the same tables, at the same time. Iceberg is in widespread use in the modern data stack and as of version 1.20 Drill now supports directly querying Apache Iceberg tables. One of Iceberg’s really powerful features is versioning in the datasets. Drill allows you to query these datasets and also access the different versions with some specialized SQL syntax. As with most all other formats in Drill, you can query this data without any schema preparation. The full documentation is available here: https://drill.apache.org/docs/iceberg-format-plugin/.
In addition to Iceberg, Drill 1.20 adds support for SAS files. According to Wikipedia, SAS is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics. Drill now can directly query SAS files using standard SQL. Drill can now directly query all the file types that are supported in the popular Python Pandas library.
API Query Improvements: OAuth 2.0 and Pagination
The Drill team made a number of very significant improvements to the HTTP plugin ultimately which makes it easier to access and work with data. The two most significant are: OAuth integration and automatic pagination. Drill’s HTTP connector now supports APIs which use OAuth 2.0 for authorization. For a non-web developer, dealing with OAuth 2.0 is very complicated as it involves obtaining tokens, refreshing these tokens and putting these tokens into HTTP headers. The good news is that Drill can now query APIs (and has been tested with) SalesForce, Google Analytics, Clickup, Workday, and others.
Another great feature which we’ve added to the HTTP/API plugin is automatic pagination. Many APIs use pagination as a way of limiting the amount of data. As an example, the Github API limits you to 100 results per API call. With the Drill pagination feature, you can configure Drill to make API calls in series so that if a user requests 200 records, Drill will execute 2 API calls to retrieve all the desired data. This process is completely invisible to the user, so from the user’s perspective they can just query paginated APIs and get their data.
The Drill PMC wants to thank everyone who contributed to this release. We’re looking forward to continue the work on Drill 2.0.