Apache Pig: It goes to 0.11

After months of work, we are happy to announce the 0.11 release of Apache Pig. In this blog post, we highlight some of the major new features and performance improvements that were contributed to this release. A large chunk of the new features was created by Google Summer of Code (GSoC) students with supervision from the Apache Pig PMC, while the core Pig team focused on performance improvements, usability issues, and bug fixes. We encourage CS students to consider applying for GSOC in 2013 -- it’s a great way to contribute to open source software.

This blog post hits some of the highlights of the release. Pig users may also find a presentation by Daniel Dai, which includes code and output samples for the new operators, helpful.

New Features

DateTime Data Type

The DateTime data type has been added to make it easier to work with timestamps. You can now do date and time arithmetic directly in a Pig script, use UDFs such as CurrentTime, AddDuration, WeeksBetween, etc. PigStorage expects timestamps to be represented in the ISO 8601 format. Much of this work was done by Zhijie Shen as part of his GSoC project.

RANK Operator

The new RANK operator allows one to assign an ordinal number to every tuple in a relation. A user can specify whether she wants exact rank (elements with the same sort value get the same rank) or ‘DENSE’ rank (elements with the same sort value get consecutive rank values). One can also rank by a field value, in which case the relation is sorted by this field prior to ranks being assigned. Much of this work was done by Allan Avendaño as part of his GSoC project.

A = load 'data' AS (f1:chararray,f2:int,f3:chararray);
   
DUMP A;
(David,1,N)
(Tete,2,N)
(Ranjit,3,M)
(Ranjit,3,P)
(David,4,Q)
(David,4,Q)
(Jillian,8,Q)
(JaePak,7,Q)
(Michael,8,T)
(Jillian,8,Q)
(Jose,10,V)

B = rank A;

dump B;
(1,David,1,N)
(2,Tete,2,N)
(3,Ranjit,3,M)
(4,Ranjit,3,P)
(5,David,4,Q)
(6,David,4,Q)
(7,Jillian,8,Q)
(8,JaePak,7,Q)
(9,Michael,8,T)
(10,Jillian,8,Q)
(11,Jose,10,V)

C = rank A by f1 DESC, f2 ASC DENSE;

dump C;
(1,Tete,2,N)
(2,Ranjit,3,M)
(2,Ranjit,3,P)
(3,Michael,8,T)
(4,Jose,10,V)
(5,Jillian,8,Q)
(5,Jillian,8,Q)
(6,JaePak,7,Q)
(7,David,1,N)
(8,David,4,Q)
(8,David,4,Q)

CUBE and ROLLUP Operators

The new CUBE and ROLLUP operators of the equivalent SQL operators provide the ability to easily compute aggregates over multi-dimensional data. Here is an example:

events = LOAD '/logs/events' USING EventLoader() AS (lang, country, app_id, event_id, total);
eventcube = CUBE events BY
 CUBE(lang, country), ROLLUP(app_id, event_id);
result = FOREACH eventcube GENERATE
  FLATTEN(group) as (lang, country, app_id, event_id),
  COUNT_STAR(cube), SUM(cube.total);
 STORE result INTO 'cuberesult';

The CUBE operator produces all combinations of cubed dimensions. The ROLLUP operator produces all levels of a hierarchical group, meaning, ROLLUP(country, region, city) will produce aggregates by country, country and region, country, region, and city, but not country and city (without region). When used together as in the above example, the output groups will be the cross product of all groups generated by cube and rollup operation. That means that if there are m dimensions in cube operations and n dimensions in rollup operation then overall number of combinations will be (2^m) * (n+1). Detailed documentation can be seen in the CUBE Jira. This work was done by Prasanth Jayachandran as part of his GSoC project. He also did further work on optimizing the cubing computation to make it extremely scalable; this optimization will likely be added to the Pig 0.12 release.

Groovy UDFs

Pig has support for UDFs written in JRuby and Jython. In this release, support for UDFs in Groovy is added, providing an easy bridge for converting Groovy and Pig data types and specifying output schemas via annotations. This work was contributed by Mathias Herberts.

Improvements

Performance improvement of in-memory aggregation

Pig 0.10 introduced in-memory aggregation for algebraic operators -- instead of relying on Hadoop combiners, which involve writing map outputs to disk and post-processing them to apply the combine function, Pig can optionally buffer up map outputs in memory and apply combiners without paying the IO cost of writing intermediate data out to platters.

While the initial implementation significantly improved performance of a number of queries, we found some corner cases where it actually hurt performance; furthermore, reserving a large chunk of memory for aggregation buffers can have negative effects on memory-intensive tasks. In Pig 0.11, we completely rewrote the partial aggregation operator to be much more efficient, and integrated it with Pig’s Spillable Memory Manager, so it no longer requires dedicated space on the task heap. This feature is still considered experimental and is off by default; you can turn it on by setting pig.exec.mapPartAgg to true. With these changes in place, Twitter was able to turn this option on by default for all Pig scripts they run on their clusters -- thousands of Map-Reduce jobs per day (they also dropped pig.exec.mapPartAgg.minReduction to 3, to be even more aggressive with this feature).

Performance improvement related to Spillable management

Speaking of the SpillableMemoryManager -- it also saw some significant improvements. The default collection data structure in Pig is a “Bag”. Bags are spillable, meaning that if there is not enough memory to hold all the tuples in a bag in RAM, Pig will spill part of the bag to disk. This allows a large job to make progress, albeit slowly, rather than crashing from “out of memory” errors. The way this worked before Pig 0.11 was as follows:

Every time a Bag is created from the BagFactory, it is registered with the SpillableMemoryManager.
The SpillableMemoryManager keeps a list of WeakReferences to Spillable objects
Upon getting a notification that GC is about to happen, the SMM iterates through its list of WeakReferences, deleting ones that are no longer valid (pointing to null), and looking for the largest Spillable it can find. It then asks this Spillable to spill, and relies on the coming GC to free up spilled data.

Some users reported seeing large amounts of time taken up by traversing the WeakReference list kept by the SMM. A large WeakReference list affected both performance, since we had to iterate over large lists when GC was imminent, and memory, since each WeakReference adds 32 bytes of overhead on a 64-bit JVM. In Pig 0.11 we modified the Bag code so that instead of registering all bags in case they grow, we have Bags register themselves if their contents exceed 100KB, the logic being that a lot of bags will never reach this size, and would not be useful to spill anyway. This drastically reduced the amount of time and memory we spend on the SpillableMemoryManager.

Improvements to AvroStorage and HBaseStorage

HBase:

Added the ability to set HBase scan maxTimestamp, minTimestamp and timestamp in HBaseStorage.
Significant performance optimization for filters over many columns
Compatibility with HBase 0.94 + secure cluster

Avro:

AvroStorage can now optionally skip corrupt Avro files
Added support for recursively defined records
Added support for Avro 1.7.1
Better support for file globbing

Faster, leaner Schema Tuples

Pig uses a generic Tuple container object to hold a “row” of data. Under the covers, it’s simply a List