-
Notifications
You must be signed in to change notification settings - Fork 177
Design
The old azkaban is a single piece of software. It is simple but insufficient for the growing need in Hadoop.
For example, for very long time, there are only two built-in job types that can run Hadoop jobs -- java and pig. And these job types only provide the bare minimun. This created a usability hell for people new to Hadoop as they need to go great length to discover which jars are needed to run mapreduce/pig/hive/...
To add more job types, such as Voldemort, Kafka, Hive, etc, as well as upgrading the existing jobtypes, one needs to upgrade the whole package.
Supporting all of them as built-in types also make azkaban very tied to specific Hadoop versions, which may not have backwards compatibility, especially when it comes to security.
So we separated most of the advanced job types (everything beyond JavaProcess type) out as plugins. The plugin jobtypes are loaded on executor server startup. Each job type will have its own classloader. Therefore it is conceivable that multiple plugins using different hadoop/pig/hive libraries co-exist on the same azkaban executor instance.
The pig job types can have a number of libraries, such as datafu, piggybank, avro, etc, pre-registerred with necessary jars all included. This way a normal pig user only needs to provide the pig script and her own udf package. Same for the data scientists/analytics who only want to USE pig/hive but shouldn't have to learn java/Hadoop.
Adding a job type is also easy: just start up a new executor server and pair it up with the web server. (See azkaban2 wiki for how to make seamless upgrads.) In many cases, it is only creating new config files without any code change. For example, creating a pig-0.10 type only requires another directory in jobtypes whose config file points to pig-0.10 jars.
When the old azkaban started out, Hadoop clusters did not have security turned on. Later when they have security turned on, the java and pig jobtypes were patched so that azkaban would do keytab login and get proxy user "doAs" to run user program. The problem is the keytab location and proxy user information are both exposed to user process.
The new Azkaban2 does Hadoop security similar to Oozie. For regular Hadoop/Pig jobs, it should get delegation tokens from name node and job tracker, and hand over only the tokens to user process.
The part that handles Hadoop security is hadoopsecuritymanager, which is included in every job type that needs to talk to a secured Hadoop cluster. This also relieve Azkaban itself from compiling against a specific Hadoop version, and makes it possible to use the same Azkaban and Job Type plugins with different hadoopsecuritymanager on different versioned Hadoop clusters.