Sustainable Spark Development

Sean McIntyre
Software Architect

The Problem

Uncharted in a nutshell:

Data Science
Machine Learning
Data cleaning and enrichment
Custom visual analytics software

Visualization is a broad term these days, and it often involves a lot of data processing to create something useful.

From a different perspective:

Data Scientists
Computer Scientists
Software Engineers
Dev Ops
...and many more

Developing big, data-driven applications involves all of these folks!

The informal workflow
i.e. "what always happens"

Data collection (Dev Ops/Engineers)
Exploratory data analysis (Data scientist)
Application development (Engineers)
Algorithm optimization (Data/computer scientists)
Deployment/Scaling (Engineers/Dev Ops)

Issues:

Everyone ends up touching Spark
Spark is hard to learn, and hard to use correctly
Scripts written for data analysis are productionized due to time constraints
Scripts are not sufficiently modular to re-use. Wheel reinventing ensues.

It became clear very quickly that this process is not sustainable.

What is Sustainability?

Priorities:

Retain the productivity/flexibility afforded by scripting
Balance with the need for modularity
Promote code reuse and composability
"Hey, how did you parse that twitter data?"
Reduce barriers to entry for new team members
people shouldn't have to be Spark experts to write efficient code

The Solution

Sparkpipe!

              
$ ./spark-shell \
  --packages software.uncharted.sparkpipe:sparkpipe-core:0.9.7

              
// import data
val df = sqlContext.read
  .format("json")
  .load("tweets.json.gz")

// add column containing words from tweet text
import scala.reflect.runtime.universe._
import org.apache.spark.sql.Column
val wordsColumn = udf {
	(text: String) => text.trim.split("\\s+?")
}(typeTag[Array[String]], typeTag[String])(new Column("text"))
val df2 = df.withColumn("tweet_words", wordsColumn)

// do other things (if you're not in tears)

There are things which are unintuitive in the Spark API

              
import software.uncharted.sparkpipe.Pipe
import software.uncharted.sparkpipe.ops.core.{dataframe => dfo}

val ingest = Pipe(sqlContext)
.to(dfo.io.read(format = "json", path = "tweets.json.gz"))
.to(dfo.addColumn(
	"tweet_words",
	(text: String) => text.trim.split("\\s+?"),
	"text"
))
val df = ingest.run

Sparkpipe aims to make them simpler and more semantically clear

              
import software.uncharted.sparkpipe.Pipe
import software.uncharted.sparkpipe.ops.community.{twitter => twops}

val ingest = Pipe(sqlContext)
.to(twops.io.read(format = "json", path = "tweets.json.gz"))
.to(twops.tweet.hashtags("tweet_hashtags"))

val df = ingest.run

more importantly, we want to reduce boilerplate
in domain-specific situations

The operation becomes the single modular unit that developers and data scientists can agree on.

              
// operations are just simple scala functions!
def myOperation(arg1: String, arg2: Int)(input: DataFrame): DataFrame = {
  /* do something */
}

They should each do one thing, and do it well
(with well-defined inputs and outputs)
©UNIX :)

Advantages for Uncharted

Pipelines are highly scriptable!
Isolation from changes in the underlying Spark API
(i.e. RDD -> DataFrame -> Dataset)
Hide the uglier parts of Spark
Pipelines separate scripts into modular, reusable bits without adding too much overhead
Productionize scripts, optimize later
Simple DSL encourages a specific format for functions, and separates structure from implementation details

Separating structure from implementation details:

							
// linear
val ingest = Pipe(red).to(blue).to(green).to(yellow)
// branch
val analysis1 = pipe.to(something).to(somethingElse)
val analysis2 = pipe.to(anotherOp)
// merge
val output = Pipe(analysis1, analysis2).to(outputOp)
//non-linear pipelines are cool!

This makes even complex scripts easier to parse and share than the corresponding raw Spark code.

The Bottom Line

We're not trying to replace the underlying Spark API, so you generally won't find ops that wrap simple things like select().

							
Pipe(sqlContext)
.to(twops.io.read(format = "json", path = "tweets.json.gz"))
.to(_.select("text")) //you can use scala inlines anytime you want!
.run

In particular, we're not trying to replace the SparkML Pipeline, though we do have ops for using it with Sparkpipe.

We've released sparkpipe-core as open source, and it's available now from Maven central.

We're also hard at work on domain-specific libraries which will also be open source.

sparkpipe-twitter-ops
sparkpipe-instagram-ops
sparkpipe-gdelt-ops
sparkpipe-salt-ops (visualization)
etc.

But I need your help!

Creating sparkpipe has helped improve the productivity of developers within my company.

But it would be even more powerful if a community of domain-specific libraries developed around it!