Sustainable Spark Development

Sean McIntyre
Software Architect

The Problem

Uncharted in a nutshell:

  • Data Science
  • Machine Learning
  • Data cleaning and enrichment
  • Custom visual analytics software

Visualization is a broad term these days, and it often involves a lot of data processing to create something useful.

From a different perspective:

  • Data Scientists
  • Computer Scientists
  • Software Engineers
  • Dev Ops
  • ...and many more

Developing big, data-driven applications involves all of these folks!

The informal workflow
i.e. "what always happens"

  1. Data collection (Dev Ops/Engineers)
  2. Exploratory data analysis (Data scientist)
  3. Application development (Engineers)
  4. Algorithm optimization (Data/computer scientists)
  5. Deployment/Scaling (Engineers/Dev Ops)

Issues:

  • Everyone ends up touching Spark
  • Spark is hard to learn, and hard to use correctly
  • Scripts written for data analysis are productionized due to time constraints
  • Scripts are not sufficiently modular to re-use. Wheel reinventing ensues.

It became clear very quickly that this process is not sustainable.

What is Sustainability?

Priorities:

  • Retain the productivity/flexibility afforded by scripting
  • Balance with the need for modularity
  • Promote code reuse and composability
    "Hey, how did you parse that twitter data?"
  • Reduce barriers to entry for new team members
    people shouldn't have to be Spark experts to write efficient code

The Solution

Sparkpipe!

              
$ ./spark-shell \
  --packages software.uncharted.sparkpipe:sparkpipe-core:0.9.7
              
            
              
// import data
val df = sqlContext.read
  .format("json")
  .load("tweets.json.gz")

// add column containing words from tweet text
import scala.reflect.runtime.universe._
import org.apache.spark.sql.Column
val wordsColumn = udf {
	(text: String) => text.trim.split("\\s+?")
}(typeTag[Array[String]], typeTag[String])(new Column("text"))
val df2 = df.withColumn("tweet_words", wordsColumn)

// do other things (if you're not in tears)
              
            

There are things which are unintuitive in the Spark API

              
import software.uncharted.sparkpipe.Pipe
import software.uncharted.sparkpipe.ops.core.{dataframe => dfo}

val ingest = Pipe(sqlContext)
.to(dfo.io.read(format = "json", path = "tweets.json.gz"))
.to(dfo.addColumn(
	"tweet_words",
	(text: String) => text.trim.split("\\s+?"),
	"text"
))
val df = ingest.run
							
						

Sparkpipe aims to make them simpler and more semantically clear

              
import software.uncharted.sparkpipe.Pipe
import software.uncharted.sparkpipe.ops.community.{twitter => twops}

val ingest = Pipe(sqlContext)
.to(twops.io.read(format = "json", path = "tweets.json.gz"))
.to(twops.tweet.hashtags("tweet_hashtags"))

val df = ingest.run
							
						

more importantly, we want to reduce boilerplate
in domain-specific situations

The operation becomes the single modular unit that developers and data scientists can agree on.

              
// operations are just simple scala functions!
def myOperation(arg1: String, arg2: Int)(input: DataFrame): DataFrame = {
  /* do something */
}
							
						

They should each do one thing, and do it well
(with well-defined inputs and outputs)
©UNIX :)

Advantages for Uncharted

  • Pipelines are highly scriptable!
  • Isolation from changes in the underlying Spark API
    (i.e. RDD -> DataFrame -> Dataset)
  • Hide the uglier parts of Spark
  • Pipelines separate scripts into modular, reusable bits without adding too much overhead
  • Productionize scripts, optimize later
  • Simple DSL encourages a specific format for functions, and separates structure from implementation details

Separating structure from implementation details:

							
// linear
val ingest = Pipe(red).to(blue).to(green).to(yellow)
// branch
val analysis1 = pipe.to(something).to(somethingElse)
val analysis2 = pipe.to(anotherOp)
// merge
val output = Pipe(analysis1, analysis2).to(outputOp)
//non-linear pipelines are cool!
							
						

This makes even complex scripts easier to parse and share than the corresponding raw Spark code.

The Bottom Line

We're not trying to replace the underlying Spark API, so you generally won't find ops that wrap simple things like select().

							
Pipe(sqlContext)
.to(twops.io.read(format = "json", path = "tweets.json.gz"))
.to(_.select("text")) //you can use scala inlines anytime you want!
.run
							
						

In particular, we're not trying to replace the SparkML Pipeline, though we do have ops for using it with Sparkpipe.

We've released sparkpipe-core as open source, and it's available now from Maven central.

We're also hard at work on domain-specific libraries which will also be open source.

  • sparkpipe-twitter-ops
  • sparkpipe-instagram-ops
  • sparkpipe-gdelt-ops
  • sparkpipe-salt-ops (visualization)
  • etc.

But I need your help!

Creating sparkpipe has helped improve the productivity of developers within my company.

But it would be even more powerful if a community of domain-specific libraries developed around it!

Mo data? No problems.