Sean McIntyre
Software Architect
Uncharted in a nutshell:
Visualization is a broad term these days, and it often involves a lot of data processing to create something useful.
From a different perspective:
Developing big, data-driven applications involves all of these folks!
It became clear very quickly that this process is not sustainable.
Sparkpipe!
$ ./spark-shell \
--packages software.uncharted.sparkpipe:sparkpipe-core:0.9.7
// import data
val df = sqlContext.read
.format("json")
.load("tweets.json.gz")
// add column containing words from tweet text
import scala.reflect.runtime.universe._
import org.apache.spark.sql.Column
val wordsColumn = udf {
(text: String) => text.trim.split("\\s+?")
}(typeTag[Array[String]], typeTag[String])(new Column("text"))
val df2 = df.withColumn("tweet_words", wordsColumn)
// do other things (if you're not in tears)
There are things which are unintuitive in the Spark API
import software.uncharted.sparkpipe.Pipe
import software.uncharted.sparkpipe.ops.core.{dataframe => dfo}
val ingest = Pipe(sqlContext)
.to(dfo.io.read(format = "json", path = "tweets.json.gz"))
.to(dfo.addColumn(
"tweet_words",
(text: String) => text.trim.split("\\s+?"),
"text"
))
val df = ingest.run
Sparkpipe aims to make them simpler and more semantically clear
import software.uncharted.sparkpipe.Pipe
import software.uncharted.sparkpipe.ops.community.{twitter => twops}
val ingest = Pipe(sqlContext)
.to(twops.io.read(format = "json", path = "tweets.json.gz"))
.to(twops.tweet.hashtags("tweet_hashtags"))
val df = ingest.run
more importantly, we want to reduce boilerplate
in domain-specific situations
The operation becomes the single modular unit that developers and data scientists can agree on.
// operations are just simple scala functions!
def myOperation(arg1: String, arg2: Int)(input: DataFrame): DataFrame = {
/* do something */
}
They should each do one thing, and do it well
(with well-defined inputs and outputs)
©UNIX :)
Separating structure from implementation details:
// linear
val ingest = Pipe(red).to(blue).to(green).to(yellow)
// branch
val analysis1 = pipe.to(something).to(somethingElse)
val analysis2 = pipe.to(anotherOp)
// merge
val output = Pipe(analysis1, analysis2).to(outputOp)
//non-linear pipelines are cool!
This makes even complex scripts easier to parse and share than the corresponding raw Spark code.
We're not trying to replace the underlying Spark API, so you generally won't find ops that wrap simple things like select()
.
Pipe(sqlContext)
.to(twops.io.read(format = "json", path = "tweets.json.gz"))
.to(_.select("text")) //you can use scala inlines anytime you want!
.run
In particular, we're not trying to replace the SparkML Pipeline
, though we do have ops for using it with Sparkpipe.
We've released sparkpipe-core
as open source, and it's available now from Maven central.
We're also hard at work on domain-specific libraries which will also be open source.
sparkpipe-twitter-ops
sparkpipe-instagram-ops
sparkpipe-gdelt-ops
sparkpipe-salt-ops
(visualization)Creating sparkpipe
has helped improve the productivity of developers within my company.
But it would be even more powerful if a community of domain-specific libraries developed around it!