Spark Streaming in Action

An introduction to Spark Streaming through demonstration

SimpleApp.scala


import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

https://spark.apache.org/docs/latest/quick-start.html

Spark Demo

"All community-contributed content on Stack Exchange is licensed under the Creative Commons BY-SA 3.0 license. As part of our commitment to that, we release a quarterly dump of all user-contributed data (after carefully sanitizing it to protect user private data, of course)." https://archive.org/details/stackexchange
StackOverflow.com data:


$ du stackoverflow.com*/*.xml --total --block-size=G
1G      stackoverflow.com-Badges/Badges.xml
8G      stackoverflow.com-Comments/Comments.xml
46G     stackoverflow.com-PostHistory/PostHistory.xml
1G      stackoverflow.com-PostLinks/PostLinks.xml
29G     stackoverflow.com-Posts/Posts.xml
1G      stackoverflow.com-Tags/Tags.xml
1G      stackoverflow.com-Users/Users.xml
7G      stackoverflow.com-Votes/Votes.xml
90G     total

- Every quarter, the Stack Exchange network releases all of its data for all of its sites under a Creative Commons license - The dataset contains all Stack Exchange site posts, comments, votes, tags, etc. - It's scrubbed of all personally identifying information - The dataset is available as direct download or a BitTorrent from the Internet Archive, it's about 21 GB compressed - It's serialized as XML with one record per line - The entire posts dataset comes out to 29GB, but for my demo I'll just be using a subset of data that's around 5GB (1 million posts) - Not a huge dataset, but for my limited computational resources it does the trick! - *Briefly show Stack Analysis app in IntelliJ.* - *Run job from console with small subset of data that can return relatively fast (10 million rows?). Turn off verbose debugging so there's not piles of loglines going by.. talk over a few of the important lines* :paste some commands to the console if there's time to demonstrate the REPL val inputFile = "data/stackexchange/stackoverflow.com-Posts/Posts1m-tail.xml" val rows = sc.textFile(inputFile) val sqs = StackAnalysis.scalaQuestions(rows) sqs.count() - ask audience to guess what the top 3 tags were? // top 10 co-occurring tags StackAnalysis.tagCounts(sqs).take(10).foreach(println) // scala questions by month StackAnalysis.scalaQuestionsByMonth(sqs).foreach(println)

Confluent's Avro Support

Kafka Avro Serialization library


libraryDependencies += "io.confluent" % "kafka-avro-serializer" % "1.0.1"

Initialization


val props = new Properties()

props.put("bootstrap.servers", "localhost:9092")
props.put("schema.registry.url", "http://localhost:8081")
props.put("value.serializer", classOf[KafkaAvroSerializer].getName)
props.put("key.serializer", classOf[KafkaAvroSerializer].getName)

val producer = KafkaProducer[Object, Object](props)