SPARK-A-TON

Hacking A Ton of Spark

This one-day Spark-a-ton is an excellent opportunity to have fun while learning new things and contributing to open source project with one of the leading Spark 2.0 experts – for free.

Duration (1 day/8 hours)

24.11.2016: 09:00 – 18:00 @ Poligon

Development Activities

Structured Streaming

  1. Developing a custom StreamSourceProvider
  2. Migrating TextSocketStream to SparkSession (currently uses SQLContext)
  3. Developing Sink and Source for Apache Kafka
  4. JDBC support (with PostgreSQL as the database)

Spark SQL

  1. Creating custom Encoder
  2. Custom format, i.e. spark.read.format(...) or spark.write.format(...)
  3. Multiline JSON reader / writer
  4. SQLQueryTestSuite – this is a very fresh thing in Spark 2.0 to write tests for Spark SQL
  5. http://stackoverflow.com/questions/39073602/i-am-running-gbt-in-spark-ml-for-ctr-prediction-i-am-getting-exception-because
  6. ExecutionListenerManager
  7. (done) Developing a custom RuleExecutor and enabling it in Spark

Spark MLlib

  1. Creating custom Transformer
    • Example: Tokenizer
    • Jonatan + Kuba + lejdis (Justyna + Magda)
    • The problem is a record of Pipeline with the Transformer, read and use.
  2. Spark MLlib 2.0 Activator

Core

  1. Monitoring executors (metrics, e.g. memory usage) using SparkListener.onExecutorMetricsUpdate.

Misc

  1. Develop a new Scala-only TCP-based Apache Kafka client
  2. Working on Issues reported in TensorFrames.
  3. Review open issues in Spark’s JIRA and pick one to work on.

Trainer: Jacek Laskowski
An independent consultant who is passionate about software development and teaching people in effective use of Apache Spark, Scala, sbt, and Apache Kafka (with a bit of Hadoop YARN, Apache Mesos, and Docker). He is the leader of the Warsaw Scala Enthusiasts and Warsaw Spark meetups in Warsaw, Poland.

WORKSHOP II

Leapfrog your competition with Spark 2.0! - REGISTRATIONS CLOSED!

This two-day course is designed to teach developers how to implement data processing pipelines and analytics using Apache Spark. Developers will use hands-on exercises to learn the Spark Core, SQL/DataFrame, Streaming, and MLlib (machine learning) APIs. Developers will also learn about Spark internals and tips for improving application performance.

Duration (2 days/2 x 8 hours)

22.11.2016: 09:00 – 19:00 @ Hotel Slon (1st floor, room Club I)

23.11.2016: 09:00 – 19:00 @ Hotel Slon (1st floor, room Club I)

Objectives

After having participated in this course you should:

  • Understand how to use the Spark Scala APIs to implement various data analytics algorithms for offline (batch-mode) and event-streaming applications
  • Understand Spark internals
  • Understand Spark performance considerations
  • Understand how to test and deploy Spark applications
  • Understand the basics of integrating Spark with Mesos, Hadoop, and Akka

Agenda

Day 1

Spark SQL – 4h

  • Dataset / SparkSession / Encoders / Schema / InternalRow
  • Aggregations, Window and Join Operators
  • Catalyst Query Optimizer
  • Thrift JDBC/ODBC Server — Spark Thrift Server (STS)

Spark MLlib – 4h

  • ML Pipeline API

Day 2

Spark MLlib – 1h

  • ML Pipeline API

Spark Streaming – 5h

  • Streaming Operators
  • Stateful Operators using mapWithState
  • Kafka Integration using Direct API

Structured Streaming – 2h

  • Kafka Integration

Prerequisites 

  • Experience with Scala and sbt
  • Knowledge of Spark basics — RDDs, spark-shell, spark-submit
  • Experience with the entire lifecycle of a Spark application from development (including sbt-assembly) to spark-submit
  • Know how to run Spark Standalone
Trainer: Jacek Laskowski
An independent consultant who is passionate about software development and teaching people in effective use of Apache Spark, Scala, sbt, and Apache Kafka (with a bit of Hadoop YARN, Apache Mesos, and Docker). He is the leader of the Warsaw Scala Enthusiasts and Warsaw Spark meetups in Warsaw, Poland.