On-demand Interactive Data Science with DC/OS

19 April 2016
By Yuri Gubin, Software Project Manager

Traffic accidents in the UK, 1979-2004.

Whether you are a journalist, a researcher or a data geek, in order to start working with large data sets, you have to complete laborious tasks of setting-up an infrastructure, configuring an environment, learning new unfamiliar tools and coding complicated apps – with DC/OS you can start crunching those numbers within minutes.

Let’s start with a problem of analyzing a set of data and take a road safety data from Great Britain, 1979-2004. While the data set might seem small, some of the analysis might require distributed processing and we should have an environment that allows our processing jobs to scale horizontally. To achieve this, we’ll be running a DC/OS cluster on top of a cluster of virtual machines. We’ll be using AWS EC2 in this scenario, but the same solution can be ported to other public and private clouds.

DC/OS sets up a cluster and deploys pre-configured components services needed to complete a task on hand. You don’t have to entirely understand the complexity of the infrastructure and how to set it up, DC/OS helps you creating necessary abstractions. Once complete, you will have a running cluster with interactive research notebook (container of Jupyter Python Notebook with Apache Spark) and distributed file system (HDFS), ready to tackle any large-scale data processing task.

Step #1 – Setup your DC/OS cluster.

Refer to manual installation documentation and setup your cluster https://dcos.io/docs/latest/administration/installing/custom/scripted-installer. You should have one bootstrap node, at least one master node and at least five slave nodes. Please use m3.xlarge to have enough memory for HDFS. You also can setup DC/OS with Amazon CoudFormation described here https://dcos.io/installing.

IMPORTANT: If you’ve chosen OS other than CoreOS please add user core to every mesos slave.

Step #2 – Start HDFS

Using instructions from DC/OS Dashboard install DC/OS CLI. You can use bootstrap node for this.

Install HDFS package with dcos package install hdfs.

HDFS will slowly start all services and eventually it will have nine services ready.

Picture1

Step #3 – Upload a file to HDFS

Uploading a file will be done by one time task executed by Marathon.

Download json for marathon with wget http://tc.pintostack.com/dcos-files/uk-data-to-hdfs.json.

Then run it in marathon with dcos marathon add uk-data-to-hdfs.json.

This command will actually download archive http://data.dft.gov.uk/road-accidents-safety-data/Stats19-Data1979-2004.zip and save it to HDFS.

Step #4 – Run Jupyter container.

Get JSON for Marathon with wget http://tc.pintostack.com/dcos-files/jupyter.json.

Run Jupyter with dcos marathon add jupyter.json

Step #5 – Create Python notebook.

Note that if namenode1.hdfs.mesos is in standby and you get error message try namenode2.hdfs.mesos.

Block 1:

from pyspark import SparkContext
sc =SparkContext()

Block 2:

import time
start_time = int(round(time.time() * 1000))
# now we have a file
text_file = sc.textFile("hdfs://namenode1.hdfs.mesos:50071/Accidents7904.csv")
# getting the header as an array
header = text_file.first().split(",")
# getting data
data = text_file \
   .map(lambda line: line.split(",")) \
   .filter(lambda w: w[header.index('Date')] != 'Date')
output = data.filter(lambda row: len(row[header.index('Date')].strip().split("/")) == 3) \
   .map(lambda row: row[header.index('Date')].strip().split("/")[2]) \
   .map(lambda word: (word, 1)) \
   .reduceByKey(lambda a, b: a + b) \
   .sortByKey(True) \
   .collect()
for (line, count) in output:
        print("%s: %i" % (line, count))
print ("Duration is '%i' ms" % (int(round(time.time() * 1000)) - start_time))

Block 3:

%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

plt.plot([str(x[0]) for x in output], [str(x[1]) for x in output])

Run the notebook. First you will notice new tasks in Mesos, these are Spark executors:

Picture2

Your Jupyter notebook will look like this:

Picture3

Picture4

As you’ve seen in this post, you can start containerized services in minutes. DC/OS gives you complete environment and lets you focus on your problem not on routine deployment or service configuration adjustments.


Add Comment

Name Mail Website Comment

  • KC Law

    21 May, 2016 09:19 am
    Thanks for sharing. However, I got an error when running the HDFS step...Here's the error: Failed to connect to hdfs.marathon.mesos port 4349: Connection refused It appears the port may be wrong.... Thx. J