Whether you are a journalist, a researcher or a data geek, in order to start working with large data sets, you have to complete laborious tasks of setting-up an infrastructure, configuring an environment, learning new unfamiliar tools and coding complicated apps - with DC/OS you can start crunching those numbers within minutes.
Let’s start with a problem of analyzing a set of data and take a road safety data from Great Britain, 1979-2004. While the data set might seem small, some of the analysis might require distributed processing and we should have an environment that allows our processing jobs to scale horizontally. To achieve this, we’ll be running a DC/OS cluster on top of a cluster of virtual machines. We’ll be using AWS EC2 in this scenario, but the same solution can be ported to other public and private clouds.
DC/OS sets up a cluster and deploys pre-configured components services needed to complete a task on hand. You don’t have to entirely understand the complexity of the infrastructure and how to set it up, DC/OS helps you creating necessary abstractions. Once complete, you will have a running cluster with interactive research notebook (container of Jupyter Python Notebook with Apache Spark) and distributed file system (HDFS), ready to tackle any large-scale data processing task.
Using instructions from DC/OS Dashboard install DC/OS CLI. You can use bootstrap node for this.
Install HDFS package with dcos package install hdfs.
HDFS will slowly start all services and eventually it will have nine services ready.
Get JSON for Marathon with wget http://tc.pintostack.com/dcos-files/jupyter.json.
Run Jupyter with dcos marathon add jupyter.json
Step #5 - Create Python notebook.
Note that if namenode1.hdfs.mesos is in standby and you get error message try namenode2.hdfs.mesos.
from pyspark import SparkContext
sc = SparkContext()
start_time = int ( round (time . time() * 1000 ))
# now we have a file
text_file = sc . textFile( "hdfs://namenode1.hdfs.mesos:50071/Accidents7904.csv" )
# getting the header as an array
header = text_file . first() . split( "," )
# getting data
data = text_file \
. map( lambda line: line . split( "," )) \
. filter( lambda w: w[header . index( 'Date' )] != 'Date' )
output = data . filter( lambda row: len (row[header . index( 'Date' )] . strip() . split( "/" )) == 3 ) \
. map( lambda row: row[header . index( 'Date' )] . strip() . split( "/" )[ 2 ]) \
. map( lambda word: (word, 1 )) \
. reduceByKey( lambda a, b: a + b) \
. sortByKey( True ) \
for (line, count) in output:
print ( " %s : %i " % (line, count))
print ( "Duration is ' %i ' ms" % ( int ( round (time . time() * 1000 )) - start_time))
% matplotlib inline
import matplotlib import numpy as np import matplotlib.pyplot as plt
plt . plot([ str (x[ 0 ]) for x in output], [ str (x[ 1 ]) for x in output])
Run the notebook. First you will notice new tasks in Mesos, these are Spark executors:
Your Jupyter notebook will look like this:
As you’ve seen in this post, you can start containerized services in minutes. DC/OS gives you complete environment and lets you focus on your problem not on routine deployment or service configuration adjustments.