On-demand Interactive Data Science with DC/OS

All articles
By Yuri Gubin
Chief Innovation Officer, DataArt
On-demand Interactive Data Science with DC/OS

Traffic accidents in the UK, 1979-2004.

Whether you are a journalist, a researcher or a data geek, in order to start working with large data sets, you have to complete laborious tasks of setting-up an infrastructure, configuring an environment, learning new unfamiliar tools and coding complicated apps - with DC/OS you can start crunching those numbers within minutes. Let’s start with a problem of analyzing a set of data and take a road safety data from Great Britain, 1979-2004. While the data set might seem small, some of the analysis might require distributed processing and we should have an environment that allows our processing jobs to scale horizontally. To achieve this, we’ll be running a DC/OS cluster on top of a cluster of virtual machines. We’ll be using AWS EC2 in this scenario, but the same solution can be ported to other public and private clouds. DC/OS sets up a cluster and deploys pre-configured components services needed to complete a task on hand. You don’t have to entirely understand the complexity of the infrastructure and how to set it up, DC/OS helps you creating necessary abstractions. Once complete, you will have a running cluster with interactive research notebook (container of Jupyter Python Notebook with Apache Spark) and distributed file system (HDFS), ready to tackle any large-scale data processing task.

Step #1 - Setup your DC/OS cluster.

Refer to manual installation documentation and setup your cluster http://docs.mesosphere.com/latest/administration/installing/custom/scripted-installer. You should have one bootstrap node, at least one master node and at least five slave nodes. Please use m3.xlarge to have enough memory for HDFS. You also can setup DC/OS with Amazon CoudFormation described here https://dcos.io/installing. IMPORTANT: If you’ve chosen OS other than CoreOS please add user core to every mesos slave.

Step #2 - Start HDFS

Using instructions from DC/OS Dashboard install DC/OS CLI. You can use bootstrap node for this. Install HDFS package with dcos package install hdfs. HDFS will slowly start all services and eventually it will have nine services ready. On-demand Interactive Data Science with DC/OS

Step #3 - Upload a file to HDFS

Uploading a file will be done by one time task executed by Marathon. Download json for marathon with wget http://tc.pintostack.com/dcos-files/uk-data-to-hdfs.json. Then run it in marathon with dcos marathon add uk-data-to-hdfs.json. This command will actually download archive http://data.dft.gov.uk/road-accidents-safety-data/Stats19-Data1979-2004.zip and save it to HDFS.

Step #4 - Run Jupyter container.

Get JSON for Marathon with wget http://tc.pintostack.com/dcos-files/jupyter.json. Run Jupyter with dcos marathon add jupyter.json Step #5 - Create Python notebook. Note that if namenode1.hdfs.mesos is in standby and you get error message try namenode2.hdfs.mesos. Block 1:
 from  pyspark  import  SparkContext
sc  = SparkContext()
Block 2:
 import  time 
start_time  =  int ( round (time . time()  *  1000 ))
 # now we have a file 
text_file  =  sc . textFile( "hdfs://namenode1.hdfs.mesos:50071/Accidents7904.csv" )
 # getting the header as an array 
header  =  text_file . first() . split( "," )
 # getting data 
data  =  text_file \
    . map( lambda  line: line . split( "," )) \
    . filter( lambda  w: w[header . index( 'Date' )]  !=  'Date' )
output  =  data . filter( lambda  row:  len (row[header . index( 'Date' )] . strip() . split( "/" ))  ==  3 ) \
    . map( lambda  row: row[header . index( 'Date' )] . strip() . split( "/" )[ 2 ]) \
    . map( lambda  word: (word,  1 )) \
    . reduceByKey( lambda  a, b: a  +  b) \
    . sortByKey( True ) \
    . collect()
 for  (line, count)  in  output:
         print ( "  %s  :   %i  "  %  (line, count))
 print  ( "Duration is '  %i  ' ms"  %  ( int ( round (time . time()  *  1000 ))  -  start_time))
Block 3:
 % matplotlib inline

 import  matplotlib  import  numpy  as  np  import  matplotlib.pyplot  as  plt 

plt . plot([ str (x[ 0 ])  for  x  in  output], [ str (x[ 1 ])  for  x  in  output])
Run the notebook. First you will notice new tasks in Mesos, these are Spark executors: On-demand Interactive Data Science with DC/OS Your Jupyter notebook will look like this: On-demand Interactive Data Science with DC/OSOn-demand Interactive Data Science with DC/OS As you’ve seen in this post, you can start containerized services in minutes. DC/OS gives you complete environment and lets you focus on your problem not on routine deployment or service configuration adjustments.
Sign Up for Updates!

Subscribe now to receive industry-related articles and updates

Choose industries of interest
Thank You for Joining!

You will receive regular updates based on your interests. No spam guaranteed

Add another email address
Sign Up for Updates!
Choose industries of interest
Thank You for Joining!

You will receive regular updates based on your interests. No spam guaranteed

Add another email address
We are glad you found us
Please explore our services and find out how we can support your business goals.
Get in Touch
Download the white paper Glancing Forward into 2021: An Industry by Industry Outlook

Explore digital trends and unanticipated benefits engendered by the pandemic, which are likely to last in 2021.