spark-submit a Python application in cluster mode

To launch a Python Spark application in cluster mode it is necessary to broadcast the application to the workers, using the --py-files directive. I concluded that the best way to do it is to create a fat egg with the .py files, and extract the entry point python file from it. The packaged code is referenced from the Spark application adding this reference in the entry point file:

import sys
sys.path.insert(0, <name of egg file>)

A simple working example can be found here:

https://github.com/sparkfireworks/spark-submit-cluster-python

Checksum a file with md5sum

md5sum creates file hashes. The hash is solely based on the file contents (even if you change the file name, the hash remains the same).

$ md5sum README.md 
bb4955b6d0855ce20c30223273124dd7  README.md

To create a file containing hashes, do:

$ md5sum README.md > MD5SUM.txt

To verify if the file was not tampered:

$ md5sum -c MD5SUM.txt
README.md: OK

Scala foldLeft

foldleft is a partial applied function (curried), where first it is applied an initial value followed by an operation on a pair of elements from the sequence to be fold:

def foldLeft[B](z: B)(op: (B, A) ⇒ B): B

scala> val xs: List[Int] = List(1,2,3)
xs: List[Int] = List(1, 2, 3)

scala> xs.foldLeft(0){(acc, x) => acc + x}
res9: Int = 6

scala> xs.foldLeft(0)(_+_)
res10: Int = 6

Scala string operations

Create an empty string:

scala> val emptyString: String = ""
emptyString: String = ""

Concatenate strings:

scala> "London " + "city"
res1: String = London city

String length:

scala> val str: String = "London " + "city"
str: String = London city

scala> str.length()
res7: Int = 11

scala> str.size
res8: Int = 11

Mulitine String:

scala> val str: String = """I am a multiline
     | String
     | In Scala""".stripMargin
str: String =
I am a multiline
String
In Scala

Parametrize a String:

scala> val name: String = "London"
name: String = London

scala> val str: String = s"""${name} city""".stripMargin
str: String = London city

Concatenate a Sequence of Strings:

scala> val xs: List[String] = List("To", "be", "or", "not", "to", "be")
xs: List[String] = List(To, be, or, not, to, be)

scala> xs.mkString(" ")
res1: String = To be or not to be

Pull all branches from Git

To pull all branches from a git repository do:

$ git branch -r | grep -v '\->' | while read remote; do git branch --track "${remote#origin/}" "$remote"; done

Read a file in Python

To read a file in Python do:

with open(“testfile.txt”) as file:  
  contents = file.read().replace("\n", " ")
  print contents

To read a file line by line in Python do:

with open(“testfile.txt”) as file:  
  contents = file.readlines() 
  for line in contents:
    print line

Switch Java version on Linux

To switch between multiple Java versions do:

$ sudo update-alternatives --config java

first post – podcasts I follow

I am an avid podcast consumer, these are some of my favorite: