apache spark as cross-over hit for data science

Download Apache Spark as Cross-Over Hit for Data Science

Post on 11-Aug-2014

418 views

Category:

Data & Analytics

2 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

  • 1 Apache Spark as Cross-over Hit for Data Science Sean Owen / Director of Data Science / Cloudera
  • Investigative vs Operational Analytics 2
  • Tools of the Trade 3
  • Trade-offs of the Tools 4 Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Data Context Metrics Library Language Investigative Operational
  • R 5 Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Data Context Metrics Library Language Investigative Operational
  • Python + scikit 6 Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Data Context Metrics Library Language Investigative Operational
  • MapReduce, Crunch, Mahout 7 Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Data Context Metrics Library Language Investigative Operational
  • Spark: Something For Everyone 8 Now Apache TLP From UC Berkeley Scala-based Expressive, efficient JVM-based Scala-like API Distributed works like local Like Crunch is Collection-like REPL Interactive Distributed Hadoop-friendly Integrate with where data already is ETL no longer separate MLlib
  • Spark 9 Production Data Large-Scale Shared Cluster Continuous Operation Online Throughput, QPS Few, Simple Systems Language Performance Historical Subset Sample Workstation Ad Hoc Investigation Offline Accuracy Many, Sophisticated Scripting, High Level Ease of Development Data Context Metrics Library Language Investigative Operational
  • Stack Overflow Tag Recommender Demo 10 Questions have tags like java or mysql Recommend new tags to questions Available as data dump Jan 20 2014 Posts.xml 24.4GB 2.1M questions 9.3M tags (34K unique)
  • 11
  • Stack Overflow Tag Recommender Demo 12 CDH 5.0.1 Spark 0.9.0 Standalone mode Install libgfortran 1 master 5 workers 24 cores 64GB RAM
  • 13
  • 14 val postsXML = sc.textFile( "hdfs:///user/srowen/SparkDemo/Posts.xml") postsXML: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at :15 postsXML.count ... res1: Long = 18066983
  • 15 (4,"c#") (4,"winforms") ... (4,3104,1.0) (4,2148819,1.0) ...
  • 16 val postIDTags = postsXML.flatMap { line => val idTagRegex = "Id="(d+)".+Tags="([^"]+)"".r val tagRegex = "".r idTagRegex.findFirstMatchIn(line) match { case None => None case Some(m) => { val postID = m.group(1).toInt val tagsString = m.group(2) val tags = tagRegex.findAllMatchIn(tagsString) .map(_.group(1)).toList if (tags.size >= 4) tags.map((postID,_)) else None } } }
  • 17 def nnHash(tag: String) = tag.hashCode & 0x7FFFFF var tagHashes = postIDTags.map(_._2).distinct.map(tag => (nnHash(tag),tag)) import org.apache.spark.mllib.recommendation._ val alsInput = postIDTags.map(t => Rating(t._1, nnHash(t._2), 1.0)) val model = ALS.trainImplicit(alsInput, 40, 10)
  • 18
  • 19 def recommend(questionID: Int, howMany: Int = 5): Array[(String, Double)] = { val predictions = model.predict( tagHashes.map(t => (questionID,t._1))) val topN = predictions.top(howMany) (Ordering.by[Rating,Double](_.rating)) topN.map(r => (tagHashes.lookup(r.product)(0), r.rating)) } recommend(7122697).foreach(println)
  • 20 (sql,0.1666023080230586) (database,0.14425980384610013) (oracle,0.09742911781766687) (ruby-on-rails,0.06623183702418671) (sqlite,0.05568507618047555) I have a large table with a text field, and want to make queries to this table, to find records that contain a given substring, using ILIKE. It works perfectly on small tables, but in my case it is a rather time-consuming operation, and I need it work fast, because I use it in a live-search field in my website. Any ideas would be appreciated... postgresql query-optimization substring text-search stackoverflow.com/questions/7122697/how-to-make-substring-matching-query-work-fast-on-a-large-table
  • 21 blog.cloudera.com/blog/2014/ 03/why-apache-spark-is-a- crossover-hit-for-data- scientists/ goo.gl/4K5YEI sowen@cloudera.com