CrossClj

2.1.3 docs

SourceDocs



RECENT

    sparkling

    Clojars

    Apr 10, 2018


    Readme

    Index of all namespaces


    « Project + dependencies

    A Clojure Library for Apache Spark

    sparkling.sqlDocsSource
    spark sql api for clojure. As sparkling.core, pass the sql context as last parameter.
    Read json or write json like sparkling.core/text-file or save-as-text-file.
    sparkling.sql.typesDocsSource
    allows for creation of dataframes from an existing rdd, steps:
    1. turn rdd into rows using create row
    2. create struct type passing in representation of the schema
    3. pass rdd of rows and struct type to sparkling.sql/data-frame
    The README below is fetched from the published project artifact. Some relative links may be broken.

    Sparkling - A Clojure API for Apache Spark

    Sparkling is a Clojure API for Apache Spark.

    Show me a small sample

    (do
      (require '[sparkling.conf :as conf])
      (require '[sparkling.core :as spark])
      (spark/with-context sc (-> (conf/spark-conf)              ; this creates a spark context from the given context
                                 (conf/app-name "sparkling-test")
                                 (conf/master "local"))
                          (let [lines-rdd (spark/into-rdd sc ["This is the first line"   ;; here we provide data from a clojure collection.
                                                              "Testing spark"           ;; You could also read from a text file, or avro file.
                                                              "and sparkling"           ;; You could even approach a JDBC datasource
                                                              "Happy hacking!"])]
                            (spark/collect                      ;; get every element from the filtered RDD
                              (spark/filter                     ;; filter elements in the given RDD (lines-rdd)
                                #(.contains % "spark")          ;; a pure clojure function as filter predicate
                                lines-rdd)))))
    

    Where to find more info

    Check out our site for information about Gorillalabs Sparkling and a getting started guide.

    Sample Project repo available

    Just clone our getting-started repo and get going right now.

    But note: There’s one thing you need to be aware of: Certain namespaces need to be AOT-compiled, e.g. because the classes are referenced in the startup process by name. I’m doing this in my project.clj using the :aot directive like this

                :aot [#".*" sparkling.serialization sparkling.destructuring]
    

    Availabilty from Clojars

    Sparkling is available from Clojars. To use with Leiningen, add

    Clojars Project to your dependencies.

    Build Status

    Release Notes

    2.0.0 - switch to Spark 2.0

    • added support for Spark SQL

    1.2.3 - more developer friendly

    • added @/deref support for broadcasts Making it easier to work with broadcasts by using Clojure mechanisms. This is especially true for unit tests, as you could test without actual broadcasts, but with anything deref-able.
    • added RDD autonaming from fn metadata, eases navigation in SparkUI
    • added lookup functionality. Make sure the key to your Tuples is Serializable (Java serialization), as it will be serialized as part of your task definition, not only as part of your data. These are handled differently in Spark.

    1.2.2 - added

    whole-text-files in sparkling.core. (thanks to Jase Bell)

    1.2.1 - improved Kryo Registration, AVRO reader + new Accumulator feature

    • feature: added accumulators (Thanks to Oleg Smirnov for that)
    • change: overhaul of Kryo Registration: Deprecated defregistrator macro, added Registrator type (see sparkling.serialization), with basic support of required types. This introduced a breaking change (sorry!): You need to aot-compile (or require) sparkling.serialization to run stuff in the REPL.
    • feature: added support for your own avro readers, making it possible to read types/records instead of maps. Major improvement on memory consumption.

    1.1.1 - cleaned dependencies

    • No more spilling of unwanted stuff in your application. You only need to refer to sparkling to get a proper environment with Spark 1.2.1. In order to deploy to a cluster with Spark pre-installed, you need to set Spark dependency to provided in your project, though.

    1.1.0 - Added a more clojuresque API

    • Use sparkling.core instead of sparkling.api for parameter orders similar to Clojure. Easier currying using partial.
    • Made it possible to use Keywords as Functions by serializing IFn instead of AFunction.
    • Tested with Spark 1.1.0 and Spark 1.2.1.

    1.0.0 - Added value to the existing libraries (clj-spark and flambo)

    • It’s about twice as fast by getting rid of a reflection call (thanks to David Jacot for his take on this).
    • Get rid of mapping/remapping inside the api functions, which
    • bloated the execution plan (mine shrinked to a third) and
    • (more importantly) allowed me to keep partitioner information.
    • adding more -values functions (e.g. map-values), againt to keep partitioner information.
    • Additional Sources for RDDs:
    • JdbcRDD: Reading Data from your JDBC source.
    • Hadoop-Avro-Reader: Reading AVRO Files from HDFS

    Contributing

    Feel free to fork the Sparkling repository, improve stuff and open up a pull request against our “develop” branch. However, we’ll only add features with tests, so make sure everything is green ;)

    Acknowledgements

    Thanks to The Climate Corporation and their open source clj-spark project, and to Yieldbot for yieldbot/flambo which served as the starting point for this project.

    License

    Copyright (C) 2014-2015 Dr. Christian Betz, and the Gorillalabs team.

    Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.