CrossClj

2.1.3 docs

SourceDocs



RECENT
    VARS
    aggregate
    cache
    cartesian
    checkpoint
    coalesce
    coalesce-max
    cogroup
    collect
    collect-map
    combine-by-key
    count
    count-by-key
    count-by-value
    count-partitions
    default-min-partitions
    default-parallelism
    distinct
    filter
    first
    flat-map
    flat-map-to-pair
    flat-map-values
    fold
    foreach
    foreach-partition
    ftruthy?
    glom
    group-by
    group-by-key
    hash-partitioner
    histogram
    intersect-by-key
    intersection
    into-pair-rdd
    into-rdd
    jar-of-ns
    join
    key-by
    key-by-fn
    keys
    left-outer-join
    local-spark-context
    lookup
    map
    map-partition
    map-partition-with-index
    map-partitions-to-pair
    map-to-pair
    map-values
    max
    min
    parallelize
    parallelize-pairs
    partition-by
    partitioner
    partitioner-aware-union
    partitions
    partitionwise-sampled-rdd
    rdd-name
    reduce
    reduce-by-key
    rekey
    repartition
    sample
    save-as-text-file
    sort-by-key
    spark-context
    stop
    storage-level!
    STORAGE-LEVELS
    subtract
    subtract-by-key
    take
    text-file
    tuple
    tuple-by
    uncache
    union
    values
    whole-text-files
    with-context
    wrap-comparator
    zip-with-index
    zip-with-unique-id

    « Index of all namespaces of this project

    This is the main entry point to sparkling, typically required like `[sparkling.core :as s]`.
    
    By design, most operations in sparkling are built up via the thread-last macro ->>.
    Thus, they work exactly like their clojure.core counterparts, making it easy to migrate Clojure-only code to Spark code.
    
    If you find an RDD operation missing from the api that you'd like to use, pull requests are
    happily accepted!
    (aggregate item-to-seq-fn combine-seqs-fn zero-value rdd)
    Aggregates the elements of each partition, and then the results for all the partitions,
    using a given combine function and a neutral 'zero value'.
    Persists rdd with the default storage level (MEMORY_ONLY).
    
    (cartesian rdd1 rdd2)
    Creates the cartesian product of two RDDs returning an RDD of pairs
    
    (checkpoint rdd)
    (coalesce n rdd)(coalesce n shuffle? rdd)
    Decrease the number of partitions in rdd to n.
    Useful for running operations more efficiently after filtering down a large dataset.
    (coalesce-max n rdd)(coalesce-max n shuffle? rdd)
    Decrease the number of partitions in rdd to n.
    Useful for running operations more efficiently after filtering down a large dataset.
    (cogroup rdd other)(cogroup rdd other1 other2)(cogroup rdd other1 other2 other3)
    Returns all the elements of rdd as an array at the driver process.
    
    (collect-map pair-rdd)
    Retuns all elements of pair-rdd as a map at the driver process.
    Attention: The resulting map will only have one entry per key.
               Thus, if you have multiple tuples with the same key in the pair-rdd, the collection returned will not contain all elements!
               The function itself will *not* issue a warning of any kind!
    (combine-by-key seq-fn conj-fn merge-fn rdd)(combine-by-key seq-fn conj-fn merge-fn n rdd)
    Combines the elements for each key using a custom set of aggregation functions.
    Turns an RDD of (K, V) pairs into a result of type (K, C), for a 'combined type' C.
    Note that V and C can be different -- for example, one might group an RDD of type
    (Int, Int) into an RDD of type (Int, List[Int]).
    Users must provide three functions:
    -- seq-fn, which turns a V into a C (e.g., creates a one-element list)
    -- conj-fn, to merge a V into a C (e.g., adds it to the end of a list)
    -- merge-fn, to combine two C's into a single one.
    Return the number of elements in rdd.
    
    (count-by-key rdd)
    Only available on RDDs of type (K, V).
    Returns a map of (K, Int) pairs with the count of each key.
    (count-by-value rdd)
    Return the count of each unique value in rdd as a map of (value, count)
    pairs.
    (count-partitions rdd)
    (default-min-partitions spark-context)
    Default min number of partitions for Hadoop RDDs when not given by user
    
    (default-parallelism spark-context)
    Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD).
    
    (distinct rdd)(distinct n rdd)
    Return a new RDD that contains the distinct elements of the source rdd.
    
    (filter f rdd)
    Returns a new RDD containing only the elements of rdd that satisfy a predicate f.
    
    Returns the first element of rdd.
    
    (flat-map f rdd)
    Similar to map, but each input item can be mapped to 0 or more output items (so the
    function f should return a collection rather than a single item)
    (flat-map-to-pair f rdd)
    Returns a new JavaPairRDD by first applying f to all elements of rdd, and then flattening
    the results.
    (flat-map-values f rdd)
    Returns a JavaPairRDD by applying f to all values of rdd, and then 
    flattening the results
    (fold f zero-value rdd)
    Aggregates the elements of each partition, and then the results for all the partitions,
    using a given associative function and a neutral 'zero value'
    (foreach f rdd)
    Applies the function f to all elements of rdd.
    
    (foreach-partition f rdd)
    Applies the function f to all elements of rdd.
    
    Private
    (ftruthy? f)
    Returns an RDD created by coalescing all elements of rdd within each partition into a list.
    
    (group-by f rdd)(group-by f n rdd)
    Returns an RDD of items grouped by the return value of function f.
    
    (group-by-key rdd)(group-by-key n rdd)
    Groups the values for each key in rdd into a single sequence.
    
    (hash-partitioner n)(hash-partitioner subkey-fn n)
    multimethod
    compute histogram of an RDD of doubles
    
    (intersect-by-key rdd1 keyfn keybackfn rdd2)
    Intersects rdd1 with rdd2 by key,
    i.e. rdd1 is rekeyed by keyfn,
    then joined to keep only those elements with keys in rdd2
    and rekeyed again with keybackfn to bring back the original structure.
    
    Remember, rekey is performed by partition,
    thus keyfn and keybackfn and the original partitioning should work with the given partitioner.
    (intersection rdd1 rdd2)
    (into-pair-rdd spark-context lst)(into-pair-rdd spark-context num-slices lst)
    Distributes a local collection to form/return an RDD
    
    (into-rdd spark-context lst)(into-rdd spark-context num-slices lst)
    Distributes a local collection to form/return an RDD
    
    (jar-of-ns ns)
    (join rdd other)
    When called on rdd of type (K, V) and (K, W), returns a dataset of
    (K, (V, W)) pairs with all pairs of elements for each key.
    (key-by f rdd)
    Creates tuples of the elements in this RDD by applying f.
    
    (key-by-fn f)
    Wraps a function f to be called with the value v of a tuple from spark,
    so that the wrapped function returns a tuple [f(v),v]
    (keys rdd)
    Return an RDD with the keys of each tuple.
    
    (left-outer-join rdd other)
    Performs a left outer join of rdd and other. For each element (K, V)
    in the RDD, the resulting RDD will either contain all pairs (K, (V, W)) for W in other,
    or the pair (K, (V, nil)) if no elements in other have key K.
    (local-spark-context app-name)
    (lookup pair-rdd key)
    (map f rdd)
    Returns a new RDD formed by passing each element of the source through the function f.
    
    (map-partition f rdd)
    Similar to map, but runs separately on each partition (block) of the rdd, so function f
    must be of type Iterator<T> => Iterable<U>.
    https://issues.apache.org/jira/browse/SPARK-3369
    (map-partition-with-index f rdd)
    Similar to map-partition but function f is of type (Int, Iterator<T>) => Iterator<U> where
    i represents the index of partition.
    (map-partitions-to-pair f preserve-partitioning? rdd)
    Similar to map, but runs separately on each partition (block) of the rdd, so function f
    must be of type Iterator<T> => Iterable<U>.
    https://issues.apache.org/jira/browse/SPARK-3369
    (map-to-pair f rdd)
    Returns a new JavaPairRDD of (K, V) pairs by applying f to all elements of rdd.
    
    (map-values f rdd)
    (max compare-fn rdd)
    Return the maximum value in rdd in the ordering defined by compare-fn
    
    (min compare-fn rdd)
    Return the minimum value in rdd in the ordering defined by compare-fn
    
    (partition-by partitioner rdd)
    (partitioner rdd)
    (partitioner-aware-union pair-rdd1 pair-rdd2 & pair-rdds)
    (partitions javaRdd)
    Returns a vector of partitions for a given JavaRDD
    
    (partitionwise-sampled-rdd sampler preserve-partitioning? seed rdd)
    Creates a PartitionwiseSampledRRD from existing RDD and a sampler object
    
    (rdd-name name rdd)(rdd-name rdd)
    (reduce f rdd)
    Aggregates the elements of rdd using the function f (which takes two arguments
    and returns one). The function should be commutative and associative so that it can be
    computed correctly in parallel.
    (reduce-by-key f rdd)
    When called on an rdd of (K, V) pairs, returns an RDD of (K, V) pairs
    where the values for each key are aggregated using the given reduce function f.
    (rekey rekey-fn rdd)
    This re-keys a pair-rdd by applying the rekey-fn to generate new tuples.
    However, it does not check whether your new keys would keep the same partitioning, so watch out!!!!
    (repartition n rdd)
    Returns a new rdd with exactly n partitions.
    
    (sample with-replacement? fraction seed rdd)
    Returns a fraction sample of rdd, with or without replacement,
    using a given random number generator seed.
    (save-as-text-file path rdd)(save-as-text-file path rdd codec-class)
    Writes the elements of rdd as a text file (or set of text files)
    in a given directory path in the local filesystem, HDFS or any other Hadoop-supported
    file system. Supports an optional codec class like org.apache.hadoop.io.compress.GzipCodec.
    Spark will call toString on each element to convert it to a line of
    text in the file.
    (sort-by-key rdd)(sort-by-key x rdd)(sort-by-key compare-fn asc? rdd)
    When called on rdd of (K, V) pairs where K implements ordered, returns a dataset of
    (K, V) pairs sorted by keys in ascending or descending order, as specified by the boolean
    ascending argument.
    (spark-context conf)(spark-context master app-name)
    Creates a spark context that loads settings from given configuration object
    or system properties
    (storage-level! storage-level rdd)
    Sets the storage level of rdd to persist its values across operations
    after the first time it is computed. storage levels are available in the `STORAGE-LEVELS' map.
    This can only be used to assign a new storage level if the RDD does not have a storage level set already.
    (subtract rdd1 rdd2)
    Removes all elements from rdd1 that are present in rdd2.
    
    (subtract-by-key rdd1 rdd2)
    Return each (key, value) pair in rdd1 that has no pair with matching key in rdd2.
    
    (take cnt rdd)
    Return an array with the first n elements of rdd.
    (Note: this is currently not executed in parallel. Instead, the driver
    program computes all the elements).
    (text-file spark-context filename)(text-file spark-context filename min-partitions)
    Reads a text file from HDFS, a local file system (available on all nodes),
    or any Hadoop-supported file system URI, and returns it as an JavaRDD of Strings.
    (uncache blocking? rdd)(uncache rdd)
    Marks rdd as non-persistent (removes all blocks for it from memory and disk).  If blocking? is true, block until the operation is complete.
    
    (union rdd1 rdd2)(union rdd1 rdd2 & rdds)
    Build the union of two or more RDDs
    
    (values rdd)
    Returns the values of a JavaPairRDD
    
    (whole-text-files spark-context filename min-partitions)(whole-text-files spark-context filename)
    Read a directory of text files from HDFS, a local file system (available on all nodes),
    or any Hadoop-supported file system URI. Each file is read as a single record and returned 
    in a key-value pair, where the key is the path of each file, the value is the content of each file.
    macro
    (with-context context-sym conf & body)
    Private
    (wrap-comparator f)
    (zip-with-index rdd)
    Zips this RDD with its element indices, creating an RDD of tuples of (item, index)
    
    (zip-with-unique-id rdd)
    Zips this RDD with generated unique Long ids, creating an RDD of tuples of (item, uniqueId)