sparkling.api

aggregate

(aggregate rdd zero-value seq-op comb-op)

Aggregates the elements of each partition, and then the results for all the partitions, using a given combine function and a neutral ‘zero value’.

cache

Persists rdd with the default storage level (MEMORY_ONLY).

cartesian

(cartesian rdd1 rdd2)

Creates the cartesian product of two RDDs returning an RDD of pairs

checkpoint

(checkpoint rdd)

coalesce

(coalesce rdd n)(coalesce rdd n shuffle?)

Decrease the number of partitions in rdd to n. Useful for running operations more efficiently after filtering down a large dataset.

coalesce-max

(coalesce-max rdd n)(coalesce-max rdd n shuffle?)

Decrease the number of partitions in rdd to n. Useful for running operations more efficiently after filtering down a large dataset.

cogroup

(cogroup rdd other)(cogroup rdd other1 other2)

collect

Returns all the elements of rdd as an array at the driver process.

combine-by-key

(combine-by-key rdd create-combiner merge-value merge-combiners)(combine-by-key rdd create-combiner merge-value merge-combiners n)

Combines the elements for each key using a custom set of aggregation functions. Turns an RDD of (K, V) pairs into a result of type (K, C), for a ‘combined type’ C. Note that V and C can be different – for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]). Users must provide three functions: – createCombiner, which turns a V into a C (e.g., creates a one-element list) – mergeValue, to merge a V into a C (e.g., adds it to the end of a list) – mergeCombiners, to combine two C’s into a single one.

count

Return the number of elements in rdd.

count-by-key

(count-by-key rdd)

Only available on RDDs of type (K, V). Returns a map of (K, Int) pairs with the count of each key.

count-by-value

(count-by-value rdd)

Return the count of each unique value in rdd as a map of (value, count) pairs.

count-partitions

(count-partitions rdd)

distinct

(distinct rdd)(distinct rdd n)

Return a new RDD that contains the distinct elements of the source rdd.

filter

(filter rdd f)

Returns a new RDD containing only the elements of rdd that satisfy a predicate f.

first

Returns the first element of rdd.

flat-map

(flat-map rdd f)

Similar to map, but each input item can be mapped to 0 or more output items (so the function f should return a collection rather than a single item)

flat-map-to-pair

(flat-map-to-pair rdd f)

Returns a new JavaPairRDD by first applying f to all elements of rdd, and then flattening the results.

flat-map-values

(flat-map-values rdd f)

fold

(fold rdd zero-value f)

Aggregates the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral ‘zero value’

foreach

(foreach rdd f)

Applies the function f to all elements of rdd.

foreach-partition

(foreach-partition rdd f)

Applies the function f to all elements of rdd.

glom

Returns an RDD created by coalescing all elements of rdd within each partition into a list.

group-by

(group-by rdd f)(group-by rdd f n)

Returns an RDD of items grouped by the return value of function f.

group-by-key

(group-by-key rdd)(group-by-key rdd n)

Groups the values for each key in rdd into a single sequence.

hash-partitioner

(hash-partitioner n)(hash-partitioner subkey-fn n)

histogram

multimethod

compute histogram of an RDD of doubles

jar-of-ns

(jar-of-ns ns)

join

(join rdd other)

When called on rdd of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

key-by

(key-by rdd f)

Creates tuples of the elements in this RDD by applying f.

keys

(keys rdd)

Return an RDD with the keys of each tuple.

left-outer-join

(left-outer-join rdd other)

Performs a left outer join of rdd and other. For each element (K, V) in the RDD, the resulting RDD will either contain all pairs (K, (V, W)) for W in other, or the pair (K, (V, nil)) if no elements in other have key K.

local-spark-context

(local-spark-context app-name)

map

(map rdd f)

Returns a new RDD formed by passing each element of the source through the function f.

map-partition

(map-partition rdd f)

Similar to map, but runs separately on each partition (block) of the rdd, so function f must be of type Iterator => Iterable. https://issues.apache.org/jira/browse/SPARK-3369

map-partition-with-index

(map-partition-with-index rdd f)

Similar to map-partition but function f is of type (Int, Iterator) => Iterator where i represents the index of partition.

map-partitions-to-pair

(map-partitions-to-pair rdd f & {:keys [preserves-partitioning]})

Similar to map, but runs separately on each partition (block) of the rdd, so function f must be of type Iterator => Iterable. https://issues.apache.org/jira/browse/SPARK-3369

map-to-pair

(map-to-pair rdd f)

Returns a new JavaPairRDD of (K, V) pairs by applying f to all elements of rdd.

map-values

(map-values rdd f)

parallelize

(parallelize spark-context lst)(parallelize spark-context lst num-slices)

Distributes a local collection to form/return an RDD

parallelize-pairs

(parallelize-pairs spark-context lst)(parallelize-pairs spark-context lst num-slices)

Distributes a local collection to form/return an RDD

partition-by

(partition-by rdd partitioner)

partitioner

(partitioner rdd)

partitioner-aware-union

(partitioner-aware-union pair-rdd1 pair-rdd2 & pair-rdds)

partitions

(partitions javaRdd)

Returns a vector of partitions for a given JavaRDD

partitionwise-sampled-rdd

(partitionwise-sampled-rdd rdd sampler preserve-partitioning? seed)

persist

(persist rdd storage-level)

Sets the storage level of rdd to persist its values across operations after the first time it is computed. storage levels are available in the `STORAGE-LEVELS’ map. This can only be used to assign a new storage level if the RDD does not have a storage level set already.

rdd-name

(rdd-name rdd name)(rdd-name rdd)

reduce

(reduce rdd f)

Aggregates the elements of rdd using the function f (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

reduce-by-key

(reduce-by-key rdd f)

When called on an rdd of (K, V) pairs, returns an RDD of (K, V) pairs where the values for each key are aggregated using the given reduce function f.

rekey-preserving-partitioning-without-check

(rekey-preserving-partitioning-without-check rdd rekey-fn)

This re-keys a pair-rdd by applying the rekey-fn to generate new tuples. However, it does not check whether your new keys would keep the same partitioning, so watch out!!!!

repartition

(repartition rdd n)

Returns a new rdd with exactly n partitions.

sample

(sample rdd with-replacement? fraction seed)

Returns a fraction sample of rdd, with or without replacement, using a given random number generator seed.

save-as-sequence-file

(save-as-sequence-file rdd path)

Writes the elements of rdd as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that either implement Hadoop’s Writable interface.

save-as-text-file

(save-as-text-file rdd path)

Writes the elements of rdd as a text file (or set of text files) in a given directory path in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.

sort-by-key

(sort-by-key rdd)(sort-by-key rdd x)(sort-by-key rdd compare-fn asc?)

When called on rdd of (K, V) pairs where K implements ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified by the boolean ascending argument.

spark-context

(spark-context conf)(spark-context master app-name)

Creates a spark context that loads settings from given configuration object or system properties

STORAGE-LEVELS

take

(take rdd cnt)

Return an array with the first n elements of rdd. (Note: this is currently not executed in parallel. Instead, the driver program computes all the elements).

text-file

(text-file spark-context filename)

Reads a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and returns it as an JavaRDD of Strings.

tuple

(tuple k v)

union

(union rdd1 rdd2)(union rdd1 rdd2 & rdds)

Build the union of two or more RDDs

values

(values rdd)

Returns the values of a JavaPairRDD

with-context

macro

(with-context context-sym conf & body)