sparkling.api
aggregate
(aggregate rdd zero-value seq-op comb-op)
Aggregates the elements of each partition, and then the results for all the partitions, using a given combine function and a neutral ‘zero value’.
cartesian
(cartesian rdd1 rdd2)
Creates the cartesian product of two RDDs returning an RDD of pairs
coalesce
(coalesce rdd n)
(coalesce rdd n shuffle?)
Decrease the number of partitions in rdd
to n
. Useful for running operations more efficiently after filtering down a large dataset.
coalesce-max
(coalesce-max rdd n)
(coalesce-max rdd n shuffle?)
Decrease the number of partitions in rdd
to n
. Useful for running operations more efficiently after filtering down a large dataset.
combine-by-key
(combine-by-key rdd create-combiner merge-value merge-combiners)
(combine-by-key rdd create-combiner merge-value merge-combiners n)
Combines the elements for each key using a custom set of aggregation functions. Turns an RDD of (K, V) pairs into a result of type (K, C), for a ‘combined type’ C. Note that V and C can be different – for example, one might group an RDD of type (Int, Int) into an RDD of type (Int, List[Int]). Users must provide three functions: – createCombiner, which turns a V into a C (e.g., creates a one-element list) – mergeValue, to merge a V into a C (e.g., adds it to the end of a list) – mergeCombiners, to combine two C’s into a single one.
count-by-key
(count-by-key rdd)
Only available on RDDs of type (K, V). Returns a map of (K, Int) pairs with the count of each key.
count-by-value
(count-by-value rdd)
Return the count of each unique value in rdd
as a map of (value, count) pairs.
distinct
(distinct rdd)
(distinct rdd n)
Return a new RDD that contains the distinct elements of the source rdd
.
filter
(filter rdd f)
Returns a new RDD containing only the elements of rdd
that satisfy a predicate f
.
flat-map
(flat-map rdd f)
Similar to map
, but each input item can be mapped to 0 or more output items (so the function f
should return a collection rather than a single item)
flat-map-to-pair
(flat-map-to-pair rdd f)
Returns a new JavaPairRDD
by first applying f
to all elements of rdd
, and then flattening the results.
fold
(fold rdd zero-value f)
Aggregates the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral ‘zero value’
glom
Returns an RDD created by coalescing all elements of rdd
within each partition into a list.
group-by
(group-by rdd f)
(group-by rdd f n)
Returns an RDD of items grouped by the return value of function f
.
group-by-key
(group-by-key rdd)
(group-by-key rdd n)
Groups the values for each key in rdd
into a single sequence.
join
(join rdd other)
When called on rdd
of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.
left-outer-join
(left-outer-join rdd other)
Performs a left outer join of rdd
and other
. For each element (K, V) in the RDD, the resulting RDD will either contain all pairs (K, (V, W)) for W in other, or the pair (K, (V, nil)) if no elements in other have key K.
map
(map rdd f)
Returns a new RDD formed by passing each element of the source through the function f
.
map-partition
(map-partition rdd f)
Similar to map
, but runs separately on each partition (block) of the rdd
, so function f
must be of type Iterator
map-partition-with-index
(map-partition-with-index rdd f)
Similar to map-partition
but function f
is of type (Int, Iteratori
represents the index of partition.
map-partitions-to-pair
(map-partitions-to-pair rdd f & {:keys [preserves-partitioning]})
Similar to map
, but runs separately on each partition (block) of the rdd
, so function f
must be of type Iterator
map-to-pair
(map-to-pair rdd f)
Returns a new JavaPairRDD
of (K, V) pairs by applying f
to all elements of rdd
.
parallelize
(parallelize spark-context lst)
(parallelize spark-context lst num-slices)
Distributes a local collection to form/return an RDD
parallelize-pairs
(parallelize-pairs spark-context lst)
(parallelize-pairs spark-context lst num-slices)
Distributes a local collection to form/return an RDD
partitionwise-sampled-rdd
(partitionwise-sampled-rdd rdd sampler preserve-partitioning? seed)
persist
(persist rdd storage-level)
Sets the storage level of rdd
to persist its values across operations after the first time it is computed. storage levels are available in the `STORAGE-LEVELS’ map. This can only be used to assign a new storage level if the RDD does not have a storage level set already.
reduce
(reduce rdd f)
Aggregates the elements of rdd
using the function f
(which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
reduce-by-key
(reduce-by-key rdd f)
When called on an rdd
of (K, V) pairs, returns an RDD of (K, V) pairs where the values for each key are aggregated using the given reduce function f
.
rekey-preserving-partitioning-without-check
(rekey-preserving-partitioning-without-check rdd rekey-fn)
This re-keys a pair-rdd by applying the rekey-fn to generate new tuples. However, it does not check whether your new keys would keep the same partitioning, so watch out!!!!
sample
(sample rdd with-replacement? fraction seed)
Returns a fraction
sample of rdd
, with or without replacement, using a given random number generator seed
.
save-as-sequence-file
(save-as-sequence-file rdd path)
Writes the elements of rdd
as a Hadoop SequenceFile in a given path
in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that either implement Hadoop’s Writable interface.
save-as-text-file
(save-as-text-file rdd path)
Writes the elements of rdd
as a text file (or set of text files) in a given directory path
in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
sort-by-key
(sort-by-key rdd)
(sort-by-key rdd x)
(sort-by-key rdd compare-fn asc?)
When called on rdd
of (K, V) pairs where K implements ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified by the boolean ascending argument.
spark-context
(spark-context conf)
(spark-context master app-name)
Creates a spark context that loads settings from given configuration object or system properties
take
(take rdd cnt)
Return an array with the first n elements of rdd
. (Note: this is currently not executed in parallel. Instead, the driver program computes all the elements).
text-file
(text-file spark-context filename)
Reads a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and returns it as an JavaRDD
of Strings.