pigpen.core documentation
cogroup
macro
(cogroup selects+ f opts?)
Joins many relations together by a common key. Each relation specifies a
key-selector function on which to join. A combiner function is applied to each
join key and all values from each relation that match that join key. This is
similar to join, without flattening the data. Optionally takes a map of options.
Example:
(pig/cogroup (foo on :a)
(bar on :b required)
(fn [key foos bars] ...)
{:parallel 20})
In this example, foo and bar are other pig queries and :a and :b are the
key-selector functions for foo and bar, respectively. These can be any
functions - not just keywords. There can be more than two select clauses.
By default, a matching key value from eatch source relation is optional,
meaning that keys don't have to exist in all source relations to be part of the
output. To specify a relation as required, add 'required' to the select clause.
The third argument is a function used to consolidate matching key values. For
each uniqe key value, this function is called with the value of the key and all
values with that key from foo and bar. As such, foos and bars are both
collections. The last argument is an optional map of options.
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/join, pigpen.core/group-by
concat
(concat relations+)
Concatenates all relations provided. Does not guarantee any ordering of the
relations. Identical to pigpen.core/union-multiset.
Example:
(pig/concat
(pig/return [1 2 2 3 3 3 4 5])
(pig/return [1 2 2 3 3])
(pig/return [1 1 2 2 3 3]))
=> [1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 4 5]
See also: pigpen.core/union, pigpen.core/distinct, pigpen.core/union-multiset
constantly
(constantly data)
Returns a function that takes any number of arguments and returns a constant
set of data as if it had been loaded by pigpen. This is useful for testing,
but not supported in generated scripts. The parameter 'data' must be a sequence.
The values of 'data' can be any clojure type.
Example:
(pig/constantly [1 2 3])
(pig/constantly [{:a 123} {:b 456}])
See also: pigpen.core/return
difference
(difference opts? relations+)
Performs a set difference on all relations provided and returns the distinct
results. Optionally takes a map of options as the first parameter.
Example:
(pig/difference
(pig/return [1 2 2 3 3 3 4 5])
(pig/return [1 2])
(pig/return [3]))
=> [4 5]
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/difference-multiset, pigpen.core/intersection
difference-multiset
(difference-multiset opts? relations+)
Performs a multiset difference on all relations provided and returns all
results. Optionally takes a map of options as the first parameter.
Example:
(pig/difference-multiset
(pig/return [1 2 2 3 3 3 3 4 5])
(pig/return [1 2 3])
(pig/return [1 2 3]))
=> [3 3 4 5]
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/difference, pigpen.core/intersection
distinct
(distinct relation)
(distinct opts relation)
Returns a relation with the distinct values of relation. Optionally takes a
map of options.
Example:
(pig/distinct foo)
(pig/distinct {:parallel 20} foo)
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/union, pigpen.core/union-multiset, pigpen.core/filter
dump
(dump script)
Executes a script locally and returns the resulting values as a clojure
sequence. This command is very useful for unit tests.
Example:
(->>
(pig/load-clj "input.clj")
(pig/map inc)
(pig/filter even?)
(pig/dump)
(clojure.core/map #(* % %))
(clojure.core/filter even?))
(deftest test-script
(is (= (->>
(pig/load-clj "input.clj")
(pig/map inc)
(pig/filter even?)
(pig/dump))
[2 4 6])))
Note: pig/store commands return an empty set
pig/script commands merge their results
See also: pigpen.core/show, pigpen.core/dump&show
dump&show
(dump&show script)
Combines pig/show and pig/dump. This is useful so that the graph & resulting
script have the same ids.
dump&show+
(dump&show+ script)
Combines pig/show+ and pig/dump. This is useful so that the graph & resulting
script have the same ids.
dump-async
(dump-async script)
Executes a script asynchronously and prints the results to the console.
filter
macro
(filter pred relation)
Returns a relation that only contains the items for which (pred item)
returns true.
Example:
(pig/filter even? foo)
(pig/filter (fn [x] (even? (* x x))) foo)
See also: pigpen.core/remove, pigpen.core/take, pigpen.core/sample, pigpen.core/distinct
generate-script
(generate-script script)
(generate-script opts script)
Generates a Pig script from the relation specified and returns it as a string.
You can pass any relation to this and it will generate a Pig script - it doesn't
have to be an output. However, if there are no store commands, the script won't
do much. If you have more than one store command, use pigpen.core/script to
combine them. Optionally takes a map of options.
Example:
(pig/generate-script (pig/store-clj "output.clj" foo))
(pig/generate-script {:debug "/temp/"} (pig/store-clj "output.clj" foo))
Options:
:debug - Enables debugging, which writes the output of every step to a file.
The value is a path to place the debug output.
:dedupe - Set to false to disable command deduping.
See also: pigpen.core/write-script, pigpen.core/script
group-by
macro
(group-by key-selector relation)
(group-by key-selector opts relation)
Groups relation by the result of calling (key-selector item) for each item.
This produces a sequence of map entry values, similar to using seq with a
map. Each value will be a lazy sequence of the values that match key.
Optionally takes a map of options.
Example:
(pig/group-by :a foo)
(pig/group-by count {:parallel 20} foo)
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/cogroup
intersection
(intersection opts? relations+)
Performs an intersection on all relations provided and returns the distinct
results. Optionally takes a map of options as the first parameter.
Example:
(pig/intersection
(pig/return [1 2 2 3 3 3 4 5])
(pig/return [1 2 2 3 3])
(pig/return [1 1 2 2 3 3]))
=> [1 2 3]
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/intersection-multiset, pigpen.core/difference
intersection-multiset
(intersection-multiset opts? relations+)
Performs a multiset intersection on all relations provided and returns all
results. Optionally takes a map of options as the first parameter.
Example:
(pig/intersection-multiset
(pig/return [1 2 2 3 3 3 4 5])
(pig/return [1 2 2 3 3])
(pig/return [1 1 2 2 3 3]))
=> [1 2 2 3 3]
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/intersection, pigpen.core/difference
into
macro
(into to relation)
Returns a new relation with all values from relation conjoined onto to.
Note: This operation uses a single reducer and won't work for large datasets.
See also: pigpen.core/reduce
join
macro
(join selects+ f opts?)
Joins many relations together by a common key. Each relation specifies a
key-selector function on which to join. A function is applied to each join
key and each pair of values from each relation that match that join key.
Optionally takes a map of options.
Example:
(pig/join (foo on :a)
(bar on :b optional)
(fn [f b] ...)
{:parallel 20})
In this example, foo and bar are other pig queries and :a and :b are the
key-selector functions for foo and bar, respectively. These can be any
functions - not just keywords. There can be more than two select clauses.
By default, a matching key value from eatch source relation is required,
meaning that they must exist in all source relations to be part of the output.
To specify a relation as optional, add 'optional' to the select clause. The
third argument is a function used to consolidate matching key values. For each
uniqe key value, this function is called with each set of values from the cross
product of each source relation. By default, this does a standard inner join.
Use 'optional' to do outer joins. The last argument is an optional map of
options.
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/cogroup, pigpen.core/union
load-clj
macro
(load-clj location)
Loads clojure data from a file. Each line should contain one value and will
be parsed using clojure.edn/read-string into a value.
Example:
(pig/load-clj "input.clj")
See also: pigpen.core/load-tsv
load-lazy
(load-lazy location)
(load-lazy location delimiter)
Loads data from a tsv file. Each line is returned as a lazy seq, split by
the specified delimiter. The default delimiter is \t.
Note: The delimiter is wrapped with [^ ]+ to negate it for use with re-seq.
Thus, only simple delimiters are supported. Experimental & might not work.
Note: Internally this uses \u0000 as the split char so Pig won't split the line.
This won't work for files that actually have that char
See also: pigpen.core/load-tsv
load-pig
macro
(load-pig location fields)
Loads data stored in Pig format and converts it to the equivalent Clojure
data structures. The data is a tab-delimited file. 'fields' defines the name for
each input field. The data is returned as a map with 'fields' as the keys.
Example:
(pig/load-pig "input.pig")
Note: This is extremely slow. Don't use it.
See also: pigpen.core/load-tsv, pigpen.core/load-clj
load-tsv
(load-tsv location)
(load-tsv location delimiter)
Loads data from a tsv file. Each line is returned as a vector of strings,
split by the specified regex delimiter. The default delimiter is #"\t".
Example:
(pig/load-tsv "input.tsv")
(pig/load-tsv "input.tsv" #",")
Note: Internally this uses \u0000 as the split char so Pig won't split the line.
This won't work for files that actually have that char
See also: pigpen.core/load-clj
map
macro
(map f relation)
Returns a relation of f applied to every item in the source relation.
Function f should be a function of one argument.
Example:
(pig/map inc foo)
(pig/map (fn [x] (* x x)) foo)
Note: Unlike clojure.core/map, pigpen.core/map takes only one relation. This
is due to the fact that there is no defined order in pigpen. See pig/join,
pig/cogroup, and pig/union for combining sets of data.
See also: pigpen.core/mapcat, pigpen.core/map-indexed, pigpen.core/join,
pigpen.core/cogroup, pigpen.core/union
map-indexed
macro
(map-indexed f relation)
(map-indexed f opts relation)
Returns a relation of applying f to the the index and value of every item in
the source relation. Function f should be a function of two arguments: the index
and the value. If you require sequential ids, use option {:dense true}.
Example:
(pig/map-indexed (fn [i x] (* i x)) foo)
(pig/map-indexed vector {:dense true} foo)
Options:
:dense - force sequential ids
Note: If you require sorted data, use sort or sort-by immediately before
this command.
See also: pigpen.core/sort, pigpen.core/sort-by, pigpen.core/map, pigpen.core/mapcat
mapcat
macro
(mapcat f relation)
Returns the result of applying concat, or flattening, the result of applying
f to each item in relation. Thus f should return a collection.
Example:
(pig/mapcat (fn [x] [(dec x) x (inc x)]) foo)
See also: pigpen.core/map, pigpen.core/map-indexed
reduce
macro
(reduce f relation)
(reduce f val relation)
Reduce all items in relation into a single value. Follows semantics of
clojure.core/reduce. If a sequence is returned, it is kept as a single value
for further processing.
Example:
(pig/reduce + foo)
(pig/reduce conj [] foo)
Note: This operation uses a single reducer and won't work for large datasets.
See also: pigpen.core/into
remove
macro
(remove pred relation)
Returns a relation without items for which (pred item) returns true.
Example:
(pig/remove even? foo)
(pig/remove (fn [x] (even? (* x x))) foo)
See also: pigpen.core/filter, pigpen.core/take, pigpen.core/sample, pigpen.core/distinct
return
(return data)
Returns a constant set of data as a pigpen relation. This is useful for
testing, but not supported in generated scripts. The parameter 'data' must be a
sequence. The values of 'data' can be any clojure type.
Example:
(pig/constantly [1 2 3])
(pig/constantly [{:a 123} {:b 456}])
See also: pigpen.core/constantly
sample
(sample p relation)
Samples the input records by p percentage. This is non-deterministic;
different values may selected on subsequent runs. p should be a value
between 0.0 and 1.0
Example:
(pig/sample 0.01 foo)
Note: This is potentially an expensive operation when run locally.
See also: pigpen.core/filter, pigpen.core/take
script
(script outputs+)
Combines multiple store commands into a single script. This is not required
if you have a single output.
Example:
(pig/script
(pig/store-tsv "foo.tsv" foo)
(pig/store-clj "bar.clj" bar))
Note: When run locally, this will merge the results of any source relations.
show
(show script)
Generates a graph image for a PigPen query. This allows you to see what steps
will be executed when the script is run. The image is opened in another window.
This command uses a terse description for each operation.
Example:
(pigpen.core/show foo)
See also: pigpen.core/show+, pigpen.core/dump&show
show+
(show+ script)
Generates a graph image for a PigPen query. This allows you to see what steps
will be executed when the script is run. The image is opened in another window.
This command uses a verbose description for each operation, including user code.
Example:
(pigpen.core/show+ foo)
See also: pigpen.core/show, pigpen.core/dump&show+
sort
macro
(sort relation)
(sort comp relation)
(sort comp opts relation)
Sorts the data with an optional comparator. Takes an optional map of options.
Example:
(pig/sort foo)
(pig/sort :desc foo)
(pig/sort :desc {:parallel 20} foo)
Notes:
The default comparator is :asc (ascending sort order).
Only :asc and :desc are supported comparators.
The values must be primitive values (string, int, etc).
Maps, vectors, etc are not supported.
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/sort-by
sort-by
macro
(sort-by key-fn relation)
(sort-by key-fn comp relation)
(sort-by key-fn comp opts relation)
Sorts the data by the specified key-fn with an optional comparator. Takes an
optional map of options.
Example:
(pig/sort-by :a foo)
(pig/sort-by #(count %) :desc foo)
(pig/sort-by (fn [x] (* x x)) :desc {:parallel 20} foo)
Notes:
The default comparator is :asc (ascending sort order).
Only :asc and :desc are supported comparators.
The key-fn values must be primitive values (string, int, etc).
Maps, vectors, etc are not supported.
Options:
:parallel - The degree of parallelism to use
See also: pigpen.core/sort
store-clj
macro
(store-clj location relation)
Stores the relation into location using edn (clojure format). Each value is
written as a single line.
Example:
(pig/store-clj "output.tsv" foo)
See also: pigpen.core/store-tsv
See: https://github.com/edn-format/edn
store-pig
macro
(store-pig location relation)
Stores the relation into location as Pig formatted data.
Example:
(pig/store-pig "output.pig" foo)
Note: Pig formatted data is not idempotent. Don't use this.
See also: pigpen.core/store-clj, pigpen.core/store-tsv
store-tsv
(store-tsv location relation)
(store-tsv location delimiter relation)
Stores the relation into location as a tab-delimited file. Thus, each input
value must be sequential. Complex values are stored as edn (clojure format).
Single string values are not quoted. You may optionally pass a different delimiter.
Example:
(pig/store-tsv "output.tsv" foo)
(pig/store-tsv "output.csv" "," foo)
See also: pigpen.core/store-clj
See: https://github.com/edn-format/edn
take
(take n relation)
Limits the number of records to n items.
Example:
(pig/take 200 foo)
Note: This is potentially an expensive operation when run on the server.
See also: pigpen.core/filter, pigpen.core/sample
union
(union opts? relations+)
Performs a union on all relations provided and returns the distinct results.
Optionally takes a map of options as the first parameter.
Example:
(pig/union
(pig/return [1 2 2 3 3 3 4 5])
(pig/return [1 2 2 3 3])
(pig/return [1 1 2 2 3 3]))
=> [1 2 3 4 5]
Options:
:parallel n - the degree of parallelism to use
See also: pigpen.core/union-multiset, pigpen.core/distinct
union-multiset
(union-multiset relations+)
Performs a union on all relations provided and returns all results.
Identical to pigpen.core/concat.
Example:
(pig/union-multiset
(pig/return [1 2 2 3 3 3 4 5])
(pig/return [1 2 2 3 3])
(pig/return [1 1 2 2 3 3]))
=> [1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 4 5]
See also: pigpen.core/union, pigpen.core/distinct, pigpen.core/concat
write-script
(write-script location script)
(write-script location opts script)
Generates a Pig script from the relation specified and writes it to location.
You can pass any relation to this and it will generate a Pig script - it doesn't
have to be an output. However, if there are no store commands, the script won't
do much. If you have more than one store command, use pigpen.core/script to
combine them. Optionally takes a map of options.
Example:
(pig/write-script "my-script.pig" (pig/store-clj "output.clj" foo))
(pig/write-script "my-script.pig" {:debug "/temp/"} (pig/store-clj "output.clj" foo))
Options:
:debug - Enables debugging, which writes the output of every step to a file.
The value is a path to place the debug output.
:dedupe - Set to false to disable command deduping.
See also: pigpen.core/generate-script, pigpen.core/script