pigpen.core documentation

cogroup

macro

(cogroup selects+ f opts?)
Joins many relations together by a common key. Each relation specifies a
key-selector function on which to join. A combiner function is applied to each
join key and all values from each relation that match that join key. This is
similar to join, without flattening the data. Optionally takes a map of options.

  Example:

    (pig/cogroup (foo on :a)
                 (bar on :b required)
                 (fn [key foos bars] ...)
                 {:parallel 20})

In this example, foo and bar are other pig queries and :a and :b are the
key-selector functions for foo and bar, respectively. These can be any
functions - not just keywords. There can be more than two select clauses.
By default, a matching key value from eatch source relation is optional,
meaning that keys don't have to exist in all source relations to be part of the
output. To specify a relation as required, add 'required' to the select clause.
The third argument is a function used to consolidate matching key values. For
each uniqe key value, this function is called with the value of the key and all
values with that key from foo and bar. As such, foos and bars are both
collections. The last argument is an optional map of options.

  Options:

    :parallel - The degree of parallelism to use

  See also: pigpen.core/join, pigpen.core/group-by

concat

(concat relations+)
Concatenates all relations provided. Does not guarantee any ordering of the
relations. Identical to pigpen.core/union-multiset.

  Example:

    (pig/concat
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 4 5]

  See also: pigpen.core/union, pigpen.core/distinct, pigpen.core/union-multiset

constantly

(constantly data)
Returns a function that takes any number of arguments and returns a constant
set of data as if it had been loaded by pigpen. This is useful for testing,
but not supported in generated scripts. The parameter 'data' must be a sequence.
The values of 'data' can be any clojure type.

  Example:

    (pig/constantly [1 2 3])
    (pig/constantly [{:a 123} {:b 456}])

  See also: pigpen.core/return

difference

(difference opts? relations+)
Performs a set difference on all relations provided and returns the distinct
results. Optionally takes a map of options as the first parameter.

  Example:

    (pig/difference
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2])
      (pig/return [3]))

    => [4 5]

  Options:

    :parallel - The degree of parallelism to use

  See also: pigpen.core/difference-multiset, pigpen.core/intersection

difference-multiset

(difference-multiset opts? relations+)
Performs a multiset difference on all relations provided and returns all
results. Optionally takes a map of options as the first parameter.

  Example:

    (pig/difference-multiset
      (pig/return [1 2 2 3 3 3 3 4 5])
      (pig/return [1 2 3])
      (pig/return [1 2 3]))

    => [3 3 4 5]

  Options:

    :parallel - The degree of parallelism to use

  See also: pigpen.core/difference, pigpen.core/intersection

distinct

(distinct relation)(distinct opts relation)
Returns a relation with the distinct values of relation. Optionally takes a
map of options.

  Example:

    (pig/distinct foo)
    (pig/distinct {:parallel 20} foo)

  Options:

    :parallel - The degree of parallelism to use

  See also: pigpen.core/union, pigpen.core/union-multiset, pigpen.core/filter

dump

(dump script)
Executes a script locally and returns the resulting values as a clojure
sequence. This command is very useful for unit tests.

  Example:

    (->>
      (pig/load-clj "input.clj")
      (pig/map inc)
      (pig/filter even?)
      (pig/dump)
      (clojure.core/map #(* % %))
      (clojure.core/filter even?))

    (deftest test-script
      (is (= (->>
               (pig/load-clj "input.clj")
               (pig/map inc)
               (pig/filter even?)
               (pig/dump))
             [2 4 6])))

  Note: pig/store commands return an empty set
        pig/script commands merge their results

  See also: pigpen.core/show, pigpen.core/dump&show

dump&show

(dump&show script)
Combines pig/show and pig/dump. This is useful so that the graph & resulting
script have the same ids.

dump&show+

(dump&show+ script)
Combines pig/show+ and pig/dump. This is useful so that the graph & resulting
script have the same ids.

dump-async

(dump-async script)
Executes a script asynchronously and prints the results to the console.

filter

macro

(filter pred relation)
Returns a relation that only contains the items for which (pred item)
returns true.

  Example:

    (pig/filter even? foo)
    (pig/filter (fn [x] (even? (* x x))) foo)

  See also: pigpen.core/remove, pigpen.core/take, pigpen.core/sample, pigpen.core/distinct

generate-script

(generate-script script)(generate-script opts script)
Generates a Pig script from the relation specified and returns it as a string.
You can pass any relation to this and it will generate a Pig script - it doesn't
have to be an output. However, if there are no store commands, the script won't
do much. If you have more than one store command, use pigpen.core/script to
combine them. Optionally takes a map of options.

  Example:

    (pig/generate-script (pig/store-clj "output.clj" foo))
    (pig/generate-script {:debug "/temp/"} (pig/store-clj "output.clj" foo))

  Options:

    :debug - Enables debugging, which writes the output of every step to a file.
             The value is a path to place the debug output.

    :dedupe - Set to false to disable command deduping.

  See also: pigpen.core/write-script, pigpen.core/script

group-by

macro

(group-by key-selector relation)(group-by key-selector opts relation)
Groups relation by the result of calling (key-selector item) for each item.
This produces a sequence of map entry values, similar to using seq with a
map. Each value will be a lazy sequence of the values that match key.
Optionally takes a map of options.

  Example:

    (pig/group-by :a foo)
    (pig/group-by count {:parallel 20} foo)

  Options:

    :parallel - The degree of parallelism to use

  See also: pigpen.core/cogroup

intersection

(intersection opts? relations+)
Performs an intersection on all relations provided and returns the distinct
results. Optionally takes a map of options as the first parameter.

  Example:

    (pig/intersection
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 2 3]

  Options:

    :parallel - The degree of parallelism to use

  See also: pigpen.core/intersection-multiset, pigpen.core/difference

intersection-multiset

(intersection-multiset opts? relations+)
Performs a multiset intersection on all relations provided and returns all
results. Optionally takes a map of options as the first parameter.

  Example:

    (pig/intersection-multiset
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 2 2 3 3]

  Options:

    :parallel - The degree of parallelism to use

  See also: pigpen.core/intersection, pigpen.core/difference

into

macro

(into to relation)
Returns a new relation with all values from relation conjoined onto to.

Note: This operation uses a single reducer and won't work for large datasets.

See also: pigpen.core/reduce

join

macro

(join selects+ f opts?)
Joins many relations together by a common key. Each relation specifies a
key-selector function on which to join. A function is applied to each join
key and each pair of values from each relation that match that join key.
Optionally takes a map of options.

  Example:

    (pig/join (foo on :a)
              (bar on :b optional)
              (fn [f b] ...)
              {:parallel 20})

In this example, foo and bar are other pig queries and :a and :b are the
key-selector functions for foo and bar, respectively. These can be any
functions - not just keywords. There can be more than two select clauses.
By default, a matching key value from eatch source relation is required,
meaning that they must exist in all source relations to be part of the output.
To specify a relation as optional, add 'optional' to the select clause. The
third argument is a function used to consolidate matching key values. For each
uniqe key value, this function is called with each set of values from the cross
product of each source relation. By default, this does a standard inner join.
Use 'optional' to do outer joins. The last argument is an optional map of
options.

  Options:

    :parallel - The degree of parallelism to use

  See also: pigpen.core/cogroup, pigpen.core/union

load-clj

macro

(load-clj location)
Loads clojure data from a file. Each line should contain one value and will
be parsed using clojure.edn/read-string into a value.

  Example:

    (pig/load-clj "input.clj")

  See also: pigpen.core/load-tsv

load-lazy

(load-lazy location)(load-lazy location delimiter)
Loads data from a tsv file. Each line is returned as a lazy seq, split by
the specified delimiter. The default delimiter is \t.

  Note: The delimiter is wrapped with [^ ]+ to negate it for use with re-seq.
        Thus, only simple delimiters are supported. Experimental & might not work.

  Note: Internally this uses \u0000 as the split char so Pig won't split the line.
        This won't work for files that actually have that char

  See also: pigpen.core/load-tsv

load-pig

macro

(load-pig location fields)
Loads data stored in Pig format and converts it to the equivalent Clojure
data structures. The data is a tab-delimited file. 'fields' defines the name for
each input field. The data is returned as a map with 'fields' as the keys.

  Example:

    (pig/load-pig "input.pig")

  Note: This is extremely slow. Don't use it.

  See also: pigpen.core/load-tsv, pigpen.core/load-clj

load-tsv

(load-tsv location)(load-tsv location delimiter)
Loads data from a tsv file. Each line is returned as a vector of strings,
split by the specified regex delimiter. The default delimiter is #"\t".

  Example:

    (pig/load-tsv "input.tsv")
    (pig/load-tsv "input.tsv" #",")

  Note: Internally this uses \u0000 as the split char so Pig won't split the line.
        This won't work for files that actually have that char

  See also: pigpen.core/load-clj

map

macro

(map f relation)
Returns a relation of f applied to every item in the source relation.
Function f should be a function of one argument.

  Example:

    (pig/map inc foo)
    (pig/map (fn [x] (* x x)) foo)

  Note: Unlike clojure.core/map, pigpen.core/map takes only one relation. This
is due to the fact that there is no defined order in pigpen. See pig/join,
pig/cogroup, and pig/union for combining sets of data. 

  See also: pigpen.core/mapcat, pigpen.core/map-indexed, pigpen.core/join,
            pigpen.core/cogroup, pigpen.core/union

map-indexed

macro

(map-indexed f relation)(map-indexed f opts relation)
Returns a relation of applying f to the the index and value of every item in
the source relation. Function f should be a function of two arguments: the index
and the value. If you require sequential ids, use option {:dense true}.

  Example:

    (pig/map-indexed (fn [i x] (* i x)) foo)
    (pig/map-indexed vector {:dense true} foo)

  Options:

    :dense - force sequential ids

  Note: If you require sorted data, use sort or sort-by immediately before
        this command.

  See also: pigpen.core/sort, pigpen.core/sort-by, pigpen.core/map, pigpen.core/mapcat

mapcat

macro

(mapcat f relation)
Returns the result of applying concat, or flattening, the result of applying
f to each item in relation. Thus f should return a collection.

  Example:

    (pig/mapcat (fn [x] [(dec x) x (inc x)]) foo)

  See also: pigpen.core/map, pigpen.core/map-indexed

reduce

macro

(reduce f relation)(reduce f val relation)
Reduce all items in relation into a single value. Follows semantics of
clojure.core/reduce. If a sequence is returned, it is kept as a single value
for further processing.

  Example:

    (pig/reduce + foo)
    (pig/reduce conj [] foo)

  Note: This operation uses a single reducer and won't work for large datasets.

  See also: pigpen.core/into

remove

macro

(remove pred relation)
Returns a relation without items for which (pred item) returns true.

Example:

  (pig/remove even? foo)
  (pig/remove (fn [x] (even? (* x x))) foo)

See also: pigpen.core/filter, pigpen.core/take, pigpen.core/sample, pigpen.core/distinct

return

(return data)
Returns a constant set of data as a pigpen relation. This is useful for
testing, but not supported in generated scripts. The parameter 'data' must be a
sequence. The values of 'data' can be any clojure type.

  Example:

    (pig/constantly [1 2 3])
    (pig/constantly [{:a 123} {:b 456}])

  See also: pigpen.core/constantly

sample

(sample p relation)
Samples the input records by p percentage. This is non-deterministic;
different values may selected on subsequent runs. p should be a value
between 0.0 and 1.0

  Example:

    (pig/sample 0.01 foo)

  Note: This is potentially an expensive operation when run locally.

  See also: pigpen.core/filter, pigpen.core/take

script

(script outputs+)
Combines multiple store commands into a single script. This is not required
if you have a single output.

  Example:

    (pig/script
      (pig/store-tsv "foo.tsv" foo)
      (pig/store-clj "bar.clj" bar))

  Note: When run locally, this will merge the results of any source relations.

show

(show script)
Generates a graph image for a PigPen query. This allows you to see what steps
will be executed when the script is run. The image is opened in another window.
This command uses a terse description for each operation.

  Example:

    (pigpen.core/show foo)

  See also: pigpen.core/show+, pigpen.core/dump&show

show+

(show+ script)
Generates a graph image for a PigPen query. This allows you to see what steps
will be executed when the script is run. The image is opened in another window.
This command uses a verbose description for each operation, including user code.

  Example:

    (pigpen.core/show+ foo)

  See also: pigpen.core/show, pigpen.core/dump&show+

sort

macro

(sort relation)(sort comp relation)(sort comp opts relation)
Sorts the data with an optional comparator. Takes an optional map of options.

Example:

  (pig/sort foo)
  (pig/sort :desc foo)
  (pig/sort :desc {:parallel 20} foo)

Notes:
  The default comparator is :asc (ascending sort order).
  Only :asc and :desc are supported comparators.
  The values must be primitive values (string, int, etc).
  Maps, vectors, etc are not supported.

Options:

  :parallel - The degree of parallelism to use

See also: pigpen.core/sort-by

sort-by

macro

(sort-by key-fn relation)(sort-by key-fn comp relation)(sort-by key-fn comp opts relation)
Sorts the data by the specified key-fn with an optional comparator. Takes an
optional map of options.

  Example:

    (pig/sort-by :a foo)
    (pig/sort-by #(count %) :desc foo)
    (pig/sort-by (fn [x] (* x x)) :desc {:parallel 20} foo)

  Notes:
    The default comparator is :asc (ascending sort order).
    Only :asc and :desc are supported comparators.
    The key-fn values must be primitive values (string, int, etc).
    Maps, vectors, etc are not supported.

  Options:

    :parallel - The degree of parallelism to use

  See also: pigpen.core/sort

store-clj

macro

(store-clj location relation)
Stores the relation into location using edn (clojure format). Each value is
written as a single line.

  Example:

    (pig/store-clj "output.tsv" foo)

  See also: pigpen.core/store-tsv

  See: https://github.com/edn-format/edn

store-pig

macro

(store-pig location relation)
Stores the relation into location as Pig formatted data.

Example:

  (pig/store-pig "output.pig" foo)

Note: Pig formatted data is not idempotent. Don't use this.

See also: pigpen.core/store-clj, pigpen.core/store-tsv

store-tsv

(store-tsv location relation)(store-tsv location delimiter relation)
Stores the relation into location as a tab-delimited file. Thus, each input
value must be sequential. Complex values are stored as edn (clojure format).
Single string values are not quoted. You may optionally pass a different delimiter.

  Example:

    (pig/store-tsv "output.tsv" foo)
    (pig/store-tsv "output.csv" "," foo)

  See also: pigpen.core/store-clj

  See: https://github.com/edn-format/edn

take

(take n relation)
Limits the number of records to n items.

Example:

  (pig/take 200 foo)

Note: This is potentially an expensive operation when run on the server.

See also: pigpen.core/filter, pigpen.core/sample

union

(union opts? relations+)
Performs a union on all relations provided and returns the distinct results.
Optionally takes a map of options as the first parameter.

  Example:

    (pig/union
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 2 3 4 5]

  Options:

    :parallel n - the degree of parallelism to use

  See also: pigpen.core/union-multiset, pigpen.core/distinct

union-multiset

(union-multiset relations+)
Performs a union on all relations provided and returns all results.
Identical to pigpen.core/concat.

  Example:

    (pig/union-multiset
      (pig/return [1 2 2 3 3 3 4 5])
      (pig/return [1 2 2 3 3])
      (pig/return [1 1 2 2 3 3]))

    => [1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 4 5]

  See also: pigpen.core/union, pigpen.core/distinct, pigpen.core/concat

write-script

(write-script location script)(write-script location opts script)
Generates a Pig script from the relation specified and writes it to location.
You can pass any relation to this and it will generate a Pig script - it doesn't
have to be an output. However, if there are no store commands, the script won't
do much. If you have more than one store command, use pigpen.core/script to
combine them. Optionally takes a map of options.

  Example:

    (pig/write-script "my-script.pig" (pig/store-clj "output.clj" foo))
    (pig/write-script "my-script.pig" {:debug "/temp/"} (pig/store-clj "output.clj" foo))

  Options:

    :debug - Enables debugging, which writes the output of every step to a file.
             The value is a path to place the debug output.

    :dedupe - Set to false to disable command deduping.

  See also: pigpen.core/generate-script, pigpen.core/script