Configure text analyzers

Bayard can analyze text by combining the prepared tokenizers and filters.

Tokenizers

Tokenizers are responsible for breaking field data into lexical units, or tokens.

raw

For each value of the field, emit a single unprocessed token.

{
  "name": "raw"
}

simple

Tokenize the text by splitting on whitespaces and punctuation.

{
  "name": "simple"
}

ngram

Tokenize the text by splitting words into n-grams of the given size(s).

  • min_gram:
         Min size of the n-gram.

  • max_gram:
         Max size of the n-gram.

  • prefix_only:
         If true, will only parse the leading edge of the input.

{
  "name": "ngram",
  "args": {
    "min_gram": 1,
    "max_gram": 3,
    "prefix_only": false
  }
}

facet

Process a facet binary representation and emits a token for all of its parent.

{
  "name": "facet"
}

cang_jie

A Chinese tokenizer based on jieba-rs.

  • hmm:
         Enable HMM or not.

  • tokenizer_option:
         Tokenizer option.

    • all:
           Cut the input text, return all possible words.

    • default:
           Cut the input text.

    • search:
           Cut the input text in search mode.

    • unicode:
           Cut the input text into UTF-8 characters.

{
  "name": "cang_jie",
  "args": {
    "hmm": false,
    "tokenizer_option": "search"
  }
}

lindera

A Tokenizer based on Lindera.

  • mode:
         Tokenization mode.

    • normal:
           Tokenize faithfully based on words registered in the dictionary. (Default)

    • decompose:
           Tokenize a compound noun words additionally.

  • dict:
         Specify the pre-built dictionary directory path instead of the default dictionary (IPADIC). Please refer to the following repository for building a dictionary:
         - Lindera IPADIC Builder (Japanese)
         - Lindera IPDIC NEologd Builder (Japanese)
         - Lindera UniDic Builder (Japanese)
         - Lindera ko-dic Builder (Korean)

{
  "name": "lindera",
  "args": {
    "mode": "decompose"
  }
}

Filters

Filters examine a stream of tokens and keep them, transform them or discard them, depending on the filter type being used.

alpha_num_only

Removes all tokens that contain non ascii alphanumeric characters.

{
  "name": "alpha_num_only"
}

ascii_folding

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

{
  "name": "ascii_folding"
}

lower_case

Converts lowercase terms.

{
  "name": "lower_case"
}

remove_long

Removes tokens that are longer than a given number of bytes (in UTF-8 representation). It is especially useful when indexing unconstrained content. e.g. Mail containing base-64 encoded pictures etc.

  • length_limit:
         A limit in bytes of the UTF-8 representation.
{
  "name": "remove_long",
  "args": {
    "length_limit": 40
  }
}

stemming

Stemming token filter. Several languages are supported. Tokens are expected to be lowercased beforehand.

  • stemmer_algorithm:
         A given language algorithm.

    • arabic
    • danish
    • dutch
    • english
    • finnish
    • french
    • german
    • greek
    • hungarian
    • italian
    • norwegian
    • portuguese
    • romanian
    • russian
    • spanish
    • swedish
    • tamil
    • turkish
{
  "name": "stemming",
  "args": {
    "stemmer_algorithm": "english"
  }
}

stop_word

Removes stop words from a token stream.

  • word:
         A list of words to remove.
{
  "name": "stop_word",
  "args": {
    "words": [
      "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into",
      "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then",
      "there", "these", "they", "this", "to", "was", "will", "with"
    ]
  }
}

Text Analyzers

The text analyzer combines the tokenizer with some filters and uses it to parse the text of the field.
For example, write as follows:

{
  "lang_en": {
    "tokenizer": {
      "name": "simple"
    },
    "filters": [
      {
        "name": "remove_long",
        "args": {
          "length_limit": 40
        }
      },
      {
        "name": "ascii_folding"
      },
      {
        "name": "lower_case"
      },
      {
        "name": "stemming",
        "args": {
          "stemmer_algorithm": "english"
        }
      },
      {
        "name": "stop_word",
        "args": {
          "words": [
            "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into",
            "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then",
            "there", "these", "they", "this", "to", "was", "will", "with"
          ]
        }
      }
    ]
  }
}

The field uses the above text analyzer are described as follows:

[
  {
    "name": "description",
    "type": "text",
    "options": {
      "indexing": {
        "record": "position",
        "tokenizer": "lang_en"
      },
      "stored": true
    }
  }
]