classEnglish

Load models into a callable object to process English text. Intended use is for one instance to be created per process. You can create more if you're doing something unusual. You may wish to make the instance a global variable or "singleton". We usually instantiate the object in the main() function and pass it around as an explicit argument.

from spacy.en import English
from spacy._doc_examples import download_war_and_peace
 
unprocessed_unicode = download_war_and_peace()
 
nlp = English()
doc = nlp(unprocessed_unicode)
__init__self, data_dir=True, Tagger=True, Parser=True, Entity=True, Matcher=True, Packer=None, load_vectors=True

Load the resources. Loading takes 20 seconds, and the instance consumes 2 to 3 gigabytes of memory.

Load data from default directory:

>>> nlp = English()
>>> nlp = English(data_dir=u'')

Load data from specified directory:

>>> nlp = English(data_dir=u'path/to/data_directory')

Disable (and avoid loading) parts of the processing pipeline:

>>> nlp = English(load_vectors=False, Parser=False, Tagger=False, Entity=False)

Start with nothing loaded:

>>> nlp = English(data_dir=None)
__call__text, tag=True, parse=True, entity=True

The main entry point to spaCy. Takes raw unicode text, and returns a Doc object, which can be iterated to access Token and Span objects. spaCy's models are all linear-time, so you can supply documents of arbitrary length, e.g. whole novels.

from spacy.en import English
nlp = English()
doc = nlp(u'Some text.) # Applies tagger, parser, entity
doc = nlp(u'Some text.', parse=False) # Applies tagger and entity, not parser
doc = nlp(u'Some text.', entity=False) # Applies tagger and parser, not entity
doc = nlp(u'Some text.', tag=False) # Does not apply tagger, entity or parser
doc = nlp(u'') # Zero-length tokens, not an error
# doc = nlp(b'Some text') <-- Error: need unicode
doc = nlp(b'Some text'.decode('utf8')) # Encode to unicode first.
classDoc

A sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings.

Internally, the Doc object holds an array of TokenC structs. The Python-level Token and Span objects are views of this array, i.e. they don't own the data themselves. This details of the internals shouldn't matter for the API – but it may help you read the code, and understand how spaCy is designed.

Constructors

viaEnglish.__call__(unicode text)
__init__self, vocab, orth_and_spaces=None This method of constructing a Doc object is usually only used for deserialization. Standard usage is to construct the document via a call to the language object.
  • vocab – A Vocabulary object, which must match any models you want to use (e.g. tokenizer, parser, entity recognizer).
  • orth_and_spaces – A list of (orth_id, has_space) tuples, where orth_id is an integer, and has_space is a boolean, indicating whether the token has a trailing space.

Sequence API

  • doc[i] Get the Token object at position i, where i is an integer. Negative indexing is supported, and follows the usual Python semantics, i.e. doc[-2] is doc[len(doc) - 2].
  • doc[start : end] Get a Span object, starting at position start and ending at position end. For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as Span objects must be contiguous (cannot have gaps).
  • for token in docIterate over Token objects, from which the annotations can be easily accessed. This is the main way of accessing Token objects, which are the main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython, via Doc.data, an array of TokenC structs. The C API has not yet been finalized, and is subject to change.
  • len(doc) The number of tokens in the document.
  • Sentence, entity and noun chunk spans

    sents

    Yields sentence Span objects. Iterate over the span to get individual Token objects. Sentence spans have no label.

    >>> from spacy.en import English
    >>> nlp = English()
    >>> doc = nlp(u'This is a sentence. Here's another...')
    >>> for sentence in doc.sents:
    ...     sentence.root.orth_
    is
    's

    ents

    Yields named-entity Span objects. Iterate over the span to get individual Token objects, or access the label:

    >>> from spacy.en import English
    >>> nlp = English()
    >>> tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
    >>> ents = list(tokens.ents)
    >>> ents[0].label, ents[0].label_, ents[0].orth_, ents[0].string
    (112504, 'PERSON', 'Best', ents[0].string) 

    noun_chunks

    Yields base noun-phrase Span objects. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses. For example:

    >>> from spacy.en import English
    >>> nlp = English()
    >>> doc = nlp('The sentence in this example has three noun chunks.')
    >>> for chunk in doc.noun_chunks:
    ...     print(chunk.label, chunk.orth_, '<--', chunk.root.head.orth_)
    NP The sentence <-- has
    NP this example <-- in
    NP three noun chunks <-- has

    Export/Import

    to_arrayattr_idsGiven a list of M attribute IDs, export the tokens to a numpy ndarray of shape N*M, where N is the length of the sentence.
    • attr_ids (list[int]) –A list of attribute ID ints. Attribute IDs can be imported from spacy.attrs
    count_byattr_idProduce a dict of {attribute (int): count (ints)} frequencies, keyed by the values of the given attribute ID.
    >>> from spacy.en import English, attrs
    >>> nlp = English()
    >>> tokens = nlp(u'apple apple orange banana')
    >>> tokens.count_by(attrs.ORTH)
    {12800L: 1, 11880L: 2, 7561L: 1}
    >>> tokens.to_array([attrs.ORTH])
    array([[11880],
            [11880],
            [7561],
            [12800]])
    from_arrayattrs, arrayWrite to a Doc object, from an M*N array of attributes.
    from_bytesDeserialize, loading from bytes.
    to_bytesSerialize, producing a byte string.
    read_bytesclassmethod
    classTokenA Token represents a single word, punctuation or significant whitespace symbol. Integer IDs are provided for all string features. The (unicode) string is provided by an attribute of the same name followed by an underscore, e.g. token.orth is an integer ID, token.orth_ is the unicode value. The only exception is the Token.string attribute, which is (unicode) string-typed.

    String Features

    Boolean Flags

    check_flagflag_idGet the value of one of the boolean flags

    Distributional Features

    Alignment and Output

    Navigating the Parse Tree

  • headThe immediate syntactic head of the token. If the token is the root of its sentence, it is the token itself, i.e. root_token.head is root_token
  • childrenAn iterator that yields from lefts, and then yields from rights.
  • subtreeAn iterator for the part of the sentence syntactically governed by the word, including the word itself.
  • left_edgeThe leftmost edge of the token's subtree
  • right_edgeThe rightmost edge of the token's subtree
  • nbor(i=1)Get the ith next / previous neighboring token.

    Named Entities

    Constructors

    __init__vocab, doc, offset
    • vocab –A Vocab object
    • doc –The parent sequence
    • offset (int) –The index of the token within the document
    classSpanA Span is a slice of a Doc object, consisting of zero or more tokens. Spans are used to represent sentences, named entities, phrases, and arbitrary contiguous slices from the Doc object. Span objects are views – that is, they do not copy the underlying C data. This makes them cheap to construct, as internally are simply a reference to the Doc object, a start position, an end position, and a label ID.
  • token = span[i]Get the Token object at position i, where i is an offset within the Span, not the document. That is:
    span = doc[4:6]
    token = span[0]
    assert token.i == 4
  • Navigating the Parse Tree

  • subtreeTokens in the range (start, end+1), where start is the index of the leftmost word descended from a token in the span, and end is the index of the rightmost token descended from a token in the span.
  • Constructors

    __init__Temp span = doc[0:4]

    String Views

    string

    String

    lemma / lemma_

    String

    label / label_

    String

    classLexeme

    The Lexeme object represents a lexical type, stored in the vocabulary – as opposed to a token, occurring in a document.

    Lexemes store various features, so that these features can be computed once per type, rather than once per token. As job sizes grow, this can amount to a substantial efficiency improvement.

    All Lexeme attributes are therefore context independent, as a single lexeme is reused for all usages of that word. Lexemes are keyed by the “orth” attribute.

    All Lexeme attributes are accessible directly on the Token object.

    String Features

    Boolean Features

    Distributional Features

    Constructors

    __init__

    Init

    classVocab

    Constructors

    __init__Tmp

    Save and Load

    dumploc
    • loc (unicode) –Path where the vocabulary should be saved
    load_lexemesloc
    • loc (unicode) –Path to load the lexemes.bin file from
    load_vectorsloc
    • loc (unicode) –Path to load the vectors.bin from
    classStringStore

    Intern strings, and map them to sequential integer IDs. The mapping table is very efficient , and a small-string optimization is used to maintain a small memory footprint. Only the integer IDs are held by spaCy's data classes (Doc, Token, Span and Lexeme) – when you use a string-valued attribute like token.orth_, you access a property that computes token.strings[token.orth].

    Constructors

    StringStore.__init__ takes no arguments, so a new instance can be constructed as follows:

    string_store = StringStore()

    However, in practice you'll usually use the instance owned by the language's vocab object, which all classes hold a reference to:

    If you create another instance, it will map strings to different integers – which is usually not what you want.

    Save and Load

    dumploc

    Save the strings mapping to the given location, in plain text. The format is subject to change; so if you need to read/write compatible files, please can find details in the strings.pyx source.

    loadloc

    Load the strings mapping from a plain-text file in the given location. The format is subject to change; so if you need to read/write compatible files, please can find details in the strings.pyx source.