English
Load models into a callable object to process English text. Intended use is for one instance to be created per process. You can create more if you're doing something unusual. You may wish to make the instance a global variable or "singleton". We usually instantiate the object in the main()
function and pass it around as an explicit argument.
from spacy.en import English
from spacy._doc_examples import download_war_and_peace
unprocessed_unicode = download_war_and_peace()
nlp = English()
doc = nlp(unprocessed_unicode)
__init__
self, data_dir=True, Tagger=True, Parser=True, Entity=True, Matcher=True, Packer=None, load_vectors=TrueLoad the resources. Loading takes 20 seconds, and the instance consumes 2 to 3 gigabytes of memory.
Load data from default directory:
>>> nlp = English()
>>> nlp = English(data_dir=u'')
Load data from specified directory:
>>> nlp = English(data_dir=u'path/to/data_directory')
Disable (and avoid loading) parts of the processing pipeline:
>>> nlp = English(load_vectors=False, Parser=False, Tagger=False, Entity=False)
Start with nothing loaded:
>>> nlp = English(data_dir=None)
True
, to load the default tagger. If falsey, no tagger is loaded.
You can also supply your own class/function, which will be called once on setup. The returned function will then be called in English.__call__
. The function passed must accept two arguments, of types (StringStore, directory)
, and produce a function that accepts one argument, of type Doc
. Its return type is unimportant.
True
, to load the default tagger. If falsey, no parser is loaded.
You can also supply your own class/function, which will be called once on setup. The returned function will then be called in English.__call__
. The function passed must accept two arguments, of types (StringStore, directory)
, and produce a function that accepts one argument, of type Doc
. Its return type is unimportant.
True
, to load the default tagger. If falsey, no entity recognizer is loaded.
You can also supply your own class/function, which will be called once on setup. The returned function will then be called in English.__call__
. The function passed must accept two arguments, of types (StringStore, directory)
, and produce a function that accepts one argument, of type Doc
. Its return type is unimportant.
__call__
text, tag=True, parse=True, entity=TrueThe main entry point to spaCy. Takes raw unicode text, and returns a Doc
object, which can be iterated to access Token
and Span
objects. spaCy's models are all linear-time, so you can supply documents of arbitrary length, e.g. whole novels.
from spacy.en import English
nlp = English()
doc = nlp(u'Some text.) # Applies tagger, parser, entity
doc = nlp(u'Some text.', parse=False) # Applies tagger and entity, not parser
doc = nlp(u'Some text.', entity=False) # Applies tagger and parser, not entity
doc = nlp(u'Some text.', tag=False) # Does not apply tagger, entity or parser
doc = nlp(u'') # Zero-length tokens, not an error
# doc = nlp(b'Some text') <-- Error: need unicode
doc = nlp(b'Some text'.decode('utf8')) # Encode to unicode first.
Doc
A sequence of Token
objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings.
Internally, the Doc
object holds an array of TokenC
structs. The Python-level Token
and Span
objects are views of this array, i.e. they don't own the data themselves. This details of the internals shouldn't matter for the API – but it may help you read the code, and understand how spaCy is designed.
English.__call__(unicode text)
__init__
self, vocab, orth_and_spaces=NoneDoc
object is usually only used for deserialization. Standard usage is to construct the document via a call to the language object.
(orth_id, has_space)
tuples, where orth_id
is an integer, and has_space is a boolean, indicating whether the token has a trailing space.
doc[i]
Get the Token
object at position i
, where i
is an integer. Negative indexing is supported, and follows the usual Python semantics, i.e. doc[-2]
is doc[len(doc) - 2]
.
doc[start : end]
Get a Span
object, starting at position start
and ending at position end
. For instance, doc[2:5]
produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]
) are not supported, as Span
objects must be contiguous (cannot have gaps).
for token in doc
Iterate over Token
objects, from which the annotations can be easily accessed. This is the main way of accessing Token
objects, which are the main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython, via Doc.data
, an array of TokenC
structs. The C API has not yet been finalized, and is subject to change.
len(doc)
The number of tokens in the document.
sents
Yields sentence Span
objects. Iterate over the span to get individual Token
objects. Sentence spans have no label.
>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp(u'This is a sentence. Here's another...')
>>> for sentence in doc.sents:
... sentence.root.orth_
is
's
ents
Yields named-entity Span
objects. Iterate over the span to get individual Token
objects, or access the label:
>>> from spacy.en import English
>>> nlp = English()
>>> tokens = nlp(u'Mr. Best flew to New York on Saturday morning.')
>>> ents = list(tokens.ents)
>>> ents[0].label, ents[0].label_, ents[0].orth_, ents[0].string
(112504, 'PERSON', 'Best', ents[0].string)
noun_chunks
Yields base noun-phrase Span
objects. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses. For example:
>>> from spacy.en import English
>>> nlp = English()
>>> doc = nlp('The sentence in this example has three noun chunks.')
>>> for chunk in doc.noun_chunks:
... print(chunk.label, chunk.orth_, '<--', chunk.root.head.orth_)
NP The sentence <-- has
NP this example <-- in
NP three noun chunks <-- has
to_array
attr_idsspacy.attrs
count_by
attr_id{attribute (int): count (ints)}
frequencies, keyed by the values of the given attribute ID.
>>> from spacy.en import English, attrs
>>> nlp = English()
>>> tokens = nlp(u'apple apple orange banana')
>>> tokens.count_by(attrs.ORTH)
{12800L: 1, 11880L: 2, 7561L: 1}
>>> tokens.to_array([attrs.ORTH])
array([[11880],
[11880],
[7561],
[12800]])
from_array
attrs, arrayDoc
object, from an M*N array of attributes.
from_bytes
to_bytes
read_bytes
Token
token.orth
is an integer ID, token.orth_
is the unicode value. The only exception is the Token.string attribute, which is (unicode) string-typed.
lemma / lemma_
The "base" of the word, with no inflectional suffixes, e.g. the lemma of "developing" is "develop", the lemma of "geese" is "goose", etc. Note that derivational suffixes are not stripped, e.g. the lemma of "instutitions" is "institution", not "institute". Lemmatization is performed using the WordNet data, but extended to also cover closed-class words such as pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his". We assign pronouns the lemma -PRON-
.
orth / orth_
The form of the word with no string normalization or processing, as it appears in the string, without trailing whitespace.
lower / lower_
The form of the word, but forced to lower-case, i.e. lower = word.orth_.lower()
shape / shape_
A transform of the word's string, to show orthographic features. The characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d. After these mappings, sequences of 4 or more of the same character are truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx, :) --> :)
prefix / prefix_
A length-N substring from the start of the word. Length may vary by language; currently for English n=1, i.e. prefix = word.orth_[:1]
suffix / suffix_
A length-N substring from the end of the word. Length may vary by language; currently for English n=3, i.e. suffix = word.orth_[-3:]
is_alpha
Equivalent to word.orth_.isalpha()
is_ascii
Equivalent to any(ord(c) >= 128 for c in word.orth_)
is_digit
Equivalent to word.orth_.isdigit()
is_lower
Equivalent to word.orth_.islower()
is_title
Equivalent to word.orth_.istitle()
is_punct
Equivalent to word.orth_.ispunct()
is_space
Equivalent to word.orth_.isspace()
like_url
Does the word resembles a URL?
like_num
Does the word represent a number? e.g. “10.9”, “10”, “ten”, etc
like_email
Does the word resemble an email?
is_oov
Is the word out-of-vocabulary?
check_flag
flag_idprob
The unigram log-probability of the word, estimated from counts from a large corpus, smoothed using Simple Good Turing estimation.
cluster
The Brown cluster ID of the word. These are often useful features for linear models. If you’re using a non-linear model, particularly a neural net or random forest, consider using the real-valued word representation vector, in Token.repvec, instead.
repvec
A “word embedding” representation: a dense real-valued vector that supports similarity queries between words. By default, spaCy currently loads vectors produced by the Levy and Goldberg (2014) dependency-based word2vec model.
idx
Start index of the token in the string
len(token)
Length of the token's orth string, in unicode code-points.
unicode(token)
Same as token.orth_
str(token)
In Python 3, returns token.orth_
. In Python 2, returnstoken.orth_.encode('utf8')
string
token.orth_ + token.whitespace_
, i.e. the form of the word as it appears in the string,
whitespace_
The number of immediate syntactic children following the word in the string.
head
The immediate syntactic head of the token. If the token is the root of its sentence, it is the token itself, i.e. root_token.head is root_token
children
An iterator that yields from lefts, and then yields from rights.
subtree
An iterator for the part of the sentence syntactically governed by the word, including the word itself.
left_edge
The leftmost edge of the token's subtree
right_edge
The rightmost edge of the token's subtree
nbor(i=1)
ent_type
If the token is part of an entity, its entity type.
ent_iob
The IOB (inside, outside, begin) entity recognition tag for the token.
__init__
vocab, doc, offsetSpan
Span
is a slice of a Doc
object, consisting of zero or more tokens. Spans are used to represent sentences, named entities, phrases, and arbitrary contiguous slices from the Doc
object. Span
objects are views – that is, they do not copy the underlying C data. This makes them cheap to construct, as internally are simply a reference to the Doc
object, a start position, an end position, and a label ID.
token = span[i]
Get the Token
object at position i, where i is an offset within the Span
, not the document. That is:
span = doc[4:6]
token = span[0]
assert token.i == 4
for token in span
Iterate over the Token
objects in the span.
__len__
Number of tokens in the span.
start
The start offset of the span, i.e. span[0].i
.
end
The end offset of the span, i.e. span[-1].i + 1
root
The first ancestor of the first word of the span that has its head outside the span. For example:
>>> toks = nlp(u'I like New York in Autumn.')
Let's name the indices --- easier than writing toks[4]
etc.
>>> i, like, new, york, in_, autumn, dot = range(len(toks))
The head of new is York, and the head of York is like
>>> toks[new].head.orth_
'York'
>>> toks[york].head.orth_
'like'
Create a span for "New York". Its root is "York".
>>> new_york = toks[new:york+1]
>>> new_york.root.orth_
'York'
When there are multiple words with external dependencies, we take the first:
>>> toks[autumn].head.orth_, toks[dot].head.orth_
('in', like')
>>> autumn_dot = toks[autumn:]
>>> autumn_dot.root.orth_
'Autumn'
lefts
Tokens that are to the left of the span, whose head is within the span, i.e.
lefts = [span.doc[i] for i in range(0, span.start)
if span.doc[i].head in span]
rights
Tokens that are to the right of the span, whose head is within the span, i.e.
rights = [span.doc[i] for i in range(span.end, len(span.doc))
if span.doc[i].head in span]
subtree
Tokens in the range (start, end+1)
, where start
is the index of the leftmost word descended from a token in the span, and end
is the index of the rightmost token descended from a token in the span.
doc[start : end]
for entity in doc.ents
for sentence in doc.sents
for noun_phrase in doc.noun_chunks
span = Span(doc, start, end, label=0)
__init__
span = doc[0:4]
string
String
lemma / lemma_
String
label / label_
String
Lexeme
The Lexeme object represents a lexical type, stored in the vocabulary – as opposed to a token, occurring in a document.
Lexemes store various features, so that these features can be computed once per type, rather than once per token. As job sizes grow, this can amount to a substantial efficiency improvement.
All Lexeme attributes are therefore context independent, as a single lexeme is reused for all usages of that word. Lexemes are keyed by the “orth” attribute.
All Lexeme attributes are accessible directly on the Token object.
orth / orth_
The form of the word with no string normalization or processing, as it appears in the string, without trailing whitespace.
lower / lower_
The form of the word, but forced to lower-case, i.e. lower = word.orth_.lower()
shape / shape_
A transform of the word's string, to show orthographic features. The characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d. After these mappings, sequences of 4 or more of the same character are truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx, :) --> :)
prefix / prefix_
A length-N substring from the start of the word. Length may vary by language; currently for English n=1, i.e. prefix = word.orth_[:1]
suffix / suffix_
A length-N substring from the end of the word. Length may vary by language; currently for English n=3, i.e. suffix = word.orth_[-3:]
is_alpha
Equivalent to word.orth_.isalpha()
is_ascii
Equivalent to any(ord(c) >= 128 for c in word.orth_)
is_digit
Equivalent to word.orth_.isdigit()
is_lower
Equivalent to word.orth_.islower()
is_title
Equivalent to word.orth_.istitle()
is_punct
Equivalent to word.orth_.ispunct()
is_space
Equivalent to word.orth_.isspace()
like_url
Does the word resembles a URL?
like_num
Does the word represent a number? e.g. “10.9”, “10”, “ten”, etc
like_email
Does the word resemble an email?
is_oov
Is the word out-of-vocabulary?
prob
The unigram log-probability of the word, estimated from counts from a large corpus, smoothed using Simple Good Turing estimation.
cluster
The Brown cluster ID of the word. These are often useful features for linear models. If you’re using a non-linear model, particularly a neural net or random forest, consider using the real-valued word representation vector, in Token.repvec, instead.
repvec
A “word embedding” representation: a dense real-valued vector that supports similarity queries between words. By default, spaCy currently loads vectors produced by the Levy and Goldberg (2014) dependency-based word2vec model.
__init__
Init
Vocab
lexeme = vocab[integer_id]
Get a lexeme by its orth ID
lexeme = vocab[string]
Get a lexeme by the string corresponding to its orth ID.
for lexeme in vocab
Iterate over Lexeme
objects
vocab[integer_id] = attributes_dict
A props dictionary
len(vocab)
Number of lexemes (unique words) in the
__init__
StringStore
Intern strings, and map them to sequential integer IDs. The mapping table is very efficient , and a small-string optimization is used to maintain a small memory footprint. Only the integer IDs are held by spaCy's data classes (Doc
, Token
, Span
and Lexeme
) – when you use a string-valued attribute like token.orth_
, you access a property that computes token.strings[token.orth]
.
string = string_store[int_id]
Retrieve a string from a given integer ID. If the integer ID is not found, raise IndexError
int_id = string_store[unicode_string]
Map a unicode string to an integer ID. If the string is previously unseen, it is interned, and a new ID is returned.
int_id = string_store[utf8_byte_string]
Byte strings are assumed to be in UTF-8 encoding. Strings encoded with other codecs may fail silently. Given a utf8 string, the behaviour is the same as for unicode strings. Internally, strings are stored in UTF-8 format. So if you start with a UTF-8 byte string, it's less efficient to first decode it as unicode, as StringStore will then have to encode it as UTF-8 once again.
n_strings = len(string_store)
Number of strings in the string-store
for string in string_store
Iterate over strings in the string store, in order, such that the ith string in the sequence has the ID i:
for i, string in enumerate(string_store):
assert i == string_store[string]
StringStore.__init__
takes no arguments, so a new instance can be constructed as follows:
string_store = StringStore()
However, in practice you'll usually use the instance owned by the language's vocab
object, which all classes hold a reference to:
english.vocab.strings
doc.vocab.strings
span.vocab.strings
token.vocab.strings
lexeme.vocab.strings
If you create another instance, it will map strings to different integers – which is usually not what you want.
dump
locSave the strings mapping to the given location, in plain text. The format is subject to change; so if you need to read/write compatible files, please can find details in the strings.pyx
source.
load
locLoad the strings mapping from a plain-text file in the given location. The format is subject to change; so if you need to read/write compatible files, please can find details in the strings.pyx
source.