nlp_architect.data package¶
Subpackages¶
- nlp_architect.data.cdc_resources package
- Subpackages
- nlp_architect.data.cdc_resources.data_types package
- nlp_architect.data.cdc_resources.embedding package
- nlp_architect.data.cdc_resources.gen_scripts package
- Submodules
- nlp_architect.data.cdc_resources.gen_scripts.create_reference_dict_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_verbocean_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_wiki_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_word_embed_elmo_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_word_embed_glove_dump module
- nlp_architect.data.cdc_resources.gen_scripts.create_wordnet_dump module
- Module contents
- nlp_architect.data.cdc_resources.relations package
- Submodules
- nlp_architect.data.cdc_resources.relations.computed_relation_extraction module
- nlp_architect.data.cdc_resources.relations.referent_dict_relation_extraction module
- nlp_architect.data.cdc_resources.relations.relation_extraction module
- nlp_architect.data.cdc_resources.relations.relation_types_enums module
- nlp_architect.data.cdc_resources.relations.verbocean_relation_extraction module
- nlp_architect.data.cdc_resources.relations.wikipedia_relation_extraction module
- nlp_architect.data.cdc_resources.relations.within_doc_coref_extraction module
- nlp_architect.data.cdc_resources.relations.word_embedding_relation_extraction module
- nlp_architect.data.cdc_resources.relations.wordnet_relation_extraction module
- Module contents
- nlp_architect.data.cdc_resources.wikipedia package
- nlp_architect.data.cdc_resources.wordnet package
- Module contents
- Subpackages
Submodules¶
nlp_architect.data.amazon_reviews module¶
-
class
nlp_architect.data.amazon_reviews.
Amazon_Reviews
(review_file, run_balance=True)[source]¶ Bases:
object
Take the *.json file of Amazon reviews as downloaded from http://jmcauley.ucsd.edu/data/amazon/ Then does data cleaning and balancing, as well as transforms the reviews 1-5 to a sentiment
nlp_architect.data.babi_dialog module¶
-
class
nlp_architect.data.babi_dialog.
BABI_Dialog
(path='.', task=1, oov=False, use_match_type=False, use_time=True, use_speaker_tag=True, cache_match_type=False, cache_vectorized=False)[source]¶ Bases:
object
This class loads in the Facebook bAbI goal oriented dialog dataset and vectorizes them into user utterances, bot utterances, and answers.
As described in: “Learning End-to-End Goal Oriented Dialog”. https://arxiv.org/abs/1605.07683.
For a particular task, the class will read both train and test files and combine the vocabulary.
Parameters: - path (str) – Directory to store the dataset
- task (str) – a particular task to solve (all bAbI tasks are train and tested separately)
- oov (bool, optional) – Load test set with out of vocabulary entity words
- use_match_type (bool, optional) – Flag to use match-type features
- use_time (bool, optional) – Add time words to each memory, encoding when the memory was formed
- use_speaker_tag (bool, optional) – Add speaker words to each memory (<BOT> or <USER>) indicating who spoke each memory.
- cache_match_type (bool, optional) – Flag to save match-type features after processing
- cache_vectorized (bool, optional) – Flag to save all vectorized data after processing
-
data_dict
¶ Dictionary containing final vectorized train, val, and test datasets
Type: dict
-
cands
¶ Vectorized array of potential candidate answers, encoded
Type: np.array
-
as integers, as returned by BABI_Dialog class. Shape = [num_cands, max_cand_length]
-
num_cands
¶ Number of potential candidate answers.
Type: int
-
max_cand_len
¶ Maximum length of a candidate answer sentence in number of words.
Type: int
-
memory_size
¶ Maximum number of sentences to keep in memory at any given time.
Type: int
-
max_utt_len
¶ Maximum length of any given sentence / user utterance
Type: int
-
vocab_size
¶ Number of unique words in the vocabulary + 2 (0 is reserved for a padding symbol, and 1 is reserved for OOV)
Type: int
-
use_match_type
¶ Flag to use match-type features
Type: bool, optional
-
kb_ents_to_type
¶ For use with match-type features, dictionary of entities found in the dataset mapping to their associated match-type
Type: dict, optional
-
kb_ents_to_cand_idxs
¶ For use with match-type features, dictionary mapping from each entity in the knowledge base to the set of indicies in the candidate_answers array that contain that entity.
Type: dict, optional
-
match_type_idxs
¶ For use with match-type features, dictionary mapping from match-type to the associated fixed index of the candidate vector which indicated this match type.
Type: dict, optional
-
static
clean_cands
(cand)[source]¶ Remove leading line number and final newline from candidate answer
-
create_cands_mat
(data_split, cache_match_type)[source]¶ Add match type features to candidate answers for each example in the dataaset. Caches once complete.
-
create_match_maps
()[source]¶ Create dictionary mapping from each entity in the knowledge base to the set of indicies in the candidate_answers array that contain that entity. Will be used for quickly adding the match type features to the candidate answers during fprop.
-
load_candidate_answers
()[source]¶ Load candidate answers from file, compute number, and store for final softmax
-
load_data
()[source]¶ Fetch and extract the Facebook bAbI-dialog dataset if not already downloaded.
Returns: training and test filenames are returned Return type: tuple
-
one_hot_vector
(answer)[source]¶ Create one-hot representation of an answer.
Parameters: answer (string) – The word answer. Returns: One-hot representation of answer. Return type: list
-
static
parse_dialog
(fn, use_time=True, use_speaker_tag=True)[source]¶ Given a dialog file, parse into user and bot utterances, adding time and speaker tags.
Parameters: - fn (str) – Filename to parse
- use_time (bool, optional) – Flag to append ‘time-words’ to the end of each utterance
- use_speaker_tag (bool, optional) – Flag to append tags specifiying the speaker to each utterance.
-
process_interactive
(line_in, context, response, db_results, time_feat)[source]¶ Parse a given user’s input into the same format as training, build the memory from the given context and previous response, update the context.
-
vectorize_cands
(data)[source]¶ Convert candidate answer word data into vectors.
If sentence length < max_cand_len it is padded with 0’s
Parameters: data (list of lists) – list of candidate answers split into words Returns: padded numpy array of word indexes forr all candidate answers Return type: tuple (2d numpy array)
-
vectorize_stories
(data)[source]¶ Convert (memory, user_utt, answer) word data into vectors.
If sentence length < max_utt_len it is padded with 0’s If memory length < memory size, it is padded with empty memorys (max_utt_len 0’s)
Parameters: data (tuple) – Tuple of memories, user_utt, answer word data. Returns: Tuple of memories, memory_lengths, user_utt, answer vectors. Return type: tuple
nlp_architect.data.conll module¶
nlp_architect.data.fasttext_emb module¶
-
class
nlp_architect.data.fasttext_emb.
Dictionary
(id2word, word2id, lang)[source]¶ Bases:
object
Merges word2idx and idx2word dictionaries :param id2word dictionary: :param word2id dictionary: :param language of the dictionary:
- Usage:
- dico.index(word) - returns an index dico[index] - returns the word
-
class
nlp_architect.data.fasttext_emb.
FastTextEmb
(path, language, vocab_size, emb_dim=300)[source]¶ Bases:
object
Downloads FastText Embeddings for a given language to the given path. :param path: Local path to copy embeddings :type path: str :param language: Embeddings language :type language: str :param vocab_size: Size of vocabulary :type vocab_size: int
Returns: Returns a dictionary and reverse dictionary Returns a numpy array with embeddings in emb_sizexvocab_size shape
-
nlp_architect.data.fasttext_emb.
get_eval_data
(eval_path, src_lang, tgt_lang)[source]¶ Downloads evaluation cross lingual dictionaries to the eval_path :param eval_path: Path where cross-lingual dictionaries are downloaded :param src_lang: Source Language :param tgt_lang: Target Language
Returns: Path to where cross lingual dictionaries are downloaded
nlp_architect.data.glue_tasks module¶
-
class
nlp_architect.data.glue_tasks.
ColaProcessor
[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Processor for the CoLA data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.
InputFeatures
(input_ids, input_mask, segment_ids, label_id, valid_ids=None)[source]¶ Bases:
object
A single set of features of data.
-
class
nlp_architect.data.glue_tasks.
MnliMismatchedProcessor
[source]¶ Bases:
nlp_architect.data.glue_tasks.MnliProcessor
Processor for the MultiNLI Mismatched data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.
MnliProcessor
[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Processor for the MultiNLI data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.
MrpcProcessor
[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Processor for the MRPC data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.
QnliProcessor
[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Processor for the QNLI data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.
QqpProcessor
[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Processor for the QQP data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.
RteProcessor
[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Processor for the RTE data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.
Sst2Processor
[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Processor for the SST-2 data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.
StsbProcessor
[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Processor for the STS-B data set (GLUE version).
-
class
nlp_architect.data.glue_tasks.
WnliProcessor
[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Processor for the WNLI data set (GLUE version).
-
nlp_architect.data.glue_tasks.
convert_examples_to_features
(examples, label_list, max_seq_length, tokenizer, output_mode, cls_token_at_end=False, pad_on_left=False, cls_token='[CLS]', sep_token='[SEP]', pad_token=0, sequence_a_segment_id=0, sequence_b_segment_id=1, cls_token_segment_id=1, pad_token_segment_id=0, mask_padding_with_zero=True)[source]¶ Loads a data file into a list of InputBatch`s `cls_token_at_end define the location of the CLS token:
- False (Default, BERT/XLM pattern): [CLS] + A + [SEP] + B + [SEP]
- True (XLNet/GPT pattern): A + [SEP] + B + [SEP] + [CLS]
cls_token_segment_id define the segment id associated to the CLS token (0 for BERT, 2 for XLNet)
nlp_architect.data.intent_datasets module¶
-
class
nlp_architect.data.intent_datasets.
IntentDataset
(sentence_length=50, word_length=12)[source]¶ Bases:
object
Intent extraction dataset base class
Parameters: sentence_length (int) – max sentence length -
char_vocab
¶ word character vocabulary
Type: dict
-
char_vocab_size
¶ char vocabulary size
Type: int
-
intent_size
¶ intent label vocabulary size
Type: int
-
intents_vocab
¶ intent labels vocabulary
Type: dict
-
label_vocab_size
¶ label vocabulary size
Type: int
labels vocabulary
Type: dict
-
test_set
¶ test set
Type: tuple
ofnumpy.ndarray
-
train_set
¶ train set
Type: tuple
ofnumpy.ndarray
-
word_vocab
¶ tokens vocabulary
Type: dict
-
word_vocab_size
¶ vocabulary size
Type: int
-
-
class
nlp_architect.data.intent_datasets.
SNIPS
(path, sentence_length=30, word_length=12)[source]¶ Bases:
nlp_architect.data.intent_datasets.IntentDataset
SNIPS dataset class
Parameters: - path (str) – dataset path
- sentence_length (int, optional) – max sentence length
- word_length (int, optional) – max word length
-
files
= ['train', 'test']¶
-
test_files
= ['AddToPlaylist/validate_AddToPlaylist.json', 'BookRestaurant/validate_BookRestaurant.json', 'GetWeather/validate_GetWeather.json', 'PlayMusic/validate_PlayMusic.json', 'RateBook/validate_RateBook.json', 'SearchCreativeWork/validate_SearchCreativeWork.json', 'SearchScreeningEvent/validate_SearchScreeningEvent.json']¶
-
train_files
= ['AddToPlaylist/train_AddToPlaylist_full.json', 'BookRestaurant/train_BookRestaurant_full.json', 'GetWeather/train_GetWeather_full.json', 'PlayMusic/train_PlayMusic_full.json', 'RateBook/train_RateBook_full.json', 'SearchCreativeWork/train_SearchCreativeWork_full.json', 'SearchScreeningEvent/train_SearchScreeningEvent_full.json']¶
-
class
nlp_architect.data.intent_datasets.
TabularIntentDataset
(train_file, test_file, sentence_length=30, word_length=12)[source]¶ Bases:
nlp_architect.data.intent_datasets.IntentDataset
Tabular Intent/Slot tags dataset loader. Compatible with many sequence tagging datasets (ATIS, CoNLL, etc..) data format must be int tabular format where: - one word per line with tag annotation and intent type separated by tabs <token> <tag_label> <intent>
- sentences are separated by an empty line
Parameters: - train_file (str) – path to train set file
- test_file (str) – path to test set file
- sentence_length (int) – max sentence length
- word_length (int) – max word length
-
files
= ['train', 'test']¶
nlp_architect.data.ptb module¶
Data loader for penn tree bank dataset
-
class
nlp_architect.data.ptb.
PTBDataLoader
(word_dict, seq_len=100, data_dir='/Users/pizsak/data', dataset='WikiText-103', batch_size=32, skip=30, split_type='train', loop=True)[source]¶ Bases:
object
Class that defines data loader
-
decode_line
(tokens)[source]¶ Decode a given line from index to word :param tokens: List of indexes
Returns: str, a sentence
-
-
class
nlp_architect.data.ptb.
PTBDictionary
(data_dir='/Users/pizsak/data', dataset='WikiText-103')[source]¶ Bases:
object
Class for generating a dictionary of all words in the PTB corpus
-
add_word
(word)[source]¶ Method for adding a single word to the dictionary :param word: str, word to be added
Returns: None
-
nlp_architect.data.sequence_classification module¶
-
class
nlp_architect.data.sequence_classification.
SequenceClsInputExample
(guid: str, text: str, text_b: str = None, label: str = None)[source]¶ Bases:
nlp_architect.data.utils.InputExample
A single training/test example for simple sequence classification.
nlp_architect.data.sequential_tagging module¶
-
class
nlp_architect.data.sequential_tagging.
CONLL2000
(data_path, sentence_length=None, max_word_length=None, extract_chars=False, lowercase=True)[source]¶ Bases:
object
CONLL 2000 POS/chunking task data set (numpy)
Parameters: - data_path (str) – directory containing CONLL2000 files
- sentence_length (int, optional) – number of time steps to embed the data. None value will not truncate vectors
- max_word_length (int, optional) – max word length in characters. None value will not truncate vectors
- extract_chars (boolean, optional) – Yield Char RNN features.
- lowercase (bool, optional) – lower case sentence words
-
char_vocab
¶ character Vocabulary
-
chunk_vocab
¶ chunk label Vocabulary
-
dataset_files
= {'test': 'test.txt', 'train': 'train.txt'}¶
-
pos_vocab
¶ pos label Vocabulary
-
test_set
¶ get the test set
-
train_set
¶ get the train set
-
word_vocab
¶ word Vocabulary
-
class
nlp_architect.data.sequential_tagging.
SequentialTaggingDataset
(train_file, test_file, max_sentence_length=30, max_word_length=20, tag_field_no=2)[source]¶ Bases:
object
Sequential tagging dataset loader. Loads train/test files with tabular separation.
Parameters: - train_file (str) – path to train file
- test_file (str) – path to test file
- max_sentence_length (int, optional) – max sentence length
- max_word_length (int, optional) – max word length
- tag_field_no (int, optional) – index of column to use a y-samples
-
char_vocab
¶ characters vocabulary
-
char_vocab_size
¶ character vocabulary size
-
test_set
¶ Get the test set
-
train_set
¶ Get the train set
-
word_vocab
¶ words vocabulary
-
word_vocab_size
¶ word vocabulary size
-
y_labels
¶ return y labels
-
class
nlp_architect.data.sequential_tagging.
TokenClsInputExample
(guid: str, text: str, tokens: List[str], label: List[str] = None)[source]¶ Bases:
nlp_architect.data.utils.InputExample
A single training/test example for simple sequence token classification.
-
class
nlp_architect.data.sequential_tagging.
TokenClsProcessor
(data_dir, tag_col: int = -1)[source]¶ Bases:
nlp_architect.data.utils.DataProcessor
Sequence token classification Processor dataset loader. Loads a directory with train.txt/test.txt/dev.txt files in tab separeted format (one token per line - conll style). Label dictionary is given in labels.txt file.
nlp_architect.data.utils module¶
-
class
nlp_architect.data.utils.
DataProcessor
[source]¶ Bases:
object
Base class for data converters for sequence/token classification data sets.
-
class
nlp_architect.data.utils.
InputExample
(guid: str, text, label=None)[source]¶ Bases:
abc.ABC
Base class for a single training/dev/test example
-
class
nlp_architect.data.utils.
Task
(name: str, processor: nlp_architect.data.utils.DataProcessor, data_dir: str, task_type: str)[source]¶ Bases:
object
A task definition class :param name: the name of the task :type name: str :param processor: a DataProcessor class containing a dataset loader :type processor: DataProcessor :param data_dir: path to the data source :type data_dir: str :param task_type: the task type (classification/regression/tagging) :type task_type: str
-
nlp_architect.data.utils.
read_column_tagged_file
(filename: str, tag_col: int = -1)[source]¶ Reads column tagged (CONLL) style file (tab separated and token per line) tag_col is the column number to use as tag of the token (defualts to the last in line) return format : [ [‘token’, ‘TAG’], [‘token’, ‘TAG2’],… ]
-
nlp_architect.data.utils.
read_tsv
(input_file, quotechar=None)[source]¶ Reads a tab separated value file.