sptm package¶
Submodules¶
sptm.conditional module¶
Compute conditional probability matrix
-
class
sptm.conditional.
ConditionalMatrix
(doctopics_path, tokens_path)[source]¶ Compute the conditional matrix of topics
From the data used to train the LDA model, make a matrix of topics vs topics and compute the conditional probability of topic B occuring after topic A. Take this matrix and process it (sort and label it).
-
doc_matrix
¶ Output from training the LDA model provided by Mallet. Contains the topic probabilities of each sentence.
-
data_index
¶ list containing the data index number for each sentence.
-
num_sent
¶ Number of sentences
-
num_topics
¶ Number of topics
-
topic_freq
¶ Sum of the weights for a topic over the whole dataset
-
freq_matrix
¶ Matrix of floats (conditional probabilities)
-
labels
¶ List of strings, labels for each topic, has to be manually labeled
-
labeled
¶ freq_matrix with labels
-
sorted
¶ freq_matrix with labels and sorted in decending order
-
construct_matrix
()[source]¶ Compute the conditional probabilities
Construct a simple frequency matrix of each topic (current sentence) vs topic (next sentence), Identify the topics with high probabilities for each sentence
Raises: Exception
– Sentence Missing, you can ignore this message
-
sptm.inference module¶
Inferencer functions
-
class
sptm.inference.
Inferencer
(model, dictionary)[source]¶ Inferncer object to compute probability of next sentence given the current sentence
-
model
¶ LDA Mallet model
-
dictionary
¶ Dictionary used to train LDA model
-
infer
(query, sentence_ml=2, token_ml=1)[source]¶ Run an inference on the query
NOTE: use the same minimum lengths here as used during preprocessing
Parameters: - query – List of reviews
- sentence_ml – Minimum length of the sentence in words
- token_ml – Minimum length of the tokens in characters
Returns: List of topics with their probability
-
sptm.model module¶
Adjust, train and optimize Model
-
class
sptm.model.
Model
(mallet_path, tokens=None, input_path=None)[source]¶ Adjust, train and optimize LDA model
This class is responsible for traning the Topic Model using Mallet’s LDA which can be found [here](http://mallet.cs.umass.edu/topics.php)
-
mallet_path
¶ Path to Mallet binary
-
tokens
¶ List of lists containing data index number and tokens
-
id2word
¶ Dictionary of the Corpus
-
corpus
¶ Term Document frequency
-
alpha = Model alpha hyperparameter
-
workers = Number of workers spawned while training the model
-
prefix = prefix
-
optimize_interval = Number of iterations after which to re-evaluate
hyperparameters
-
iterations = Number of iterations
-
topic_threshold = topic threshold
-
num_topics = Number of topics
-
lda_model_mallet = Gensim Mallet LDA wrapper object
-
fit
()[source]¶ Generate the id2word dictionary and term document frequency of the given tokens
NOTE: Should be called only after making sure that the tokens have been properly read
Raises: Exception
– self.tokens empty or not in required format
-
get_coherence
()[source]¶ Compute Coherence Score of the model
NOTE: You cannot compute the coherence score of a saved model
Returns: Float value
-
load
(saved_model)[source]¶ Load a Mallet LDA model previously saved
Parameters: saved_model – Location to saved model Raises: IOError
– File already present or location does not exist
-
optimum_topic
(start=10, limit=100, step=11)[source]¶ Compute c_v coherence for various number of topics
if you want to change the parameters of the model while training, call Model.params() first as it uses the same parameters.
NOTE: You cannot compute the coherence score of a saved model.
Parameters: - dictionary – Gensim dictionary
- corpus – Gensim corpus
- texts – List of input texts
- limit – Max num of topics
Returns: Dictionary of {num_topics, c_v}
-
params
(alpha=50, workers=4, prefix=None, optimize_interval=0, iterations=1000, topic_threshold=0.0, num_topics=100)[source]¶ Model parameters
NOTE: These are the same parameters used while traning models for coherence computation. Call this function to re-initialize parameter values in that case
Parameters: - alpha – Alpha value (Dirichlet Hyperparameter)
- workers – Number of threads to spawn to parallel traning process
- prefix – prefix
- optimize_interval – Number of intervals after which to recompute hyperparameters
- iterations – Number of iterations
- topic_threshold – Topic threshold
- num_topics – Number of topics
-
save
(output_path)[source]¶ Save the Mallet lDA model
Also, save the document_topic distribution, corpus and inferencer
Parameters: output_path – Location with filename to save the LDA model Raises: IOError
– Error with output_path / File already exists
-
sptm.postprocess module¶
Used to graph the Hellingers distance between topic vectors
sptm.preprocess module¶
Preprocess the data
-
class
sptm.preprocess.
Corpus
(path=None, raw_review=None, sentences=None, tokens=None)[source]¶ Corpus object to handle all pre-processing of data
read_reviews() method assumes data in the file to be in the following format:
<metadata></t>…</t><data_in_multiple_sentences>
Data to be preprocessed must be in the last column.
-
path
¶ path to the data file
-
raw_review
¶ data read from the file in a list
-
sentences
¶ list of lists containing data number and sentence
-
tokens
¶ list of lists of data index number followed by tokenized sentence
-
read_reviews
(delimiter='\t', reg='(\\\\u[0-z][0-z][0-z])\\w', rep=' ')[source]¶ Read reviews and store it in a list
Parameters: - delimiter – The seperator between data
- reg – Custom regex to filter
- rep – String to replace the regex values
Raises: IOError
– file not foundException
– Data format in the file opened does not follow the specified template style
-
split_sentence
(min_len=2)[source]¶ Split each data index into its individual sentences
Splits data index at periods.
Parameters: min_len – Minimum length of a sentence above which to include
-
tokenize_custom
(min_len=1)[source]¶ Processes sentences
Tokenize, ignore tokens that are too small, lemmatize, filter out grammar {stop words, symbols, prepositions, numbers etc}
Parameters: min_len – Minimum length of tokens
-
sptm.utils module¶
Utility functions
-
sptm.utils.
force_unicode
(string, encoding='utf-8', errors='ignore')[source]¶ Force converts a string to unicode object
Treats bytestrings using the ‘encoding’ codec.
Parameters: - string – string to be encoded
- encoding – encoding type, defaults to utf-8
- errors – whether or not to ignore errors, defaults to ignore
Returns: unicode object
Raises: TypeError
– string argument left empty