sptm package

Submodules

sptm.conditional module

Compute conditional probability matrix

class sptm.conditional.ConditionalMatrix(doctopics_path, tokens_path)[source]

Compute the conditional matrix of topics

From the data used to train the LDA model, make a matrix of topics vs topics and compute the conditional probability of topic B occuring after topic A. Take this matrix and process it (sort and label it).

doc_matrix

Output from training the LDA model provided by Mallet. Contains the topic probabilities of each sentence.

data_index

list containing the data index number for each sentence.

num_sent

Number of sentences

num_topics

Number of topics

topic_freq

Sum of the weights for a topic over the whole dataset

freq_matrix

Matrix of floats (conditional probabilities)

labels

List of strings, labels for each topic, has to be manually labeled

labeled

freq_matrix with labels

sorted

freq_matrix with labels and sorted in decending order

construct_matrix()[source]

Compute the conditional probabilities

Construct a simple frequency matrix of each topic (current sentence) vs topic (next sentence), Identify the topics with high probabilities for each sentence

Raises:Exception – Sentence Missing, you can ignore this message
save(output_path, matrix)[source]

Save matrix

Parameters:
  • output_path – Location with filename to save matrix
  • matrix – Matrix to save
Raises:

IOError – Output path does not exist

sort_and_label(labels_path)[source]

Sort and label each value in the matrix

Parameters:

labels_path – Path to a labels file

Raises:
  • IOError – Labels file not found
  • Exception – Error matching topics and labels, Error sorting Conditional Probabilities

sptm.inference module

Inferencer functions

class sptm.inference.Inferencer(model, dictionary)[source]

Inferncer object to compute probability of next sentence given the current sentence

model

LDA Mallet model

dictionary

Dictionary used to train LDA model

infer(query, sentence_ml=2, token_ml=1)[source]

Run an inference on the query

NOTE: use the same minimum lengths here as used during preprocessing

Parameters:
  • query – List of reviews
  • sentence_ml – Minimum length of the sentence in words
  • token_ml – Minimum length of the tokens in characters
Returns:

List of topics with their probability

sptm.model module

Adjust, train and optimize Model

class sptm.model.Model(mallet_path, tokens=None, input_path=None)[source]

Adjust, train and optimize LDA model

This class is responsible for traning the Topic Model using Mallet’s LDA which can be found [here](http://mallet.cs.umass.edu/topics.php)

mallet_path

Path to Mallet binary

tokens

List of lists containing data index number and tokens

id2word

Dictionary of the Corpus

corpus

Term Document frequency

alpha = Model alpha hyperparameter
workers = Number of workers spawned while training the model
prefix = prefix
optimize_interval = Number of iterations after which to re-evaluate

hyperparameters

iterations = Number of iterations
topic_threshold = topic threshold
num_topics = Number of topics
lda_model_mallet = Gensim Mallet LDA wrapper object
fit()[source]

Generate the id2word dictionary and term document frequency of the given tokens

NOTE: Should be called only after making sure that the tokens have been properly read

Raises:Exception – self.tokens empty or not in required format
get_coherence()[source]

Compute Coherence Score of the model

NOTE: You cannot compute the coherence score of a saved model

Returns:Float value
load(saved_model)[source]

Load a Mallet LDA model previously saved

Parameters:saved_model – Location to saved model
Raises:IOError – File already present or location does not exist
optimum_topic(start=10, limit=100, step=11)[source]

Compute c_v coherence for various number of topics

if you want to change the parameters of the model while training, call Model.params() first as it uses the same parameters.

NOTE: You cannot compute the coherence score of a saved model.

Parameters:
  • dictionary – Gensim dictionary
  • corpus – Gensim corpus
  • texts – List of input texts
  • limit – Max num of topics
Returns:

Dictionary of {num_topics, c_v}

params(alpha=50, workers=4, prefix=None, optimize_interval=0, iterations=1000, topic_threshold=0.0, num_topics=100)[source]

Model parameters

NOTE: These are the same parameters used while traning models for coherence computation. Call this function to re-initialize parameter values in that case

Parameters:
  • alpha – Alpha value (Dirichlet Hyperparameter)
  • workers – Number of threads to spawn to parallel traning process
  • prefix – prefix
  • optimize_interval – Number of intervals after which to recompute hyperparameters
  • iterations – Number of iterations
  • topic_threshold – Topic threshold
  • num_topics – Number of topics
save(output_path)[source]

Save the Mallet lDA model

Also, save the document_topic distribution, corpus and inferencer

Parameters:output_path – Location with filename to save the LDA model
Raises:IOError – Error with output_path / File already exists
topics(num_topics=100, num_words=10)[source]

Return top <num_words> words for the first <num_topics> topics

Parameters:
  • num_topics – Number of topics to print
  • num_words – Number of top words to print for each topic
Returns:

List of topics and top words

train()[source]

Train LDA Mallet model using gensim’s Mallet wrapper

sptm.postprocess module

Used to graph the Hellingers distance between topic vectors

class sptm.postprocess.TopicDistanceMap(lda_mallet, label_filename)[source]
intertopic_distance()[source]

Calculate the Hellinger Distance between all pairwise topic vectors

plot_map()[source]

Plot the Map

save_dist(filename)[source]

Save the matrix

Parameters:filename (Location with filename to save the topic matrix) –

sptm.preprocess module

Preprocess the data

class sptm.preprocess.Corpus(path=None, raw_review=None, sentences=None, tokens=None)[source]

Corpus object to handle all pre-processing of data

read_reviews() method assumes data in the file to be in the following format:

<metadata></t>…</t><data_in_multiple_sentences>

Data to be preprocessed must be in the last column.

path

path to the data file

raw_review

data read from the file in a list

sentences

list of lists containing data number and sentence

tokens

list of lists of data index number followed by tokenized sentence

read_reviews(delimiter='\t', reg='(\\\\u[0-z][0-z][0-z])\\w', rep=' ')[source]

Read reviews and store it in a list

Parameters:
  • delimiter – The seperator between data
  • reg – Custom regex to filter
  • rep – String to replace the regex values
Raises:
  • IOError – file not found
  • Exception – Data format in the file opened does not follow the specified template style
split_sentence(min_len=2)[source]

Split each data index into its individual sentences

Splits data index at periods.

Parameters:min_len – Minimum length of a sentence above which to include
tokenize_custom(min_len=1)[source]

Processes sentences

Tokenize, ignore tokens that are too small, lemmatize, filter out grammar {stop words, symbols, prepositions, numbers etc}

Parameters:min_len – Minimum length of tokens
tokenize_simple(deacc=False, min_len=2, max_len=15)[source]

Processes sentences

Tokenize, ignore tokens that are too small

Parameters:
  • deacc – Remove accentuation
  • min_len – Minimal length of token in result
  • max_len – Maximum length of token in result
write_processed(name)[source]

Save to file

Appends tokens to given file

Parameters:

name – Name of file

Raises:
  • IOError – Path does not exist
  • Exception – self.tokens structure not supported, manually check its value

sptm.utils module

Utility functions

sptm.utils.force_unicode(string, encoding='utf-8', errors='ignore')[source]

Force converts a string to unicode object

Treats bytestrings using the ‘encoding’ codec.

Parameters:
  • string – string to be encoded
  • encoding – encoding type, defaults to utf-8
  • errors – whether or not to ignore errors, defaults to ignore
Returns:

unicode object

Raises:

TypeError – string argument left empty

Module contents