nlp_architect.solutions.trend_analysis package¶
Subpackages¶
Submodules¶
nlp_architect.solutions.trend_analysis.np_scorer module¶
nlp_architect.solutions.trend_analysis.scoring_utils module¶
-
class
nlp_architect.solutions.trend_analysis.scoring_utils.
CorpusIndex
(documents: list, spans: list)[source]¶ Bases:
object
Text span index class. Holds TF and DF values per span. Text spans are normalized and similar spans are mapped to the same TF DF values.
-
class
nlp_architect.solutions.trend_analysis.scoring_utils.
TextSpanScoring
(documents, spans, min_tf=1)[source]¶ Bases:
object
Text spans scoring class. Contains misc scoring algorithms for scoring text fragments extracted from a corpus.
Parameters: - documents (list) – List of spaCy documents.
- spans (list[list]) – List of spaCy spans representing noun phrases of documents document.
-
doc_text_spans
¶
-
documents
¶
nlp_architect.solutions.trend_analysis.topic_extraction module¶
-
nlp_architect.solutions.trend_analysis.topic_extraction.
create_w2v_model
(text_list_t, text_list_r)[source]¶ Create a w2v model on the given corpora
Parameters: - text_list_t – A list of documents - target corpus (List[String])
- text_list_r – A list of documents - reference corpus (List[String])
-
nlp_architect.solutions.trend_analysis.topic_extraction.
get_urls_from_file
(file)[source]¶ Merge two corpora into a single text file
Parameters: - corpus_a – A folder containing text files (String)
- corpus_b – A folder containing text files (String)
Returns: The path of the unified corpus
-
nlp_architect.solutions.trend_analysis.topic_extraction.
load_text_from_folder
(folder)[source]¶ Load files content into a list of docs (texts)
Parameters: folder – A path to a folder containing text files Returns: A list of documents (List[String])
-
nlp_architect.solutions.trend_analysis.topic_extraction.
load_url_content
(url_list)[source]¶ Load articles content into a list of docs (texts)
Parameters: url_list (List[String]) – A list of urls Returns: A list of documents (List[String])
-
nlp_architect.solutions.trend_analysis.topic_extraction.
main
(corpus_t, corpus_r, single_thread, no_train, url)[source]¶
-
nlp_architect.solutions.trend_analysis.topic_extraction.
noun_phrase_extraction
(docs, parser)[source]¶ Extract noun-phrases from a textual corpus
Parameters: - docs (List[String]) – A list of documents
- parser (SpacyInstance) – Spacy NLP parser
Returns: List of topics with their tf_idf, c_value, language-model scores
-
nlp_architect.solutions.trend_analysis.topic_extraction.
save_scores
(np_result, file_path)[source]¶ Save the result of a topic extraction into a file
Parameters: - np_result – A list of topics with different score types (tfidf, cvalue, freq)
- file_path – The output file path
-
nlp_architect.solutions.trend_analysis.topic_extraction.
train_w2v_model
(data)[source]¶ Train a w2v (skipgram) model using fasttext package
Parameters: data – A path to the training data (String)
-
nlp_architect.solutions.trend_analysis.topic_extraction.
unify_corpora_from_folders
(corpus_a, corpus_b)[source]¶ Merge two corpora into a single text file
Parameters: - corpus_a – A folder containing text files (String)
- corpus_b – A folder containing text files (String)
Returns: The path of the unified corpus
-
nlp_architect.solutions.trend_analysis.topic_extraction.
unify_corpora_from_texts
(text_list_t, text_list_r)[source]¶ Merge two corpora into a single text file
Parameters: - text_list_t – A list of documents - target corpus (List[String])
- text_list_r – A list of documents - reference corpus (List[String])
Returns: The path of the unified corpus
nlp_architect.solutions.trend_analysis.trend_analysis module¶
-
nlp_architect.solutions.trend_analysis.trend_analysis.
analyze
(target_data, ref_data, tar_header, ref_header, top_n=10000, top_n_vectors=500, re_analysis=False, tfidf_w=0.5, cval_w=0.5, lm_w=0)[source]¶ Compare a topics list of a target data to a topics list of a reference data and extract hot topics, trends and clusters. Topic lists can be generated by running topic_extraction.py
Parameters: - target_data – A list of topics with importance scores extracted from the tagret corpus
- ref_data – A list of topics with importance scores extracted from the reference corpus
- tar_header – The header to appear for the target topics graphs
- ref_header – The header to appear for the reference topics graphs
- top_n (int) – Limit the analysis to only the top N phrases of each list
- top_n_vectors (int) – The number of vectors to include in the scatter
- re_analysis (Boolean) – whether a first analysis has already been made or not
- tfidf_w (Float) – the TF_IDF weight for the final score calculation
- cval_w (Float) – the C_Value weight for the final score calculation
- lm_w (Float) – the Language-Model weight for the final score calculation
-
nlp_architect.solutions.trend_analysis.trend_analysis.
calc_scores
(scores_file, tfidf_w, cval_w, lm_w, output_path)[source]¶ - Given a topic list with tf_idf,c_value,language_model scores, compute
- a final score for each phrases group according to the given weights
Parameters: - scores_file (String) – A path to the file with groups and raw scores
- tfidf_w (Float) – the TF_IDF weight for the final score calculation
- cval_w (Float) – the C_Value weight for the final score calculation
- lm_w (Float) – the Language-Model weight for the final score calculation
- output_path – A path for the output file of final scores (String)
-
nlp_architect.solutions.trend_analysis.trend_analysis.
clean_group
(phrase_group)[source]¶ Returns the shortest element in a group of phrases
- Args:
- phrase_group (String): a group of phrases separated by ‘;’
- Returns:
- The shortest phrase in the group (String)
-
nlp_architect.solutions.trend_analysis.trend_analysis.
compute_scatter_subwords
(top_groups, w2v_loc)[source]¶ Compute 2D vectors of the provided phrases groups
Parameters: - top_groups – A list of group-representative phrases (List(String))
- w2v_loc – A path to a w2v model (String)
Returns: A tuple (phrases, x, y, n) WHERE: phrases: A list of phrases that are part of the model x: A DataFrame column as the x values of the phrase vector y: A DataFrame column as the y values of the phrase vector n: The number of computed vectors
-
nlp_architect.solutions.trend_analysis.trend_analysis.
merge_phrases
(data, is_ref_data, hash2group, rep2rank, top_n, topics_count)[source]¶ Analyze the provided topics data and detect trends (changes in importance)
Parameters: - data – A list of topics with importance scores
- is_ref_data (Boolean) – Was the data extracted from the target/reference corpus
- hash2group – A dictionary storing the data of each topic
- rep2rank – A dict of all groups representatives and their ranks
- top_n (int) – Limit the analysis to only the top N phrases of each list
- topics_count (int) – The total sum of all topics extracted from both corpora