# The infer function¶

hdhp.infer(history, alpha_0, mu_0, omega=1, beta=1, theta_0=None, threads=1, num_particles=1, particle_weight_threshold=1, resample_every=10, update_kernels=True, mu_rate=0.6, keep_alpha_history=False, progress_file=None, seed=None)[source]

Runs the inference algorithm and returns a particle.

Parameters: history (list) – A list of 4-tuples (user, time, content, metadata) that represents the event history that we want to infer our model on. alpha_0 (tuple) – The Gamma prior parameter for a pattern’s time kernel. mu_0 (tuple) – The Gamma prior parameter for the user activity rate. omega (float) – The time decay parameter. beta (float) – A parameter that controls the new-task probability. theta_0 (list, default is None) – If not None, theta_0 corresponds to the Dirichlet prior used for the word distribution. It should have as many dimensions as the number of words. By default, this is the vector $$[1 / |V|, \ldots, 1 / |V|]$$, where $$|V|$$ is the size of the vocabulary. threads (int, default is 1) – The number of CPU threads that will be used during inference. num_particles (int, default is 1) – The number of particles that the SMC algorithm will use. particle_weight_threshold (float, default is 1) – A parameter that controls when the particles need to be resampled resample_every (int, default is 10) – The frequency with which we check if we need to resample or not. The number is in inference steps (number of events) update_kernels (bool, default is True) – Controls wheter the time kernel parameter of each pattern will be updated from the posterior, or not. mu_rate (float, default is 0.6) – The learning-rate with which we update the activity rate of a user. keep_alpha_history (bool, default is False) – For debug reasons, we make want to keep the complete history of the value of each pattern’s time kernel parameter as we see more events in that pattern. progress_file (str, default is None) – Since the computation might be slow, we want to save progress information to a file instead of printing it. If None, a temporary, randomly-named file is generated for this purpose.

# The HDHProcess object¶

class hdhp.HDHProcess(num_patterns, alpha_0, mu_0, vocabulary, omega=1, doc_length=20, doc_min_length=5, words_per_pattern=10, random_state=None)[source]
__init__(num_patterns, alpha_0, mu_0, vocabulary, omega=1, doc_length=20, doc_min_length=5, words_per_pattern=10, random_state=None)[source]
Parameters: num_patterns (int) – The maximum number of patterns that will be shared across the users. alpha_0 (tuple) – The parameter that is used when sampling the time kernel weights of each pattern. The distribution that is being used is a Gamma. This tuple should be of the form (shape, scale). mu_0 (tuple) – The parameter of the Gamma distribution that is used to sample each user’s mu (activity level). This tuple should be of the form (shape, scale). vocabulary (list) – The list of available words to use when generating documents. omega (float, default is 1) – The decay parameter for the decay of the exponential decay kernel. doc_length (int, default is 20) – The maximum number of words per document. doc_min_length (int, default is 5) – The minimum number of words per document. words_per_pattern (int, default is 10) – The number of words that will have a non-zero probability to appear in each pattern. random_state (int or RandomState object, default is None) – The random number generator.
kernel(t_i, t_j)[source]

Returns the kernel function for t_i and t_j.

Parameters: t_i (float) – Timestamp representing now. t_j (float) – Timestamp representaing past. float
pattern_content_str(patterns=None, show_words=-1, min_word_occurence=5)[source]

Return the content information for the patterns of the process.

Parameters: patterns (list, default is None) – If this list is provided, only information about the patterns in the list will be returned. show_words (int, default is -1) – The maximum number of words to show for each pattern. Notice that the words are sorted according to their occurence count. min_word_occurence (int, default is 5) – Only show words that show up at least min_word_occurence number of times in the documents of the respective pattern. A string with all the content information str
plot(num_samples=500, T_min=0, T_max=None, start_date=None, users=None, user_limit=50, patterns=None, task_detail=False, save_to_file=False, filename='user_timelines', intensity_threshold=None, paper=True, colors=None, fig_width=20, fig_height_per_user=5, time_unit='months', label_every=3, seed=None)[source]

Plots the intensity of a set of users for a set of patterns over a time period.

In this plot, each user is a separate subplot and for each user the plot shows her event_rate for each separate pattern that she has been active at.

Parameters: num_samples (int, default is 500) – The granularity level of the intensity line. Smaller number of samples results in faster plotting, while larger numbers give much more detailed result. T_min (float, default is 0) – The minimum timestamp that the plot shows, in seconds. T_max (float, default is None) – If not None, this is the maximum timestamp that the plot considers, in seconds. start_date (datetime, default is None) – If provided, this is the actual datetime that corresponds to time 0. This is required if paper is True. users (list, default is None) – If provided, this list contains the id’s of the users that will be plotted. Actually, only the first user_limit of them will be shown. user_limit (int, default is 50) – The maximum number of users to plot. patterns (list, default is None) – The list of patterns that will be shown in the final plot. If None, all of the patterns will be plotted. task_detail (bool, default is False) – If True, thee plot has one line per task. Otherwise, we only plot the commulative intensity of all tasks under the same pattern. save_to_file (bool, default is False) – If True, the plot will be saved to a pdf and a png file. filename (str, default is 'user_timelines') – The name of the output file that will be used when saving the plot. intensity_threshold (float, default is None) – If provided, this is the maximum intensity value that will be plotted, i.e. the y_max that will be the cut-off threshold for the y-axis. paper (bool, default is True) – If True, the plot result will be the same as the figures that are in the published paper. colors (list, default is None) – A list of colors that will be used for the plot. Each color will correspond to a single pattern, and will be shared across all the users. fig_width (int, default is 20) – The width of the figure that will be returned. fig_height_per_user (int, default is 5) – The height of each separate user-plot of the final figure. If multiplied by the number of users, this determines the total height of the figure. Notice that due to a matplotlib constraint(?) the total height of the figure cannot be over 70. time_unit (str, default is 'months') – Controls wether the time units is measured in days (in which case it should be set to ‘days’) or months. label_every (int, default is 3) – The frequency of the labels that show in the x-axis. seed (int, default is None) – A seed to the random number generator used to assign colors to patterns. fig matplotlib.Figure object
reset()[source]

Removes all the events and users already sampled.

Note

It does not reseed the random number generator. It also retains the already sampled pattern parameters (word distributions and alphas)

sample_document(pattern)[source]

Sample a random document from a specific pattern.

Parameters: pattern (int) – The pattern from which to sample the content. A space separeted string that contains all the sampled words. str
sample_mu()[source]

Samples a value from the prior of the base intensity mu.

Returns: mu_u – The base intensity of a user, sampled from the prior. float
sample_next_time(pattern, user)[source]

Samples the time of the next event of a pattern for a given user.

Parameters: pattern (int) – The pattern index that we want to sample the next event for. user (int) – The index of the user that we want to sample for. timestamp float
sample_pattern_params()[source]

Returns the word distributions for each pattern.

Returns: parameters – A list of word distributions, one for each pattern. list
sample_pattern_popularity()[source]

Returns a popularity distribution over the patterns.

Returns: pattern_popularities – A list with the popularity distribution of each pattern. list
sample_time_kernels()[source]

Returns the time decay parameter of each pattern.

Returns: alphas – A list of time decay parameters, one for each pattern. list
sample_user_events(min_num_events=100, max_num_events=None, t_max=None)[source]

Samples events for a user.

Parameters: min_num_events (int, default is 100) – The minimum number of events to sample. max_num_events (int, default is None) – If not None, this is the maximum number of events to sample. t_max (float, default is None) – The time limit until which to sample events. events – A list of the form [(t_i, doc_i, user_i, meta_i), ...] sorted by increasing time that has all the events of the sampled users. Above, doc_i is the document and meta_i is any sort of metadata that we want for doc_i, e.g. question_id. The generator will return an empty list for meta_i. list
show_annotated_events(user=None, patterns=None, show_time=True, T_min=0, T_max=None)[source]

Returns a string where each event is annotated with the inferred pattern.

Parameters: user (int, default is None) – If given, the events returned are limited to the selected user patterns (list, default is None) – If not None, an event is return only if it belongs to one of the selected patterns show_time (bool, default is True) – Controls whether the time of the event will be shown T_min (float, default is 0) – Controls the minimum timestamp after which the events will be shown T_max (float, default is None) – If given, T_max controls the maximum timestamp shown str
show_pattern_content(patterns=None, words=0, detail_threshold=5)[source]

Shows the content distrubution of the inferred patterns.

Parameters: patterns (list, default is None) – If not None, only the content of the selected patterns will be shown words (int, default is 0) – A positive number that control how many words will be shown. The words are being shown sorted by their likelihood, starting with the most probable. detail_threshold (int, default is 5) – A positive number that sets the lower bound in the number of times that a word appeared in a pattern so that its count is shown. str
user_pattern_history_str(user=None, patterns=None, show_time=True, t_min=0)[source]

Returns a representation of the history of a user’s actions and the pattern that they correspond to.

Parameters: user (int, default is None) – An index to the user we want to inspect. If None, the function runs over all the users. patterns (list, default is None) – If not None, limit the actions returned to the ones that belong in the provided patterns. show_time (bool, default is True) – Control wether the timestamp will appear in the representation or not. t_min (float, default is 0) – The timestamp after which we only consider actions. str
user_patterns(user)[source]

Returns a list with the patterns that a user has adopted.

Parameters: user (int) –
user_patterns_set(user)[source]

Return the patterns that a specific user adopted.

Parameters: user (int) – The index of a user. The set of the patterns that the user adopted. set

# The Particle object¶

class hdhp.HDHProcess(num_patterns, alpha_0, mu_0, vocabulary, omega=1, doc_length=20, doc_min_length=5, words_per_pattern=10, random_state=None)[source]
__init__(num_patterns, alpha_0, mu_0, vocabulary, omega=1, doc_length=20, doc_min_length=5, words_per_pattern=10, random_state=None)[source]
Parameters: num_patterns (int) – The maximum number of patterns that will be shared across the users. alpha_0 (tuple) – The parameter that is used when sampling the time kernel weights of each pattern. The distribution that is being used is a Gamma. This tuple should be of the form (shape, scale). mu_0 (tuple) – The parameter of the Gamma distribution that is used to sample each user’s mu (activity level). This tuple should be of the form (shape, scale). vocabulary (list) – The list of available words to use when generating documents. omega (float, default is 1) – The decay parameter for the decay of the exponential decay kernel. doc_length (int, default is 20) – The maximum number of words per document. doc_min_length (int, default is 5) – The minimum number of words per document. words_per_pattern (int, default is 10) – The number of words that will have a non-zero probability to appear in each pattern. random_state (int or RandomState object, default is None) – The random number generator.
kernel(t_i, t_j)[source]

Returns the kernel function for t_i and t_j.

Parameters: t_i (float) – Timestamp representing now. t_j (float) – Timestamp representaing past. float
pattern_content_str(patterns=None, show_words=-1, min_word_occurence=5)[source]

Return the content information for the patterns of the process.

Parameters: patterns (list, default is None) – If this list is provided, only information about the patterns in the list will be returned. show_words (int, default is -1) – The maximum number of words to show for each pattern. Notice that the words are sorted according to their occurence count. min_word_occurence (int, default is 5) – Only show words that show up at least min_word_occurence number of times in the documents of the respective pattern. A string with all the content information str
plot(num_samples=500, T_min=0, T_max=None, start_date=None, users=None, user_limit=50, patterns=None, task_detail=False, save_to_file=False, filename='user_timelines', intensity_threshold=None, paper=True, colors=None, fig_width=20, fig_height_per_user=5, time_unit='months', label_every=3, seed=None)[source]

Plots the intensity of a set of users for a set of patterns over a time period.

In this plot, each user is a separate subplot and for each user the plot shows her event_rate for each separate pattern that she has been active at.

Parameters: num_samples (int, default is 500) – The granularity level of the intensity line. Smaller number of samples results in faster plotting, while larger numbers give much more detailed result. T_min (float, default is 0) – The minimum timestamp that the plot shows, in seconds. T_max (float, default is None) – If not None, this is the maximum timestamp that the plot considers, in seconds. start_date (datetime, default is None) – If provided, this is the actual datetime that corresponds to time 0. This is required if paper is True. users (list, default is None) – If provided, this list contains the id’s of the users that will be plotted. Actually, only the first user_limit of them will be shown. user_limit (int, default is 50) – The maximum number of users to plot. patterns (list, default is None) – The list of patterns that will be shown in the final plot. If None, all of the patterns will be plotted. task_detail (bool, default is False) – If True, thee plot has one line per task. Otherwise, we only plot the commulative intensity of all tasks under the same pattern. save_to_file (bool, default is False) – If True, the plot will be saved to a pdf and a png file. filename (str, default is 'user_timelines') – The name of the output file that will be used when saving the plot. intensity_threshold (float, default is None) – If provided, this is the maximum intensity value that will be plotted, i.e. the y_max that will be the cut-off threshold for the y-axis. paper (bool, default is True) – If True, the plot result will be the same as the figures that are in the published paper. colors (list, default is None) – A list of colors that will be used for the plot. Each color will correspond to a single pattern, and will be shared across all the users. fig_width (int, default is 20) – The width of the figure that will be returned. fig_height_per_user (int, default is 5) – The height of each separate user-plot of the final figure. If multiplied by the number of users, this determines the total height of the figure. Notice that due to a matplotlib constraint(?) the total height of the figure cannot be over 70. time_unit (str, default is 'months') – Controls wether the time units is measured in days (in which case it should be set to ‘days’) or months. label_every (int, default is 3) – The frequency of the labels that show in the x-axis. seed (int, default is None) – A seed to the random number generator used to assign colors to patterns. fig matplotlib.Figure object
reset()[source]

Removes all the events and users already sampled.

Note

It does not reseed the random number generator. It also retains the already sampled pattern parameters (word distributions and alphas)

sample_document(pattern)[source]

Sample a random document from a specific pattern.

Parameters: pattern (int) – The pattern from which to sample the content. A space separeted string that contains all the sampled words. str
sample_mu()[source]

Samples a value from the prior of the base intensity mu.

Returns: mu_u – The base intensity of a user, sampled from the prior. float
sample_next_time(pattern, user)[source]

Samples the time of the next event of a pattern for a given user.

Parameters: pattern (int) – The pattern index that we want to sample the next event for. user (int) – The index of the user that we want to sample for. timestamp float
sample_pattern_params()[source]

Returns the word distributions for each pattern.

Returns: parameters – A list of word distributions, one for each pattern. list
sample_pattern_popularity()[source]

Returns a popularity distribution over the patterns.

Returns: pattern_popularities – A list with the popularity distribution of each pattern. list
sample_time_kernels()[source]

Returns the time decay parameter of each pattern.

Returns: alphas – A list of time decay parameters, one for each pattern. list
sample_user_events(min_num_events=100, max_num_events=None, t_max=None)[source]

Samples events for a user.

Parameters: min_num_events (int, default is 100) – The minimum number of events to sample. max_num_events (int, default is None) – If not None, this is the maximum number of events to sample. t_max (float, default is None) – The time limit until which to sample events. events – A list of the form [(t_i, doc_i, user_i, meta_i), ...] sorted by increasing time that has all the events of the sampled users. Above, doc_i is the document and meta_i is any sort of metadata that we want for doc_i, e.g. question_id. The generator will return an empty list for meta_i. list
show_annotated_events(user=None, patterns=None, show_time=True, T_min=0, T_max=None)[source]

Returns a string where each event is annotated with the inferred pattern.

Parameters: user (int, default is None) – If given, the events returned are limited to the selected user patterns (list, default is None) – If not None, an event is return only if it belongs to one of the selected patterns show_time (bool, default is True) – Controls whether the time of the event will be shown T_min (float, default is 0) – Controls the minimum timestamp after which the events will be shown T_max (float, default is None) – If given, T_max controls the maximum timestamp shown str
show_pattern_content(patterns=None, words=0, detail_threshold=5)[source]

Shows the content distrubution of the inferred patterns.

Parameters: patterns (list, default is None) – If not None, only the content of the selected patterns will be shown words (int, default is 0) – A positive number that control how many words will be shown. The words are being shown sorted by their likelihood, starting with the most probable. detail_threshold (int, default is 5) – A positive number that sets the lower bound in the number of times that a word appeared in a pattern so that its count is shown. str
user_pattern_history_str(user=None, patterns=None, show_time=True, t_min=0)[source]

Returns a representation of the history of a user’s actions and the pattern that they correspond to.

Parameters: user (int, default is None) – An index to the user we want to inspect. If None, the function runs over all the users. patterns (list, default is None) – If not None, limit the actions returned to the ones that belong in the provided patterns. show_time (bool, default is True) – Control wether the timestamp will appear in the representation or not. t_min (float, default is 0) – The timestamp after which we only consider actions. str
user_patterns(user)[source]

Returns a list with the patterns that a user has adopted.

Parameters: user (int) –
user_patterns_set(user)[source]

Return the patterns that a specific user adopted.

Parameters: user (int) – The index of a user. The set of the patterns that the user adopted. set