Illuminate Insights: A Beacon of Light in Data Projects

Illuminate Insights: A Beacon of Light in Data Projects

Share this post

Illuminate Insights: A Beacon of Light in Data Projects
Illuminate Insights: A Beacon of Light in Data Projects
Latest News X(formally known as Twitter) Bot with Python

Latest News X(formally known as Twitter) Bot with Python

Part 1: Creation

Brandon I. King, Ph.D.'s avatar
Brandon I. King, Ph.D.
Apr 01, 2024
∙ Paid

Share this post

Illuminate Insights: A Beacon of Light in Data Projects
Illuminate Insights: A Beacon of Light in Data Projects
Latest News X(formally known as Twitter) Bot with Python
Share

Introduction

In today’s exponentially growing digital realm, my news interests consist of three categories, this includes technology, gaming, and sports (specifically the National Basketball Association and Mixed Martial Arts). I enjoy staying up-to-date on the latest technology, new game releases, National Basketball Association (NBA) analytics, and Mixed Martial Arts fight nights. I desired to create a way to consolidate my favorite news topics into one condensed post, which is one of the reasons I chose X(Twitter). In parallel, I wanted to create data visualizations related to the articles since I enjoy numbers and analytics. This article will dive deep into creating an X(Twitter) bot using Python, guiding readers through the steps of building a bespoke news aggregator. Let’s walk you through the process of creating your own personal X(Twitter) bot.


Environment Setup

  • Operations System: Windows

  • Integrated Development Environment: Visual Studio Code


Prerequisites

Packages and Libraries

  • Python 3.11

  • Tweepy

  • Requests

  • BeatifulSoup

  • Pandas

  • Matplotlib

  • DateTime

  • NLTK

Install these packages and libraries by running this script in the Terminal:

pip install -r requirements.txt

requirements.txt

beautifulsoup4==4.12.3
bs4==0.0.2
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
contourpy==1.2.0
cycler==0.12.1
DateTime==5.4
fonttools==4.49.0
idna==3.6
joblib==1.3.2
kiwisolver==1.4.5
matplotlib==3.8.3
nltk==3.8.1
numpy==1.26.4
oauthlib==3.2.2
packaging==23.2
pandas==2.2.1
pillow==10.2.0
pyparsing==3.1.1
python-dateutil==2.9.0
pytz==2024.1
regex==2023.12.25
requests==2.31.0
requests-oauthlib==1.3.1
six==1.16.0
soupsieve==2.5
tabulate==0.9.0
tqdm==4.66.2
tweepy==4.14.0
tzdata==2024.1
urllib3==2.2.1
zope.interface==6.2

Data Sources

  • Google News: A great source for the latest news by topics or categories hosted by Google


Authenticating Twitter API

In creating our Twitter bot, one of the crucial initial steps is authenticating our access to the Twitter API. The process starts with connecting to the Twitter API and requires an API Key, API Secret, access token, and access token secret. The tweepy library allows you to authenticate your developer account using the four credentials and initialize the X(Twitter) API using two simple lines of code. In this section, the mechanics of authentication are discussed. Then, the following sections will demonstrate how to initialize the X(Twitter) API seamlessly, setting the stage for our bot’s data retrieval and interaction capabilities. Below, is a concise snippet of code that illustrates the first authentication procedure, paving the way for the exciting analytics journey ahead.

Code for Authenticating Twitter API v1

# Twitter API credentials
consumer_key = API_KEY
consumer_secret = API_SECRET
access_token = ACCESS_TOKEN
access_token_secret = ACCESS_TOKEN_SECRET

# Authenticate with Twitter v1.0a
auth = tweepy.OAuth1UserHandler(consumer_key, consumer_secret, access_token, access_token_secret)
api = tweepy.API(auth)

In transitioning to Twitter API v2, we encounter an additional credential vital for authentication: the bearer token. The Bearer Token is the application-only (app-only) authentication that allows developers to have a safer entry point with the API. The client is created using the bearer token and the other credentials. As this article professes, this client will demonstrate its significance when connecting to the X(Twitter) account and creating posts. Below, we present a concise code snippet encapsulating the creation of this client, laying the groundwork for our bot’s seamless integration with Twitter’s API v2.

Code for Authenticating Twitter API v2

# Authenticate Twitter API v2 Client
bearer_token = BEARER_TOKEN

client = tweepy.Client(
    bearer_token=bearer_token,
    consumer_key=consumer_key,
    consumer_secret=consumer_secret,
    access_token=access_token,
    access_token_secret=access_token_secret
    )

Extracting Data Using BeautifulSoup

Before creating the post, the data needed to be extracted from various data sources. In this particular article, three different web pages were used from Google News. The BeautifulSoup library is a great tool to use when extracting data from web pages if you are familiar with HTML. Reading HTML is beyond the scope of this article, however, I recommend getting prior knowledge of HTML before moving forward. The function below extracts the first tag (i.e., the latest link) displayed on the web page URL using the find() function of the BeautifulSoup library given the tags (i.e., ‘a’, ‘gPFEn’). After retrieving the first tag (stored in the ‘link’ variable) with the article information, the text() function is used to extract the string value from the tag (in this case it was the article title). Then, the title is condensed to a specified number of words to fit into the character length of the X(Twitter) post using the summarize_title() function described in the Formatting Data section. Setting the ‘href’ (hypertext reference) variable as ‘True’ in the find() function guarantees that each tag found has a hyperlink. The hyperlink is extracted from the ‘link’ variable using ‘link[‘href’][1:]’ and the value was concatenated to the base URL (i.e., https://news.google.com). The values (title, sum_title, and link) created are stored in the dictionary and returned to use in later functions. The code snippet below displays the get_top_news() function, which is a cornerstone in the quest to harvest, refine, and distribute timely news updates via X(Twitter).

# Function to fetch top news links from a given URL
def get_top_news(key:str, dictionary:dict, soup):
    """
    Function to fetch top news links from a given URL
    and stores additional information in dictionary.

    Args:
        key (str): category of news related to the article
        dictionary (dict): dictionary of topics and urls
        soup (Beautiful Soup object): Beautiful Soup object of the HTML document of  webpage

    Returns:
        dict: updated dictionary with title, sum_title, and link
    """
    # Google base URL
    base_url = 'https://news.google.com'
    
    # Extract first news link from the latest webpage
    link = soup.find('a', 'gPFEn', href=True)
    # Get title and link to most recent article
    dictionary[key]['title'] = link.text
    dictionary[key]['sum_title'] = summarize_title(link.text)
    dictionary[key]['link'] = base_url + link['href'][1:]
   
    return dictionary  # Return the top news link details by category

On top of receiving the latest link for each category, Google provides three supporting articles related to the most recent article link. Leveraging the robust capabilities of Python and BeautifulSoup, the data is extracted from the Google news webpage with finesse. The find_all() function of the BeautifulSoup library given the tags (i.e., ‘div’, ‘UOVeFe Jjkwtf’) to retrieve the three supporting links related to the most recent article title. After retrieving the three links, the user-created function (i.e., split_post_info()) extracts the author and time posted from the current moment of the supporting articles. The split_post_info() function sorts the values in descending order by time posted and stores the value in the dictionary by category, which will be discussed in the Creating DataFrame Using Pandas section. The ‘sorted_dict’ value created encapsulates these enriched datasets and is stored in the dictionary. The returned result is used in later functions. The code snippet for the get_posts_info() function is shown below.

# Function to fetch 3 latest post about top news link
def get_posts_info(key:str, dictionary:dict, soup):
    """
    Function to fetch 3 latest post about top news link

    Args:
        key (str): category of news related to the article
        dictionary (dict): dictionary of topics and urls
        soup (Beautiful Soup object): Beautiful Soup object of the HTML document of webpage

    Returns:
        dict: updated dictionary with sorted_dicts and df
    """
    posts_info = []

    # Extract top 3 news post related to main article from the webpage
    for post in soup.find_all('div', 'UOVeFe Jjkwtf'):
        posts_info.append(post.text)
    # Return formatted top 3 posts    
    dictionary[key]['sorted_dicts'], dictionary[key]['df'] = split_post_info(key, posts_info[:3]) 
    # print(dictionary[key]['table'])
        
    return dictionary # Return updated dictionary

Formatting Data

Navigating the constraints of the X(Twitter) platform’s 280-character limit presents a handicap when creating a concise and impactful post. The summarize_title() function uses the extract summarization method to select an equally specified number of words from the beginning and the end of the title to dynamically change the length of the message. This summarization method extracts the most important words from a sentence to provide a summary. After extraction, the words from the list are joined into one sentence, and whitespace is removed from symbols used in the text using the replace() function. Below, the code segment for the summarize_title() function portrays the process of transforming the titles of the articles to polished, Twitter-ready summarized titles.

# Function to summarize title to with set length
def summarize_title(title:str, extract_len:int=10):
    """
    Function to summarize the title of article extracted
    from the webpage using BeautifulSoup

    Args:
        title (str): the title of article
        extract_len (int): the constraint for the length of the article title

    Returns:
        str: summarized title of the article
    """
    # Check if the title has more than "n" words
    if len(title.split()) > (extract_len*2):
        # Tokenize the title
        words = word_tokenize(title)

        # Reduce the number of words (adjust as needed) by selecting the first and last few words
        summary_length =extract_len
        summary_words = words[:summary_length] + words[-summary_length:]

        # Summarized title using Extractive Summarization method
        summarized = " ".join(summary_words).replace(" ,", ",").replace(" '", "'").replace(" ?", "?").replace("‘ ", "‘").replace(" ’", "’")
    else:
        summarized = title
    # print(summarized)
    return summarized

This is the part of the article where we start to transition into the analytical portion of the project. Here, the focus pivots towards harnessing the power of data to extract meaningful insights. The convert_string_to_hours() function identifies the time string data type (i.e., minutes, day, and yesterday) and converts the numerical value to a float variable in hours, which returns that float variable result. The code snippet for the convert_string_to_hours() function demonstrates the time conversion process from strings to integers.

def convert_string_to_hours(split_string:list):
    """
    Function that converts strings to hours
    Examples:
        - '34 hours ago'
        - '10 minutes ago'
        - 'Yesterday'

    Args:
        split_string (list): list of split strings evaluated to convert to time (in hours)

    Returns:
        int: converted integer value from string
    """
    if "minutes" in split_string[0]:
        minutes = int(split_string[0].split()[0])
        hours = minutes // 60
        time_part = hours
    elif "days" in split_string[0]:
        days = int(split_string[0].split()[0])
        hours = days * 24
        time_part = hours
    else:
        try:
            time_part = int(split_string[0].split()[0])
        except ValueError:  # If time part is "Yesterday"
            time_part = 24
            split_string[0] = split_string[0].replace("Yesterday", "")  # Remove "Yesterday" from the string
    
    return time_part

Creating DataFrame Using Pandas

In the data analysis domain, the ability to transform raw data into a clean and structured format is pivotal. Pandas DataFrame allows users to format and analyze into a useful data table. The split_post_info() function takes in a couple of different inputs, which include a key variable and the lists of strings. First, the function converts strings of post time to hours using the convert_string_to_hours() function. Then, extract the author's name from the list of strings and store the values in a dictionary variable. The dictionary is converted into a DataFrame for data visualization to complete the result, and the code snippet for the split_post_info() function is presented below.

def split_post_info(key:str, post_strings:list):
    """
    This funtion:
    - converts string to hours of related article urls
    - extract author name of related article urls

    Args:
        key (str): category of news related to the article
        post_strings (list):list of information for related article urls

    Returns:
        dictionary, pd.DataFrame: dictionary and DataFrame of post time and author sorted by post time (descending)
    """
    # New list to store dictionaries
    modified_dicts = []

    # Convert days to hours and split strings by "ago"
    for string in post_strings:
        parts = string.split(" ago")
    
        time_part = convert_string_to_hours(parts)

        # Extract the name part
        try:
            name_part = parts[1].split("By")[0].strip()
            if name_part == "":
                name_part = "unknown"
        except IndexError:  # If time part is "Yesterday"
            name_part = parts[0].split("By")[0].strip()
            if name_part == "":
                name_part = "unknown"
        
        # Create a dictionary
        data_dict = {"time": time_part, "author": name_part}
        modified_dicts.append(data_dict)

    # Sort the list of dictionaries by "time" in descending order then "name" in ascending order
    sorted_dicts = sorted(modified_dicts, key=lambda x: (-x["time"], x["author"]))

    # convert to DataFrame
    df = pd.DataFrame(sorted_dicts)
    
    # Add a header row to the DataFrame
    df.columns = [key, ""]
    
    # Create a DataFrame for the row above column names
    header_row = pd.DataFrame([["Prior Posts Time (hours ago)", "Author"]], columns=df.columns)

    # Concatenate the header row with the original DataFrame
    df = pd.concat([header_row, df], ignore_index=True)
    
    return sorted_dicts, df

Generating a Data Visualization Image Using Matplotlib

The DataFrame created using the split_post_info() function allows us to create data visualization using the Matplotlib library. The steps to create this table data visualization are:

  1. Create a figure using plt.subplots() and specify the size of the figure in inches using the figsize parameter. We convert the pixel dimensions to inches by dividing by the DPI (100 in this case).

  2. Set the DPI to 100 using the dpi parameter.

  3. Hide the axes using ax.axis(‘off’) to display the table without any axes.

  4. Create the table using ax.table() with the data from the DataFrame. Adding df.values to cellText and df.columns to colLabels for table statistics.

  5. Finally, we save the figure as a PNG file using plt.savefig() with the desired filename (‘table.png’).

Below, is the code for the generate_table_image() function to convert the DataFrame to a table image.

def generate_table_image(df:pd.DataFrame):
    """
    Function creates a table using Matplotlib 
    of the 3 latest post about top news link
    and saves the table as a .png file

    Args:
        df (pd.DataFrame): DataFrame of top 3 news post related to main article from the webpage
    """
    # Create a figure and axis
    fig, ax = plt.subplots(figsize=(1600/100, 900/100), dpi=100)
    
    # Create a custom table
    colors = ['lemonchiffon', 'lemonchiffon', 'lightgreen', 'lightgreen', 'lightcoral', 'lightcoral']
    ax.table(cellText=df.values, colLabels=df.columns, loc='center', cellLoc='center', colColours=colors, fontsize=1000)

    # Remove axis
    ax.axis('off')  # Turn off axis
    
    # Add title
    ax.set_title('Most Recent Articles by Category', fontsize=20, fontweight='bold')
    
    fig.tight_layout()  # Adjust layout

    # Save the plot as a .png file
    fig.savefig('table.png')
    
    # Show the plot (optional)
    # plt.show()

Posting Messages to X(Twitter)

The pivotal moment of sharing the curated message with the X(Twitter) world is upon us once the previous steps are completed. As stated before in a previous section, X(Twitter) has a character count limit of 280 characters, so you may have to condense your message depending on your ideal post. The inaugural post from our Twitter bot is shown below.

Illuminate Insights: A Beacon of Light in Data Projects is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Keep reading with a 7-day free trial

Subscribe to Illuminate Insights: A Beacon of Light in Data Projects to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Brandon I. King, Ph.D.
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share