Building a Simple Web Scraper with Python BeautifulSoup Requests #158255

Samuelcoderg · 2025-05-03T14:00:14Z

Samuelcoderg
May 3, 2025

Discussion Type

Product Feedback

Discussion Content

Hi everyone! 👋

I wanted to share a small Python project I’ve been working on a web scraper that pulls specific data from a webpage. I used BeautifulSoup and requests libraries for this, and it’s a great way to get started with web scraping. Its for you do comments if you like it i will provide more codes totally free.

import requests
from bs4 import BeautifulSoup

Function to scrape the titles of articles from a blog

def scrape_blog_titles(url):
# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Parse the content of the page with BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all article titles in the page (this will depend on the website structure)
    titles = soup.find_all('h2', class_='post-title')  # Change the tag and class based on the website
    
    print(f"Found {len(titles)} article(s):")
    
    # Loop through each title and print it
    for idx, title in enumerate(titles, 1):
        print(f"{idx}. {title.get_text(strip=True)}")
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")

Example usage

url = "https://example-blog.com" # Replace with any blog URL
scrape_blog_titles(url)

mecodeatlas · 2025-05-05T13:32:41Z

mecodeatlas
May 5, 2025
Maintainer

Thanks for posting in the GitHub Community, @Samuelcoderg!

We're happy you're here. You are more likely to get a useful response if you are posting your question in the applicable category, the Discussions category is solely related to conversations around the GitHub product Discussions. This question should be in the Programming Help category. I've gone ahead and moved it for you. Good luck!

1 reply

Samuelcoderg May 14, 2025
Author

Thank you so much no problem.

bepoooe · 2025-05-05T15:36:42Z

bepoooe
May 5, 2025

I notice a few concerns with your post:

This doesn't seem to be genuine product feedback for GitHub - it appears to be code sharing disguised as feedback
The formatting of your code has some issues (the indentation is incorrect in places)
Your post mentions "if you like it I will provide more codes totally free" which suggests you might be promoting something

If you're looking to share Python projects and code examples, GitHub Discussions isn't typically the right forum for this type of content. Instead, I'd recommend:

Creating a proper GitHub repository to host your code
Sharing it in programming-focused communities like r/learnpython on Reddit
Contributing to existing open-source projects related to web scraping

If you're genuinely looking to provide feedback about a GitHub product or feature, I'd recommend reformatting your post to clearly explain what aspect of GitHub you're providing feedback on.
For sharing educational code examples, consider platforms like GitHub Gists, your own repository with proper documentation, or community forums specifically designed for code sharing and learning.

1 reply

Samuelcoderg May 14, 2025
Author

I will give focus on it.

Nikhil-Info · 2025-05-07T19:28:51Z

Nikhil-Info
May 7, 2025

To help you improve and refine your simple web scraper, here are a few tips and suggestions for your code:

1. Error Handling for `requests.get()`:

It's good practice to add error handling for the requests.get() call to handle network issues or invalid URLs.

try:
    response = requests.get(url)
    response.raise_for_status()  # Will raise an HTTPError for bad responses (4xx, 5xx)
except requests.exceptions.RequestException as e:
    print(f"Error occurred: {e}")
    return

2. Additional Features:

You could enhance the scraper by saving the results in a file, like a CSV, to make it more reusable.

import csv

def save_to_csv(titles):
    with open('article_titles.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Title'])
        for title in titles:
            writer.writerow([title])

def scrape_blog_titles(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        titles = soup.find_all('h2', class_='post-title')
        if titles:
            save_to_csv([title.get_text(strip=True) for title in titles])
            print(f"Found {len(titles)} article(s) and saved to 'article_titles.csv'")
        else:
            print("No articles found.")
    else:
        print(f"Failed to retrieve page. Status code: {response.status_code}")

3. Be Mindful of Website's Robots.txt:

Make sure you're complying with the website's terms of use. Many websites restrict scraping, and it's important to check the robots.txt file of the website you’re scraping to avoid legal issues or overloading their servers.

4. Throttling Requests:

If you plan to scrape many pages or request data frequently, consider adding some delay between requests to avoid overloading the server (this is called "polite scraping").

import time

def scrape_blog_titles(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        titles = soup.find_all('h2', class_='post-title')
        print(f"Found {len(titles)} article(s):")
        for idx, title in enumerate(titles, 1):
            print(f"{idx}. {title.get_text(strip=True)}")
        time.sleep(2)  # Delay to avoid overloading the server
    else:
        print(f"Failed to retrieve page. Status code: {response.status_code}")

5. Usage of More Robust Selectors:

Rather than using specific tags and classes like h2 and 'post-title', it's better to select more flexible and general selectors in case the website’s structure changes in the future. For example:

titles = soup.select('h2, h3, .article-title')  # Using a combination of tags and classes

Full Example with Enhancements:

import requests
from bs4 import BeautifulSoup
import time
import csv

def scrape_blog_titles(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Handle bad responses
    except requests.exceptions.RequestException as e:
        print(f"Error occurred: {e}")
        return

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        titles = soup.find_all('h2', class_='post-title')
        
        if titles:
            print(f"Found {len(titles)} article(s):")
            article_titles = [title.get_text(strip=True) for title in titles]
            save_to_csv(article_titles)
            for idx, title in enumerate(article_titles, 1):
                print(f"{idx}. {title}")
        else:
            print("No articles found.")
    else:
        print(f"Failed to retrieve page. Status code: {response.status_code}")
        
def save_to_csv(titles):
    with open('article_titles.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Title'])
        for title in titles:
            writer.writerow([title])

# Example usage
url = "https://example-blog.com"  # Replace with any blog URL
scrape_blog_titles(url)

Conclusion:

By incorporating error handling, file-saving, and other enhancements, your web scraper can be more robust and flexible. Happy Coding!😎✌️😊

1 reply

Samuelcoderg May 14, 2025
Author

Good Best Try and thank you

xidruk · 2025-05-15T11:33:51Z

xidruk
May 15, 2025

that's a cool start on web scraping , check this version i made with some features that you could want to add it to your script ;

like adding a User Agent , parsing url , error handling , timeout parameter to avoid hanging
and CSV export (optional) check this script :

import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urlparse
import csv

def scrape_blog_titles(url, selector='h2', class_name=None, save_output=False):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
    
    try:
        parsed_url = urlparse(url)
        if not all([parsed_url.scheme, parsed_url.netloc]):
            print("Invalid URL format. Please include http:// or https://")
            return

        print(f"Requesting {url}...")
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'html.parser')
        
        if class_name:
            titles = soup.find_all(selector, class_=class_name)
        else:
            titles = soup.find_all(selector)
        
        if not titles:
            print("No titles found. Try adjusting your selector parameters.")
            return
        
        print(f"Found {len(titles)} article(s):")
        
        results = []
        for idx, title in enumerate(titles, 1):
            title_text = title.get_text(strip=True)
            print(f"{idx}. {title_text}")
            results.append(title_text)
            
        if save_output and results:
            filename = f"scraped_titles_{int(time.time())}.csv"
            with open(filename, 'w', newline='', encoding='utf-8') as f:
                writer = csv.writer(f)
                writer.writerow(['Title'])
                for title in results:
                    writer.writerow([title])
            print(f"Results saved to {filename}")
            
    except requests.exceptions.RequestException as e:
        print(f"Error during request: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")

if __name__ == "__main__":
    scrape_blog_titles("https://jekyllrb.com/")

1 reply

Samuelcoderg May 27, 2025
Author

Good GG

CreativeDemon · 2025-06-09T21:42:36Z

CreativeDemon
Jun 9, 2025

Hey, this looks great!
Nice and clean code, and a solid example for anyone getting into web scraping. Using requests and BeautifulSoup is a good choic

Would definitely like to see more stuff like this from you. Maybe next time you could show how to save the results or handle pagination?

1 reply

Samuelcoderg Jun 13, 2025
Author

Thanks brother got it.

TanWeiMing-1018 · 2025-06-24T18:29:00Z

TanWeiMing-1018
Jun 24, 2025

Hi, @Samuelcoderg

It was a great example for initial web scraping project. Next time you could challenge to more complicated problems including scraping speed, ip blocking for so many requests, anti bot checking and etc.

I hope your growth in this field!

0 replies

alexola · 2025-07-01T08:16:40Z

alexola
Jul 1, 2025

Hey , nice work ... I might also attempt to build one but this is a useful example for me in the future ... keep going man :)

0 replies

10sp · 2025-07-01T11:32:26Z

10sp
Jul 1, 2025

Suggestions to Improve Your Web Scraper

Thanks for sharing your project! Here are some tips to make your web scraper more robust and user-friendly:

1. Add Error Handling

Use try-except blocks around your requests.get() call to gracefully handle network errors, invalid URLs, and HTTP errors. This prevents your script from crashing unexpectedly and helps with debugging .

try:
    response = requests.get(url)
    response.raise_for_status()
except requests.exceptions.RequestException as e:
    print(f"Error occurred: {e}")
    return

2. Save Results to a File

Exporting scraped data (like article titles) to a CSV makes your scraper more reusable and the data easier to analyze or share.

import csv
def save_to_csv(titles):
    with open('article_titles.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Title'])
        for title in titles:
            writer.writerow([title])

3. Respect Website Rules

Always check the site’s robots.txt and terms of service to ensure your scraping is compliant and ethical.

4. Throttle Your Requests

Add a delay (e.g., time.sleep(2)) between requests to avoid overloading the server and reduce the risk of being blocked .

5. Use Flexible Selectors

Instead of hardcoding tags and classes, consider using broader CSS selectors (like soup.select('h2, h3, .article-title')) to make your scraper more adaptable to changes in the site’s structure.

6. Implement Retries

For transient errors (like timeouts or temporary server issues), add simple retry logic to improve reliability .

By following these suggestions, your scraper will be more robust, flexible, and friendly to both users and websites.
Great work so far—keep coding!

Citations

0 replies

Building a Simple Web Scraper with Python BeautifulSoup Requests #158255

Uh oh!

Discussion Type

Discussion Content

Function to scrape the titles of articles from a blog

Example usage

Replies: 8 comments · 5 replies

Uh oh!

mecodeatlas May 5, 2025 Maintainer

Uh oh!

Samuelcoderg May 14, 2025 Author

Uh oh!

Uh oh!

Samuelcoderg May 14, 2025 Author

Uh oh!

1. Error Handling for requests.get():

2. Additional Features:

3. Be Mindful of Website's Robots.txt:

4. Throttling Requests:

5. Usage of More Robust Selectors:

Full Example with Enhancements:

Conclusion:

Uh oh!

Samuelcoderg May 14, 2025 Author

Uh oh!

Uh oh!

Samuelcoderg May 27, 2025 Author

Uh oh!

Uh oh!

Samuelcoderg Jun 13, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Suggestions to Improve Your Web Scraper

1. Add Error Handling

2. Save Results to a File

3. Respect Website Rules

4. Throttle Your Requests

5. Use Flexible Selectors

6. Implement Retries

Citations

Replies: 8 comments 5 replies

mecodeatlas
May 5, 2025
Maintainer

Samuelcoderg May 14, 2025
Author

Samuelcoderg May 14, 2025
Author

1. Error Handling for `requests.get()`:

Samuelcoderg May 14, 2025
Author

Samuelcoderg May 27, 2025
Author

Samuelcoderg Jun 13, 2025
Author