Building a Simple Web Scraper with Python BeautifulSoup Requests #158255
Replies: 8 comments 5 replies
-
Thanks for posting in the GitHub Community, @Samuelcoderg! We're happy you're here. You are more likely to get a useful response if you are posting your question in the applicable category, the Discussions category is solely related to conversations around the GitHub product Discussions. This question should be in the Programming Help category. I've gone ahead and moved it for you. Good luck! |
Beta Was this translation helpful? Give feedback.
-
I notice a few concerns with your post: This doesn't seem to be genuine product feedback for GitHub - it appears to be code sharing disguised as feedback If you're looking to share Python projects and code examples, GitHub Discussions isn't typically the right forum for this type of content. Instead, I'd recommend: Creating a proper GitHub repository to host your code If you're genuinely looking to provide feedback about a GitHub product or feature, I'd recommend reformatting your post to clearly explain what aspect of GitHub you're providing feedback on. |
Beta Was this translation helpful? Give feedback.
-
To help you improve and refine your simple web scraper, here are a few tips and suggestions for your code: 1. Error Handling for
|
Beta Was this translation helpful? Give feedback.
-
that's a cool start on web scraping , check this version i made with some features that you could want to add it to your script ; like adding a User Agent , parsing url , error handling , timeout parameter to avoid hanging import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import urlparse
import csv
def scrape_blog_titles(url, selector='h2', class_name=None, save_output=False):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
try:
parsed_url = urlparse(url)
if not all([parsed_url.scheme, parsed_url.netloc]):
print("Invalid URL format. Please include http:// or https://")
return
print(f"Requesting {url}...")
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
if class_name:
titles = soup.find_all(selector, class_=class_name)
else:
titles = soup.find_all(selector)
if not titles:
print("No titles found. Try adjusting your selector parameters.")
return
print(f"Found {len(titles)} article(s):")
results = []
for idx, title in enumerate(titles, 1):
title_text = title.get_text(strip=True)
print(f"{idx}. {title_text}")
results.append(title_text)
if save_output and results:
filename = f"scraped_titles_{int(time.time())}.csv"
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Title'])
for title in results:
writer.writerow([title])
print(f"Results saved to {filename}")
except requests.exceptions.RequestException as e:
print(f"Error during request: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
if __name__ == "__main__":
scrape_blog_titles("https://jekyllrb.com/")
|
Beta Was this translation helpful? Give feedback.
-
Hey, this looks great! Would definitely like to see more stuff like this from you. Maybe next time you could show how to save the results or handle pagination? |
Beta Was this translation helpful? Give feedback.
-
Hi, @Samuelcoderg It was a great example for initial web scraping project. Next time you could challenge to more complicated problems including scraping speed, ip blocking for so many requests, anti bot checking and etc. I hope your growth in this field! |
Beta Was this translation helpful? Give feedback.
-
Hey , nice work ... I might also attempt to build one but this is a useful example for me in the future ... keep going man :) |
Beta Was this translation helpful? Give feedback.
-
Suggestions to Improve Your Web ScraperThanks for sharing your project! Here are some tips to make your web scraper more robust and user-friendly: 1. Add Error HandlingUse try:
response = requests.get(url)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error occurred: {e}")
return 2. Save Results to a FileExporting scraped data (like article titles) to a CSV makes your scraper more reusable and the data easier to analyze or share. import csv
def save_to_csv(titles):
with open('article_titles.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Title'])
for title in titles:
writer.writerow([title]) 3. Respect Website RulesAlways check the site’s 4. Throttle Your RequestsAdd a delay (e.g., 5. Use Flexible SelectorsInstead of hardcoding tags and classes, consider using broader CSS selectors (like 6. Implement RetriesFor transient errors (like timeouts or temporary server issues), add simple retry logic to improve reliability . By following these suggestions, your scraper will be more robust, flexible, and friendly to both users and websites. Citations
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Discussion Type
Product Feedback
Discussion Content
Hi everyone! 👋
I wanted to share a small Python project I’ve been working on a web scraper that pulls specific data from a webpage. I used BeautifulSoup and requests libraries for this, and it’s a great way to get started with web scraping. Its for you do comments if you like it i will provide more codes totally free.
import requests
from bs4 import BeautifulSoup
Function to scrape the titles of articles from a blog
def scrape_blog_titles(url):
# Send a GET request to the URL
response = requests.get(url)
Example usage
url = "https://example-blog.com" # Replace with any blog URL
scrape_blog_titles(url)
Beta Was this translation helpful? Give feedback.
All reactions