How to Scrape News Articles Automatically and Save Them to a Database
GuideLearn how to scrape news articles automatically and save them to a database — step-by-step guide with Python code, scheduling, and database storage.
News moves fast — and if you're building a media monitoring tool, a research dataset, a sentiment analysis pipeline, or a competitive intelligence feed, you can't be manually copying articles into a spreadsheet. You need a system that watches sources continuously, pulls new content as it appears, and persists it somewhere you can actually query.
Scraping news articles automatically means building a pipeline that fetches article content from news sites on a schedule, extracts the structured data you care about — headline, author, body text, publication date, URL — and saves it to a database where it's available for downstream analysis. With the right tools and architecture, this is genuinely straightforward to build. In this guide, we'll walk through the complete process: reading RSS feeds for efficient article discovery, extracting clean article content from web pages, storing everything to a database, and scheduling the pipeline to run continuously without manual intervention.
By the end, you'll have a working foundation for an automated news scraping pipeline you can adapt to your specific sources and storage requirements.
Table of Contents
- What Is News Scraping and How Does It Work?
- How the News Scraping Pipeline Works
- Step-by-Step Guide: Building an Automated News Scraper
- Common Challenges and Limitations
- Conclusion
- What We Learned
- FAQ
What Is News Scraping and How Does It Work?
News scraping is the automated extraction of article content — headlines, body text, authors, publication dates, URLs — from news publisher websites. It's a specific application of web scraping, distinguished by its data targets (article-structured content), its cadence (continuous monitoring rather than one-time extraction), and its downstream use cases (media monitoring, sentiment analysis, research datasets, content aggregation).
The most practical approach to news scraping combines two techniques. RSS feeds are the discovery layer: most news publishers expose RSS or Atom feeds listing their latest articles with basic metadata. Polling these feeds gives you a reliable, low-overhead way to know when new articles are published without having to scrape the site's homepage or section pages. Full article extraction is the content layer: once you have the article URL from the feed, you fetch and parse the full article page to extract the complete body text, author information, and any additional structured data the feed doesn't provide.
This two-layer approach is more efficient than crawling a publisher's full website, more reliable than detecting new articles by scraping index pages, and more respectful of the publisher's infrastructure. According to the RSS 2.0 specification maintained by Harvard's Berkman Klein Center, RSS was designed explicitly for content syndication — using it as your discovery mechanism is using the web as it was intended.
The database layer sits at the end of the pipeline, converting the stream of article extractions into a persistent, queryable store. SQLite works for single-machine prototypes and moderate volume; PostgreSQL or MySQL are appropriate for production workloads and team access.
How the News Scraping Pipeline Works
Before diving into code, it helps to see the full pipeline as a sequence of distinct stages — because each stage has its own concerns and failure modes.
Stage 1 — Feed polling. Your scraper fetches the RSS or Atom feed from each configured source. The feed returns an XML document listing recent articles with their URLs, titles, publication timestamps, and sometimes a short summary. The scraper parses this XML, identifies articles it hasn't seen before (by checking URLs against the database), and queues the new ones for content extraction.
Stage 2 — Article fetching. For each new article URL, the scraper sends an HTTP request to the full article page. For news sites that render their content server-side (most traditional publishers), the HTML response contains the article body. For newer, JavaScript-rendered news sites — many modern digital-native publishers use React or similar front-end frameworks — the initial HTML is a skeleton, and the actual content loads via JavaScript after page load. These require a different fetch approach.
Stage 3 — Content extraction. The raw HTML of the article page needs to be parsed into clean, structured fields. This is harder than it sounds: news article pages contain navigation, headers, footers, related article widgets, comment sections, ads, and social sharing elements — all wrapped around the content you actually want. A naive extraction of all text from a news page produces unusable noise. Article extraction libraries handle this automatically, identifying the main content region and separating it from the surrounding page chrome.
Stage 4 — Database storage. Extracted article data — headline, author, body text, publication date, source URL, feed source — gets written to the database. A deduplication check (usually based on URL) prevents re-inserting articles the pipeline has already processed on previous runs.
Stage 5 — Scheduling. The whole pipeline runs on a schedule — every 15 minutes, every hour, or whatever cadence matches your freshness requirements. A scheduler triggers Stage 1 on the configured interval, and the pipeline processes whatever new articles have appeared since the last run.
Step-by-Step Guide: Building an Automated News Scraper
Step 1: Set Up Dependencies
You'll need three core libraries. Install them with pip:
pip install feedparser trafilatura apscheduler
feedparser handles RSS and Atom feed parsing across different feed formats and encoding variants — it's the de facto standard Python library for feed consumption, documented at https://feedparser.readthedocs.io. trafilatura is a battle-tested article extraction library that identifies the main content region of news pages and strips surrounding boilerplate — it consistently outperforms manual BeautifulSoup extraction for news article body text. apscheduler handles the scheduling layer, letting you run the pipeline on a configurable interval without needing a separate cron configuration.
Step 2: Set Up the Database Schema
SQLite is the right choice for getting started quickly — no server to configure, no credentials to manage, and the built-in sqlite3 module is all you need.
import sqlite3
def init_db(db_path="news.db"):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS articles (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE NOT NULL, -- UNIQUE prevents duplicate inserts
title TEXT,
author TEXT,
published_date TEXT,
body_text TEXT,
source_feed TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
conn.close()
Two design decisions worth noting: url TEXT UNIQUE means attempting to insert an article that already exists will fail silently (when using INSERT OR IGNORE), giving you automatic deduplication. scraped_at records when your pipeline captured the article, separate from published_date which is when the publisher released it — both are useful and serve different analytical purposes.
Step 3: Poll RSS Feeds for New Articles
import feedparser
import sqlite3
FEED_URLS = [
"https://feeds.bbci.co.uk/news/rss.xml",
"https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml",
# Add your target sources here
]
def get_known_urls(db_path="news.db"):
"""Return a set of URLs already in the database."""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("SELECT url FROM articles")
known = {row[0] for row in cursor.fetchall()}
conn.close()
return known
def fetch_new_article_urls(db_path="news.db"):
"""Poll configured feeds and return URLs not yet in the database."""
known_urls = get_known_urls(db_path)
new_articles = []
for feed_url in FEED_URLS:
feed = feedparser.parse(feed_url)
for entry in feed.entries:
url = entry.get("link", "")
if url and url not in known_urls:
new_articles.append({
"url": url,
"title": entry.get("title", ""),
"published_date": entry.get("published", ""),
"source_feed": feed_url,
})
return new_articles
feedparser.parse() handles the HTTP request and XML parsing in one call, normalizing the result across RSS 2.0, Atom 1.0, and other variants into a consistent Python dictionary structure. The deduplication check against known_urls ensures the pipeline only processes articles it hasn't seen before, even if the same article stays in the feed across multiple polling cycles.
Step 4: Extract Article Content
import trafilatura
def extract_article_content(url):
"""Fetch a full article page and extract clean body text."""
downloaded = trafilatura.fetch_url(url)
if downloaded is None:
return None
# extract() returns clean article body text, stripping navigation and boilerplate
body_text = trafilatura.extract(downloaded)
return body_text
trafilatura.fetch_url() handles the HTTP request, and trafilatura.extract() identifies the main content region using a combination of HTML structural signals and content density heuristics. In practice, it correctly strips navigation, sidebars, comments, and ad regions on the vast majority of standard news article pages. For pages where it returns None — heavily paywalled content, pages that fail to load, or unusual page structures — the pipeline should log the failure and skip rather than crashing.
Step 5: Save to the Database and Schedule the Pipeline
import sqlite3
from apscheduler.schedulers.blocking import BlockingScheduler
def save_articles(articles, db_path="news.db"):
"""Write a list of article dicts to the database, ignoring duplicates."""
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
for article in articles:
cursor.execute("""
INSERT OR IGNORE INTO articles
(url, title, author, published_date, body_text, source_feed)
VALUES (?, ?, ?, ?, ?, ?)
""", (
article["url"],
article.get("title", ""),
article.get("author", ""),
article.get("published_date", ""),
article.get("body_text", ""),
article.get("source_feed", ""),
))
conn.commit()
conn.close()
def run_pipeline():
"""One full pipeline run: discover new articles, extract content, save."""
print("Pipeline run started...")
new_articles = fetch_new_article_urls()
for article in new_articles:
body_text = extract_article_content(article["url"])
article["body_text"] = body_text or ""
save_articles(new_articles)
print(f"Saved {len(new_articles)} new articles.")
if __name__ == "__main__":
init_db()
run_pipeline() # Run once immediately on start
scheduler = BlockingScheduler()
# Run the pipeline every 30 minutes
scheduler.add_job(run_pipeline, "interval", minutes=30)
scheduler.start()
INSERT OR IGNORE in SQLite means duplicate URLs skip silently rather than raising an exception — the UNIQUE constraint on the url column handles deduplication at the database level. APScheduler's BlockingScheduler keeps the process running and fires run_pipeline() on the configured interval. For production deployment, running this in a Docker container or as a systemd service keeps it alive across machine restarts.
Common Challenges and Limitations
Paywalled and subscription-gated content. Many premium news publishers — the WSJ, FT, Bloomberg — put their article content behind paywalls. The RSS feed may list article URLs and titles, but fetching the full article page returns a paywall prompt rather than the article body. trafilatura will extract whatever text is visible on the paywall page, which is typically the lede paragraph or a truncated preview. There's no clean technical solution for this within the bounds of legitimate scraping — accessing paywalled content requires a subscription, and extracting content the site intentionally restricts violates terms of service.
JavaScript-rendered article pages. Most traditional news publishers render articles server-side, which means trafilatura.fetch_url() receives the full article content in the initial HTML response. But some digital-native publishers and aggregators render their article content via JavaScript after page load — fetching the URL with a simple HTTP request returns an empty shell. For these targets, you need a browser-based fetch rather than a plain HTTP request. Tools like Playwright can render these pages fully before passing the HTML to trafilatura for extraction. For teams managing multiple sources across different rendering types, a managed platform like MrScraper handles browser rendering and content extraction through a single API call, removing the need to maintain both a Playwright setup and an extraction library simultaneously. More at https://mrscraper.com.
Rate limiting and IP-based blocking. Making hundreds of requests per hour to the same news domain will trigger rate limiting or IP blocking on most publisher infrastructure. Implement per-domain request delays — at minimum a few seconds between requests to the same host — and consider distributing requests across your configured sources rather than exhausting one feed's articles before moving to the next. For pipelines monitoring many sources at high frequency, rotating request IPs or using a proxy layer reduces the likelihood of blocking at the publisher level.
Feed freshness and coverage gaps. RSS feeds don't always include every article a publisher produces — some publishers omit certain content types, only expose the last ten or twenty articles in the feed regardless of publication frequency, or update the feed with delays relative to actual publication time. For comprehensive coverage of a publisher's output, supplementing RSS discovery with periodic scraping of the site's section index pages catches articles that don't appear in the feed. This adds complexity to the pipeline but is sometimes necessary for research or monitoring use cases that require complete coverage.
Author and metadata field inconsistency. published_date and author fields are populated from RSS feed metadata or page-level extraction, and they're inconsistent across publishers. Some feeds include full author names; others include author codes or omit the field entirely. Publication dates appear in many different timestamp formats. Build normalization and fallback logic for these fields rather than treating them as reliably structured — and design your database schema and downstream queries to handle null or malformed values gracefully.
Conclusion
Automated news scraping doesn't have to be complex — the RSS-based discovery approach keeps the pipeline efficient and publisher-friendly, trafilatura handles the messy work of separating article content from page chrome, and SQLite with APScheduler gets you to a running, scheduled pipeline in a single Python file. The foundation laid here scales naturally: swap SQLite for PostgreSQL when you need multi-user access or larger volumes, add Playwright for JavaScript-rendered sources, add a proxy layer when rate limiting becomes a concern.
The real value of this pipeline is in what it enables downstream: sentiment analysis across a monitored topic over time, competitive intelligence on how different publications cover the same event, research datasets built from primary sources, or real-time alerting when specific keywords appear in the news. Getting the collection infrastructure right is what makes those downstream applications possible — and it's more achievable than most people expect before they start.
What We Learned
- RSS feeds are the right discovery layer for news scraping: They're designed for content syndication, provide structured metadata including publication timestamps, and give you reliable new-article signals without scraping index pages on every polling cycle.
- Article extraction libraries outperform manual parsing for news content:
trafilaturaand similar tools use content density heuristics to isolate article body text from surrounding page chrome — writing custom BeautifulSoup selectors for each publisher is more work with worse generalization. - URL-based deduplication at the database layer is the cleanest approach: A
UNIQUEconstraint on the URL column withINSERT OR IGNOREhandles deduplication automatically without requiring the pipeline to maintain a separate seen-URL state between runs. - Scheduling belongs in the pipeline, not in a separate system for simple setups: APScheduler handles interval-based execution within the Python process — adequate for moderate pipelines without the overhead of a cron job or a separate workflow orchestration tool.
- Paywalled and JavaScript-rendered content are the two most common hard limits: Paywalls require legitimate subscription access; JS-rendered pages require a browser-based fetch. Both are solvable with the right approach but require explicit handling rather than falling through gracefully on their own.
- Rate limiting is a real-world constraint, not a theoretical concern: News publishers have infrastructure monitoring for excessive automated access — per-domain request delays and respectful polling intervals keep the pipeline running without triggering blocks.
FAQ
-
Is it legal to scrape news articles?
Scraping publicly accessible news content is generally legal in most jurisdictions for personal research, academic, and non-commercial purposes, following the principles established in cases like hiQ Labs v. LinkedIn in the US. The legal picture gets more complex for commercial use, content republication, or scraping behind paywalls or authentication. Always review each publisher's terms of service before scraping — many news sites explicitly address automated access — and consult legal counsel for commercial applications.
-
What is the best Python library for extracting news article content?
trafilaturais currently the most reliable open-source Python library for news article extraction, with consistent performance across a wide range of publisher page structures. It handles boilerplate removal, encoding normalization, and main content identification automatically. Thenewspaper3klibrary (also well-maintained) is a strong alternative with additional features like keyword extraction and NLP summaries. For most article scraping pipelines,trafilaturais the lower-maintenance starting point. -
How do I handle news sites that don't have RSS feeds?
Some publishers don't expose RSS feeds, or expose incomplete feeds that omit certain content types. For these sources, you need to scrape the publisher's section index or homepage on each polling cycle, extract article URLs from the page, and filter for new URLs not already in your database. This is more brittle than RSS-based discovery because it depends on the page's HTML structure, but it's the fallback when feeds aren't available. Some publishers also expose sitemaps (
/sitemap.xml,/news-sitemap.xml) that can serve as a discovery source — these are often more complete than RSS feeds. -
How often should I poll news feeds?
Polling frequency depends on how fresh you need your data and how many concurrent sources you're monitoring. For general news monitoring, every 15–30 minutes provides reasonable freshness without excessive server load. For breaking news monitoring where latency matters, every 5 minutes is reasonable for a small number of high-priority sources. Most RSS feeds don't update more frequently than every few minutes, so polling more aggressively than that provides diminishing freshness returns while increasing the risk of rate limiting.
-
Can I scrape news articles directly from Google News?
Google News doesn't expose a standard RSS feed or API for direct content access, and its terms of service prohibit automated scraping of the service. For monitoring news by topic or keyword, using individual publisher RSS feeds or Google News RSS feeds via
news.google.com/rss/search?q=keyword(which Google provides for personal use) is a more appropriate approach. For production news monitoring at scale, dedicated news data APIs from providers like NewsAPI or GDELT provide licensed, structured access to news content without the legal and technical friction of scraping news aggregators directly. -
How do I scale this pipeline to hundreds of news sources?
The single-process APScheduler approach works well up to dozens of sources at moderate polling intervals. For hundreds of sources or high-frequency polling, move to a distributed architecture: a job queue (Celery with Redis or RabbitMQ) to distribute article extraction tasks across multiple workers, a production database (PostgreSQL) for concurrent write access and better indexing, and a monitoring layer (Prometheus, Grafana, or a simple alerting script) to catch extraction failure rates before they silently affect data quality. The pipeline structure stays the same — feed polling, content extraction, database storage — the components just become horizontally scalable.
Find more insights here
How AI-Powered Scrapers Understand Any Page Structure Without Selectors
AI-powered web scrapers extract data without CSS selectors or XPath — learn how LLMs and vision mode...
Residential Proxy vs VPN for Web Scraping: Which is Better?
Residential proxy vs VPN for web scraping compared — detection rates, speed, rotation, anonymity, an...
How to Scrape Geo-Restricted Content Using Residential Proxies (Step-by-Step Guide)
Learn how to scrape geo-restricted content using residential proxies — step-by-step guide covering s...