Programmatically collect online-published news articles. Open-sourced & free tools only
An ultimate guide to where to search for online published news articles. Pros & Cons for each source.
While we’ve been building NewsCatcher we discovered many open-sourced & free tools, services, libraries that help you find & parse online-published news articles. We even published two Python packages that help work with news (newscatcher, pygooglenews that cumulated over 3k stars on GitHub).
Developers, data scientists, and NLP enthusiasts love working with news data. While there are a few paid options (including what NewsCatcher does) I believe that the vast majority of non-commercial use cases might be satisfied with open-sourced & free options.
Who is this list for?
- Students/portfolio builders
Are you searching for a data science/data engineering job? You’d need to prove you can deliver some results.
Here’s one example of a data engineering project from Damian Kliś that I’m quite fascinated about.
Off-topic comment: one thing that makes Damian’s GitHub repository stand out is that he made a clear and concise README. If you add some repository to your CV you’d better explain well what it is about. No-README repo might harm you more than help you find a job: it shows you don’t care about documenting & explaining your work. It’s a huge red sign for people who’d consider hiring you. - Side-project
You might be inspired for your next side-project here. Like, building another news aggregator. - Indie hackers
Don’t have money to pay news data providers? Try building your own tool.
Where to find online-published news articles
1) GDELT 2.0 Global Knowledge Graph
GDELT analyses news articles published online. They do NLP to understand what this planet is talking about. However, in this post we will use one of the datasets to grab the links to newly published news articles.
Pros:
- ~400,000 news articles/day
- updates every 15 minutes
- worldwide multi-language coverage
- 5 years of history
Cons:
- just URL to the article: you have to scrape & extract it yourself
- not consistent in terms of delay
You might think that a list of URLs isn't much, but I bet you're wrong. It's a half job done.
2) News Crawl by Common Crawl
Common Crawl crawls the web and open-source all of the online pages they could have found. They are non-profit, so I highly encourage you to donate to them if you'll end up using their solution.
In 2016, Common Crawl decided to decouple the news crawl part from their main dataset.
News Crawl uses RSS & news sitemaps to parse the news. This part of crawl is separately open-sourced. Check the GitHub repo.
Pros:
- ~600,000 news articles/day
- worldwide multi-language coverage
- few years of history
- full HTML of a page
Cons:
- updates multiple times a day
- you still need to parse the content from the HTML
3) RSS feeds
RSS feeds still exist. Our beta version used to rely solely on RSS feeds. You can read a full article here:

Pros:
- partially structured & contains some data points (title, published date)
Cons:
- you have to find the RSS feed – it's not a trivial task when you need it at scale
- RSS feed can disappear any time
4) Google News (RSS)
Google News is the biggest UI-first news aggregators.
Google News has an RSS for any UI page. This RSS is lightweight and you will not get blocked for accessing it many times a day.
I created a Python library that helps you parse any Google News RSS page. Even if your are not a Python person you can use this repository as an unofficial Google News RSS documentation (there is no official one).
5) NewsCatcher's Free News API
This April, we will launch an absolutely Free News API:

Preliminary list of features/limitations:
- articles from 500 worldwide news sources
- 50 calls per hour
- 50 news articles/call
- filter by keyword, country, source, published time
- 1 month of historical data
Pros:
- searchable API on top of an indexed database
- news articles from the top 500 news providers
- already structured
- ready to integrate
Cons:
- not a full body-article text (500 chars limit)
- vendor lock (even though it's free)
- 1 month of historical data
Next week, we will write about how to programmatically parse news articles. Follow us on LinkedIn to not miss our guides, product updates, and latest news.