Programmatically collect online-published news articles. Open-sourced & free tools only

An ultimate guide to where to search for online published news articles. Pros & Cons for each source.

Programmatically collect online-published news articles. Open-sourced & free tools only

While we’ve been building NewsCatcher we discovered many open-sourced & free tools, services, libraries that help you find & parse online-published news articles. We even published two Python packages that help work with news (newscatcher, pygooglenews that cumulated over 3k stars on GitHub).

Developers, data scientists, and NLP enthusiasts love working with news data. While there are a few paid options (including what NewsCatcher does) I believe that the vast majority of non-commercial use cases might be satisfied with open-sourced & free options.


Who is this list for?

  1. Students/portfolio builders
    Are you searching for a data science/data engineering job? You’d need to prove you can deliver some results.
    Here’s one example of a data engineering project from Damian Kliś that I’m quite fascinated about.
    Off-topic comment: one thing that makes Damian’s GitHub repository stand out is that he made a clear and concise README. If you add some repository to your CV you’d better explain well what it is about. No-README repo might harm you more than help you find a job: it shows you don’t care about documenting & explaining your work. It’s a huge red sign for people who’d consider hiring you.
  2. Side-project
    You might be inspired for your next side-project here. Like, building another news aggregator.
  3. Indie hackers
    Don’t have money to pay news data providers? Try building your own tool.

Where to find online-published news articles

1) GDELT 2.0 Global Knowledge Graph

GDELT analyses news articles published online. They do NLP to understand what this planet is talking about. However, in this post we will use one of the datasets to grab the links to newly published news articles.

Pros:

  • ~400,000 news articles/day
  • updates every 15 minutes
  • worldwide multi-language coverage
  • 5 years of history

Cons:

  • just URL to the article: you have to scrape & extract it yourself
  • not consistent in terms of delay

You might think that a list of URLs isn't much, but I bet you're wrong. It's a half job done.

2) News Crawl by Common Crawl

Common Crawl crawls the web and open-source all of the online pages they could have found. They are non-profit, so I highly encourage you to donate to them if you'll end up using their solution.

In 2016, Common Crawl decided to decouple the news crawl part from their main dataset.

News Crawl uses RSS & news sitemaps to parse the news. This part of crawl is separately open-sourced. Check the GitHub repo.

Documentation

Pros:

  • ~600,000 news articles/day
  • worldwide multi-language coverage
  • few years of history
  • full HTML of a page

Cons:

  • updates multiple times a day
  • you still need to parse the content from the HTML

3) RSS feeds

RSS feeds still exist. Our beta version used to rely solely on RSS feeds. You can read a full article here:

How we built News API beta in 60 days
We are a team of 2 data engineers. From February to March 2020 we dedicated most of our spare time building an API that allows you to search for the news articles’ data.

Pros:

  • partially structured & contains some data points (title, published date)

Cons:

  • you have to find the RSS feed – it's not a trivial task when you need it at scale
  • RSS feed can disappear any time

4) Google News (RSS)

Google News is the biggest UI-first news aggregators.

Google News has an RSS for any UI page. This RSS is lightweight and you will not get blocked for accessing it many times a day.

I created a Python library that helps you parse any Google News RSS page. Even if your are not a Python person you can use this repository as an unofficial Google News RSS documentation (there is no official one).

kotartemiy/pygooglenews
If Google News had a Python library. Contribute to kotartemiy/pygooglenews development by creating an account on GitHub.

5) NewsCatcher's Free News API

This April, we will launch an absolutely Free News API:

Free News API
Free News API for developers, indie hackers, students

Preliminary list of features/limitations:

  • articles from 500 worldwide news sources
  • 50 calls per hour
  • 50 news articles/call
  • filter by keyword, country, source, published time
  • 1 month of historical data

Pros:

  • searchable API on top of an indexed database
  • news articles from the top 500 news providers
  • already structured
  • ready to integrate

Cons:

  • not a full body-article text (500 chars limit)
  • vendor lock (even though it's free)
  • 1 month of historical data

Next week, we will write about how to programmatically parse news articles. Follow us on LinkedIn to not miss our guides, product updates, and latest news.