Articles on The Verge sometimes contain a section for related articles.
This section can be distracting in reader mode. Therefore, filter the
related article section using the scraper rules.
Feed entries are usually ordered from most to least recent.
Processing older entries first ensures that their creation timestamp
is lower than that of newer entries.
This is useful when we order by creation, because then we get a
consistent timeline.
* in some cases, what the scraper got is only a landing page, user can use scraper rules to extract the link of the landing page and follow it
* it also fix the wrong scrape rule apply when the server redirects it to another host
Trim space around CDATA elements before extracting the CharData.
This problem was discovered when reading https://www.sethvargo.com/feed.xml.
Title and Summary fields have newlines and space between the <title>
element and the CDATA element. e.g.
<title>
<![CDATA[Entry title here]]>
</title>
This meant the title of the feed was coming into MiniFlux as,
<![CDATA[Entry title here]]>
Some articles (especially the recent year-in-review ones) include a Youtube
video. The server-side rendered articles do not include the Youtube iframe,
but they do have a script that looks like
<script type="text/javascript" data-reactid="6">
window.__APOLLO_STATE__ = {
...
youtube_id: "9uASADiYe_8",
We add a reformatting function that tries to detect obvious JavaScript code
that has a field or variable called youtube_id that has an 11-character
double-quoted value, and adds the referenced Youtube videos in the beginning of
the article. This is slightly more general than needed for Quanta, in the hope
that it could be useful for similar sites.