Commit graph

228 commits

Author SHA1 Message Date
nemunaire
5a07fd8932
Add new rewrite rule to decode base64 content 2022-05-25 20:44:04 -07:00
lf94
fa8431c5c6 Try to use outermost element text when title is empty 2022-04-13 21:51:54 -07:00
Frédéric Guillot
f6825c1c60 Fix invalid parsing of data URL
Fetching icons crashes with "slice bounds out of range" error if no encoding is specified.
2022-03-25 22:30:20 -07:00
Frédéric Guillot
1eb01b39e7 Use truncated entry description as title if unavailable 2022-03-04 17:10:32 -08:00
Frédéric Guillot
c9e0f0b3e4 Do not fallback to InnerXML if XHTML title is empty 2022-03-04 14:28:56 -08:00
Romain de Laage
808635e314 Add a rewrite rule for castopod episodes 2022-01-30 16:33:17 -08:00
Adrian Smith
cc3e65dd3c Handle atom feed with space around CDATA
Trim space around CDATA elements before extracting the CharData.

This problem was discovered when reading https://www.sethvargo.com/feed.xml.
Title and Summary fields have newlines and space between the <title>
element and the CDATA element. e.g.

  <title>
    <![CDATA[Entry title here]]>
  </title>

This meant the title of the feed was coming into MiniFlux as,
  <![CDATA[Entry title here]]>
2022-01-17 15:25:22 -08:00
Frédéric Guillot
f18ded6117 Add support for multiple authors in Atom feeds 2022-01-14 20:20:55 -08:00
Frédéric Guillot
2309b27458 Use custom feed user agent to fetch website icon 2022-01-08 15:20:18 -08:00
Romain de Laage
8329e9b46c
Make Invidious instance configurable 2022-01-05 20:43:03 -08:00
Jouni K. Seppänen
bb0d2bf675 Add Youtube videos in Quanta articles
Some articles (especially the recent year-in-review ones) include a Youtube
video. The server-side rendered articles do not include the Youtube iframe,
but they do have a script that looks like

    <script type="text/javascript" data-reactid="6">
      window.__APOLLO_STATE__ = {
        ...
          youtube_id: "9uASADiYe_8",

We add a reformatting function that tries to detect obvious JavaScript code
that has a field or variable called youtube_id that has an 11-character
double-quoted value, and adds the referenced Youtube videos in the beginning of
the article. This is slightly more general than needed for Quanta, in the hope
that it could be useful for similar sites.
2022-01-03 10:10:13 -08:00
Jouni K. Seppänen
dcf87bd642 Add scrape and rewrite rules for quantamagazine
This is a somewhat complex React site so the rules could be a little fragile.
Text content seems to be always inside .outer--content, and most h6 elements
are fluff like "read later" or pointers to other articles. However, h6.byline
and h6.post__title__kicker are relevant to the current article.

Figure captions are sometimes inside both figure and div.outer--content
elements, sometimes only inside figure, so take both and remove the
intersection.

The figure elements sometimes contain multiple copies of images or
videos, and we just take them all. Math articles seem to use Mathjax,
which we don't add.
2022-01-03 10:10:13 -08:00
Jouni K. Seppänen
2fedd8f234 Add scraper rule for ikiwiki.iki.fi
Feed: https://ikiwiki.iki.fi/feed.php?linkto=current&ns=uutiset%3Ablog&num=5

Example page: https://ikiwiki.iki.fi/uutiset/blog/20210923100421viiveita

(To clarify, I'm not a representative of iki.fi although I have an email address in the domain. This is a nonprofit association that offers email forwarding addresses, and the rss feed in question contains news for their members.)
2021-12-27 20:51:37 -08:00
Thiago Perrotta
28d036434f Add rewrite rule: monkeyuser.com
Comics site, uses alt image text similarly to xkcd.com.
2021-12-16 11:50:26 -08:00
Thiago Perrotta
4b12043cea Sort rewrite rules 2021-12-16 11:50:26 -08:00
Frédéric Guillot
0f6f4c8c60 Add <head> tag to OPML export 2021-12-16 11:49:50 -08:00
Artémis
b585dab6b4
Add data-srcset support to "add_dynamic_image rewrite" rewrite rule 2021-10-22 18:12:23 -07:00
Frank Steinborn
2dcabc840c Fix minor typo 2021-10-17 16:58:42 -07:00
Frédéric Guillot
5f9d6fd81b Handle srcset images with no space after comma 2021-10-13 21:31:08 -07:00
三三
34dd358eb0
Add Telegram integration 2021-09-07 20:04:22 -07:00
Lukas Dietrich
93596c1218 Add rewrite rule to remove dom elements 2021-09-06 09:47:05 -07:00
hulb
01f678c3b1 add proxy arg in scraper.Fetch 2021-08-28 21:57:11 -07:00
James Loh
2f6895e118 Fix finding JSON feeds with new MIME type
The 1.1 version (https://jsonfeed.org/version/1.1) for JSON feeds defines that feeds should have a MIME type of `application/feed+json` which Miniflux wasn't searching for
2021-08-21 13:01:08 -07:00
Frédéric Guillot
b7c229f30f Update scraper rule for theregister.com 2021-08-16 20:04:02 -07:00
Alexandros Kosiaris
b8b16c3bdf Add /rss/ in finder's wellKnownUrls
ATCOM netvolution WCM, probably alongside others, a CMS powering several
high profile and high traffic Greek news sites, among other sites,
publishes the RSS feed under /rss/. Add it to the list. It's generic
enough to allow us to assume other software might do it to

On a select set of 627 Greek news media sites (the infamous Petsas list),
adding this rule increased discoverability of RSS feeds by a factor of
2.61% (from 498 to 511).
2021-07-22 19:46:40 -07:00
Dave Marquard
fc766de02d use authors entry for json 1.1 feeds 2021-07-21 21:28:37 -07:00
Jan-Lukas Else
20cd023c07
Use runes instead of bytes to truncate JSON feed titles
This fix avoid breaking Unicode string. 

It solves this error:

pq: invalid byte sequence for encoding "UTF8": 0xf0 0x9f 0x9a 0x2e
2021-05-31 11:42:59 -07:00
Frédéric Guillot
5b8eb4735c Handle RSS feed title with encoded Unicode entities 2021-04-30 22:57:29 -07:00
yue
18e414ec45
Fix typo in reader/json/doc.go 2021-04-02 19:00:06 -07:00
Frédéric Guillot
6e2e2d1665 Setup golangci-lint Github Action 2021-03-22 21:34:48 -07:00
Darius
9242350f0e
Add per feed cookies option 2021-03-22 20:27:58 -07:00
Frédéric Guillot
e60e0ba3c4 Add workaround to handle some invalid dates 2021-03-21 10:52:27 -07:00
Frédéric Guillot
5877048749 Improve handling of Atom text content with CDATA 2021-03-20 20:47:35 -07:00
Frédéric Guillot
c8c1f05328 Add better support of Atom text constructs
- Note that Miniflux does not render entry title with HTML tags as of now
- Omit XHTML div element because it should not be part of the content
2021-03-19 22:05:00 -07:00
Frédéric Guillot
96f3e888cf Handle RDF feed with HTML encoded entry title
Example: http://rss.slashdot.org/Slashdot/slashdotMain
2021-03-19 18:49:51 -07:00
Frédéric Guillot
14888f1cb8 Fix incorrect parsing of Atom entry content of type HTML 2021-03-18 21:43:59 -07:00
Gabriel Augendre
1d80c12e18
Prevent Youtube scraping if entry already exists 2021-03-08 20:10:53 -08:00
hykhd
053b1d0f8d
Handle RSS feeds with CDATA in author item element 2021-02-28 12:26:52 -08:00
Frédéric Guillot
ec3c604a83 Add option to allow self-signed or invalid certificates 2021-02-21 13:58:52 -08:00
Ilya Mateyko
c3f871b49b Use YouTube video duration as read time
This feature works by scraping YouTube website.

To enable it, set the FETCH_YOUTUBE_WATCH_TIME environment variable to
1.

Resolves #972.
2021-02-21 11:13:52 -08:00
hykhd
3cb04b2c56 update whitelist fix bilibili video 2021-02-20 10:29:42 -08:00
Frédéric Guillot
a352aff93b Remove deprecated io/ioutil package
Miniflux now requires at least Go 1.16 and io/util is deprecated.

https://golang.org/doc/go1.16#ioutil
2021-02-16 21:25:21 -08:00
Frédéric Guillot
04f9c456d5 Handle entry title with double encoded entities in Atom feeds 2021-02-14 11:19:21 -08:00
Frédéric Guillot
0413daf76b Remove iframe inner HTML contents
An iframe element never has fallback content, as it will always create a nested
browsing context, regardless of whether the specified initial contents are
successfully used.

https://www.w3.org/TR/2010/WD-html5-20101019/the-iframe-element.html#the-iframe-element
2021-02-13 14:00:21 -08:00
Frédéric Guillot
5043749b9f Add workaround for entry title with double encoded entities
Example: &amp;#39;Text&amp;#39;
2021-02-13 13:33:59 -08:00
Nick Chitwood
793f475edd
Update date parser to fix another time zone issue
The Washington Post has its feeds with EST, which is getting parsed by miniflux as UTC, and showing up as 8 hours off.

See http://feeds.washingtonpost.com/rss/politics for an example.

This fix applies a similar workaround for EST/EDT as was done for PST/PDT.
2021-02-10 22:45:02 -08:00
Frédéric Guillot
864dd9f219 Allow images with data URLs
Only URLs with a mime-type image/* are allowed
2021-02-06 14:46:01 -08:00
Ilya Mateyko
4464802947 Reformat some Go files
When working on #994 I noticed that some Go files are not formatted with
`gofmt`.

This PR fixes this.
2021-01-27 18:13:58 -08:00
Frédéric Guillot
806b9545a9 Refactor feed validator 2021-01-04 14:47:25 -08:00
Frédéric Guillot
4468ef1410 Refactor category validation 2021-01-03 22:50:24 -08:00