Commit graph

103 commits

Author SHA1 Message Date
Frédéric Guillot
56efd2eb3f Add workaround for non GMT dates (RFC822, RFC850, and RFC1123)
RFC822, RFC850, and RFC1123 are supposed to be always in GMT.

This is a workaround for the one defined in PST timezone.
2018-12-26 20:24:38 -08:00
Frédéric Guillot
012138179c Add function storage.UpdateFeedError() 2018-12-15 13:04:38 -08:00
Tom Matthews
8b40778ee1 Add BBC News scraping rule 2018-12-13 20:25:30 -08:00
Frederic Guillot
61bfb3cfa8 Make password prompt compatible with Windows 2018-12-09 17:44:33 -08:00
Frédéric Guillot
1bc8535dbb Move image proxy filter to template functions 2018-12-02 21:09:53 -08:00
Frédéric Guillot
6f5d93cbbe Update scraper rule for lemonde.fr 2018-12-02 20:53:22 -08:00
Frédéric Guillot
311a133ab8 Refactor manual entry scraper 2018-12-02 20:51:06 -08:00
mapl
e47188eab2 Update scraper rule for heise.de 2018-12-01 11:49:30 -08:00
Frédéric Guillot
487852f07e Replace daemon and scheduler package with service package 2018-11-11 15:32:48 -08:00
Frédéric Guillot
3b6e44c331 Allow the scraper to parse XHTML documents
Only "text/html" was authorized before.
2018-11-03 13:44:13 -07:00
Frédéric Guillot
ae1dc1a91e Handle more encoding conversion edge cases 2018-10-29 23:00:03 -07:00
Frédéric Guillot
7d1b471d88 Add test case to check different feed encoding and HTTP headers 2018-10-29 19:04:36 -07:00
Frédéric Guillot
85d48c8a71 Add entries storage error to feed errors count 2018-10-21 11:44:29 -07:00
Frédéric Guillot
b8f874a37d Simplify feed entries filtering
- Rename processor package to filter
- Remove boilerplate code
2018-10-14 22:33:19 -07:00
Frédéric Guillot
778346b0b0 Simplify feed fetcher
- Add browser package to handle HTTP errors
- Reduce code duplication
2018-10-14 21:43:48 -07:00
Frédéric Guillot
5870f04260 Simplify feed parser and format detection
- Avoid doing multiple buffer copies
- Move parser and format detection logic to its own package
2018-10-14 11:46:41 -07:00
Frédéric Guillot
9606126196 Convert text links and line feeds to HTML in YouTube channels 2018-10-08 20:47:10 -07:00
Frédéric Guillot
9dc38a0803 Add missing package descriptions for GoDoc 2018-10-08 17:32:17 -07:00
Frédéric Guillot
11dfcdd3d6 Fix typo in license header 2018-10-08 15:50:15 -07:00
Frédéric Guillot
b1e8f534ef Simplify locale package usage (refactoring) 2018-09-22 15:04:55 -07:00
Frédéric Guillot
beb7a0cfcb Use unique translation IDs instead of English text as key 2018-09-21 22:23:23 -07:00
Patrick
2538eea177 Add the possibility to override default user agent for each feed 2018-09-19 18:19:24 -07:00
Frédéric Guillot
df2bebaf3d Update scraper rule for heise.de 2018-08-25 10:33:18 -07:00
Frédéric Guillot
dbcc5d8a97 Use canonical imports 2018-08-24 21:56:39 -07:00
neepl
5365f31e90 Add support for published tag in Atom feeds 2018-07-17 21:52:05 -07:00
Frédéric Guillot
a786e78aca Add embedly.com to iframe whitelist 2018-07-10 20:56:54 -07:00
dzaikos
6d25e02cb5 New add_dynamic_image rewriter for JavaScript-loaded images.
Searches tags for various `data-*` attributes and sets `img` tag `src` attribute appropriately. Falls back to searching `noscript` for `img` tags.

Includes unit tests.
2018-07-09 01:22:48 -04:00
dzaikos
e1c56b2e53 Processor: Do rewriter before sanitizer for entry.Content.
Addresses #163.
2018-07-06 00:17:07 -04:00
Frédéric Guillot
de1a4aad30 Add support for protocol relative YouTube URLs 2018-07-04 22:45:44 -07:00
dzaikos
7d4a195519 Sandbox iframes when sanitizing.
Updated iframe unit tests.

Refactored sanitizer.getExtraAttributes() to use `switch` instead of multiple `if` statements.
2018-07-03 12:55:18 -07:00
Frédéric Guillot
9c0f882ba0 Add specific 404 and 401 error messages 2018-06-30 12:42:12 -07:00
dzaikos
45d7105ed1 Refactor AddImageTitle rewriter.
* Only processes images with `src` **and** `title` attributes (others are ignored).
* Processes **all** images in the document (not just the first one).
* Wraps the image and its title attribute in a `figure` tag with the title attribute's contents in a `figcaption` tag.

Updated xkcd rewriter unit test.

Added another xkcd rewriter unit test to check rendering of images without title tags.
2018-06-26 17:50:18 -04:00
dzaikos
c9131b0e89 Improve sanitizer to remove style tag contents.
See #157.

Refactored how blacklisted tags are handled so they're easier manage in the future.
2018-06-24 19:53:23 -07:00
Dave Z
d847b10e32 Improve sanitizer to remove script and noscript contents
These tags where removed but the content was rendered as escaped HTML.

See #157
2018-06-23 17:50:43 -07:00
Frédéric Guillot
bddca15b69 Add new fields for feed username/password 2018-06-19 22:58:29 -07:00
Frédéric Guillot
c719cf7df0 Rewrite iframe Youtube URLs to https://www.youtube-nocookie.com 2018-06-12 18:45:09 -07:00
Frédéric Guillot
0c2e5ff0dc Handle feeds with dates formatted as Unix timestamp 2018-05-08 20:41:24 -07:00
Frédéric Guillot
5cacae6cf2 Add API endpoint to import OPML file 2018-04-29 18:56:40 -07:00
Frédéric Guillot
1eba1730d1 Move HTTP client to its own package 2018-04-28 10:51:07 -07:00
aniran
322b265d7a Scrape parent element for iframe
Current behavior: if you have an `iframe` scraper rule, `scrapContent`
tries to return the inner HTML of the `iframe`, which turns up blank.

New behavior: like `img` elements, if an `iframe` is matched by a scraper rule,
the parent element's inner HTML (i.e. the `iframe` is returned).
2018-04-27 17:57:22 -07:00
aniran
920dda79b7 Add soundcloud and bandcamp iframe sources 2018-04-27 17:55:58 -07:00
Frédéric Guillot
dcbb5047b1 Add support for Dublin Core date in RDF feeds 2018-04-10 18:13:05 -07:00
Frédéric Guillot
02ba735ba9 Handle some non-english date formats 2018-04-09 21:27:15 -07:00
Frédéric Guillot
e2d02bac5a Rename RSS parser getters 2018-04-09 20:38:12 -07:00
Frédéric Guillot
f76093690c Get the right comments URL when having multiple namespaces 2018-04-09 20:30:55 -07:00
Frédéric Guillot
702256bcc0 Add unit test for comments url and French translation 2018-04-07 13:56:11 -07:00
Ben Brooks
538d08c16c Add CommentsURL to entry 2018-04-07 13:50:45 -07:00
Frédéric Guillot
6ea4da3bce Handle RSS author elements with inner HTML 2018-03-18 11:57:46 -07:00
Frédéric Guillot
482785c5e6 Convert enclosure size field to bigint 2018-03-14 20:09:06 -07:00
Frédéric Guillot
ec08f45bf5 Fix broken OPML import with Go 1.10 2018-03-14 18:50:06 -07:00