Commit graph

21 commits

Author SHA1 Message Date
jebbs
10207967c4 scraper follow the only link
* in some cases, what the scraper got is only a landing page, user can use scraper rules to extract the link of the landing page and follow it
* it also fix the  wrong scrape rule apply when the server redirects it to another host
2022-10-31 19:49:34 -07:00
hulb
01f678c3b1 add proxy arg in scraper.Fetch 2021-08-28 21:57:11 -07:00
Darius
9242350f0e
Add per feed cookies option 2021-03-22 20:27:58 -07:00
Frédéric Guillot
ec3c604a83 Add option to allow self-signed or invalid certificates 2021-02-21 13:58:52 -08:00
Frédéric Guillot
c394a61a4e Add Prometheus exporter 2020-09-27 20:04:48 -07:00
Frédéric Guillot
16b7b3bc3e http client: remove dependency on global config options 2020-09-27 14:37:46 -07:00
cinput
8e1ed8bef3 Return outer HTML when scraping elements 2019-12-21 21:18:31 -08:00
Frédéric Guillot
311a133ab8 Refactor manual entry scraper 2018-12-02 20:51:06 -08:00
Frédéric Guillot
3b6e44c331 Allow the scraper to parse XHTML documents
Only "text/html" was authorized before.
2018-11-03 13:44:13 -07:00
Frédéric Guillot
5870f04260 Simplify feed parser and format detection
- Avoid doing multiple buffer copies
- Move parser and format detection logic to its own package
2018-10-14 11:46:41 -07:00
Patrick
2538eea177 Add the possibility to override default user agent for each feed 2018-09-19 18:19:24 -07:00
Frédéric Guillot
dbcc5d8a97 Use canonical imports 2018-08-24 21:56:39 -07:00
Frédéric Guillot
1eba1730d1 Move HTTP client to its own package 2018-04-28 10:51:07 -07:00
aniran
322b265d7a Scrape parent element for iframe
Current behavior: if you have an `iframe` scraper rule, `scrapContent`
tries to return the inner HTML of the `iframe`, which turns up blank.

New behavior: like `img` elements, if an `iframe` is matched by a scraper rule,
the parent element's inner HTML (i.e. the `iframe` is returned).
2018-04-27 17:57:22 -07:00
Frédéric Guillot
3c3f397bf5 Make sure the scraper parse only HTML documents 2018-01-02 18:32:01 -08:00
Frédéric Guillot
1d8193b892 Add logger 2017-12-15 18:55:57 -08:00
Frédéric Guillot
c6d9eb3614 Improve content scraper 2017-12-13 21:30:40 -08:00
Frédéric Guillot
84d912c979 Rewrite imports 2017-12-12 21:48:13 -08:00
Frédéric Guillot
ef097f02fe Add the possibility to enable crawler for feeds 2017-12-12 19:19:36 -08:00
Frédéric Guillot
87ccad5c7f Add scraper rules 2017-12-10 20:51:04 -08:00
Frédéric Guillot
7a35c58f53 Add readability package to fetch original content 2017-12-10 19:01:38 -08:00