Commit graph

72 commits

Author SHA1 Message Date
Frédéric Guillot
6d97f8b458 Parse podcast categories 2024-03-11 22:30:27 -07:00
Frédéric Guillot
f8e50947f2 Move iTunes and GooglePlay XML definitions to their own packages 2024-03-11 22:09:31 -07:00
Frédéric Guillot
9a637ce95e Refactor RSS parser to use default namespace
This change avoid some limitations of the Go XML parser regarding XML namespaces
2024-03-11 21:07:13 -07:00
jvoisin
a074773e6c Use an io.ReadSeeker instead of an io.Reader to parse feeds
This will allow to make use of func (*Reader) Seek, instead of re-recreating a
new reader. It's a large commit for a small change, but anything to simply the
reader/buffer/ReadAll/… mess is a step in the right direction I think, and it
should enable more follow-up simplifications.
2024-03-06 20:13:39 -08:00
jvoisin
3d0126be0b Speed the sanitizer up a bit, again
- allow youtube urls to start with `www`
- use `strings.Builder` instead of a `bytes.Buffer`
- use a `strings.NewReader` instead of a `bytes.NewBufferString`
- sprinkles a couple of `continue` to make the code-flow more obvious
- inline calls to `inList`, and put their parameters in the right order
- simplify isPixelTracker
- simplify `isValidIframeSource`, by extracting the hostname and comparing it
  directly, instead of using the full url and checking if it starts with
  multiple variations of the same one (`//`, `http:`, `https://` multiplied by
  ``/`www.`)
- add a benchmark
2024-03-05 19:31:50 -08:00
jvoisin
111e3f2106 Reuse a Reader instead of copying to a buffer when parsing an atom feed 2024-03-04 17:36:10 -08:00
jvoisin
3339d9d3d7 Preallocate memory when exporting to OPML
This should marginally increase performance when export a large amount of feeds
to OPML.
2024-03-03 20:34:37 -08:00
jvoisin
347740dce1 Speed up removeUnlikelyCandidates
`.Not` returns a brand new Selection, copied element by element.
2024-02-29 19:38:43 -08:00
jvoisin
ab85d4d678 Improve EstimateReadingTime's speed by a factor 7
- Refactorise the tests and add some
- Use 250 signs instead of the whole text
- Only check for Korean, Chinese and Japanese script
- Add a benchmark
- Use a more idiomatic control flow

```console
$ # main branch
$ go test -bench=.
goos: linux
goarch: amd64
pkg: miniflux.app/v2/internal/reader/readingtime
BenchmarkEstimateReadingTime-12              267           4821268 ns/op
PASS
ok      miniflux.app/v2/internal/reader/readingtime     1.754s
$ # speed_up_reading_time branch
$ go test -bench=.
goos: linux
goarch: amd64
pkg: miniflux.app/v2/internal/reader/readingtime
cpu: 12th Gen Intel(R) Core(TM) i7-1265U
BenchmarkEstimateReadingTime-12             1941            653312 ns/op
PASS
ok      miniflux.app/v2/internal/reader/readingtime     1.342s
$
```
2024-02-29 19:24:15 -08:00
jvoisin
31ac62f410 Don't compute reading-time when unused
If the user doesn't display reading times, there is no need to compute them.
This should speed things up a bit, since `whatlanggo.Detect` is abysmally slow.
2024-02-29 19:14:17 -08:00
Frédéric Guillot
97765b93a9 Revert "Minor internal/reader/readability/readability.go speedup"
This reverts commit 4db138d4b8.

```
panic: runtime error: index out of range [-1]

goroutine 49 [running]:
miniflux.app/v2/internal/reader/readability.getArticle.func1(0x8?, 0xc000b56570)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:120 +0x2ac
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc000b56510, 0xc000892fa8)
        /home/fred/go/pkg/mod/github.com/!puerkito!bio/goquery@v1.9.0/iteration.go:10 +0x62
miniflux.app/v2/internal/reader/readability.getArticle(0xc00044f1f0, 0xc000a04a50)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:101 +0x15d
miniflux.app/v2/internal/reader/readability.ExtractContent({0x1005d00?, 0xc0001522d0?})
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:91 +0x211
miniflux.app/v2/internal/reader/scraper.ScrapeWebsite(0xc000893688?, {0xc0007ce720, 0x54}, {0x0, 0x0})
        /home/fred/repos/miniflux/v2/internal/reader/scraper/scraper.go:63 +0x859
miniflux.app/v2/internal/reader/processor.ProcessFeedEntries(0xc000133188, 0xc000502c40, 0xc0003e6360, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/processor/processor.go:77 +0x8ea
miniflux.app/v2/internal/reader/handler.RefreshFeed(0xc000133188, 0x10cf, 0x52d5c, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/handler/handler.go:301 +0x1485
miniflux.app/v2/internal/cli.refreshFeeds.func1(0x0)
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:59 +0x2d7
created by miniflux.app/v2/internal/cli.refreshFeeds in goroutine 1
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:50 +0x5d5
```
2024-02-29 19:06:03 -08:00
Frédéric Guillot
c493f8921e Add missing regex anchor detected by CodeQL 2024-02-28 20:50:17 -08:00
jvoisin
4db138d4b8 Minor internal/reader/readability/readability.go speedup
- Don't use a capturing group in `divToPElementsRegexp`
- Remove a duplicate condition
- Replace a regex with a fixed-comparison and a `Contains`
2024-02-28 20:03:14 -08:00
jvoisin
f12d5131b0 Divide the sanitization time by 3
Instead of having to allocate a ~100 keys map containing possibly dynamic
values (at least to the go compiler), allocate it once in a global variable.
This significantly speeds things up, by reducing the garbage
collector/allocator involvements.

Local synthetic benchmarks have shown a improvements from 38% of wall time to only
12%.
2024-02-28 20:00:13 -08:00
jvoisin
645a817685 Use modern for loops
Go 1.22 introduced a new [for-range](https://go.dev/ref/spec#For_range)
construct that looks a tad better than the usual `for i := 0; i < N; i++`
construct. I also tool the liberty of replacing some
`for i := 0; i < len(myitemsarray); i++ { … myitemsarray[i] …}`
with  `for item := range myitemsarray` when `myitemsarray` contains only pointers.
2024-02-28 19:55:28 -08:00
jvoisin
543a690bfd Close resources as soon as possible, instead of using defer() in a loop
So that resources can be freed as soon as they're not used anymore, instead of
waiting for the two nested loops to finish.
2024-02-28 19:47:30 -08:00
jvoisin
c4e5dad549 Remove superfluous escaping in a regex 2024-02-28 19:47:30 -08:00
jvoisin
fa12c23d79 Use strings.ReplaceAll instead of strings.Replace(…, -1) 2024-02-28 19:47:30 -08:00
jvoisin
4fe902a5d2 Use strings.EqualFold instead of strings.ToLower(…) == 2024-02-28 19:47:30 -08:00
jvoisin
61af08a721 Use .WriteString( instead of .Write([]byte(… 2024-02-28 19:47:30 -08:00
jvoisin
b04550e2f2 Use %q instead of "%s" 2024-02-28 19:47:30 -08:00
jvoisin
b94756bbf0 Add a warning for StripTags 2024-02-27 20:41:47 -08:00
jvoisin
db6ae707ef Add some tests for add_image_title
I'm not sure if the behaviour is expected, but I didn't manage to
get the content injection to work in my browser, so I guess it's alright?
2024-02-27 20:41:15 -08:00
jvoisin
06e256e5ef Simplify internal/reader/icon/finder.go
- Use a simple regex to parse data uri instead of a hand-rolled parser, and
  document what fields are considered mandatory.
- Use case-insensitive matching to find (fav)icons, instead of doing the same
  query twice with different letter cases
- Add 'apple-touch-icon-precomposed.png' as a fallback favicon
- Reorder the queries to have i`con` first, since it seems to be the most
  popular one. It used to be last, meaning that pages had to be parsed
  completely 4 times, instead of one now.
- Minor factorisation in findIconURLsFromHTMLDocument
2024-02-26 18:18:04 -08:00
jvoisin
040938ff6d Small refactoring of internal/reader/date/parser.go
- Split dates formats into those that require local times
  and those who don't, so that there is no need to have a switch-case in the
  for loop with around 250 iterations at most.
- Be more strict when it comes to timezones, previously invalid ones like -13
  were accepted. Also add a test for this.
- Bail out early if the date is an empty string.
2024-02-26 18:08:04 -08:00
jvoisin
c2d2f31438 Improve a bit internal/reader/scraper/scraper.go
- make findContentUsingCustomRules' more idiomatic,
  since in golang a function returning an error might
  return garbage in other parameter. Moreover, ignoring
  errors is bad practise.
- getPredefinedScraperRules is now running in constant-time,
  instead of iterating on a list with around 50 items in it.
2024-02-26 18:00:23 -08:00
jvoisin
5b2558bf92 Miscellaneous improvements to internal/reader/subscription/finder.go
- Surface `localizedError` in FindSubscriptionsFromWellKnownURLs via slog
- Use an inline declaration for new subscriptions, like done elsewhere in the
  file, if only for consistency's sake
- Preallocate the `subscriptions` slice when using an RSS-bridge,
  it's a good practise, and it might even marginally improve
  performances when adding __a lot__ of feeds via an rss-bridge instance, wooo!
2024-02-26 17:52:21 -08:00
jvoisin
ecd59009fb Add a couple of new possible locations for feeds
- Hugo likes to generate index.xml
- feed.atom and feed.rss are used by enterprise-scale/old-school gigantic CMS
2024-02-26 17:43:51 -08:00
jvoisin
4a943b722d Add a couple of fuzzers 2024-02-26 17:23:49 -08:00
jvoisin
54b5be5e7d Significantly simplify/speed up the sanitizer
- Use constant time access for maps instead of iterating on them
- Build a ~large whitelist map inline instead of constructing it item by item
  (and remove a duplicate key/value pair)
- Use `slices` instead of hand-rolled loops
2024-02-25 17:29:46 -08:00
Frédéric Guillot
eae4cb1417 Add feed option to disable HTTP/2 to avoid fingerprinting 2024-02-24 22:30:26 -08:00
jvoisin
b48ad6dbfb Make use of go≥1.21 slices package instead of hand-rolled loops
This makes the code a tad smaller, moderner,
and maybe even marginally faster, yay!
2024-02-24 20:22:53 -08:00
jvoisin
c544dadd55 Fix categories import from Thunderbird's OPML
Thunderbird OPML exports are looking like this:

```xml
<opml version="1.0" xmlns:fz="urn:forumzilla:">
<head>
	<title>Thunderbird OPML Export - RSS</title>
    	<dateCreated>Sat, 24 Feb 2024 11:31:13 GMT</dateCreated>
</head>
<body>
	<outline title="News">
		<outline type="rss" ...>
		<outline type="rss" ...>
		...
	</outline>
	<outline title="Blogs">
		<outline type="rss" ...>
		<outline type="rss" ...>
		...
	</outline>
</body>
```

This commit make it so that categories are now correctly imported.
2024-02-24 19:43:33 -08:00
Frédéric Guillot
c595c80356 Handle RDF feeds with duplicated <title> elements 2024-02-23 17:40:58 -08:00
Matt Stobo
4a50ca9122 Allow filtering feeds on entry.Author 2024-01-31 19:42:07 -08:00
Dave
1159dd6982 Add addDynamicIframe rewrite function.
Add unit tests for `add_dynamic_iframe` rewrite.
2024-01-23 19:23:57 -08:00
dzaikos
d68f2306c6 Add attribute to add_dynamic_image rewrite candidates. 2024-01-21 14:27:06 -08:00
Frédéric Guillot
8553188ae4 Add missing translation argument 2024-01-20 10:48:27 -08:00
Frédéric Guillot
ce32d181d5 Change default Accept header 2024-01-13 13:53:57 -08:00
Filipe de Luna
1441dc7600
Update entry processor to allow blocking/keeping entries by tags 2024-01-09 21:15:11 -08:00
Jan Tojnar
074393d3bf fix: Include type for OPML subscriptions
As per [OPML 2.0 specification]:

> Each sub-element of the body of the OPML document is a node of type rss or an outline element that contains nodes of type rss.

> Required attributes: type, text, xmlUrl.

[OPML 2.0 specification]: http://opml.org/spec2.opml#subscriptionLists
2023-12-31 10:00:50 -08:00
Darwin
d90667777f request_builder.go: fetcher: Force try HTTP/2 2023-12-15 16:27:00 -08:00
Kristof Mattei
0465f9b188 fix: tests for allow popups to escape sandbox 2023-12-10 16:59:58 -08:00
Kristof Mattei
d53ad3b79a fix: clicking youtube links in iframes returns ERR_BLOCKED_BY_RESPONSE 2023-12-10 16:59:58 -08:00
Frédéric Guillot
d0f99cee1a Regression: ensure all HTML documents are encoded in UTF-8
Fixes #2196
2023-12-01 16:52:03 -08:00
Frédéric Guillot
5de0714256 Deduplicate feed URLs when parsing HTML document during discovery process
Fixes #2232
2023-12-01 13:57:05 -08:00
Shizun Ge
27ec6dbd7d Setting NextCheckAt due to TTL of a feed in feed.go.
Add unit tests.
2023-12-01 12:22:30 -08:00
Thomas J Faughnan Jr
fe0ef8b579 Fix conditional requests regression
The recent HTTP client refactor in 14e25ab9fe
caused feed refreshes to no longer make conditional requests. Prior to
the refactor, `client.WithCacheHeaders` handled this. Now this function
is split into `fetcher.WithETag` and `fetcher.WithLastModified` but
these functions are only declared and never actually used. Fix this by
calling them inside `handler.RefreshFeed`.
2023-11-29 19:46:50 -08:00
Thomas J Faughnan Jr
7a03291442 Fix default User-Agent regression
The recent HTTP client refactor in 14e25ab9fe
introduced a bug in which the global default User-Agent is no longer
used for requests. Unless a per-feed User-Agent exists, the Go standard
library's default User-Agent is used, which looks something like
"Go-http-client/1.1". To fix this, make RequestBuilder.WithUserAgent
take an additional argument, the default User-Agent, which will be used
if there is no per-feed User-Agent (i.e. it is an empty string).

Fixes #2188
Fixes #2189
2023-11-18 20:57:47 +01:00
Frédéric Guillot
e3eaaea15a Update date parser to parse more invalid date formats 2023-11-01 20:55:35 +01:00