Commit graph

92 commits

Author SHA1 Message Date
jvoisin
f109e3207c reader/rss: don't add empty tags to RSS items
This commit adds a bunch of checks to prevent reader/rss from adding empty tags
to rss items, as well as some minor refactors like nested conditions and loops
unrolling.
2024-03-24 19:46:56 -07:00
Frédéric Guillot
ad1d349a0c rss: use Channel tags only if there is no Item tags 2024-03-23 13:46:48 -07:00
jvoisin
fc4bdf3ab0 Inline a one-liner function
No need to expose a symbol for this.
2024-03-20 17:21:30 -07:00
Frédéric Guillot
08640b27d5 Ensure enclosure URLs are always absolute 2024-03-19 21:57:46 -07:00
jvoisin
4be993e055 Minor refactoring of internal/reader/atom/atom_10_adapter.go
- Move the population of the feed's entries into a new function, to make
  `BuildFeed` easier to understand/separate concerns/implementation details
- Use `sort+compact` instead of `compact+sort` to remove duplicates
- Change `if !a { a = } if !a {a = }` constructs into `if !a { a = ; if !a {a = }}`.
  This reduce the number of comparisons, but also improves a tad the
  control-flow readability.
2024-03-19 20:41:44 -07:00
Jean Khawand
a78d1c79da
Add FILTER_ENTRY_MAX_AGE_DAYS config option to limit fetching all feed items 2024-03-20 02:58:53 +00:00
Frédéric Guillot
fa9697b972 Remove trailing space in SiteURL and FeedURL 2024-03-18 17:51:06 -07:00
jvoisin
91f5522ce0 Minor simplification of internal/reader/media/media.go
- Simplify a switch-case by moving a common condition above it.
- Remove a superfluous error-check: `strconv.ParseInt` returns `0` when passed
  an empty string.
2024-03-18 16:09:32 -07:00
Frédéric Guillot
8212f16aa2 atom: avoid debug message when the date is empty 2024-03-17 15:29:50 -07:00
Frédéric Guillot
b1e73fafdf Enable go-critic linter and fix various issues detected 2024-03-17 13:52:34 -07:00
jvoisin
c29ca0e313 Minor simplifications of the rewriter
- Online some one-line functions
- Transform a free-standing function into a method
- Massively simplify `removeClickbait`
- Use a proper constant instead of a magic number in `applyFuncOnTextContent`
2024-03-17 12:15:46 -07:00
jvoisin
02a074ed26 Compile block/keep regex only once per feed
No need to compile them once for matching on the url,
once per tag, once per title, once per author, … one time is enough.
It also simplify error handling, since while regexp compilation can fail,
matching can't.
2024-03-17 12:08:03 -07:00
Frédéric Guillot
309fdbb9fc Fix force refresh 2024-03-15 19:42:09 -07:00
Frédéric Guillot
4834e934f2 Remove some duplicated code in RSS parser 2024-03-15 18:40:06 -07:00
Frédéric Guillot
dd4fb660c1 Refactor Atom parser to use an adapter 2024-03-15 17:27:16 -07:00
Frédéric Guillot
5948786b15 Add support for RSS <media:category> element 2024-03-13 21:35:39 -07:00
Frédéric Guillot
648b9a8f6f Refactor RSS Parser to use an adapter 2024-03-13 21:25:09 -07:00
Frédéric Guillot
8429c6b0ab Refactor JSON Feed parser to use an adapter 2024-03-12 22:37:14 -07:00
Frédéric Guillot
6bc4b35e38 Refactor RDF parser to use an adapter
Avoid tight coupling between `model.Feed` and the original XML RDF feed.
2024-03-12 20:54:05 -07:00
jvoisin
45d486b919 When detecting the format, detect its version as well
There is no need to detect the format and then the version when both can be
done at the same time.

Add a benchmark as well, on large and small atom and rss files.
2024-03-12 18:56:56 -07:00
Frédéric Guillot
6d97f8b458 Parse podcast categories 2024-03-11 22:30:27 -07:00
Frédéric Guillot
f8e50947f2 Move iTunes and GooglePlay XML definitions to their own packages 2024-03-11 22:09:31 -07:00
Frédéric Guillot
9a637ce95e Refactor RSS parser to use default namespace
This change avoid some limitations of the Go XML parser regarding XML namespaces
2024-03-11 21:07:13 -07:00
jvoisin
a074773e6c Use an io.ReadSeeker instead of an io.Reader to parse feeds
This will allow to make use of func (*Reader) Seek, instead of re-recreating a
new reader. It's a large commit for a small change, but anything to simply the
reader/buffer/ReadAll/… mess is a step in the right direction I think, and it
should enable more follow-up simplifications.
2024-03-06 20:13:39 -08:00
jvoisin
3d0126be0b Speed the sanitizer up a bit, again
- allow youtube urls to start with `www`
- use `strings.Builder` instead of a `bytes.Buffer`
- use a `strings.NewReader` instead of a `bytes.NewBufferString`
- sprinkles a couple of `continue` to make the code-flow more obvious
- inline calls to `inList`, and put their parameters in the right order
- simplify isPixelTracker
- simplify `isValidIframeSource`, by extracting the hostname and comparing it
  directly, instead of using the full url and checking if it starts with
  multiple variations of the same one (`//`, `http:`, `https://` multiplied by
  ``/`www.`)
- add a benchmark
2024-03-05 19:31:50 -08:00
jvoisin
111e3f2106 Reuse a Reader instead of copying to a buffer when parsing an atom feed 2024-03-04 17:36:10 -08:00
jvoisin
3339d9d3d7 Preallocate memory when exporting to OPML
This should marginally increase performance when export a large amount of feeds
to OPML.
2024-03-03 20:34:37 -08:00
jvoisin
347740dce1 Speed up removeUnlikelyCandidates
`.Not` returns a brand new Selection, copied element by element.
2024-02-29 19:38:43 -08:00
jvoisin
ab85d4d678 Improve EstimateReadingTime's speed by a factor 7
- Refactorise the tests and add some
- Use 250 signs instead of the whole text
- Only check for Korean, Chinese and Japanese script
- Add a benchmark
- Use a more idiomatic control flow

```console
$ # main branch
$ go test -bench=.
goos: linux
goarch: amd64
pkg: miniflux.app/v2/internal/reader/readingtime
BenchmarkEstimateReadingTime-12              267           4821268 ns/op
PASS
ok      miniflux.app/v2/internal/reader/readingtime     1.754s
$ # speed_up_reading_time branch
$ go test -bench=.
goos: linux
goarch: amd64
pkg: miniflux.app/v2/internal/reader/readingtime
cpu: 12th Gen Intel(R) Core(TM) i7-1265U
BenchmarkEstimateReadingTime-12             1941            653312 ns/op
PASS
ok      miniflux.app/v2/internal/reader/readingtime     1.342s
$
```
2024-02-29 19:24:15 -08:00
jvoisin
31ac62f410 Don't compute reading-time when unused
If the user doesn't display reading times, there is no need to compute them.
This should speed things up a bit, since `whatlanggo.Detect` is abysmally slow.
2024-02-29 19:14:17 -08:00
Frédéric Guillot
97765b93a9 Revert "Minor internal/reader/readability/readability.go speedup"
This reverts commit 4db138d4b8.

```
panic: runtime error: index out of range [-1]

goroutine 49 [running]:
miniflux.app/v2/internal/reader/readability.getArticle.func1(0x8?, 0xc000b56570)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:120 +0x2ac
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc000b56510, 0xc000892fa8)
        /home/fred/go/pkg/mod/github.com/!puerkito!bio/goquery@v1.9.0/iteration.go:10 +0x62
miniflux.app/v2/internal/reader/readability.getArticle(0xc00044f1f0, 0xc000a04a50)
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:101 +0x15d
miniflux.app/v2/internal/reader/readability.ExtractContent({0x1005d00?, 0xc0001522d0?})
        /home/fred/repos/miniflux/v2/internal/reader/readability/readability.go:91 +0x211
miniflux.app/v2/internal/reader/scraper.ScrapeWebsite(0xc000893688?, {0xc0007ce720, 0x54}, {0x0, 0x0})
        /home/fred/repos/miniflux/v2/internal/reader/scraper/scraper.go:63 +0x859
miniflux.app/v2/internal/reader/processor.ProcessFeedEntries(0xc000133188, 0xc000502c40, 0xc0003e6360, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/processor/processor.go:77 +0x8ea
miniflux.app/v2/internal/reader/handler.RefreshFeed(0xc000133188, 0x10cf, 0x52d5c, 0x0)
        /home/fred/repos/miniflux/v2/internal/reader/handler/handler.go:301 +0x1485
miniflux.app/v2/internal/cli.refreshFeeds.func1(0x0)
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:59 +0x2d7
created by miniflux.app/v2/internal/cli.refreshFeeds in goroutine 1
        /home/fred/repos/miniflux/v2/internal/cli/refresh_feeds.go:50 +0x5d5
```
2024-02-29 19:06:03 -08:00
Frédéric Guillot
c493f8921e Add missing regex anchor detected by CodeQL 2024-02-28 20:50:17 -08:00
jvoisin
4db138d4b8 Minor internal/reader/readability/readability.go speedup
- Don't use a capturing group in `divToPElementsRegexp`
- Remove a duplicate condition
- Replace a regex with a fixed-comparison and a `Contains`
2024-02-28 20:03:14 -08:00
jvoisin
f12d5131b0 Divide the sanitization time by 3
Instead of having to allocate a ~100 keys map containing possibly dynamic
values (at least to the go compiler), allocate it once in a global variable.
This significantly speeds things up, by reducing the garbage
collector/allocator involvements.

Local synthetic benchmarks have shown a improvements from 38% of wall time to only
12%.
2024-02-28 20:00:13 -08:00
jvoisin
645a817685 Use modern for loops
Go 1.22 introduced a new [for-range](https://go.dev/ref/spec#For_range)
construct that looks a tad better than the usual `for i := 0; i < N; i++`
construct. I also tool the liberty of replacing some
`for i := 0; i < len(myitemsarray); i++ { … myitemsarray[i] …}`
with  `for item := range myitemsarray` when `myitemsarray` contains only pointers.
2024-02-28 19:55:28 -08:00
jvoisin
543a690bfd Close resources as soon as possible, instead of using defer() in a loop
So that resources can be freed as soon as they're not used anymore, instead of
waiting for the two nested loops to finish.
2024-02-28 19:47:30 -08:00
jvoisin
c4e5dad549 Remove superfluous escaping in a regex 2024-02-28 19:47:30 -08:00
jvoisin
fa12c23d79 Use strings.ReplaceAll instead of strings.Replace(…, -1) 2024-02-28 19:47:30 -08:00
jvoisin
4fe902a5d2 Use strings.EqualFold instead of strings.ToLower(…) == 2024-02-28 19:47:30 -08:00
jvoisin
61af08a721 Use .WriteString( instead of .Write([]byte(… 2024-02-28 19:47:30 -08:00
jvoisin
b04550e2f2 Use %q instead of "%s" 2024-02-28 19:47:30 -08:00
jvoisin
b94756bbf0 Add a warning for StripTags 2024-02-27 20:41:47 -08:00
jvoisin
db6ae707ef Add some tests for add_image_title
I'm not sure if the behaviour is expected, but I didn't manage to
get the content injection to work in my browser, so I guess it's alright?
2024-02-27 20:41:15 -08:00
jvoisin
06e256e5ef Simplify internal/reader/icon/finder.go
- Use a simple regex to parse data uri instead of a hand-rolled parser, and
  document what fields are considered mandatory.
- Use case-insensitive matching to find (fav)icons, instead of doing the same
  query twice with different letter cases
- Add 'apple-touch-icon-precomposed.png' as a fallback favicon
- Reorder the queries to have i`con` first, since it seems to be the most
  popular one. It used to be last, meaning that pages had to be parsed
  completely 4 times, instead of one now.
- Minor factorisation in findIconURLsFromHTMLDocument
2024-02-26 18:18:04 -08:00
jvoisin
040938ff6d Small refactoring of internal/reader/date/parser.go
- Split dates formats into those that require local times
  and those who don't, so that there is no need to have a switch-case in the
  for loop with around 250 iterations at most.
- Be more strict when it comes to timezones, previously invalid ones like -13
  were accepted. Also add a test for this.
- Bail out early if the date is an empty string.
2024-02-26 18:08:04 -08:00
jvoisin
c2d2f31438 Improve a bit internal/reader/scraper/scraper.go
- make findContentUsingCustomRules' more idiomatic,
  since in golang a function returning an error might
  return garbage in other parameter. Moreover, ignoring
  errors is bad practise.
- getPredefinedScraperRules is now running in constant-time,
  instead of iterating on a list with around 50 items in it.
2024-02-26 18:00:23 -08:00
jvoisin
5b2558bf92 Miscellaneous improvements to internal/reader/subscription/finder.go
- Surface `localizedError` in FindSubscriptionsFromWellKnownURLs via slog
- Use an inline declaration for new subscriptions, like done elsewhere in the
  file, if only for consistency's sake
- Preallocate the `subscriptions` slice when using an RSS-bridge,
  it's a good practise, and it might even marginally improve
  performances when adding __a lot__ of feeds via an rss-bridge instance, wooo!
2024-02-26 17:52:21 -08:00
jvoisin
ecd59009fb Add a couple of new possible locations for feeds
- Hugo likes to generate index.xml
- feed.atom and feed.rss are used by enterprise-scale/old-school gigantic CMS
2024-02-26 17:43:51 -08:00
jvoisin
4a943b722d Add a couple of fuzzers 2024-02-26 17:23:49 -08:00
jvoisin
54b5be5e7d Significantly simplify/speed up the sanitizer
- Use constant time access for maps instead of iterating on them
- Build a ~large whitelist map inline instead of constructing it item by item
  (and remove a duplicate key/value pair)
- Use `slices` instead of hand-rolled loops
2024-02-25 17:29:46 -08:00