Jump to content

Toggle the table of contents

Content extraction

From IndieWeb

This article is a stub. You can help the IndieWeb wiki by expanding it.

Content extraction are techniques and tools to get the main/structured content from web pages.

tools

if you have experience with any of these tools, please add your experience

https://github.com/n1k0/readable-proxy – based on readability.js, the basis of

oh, content extraction service

readability-lxml
- limited
breadability
- limited
newspaper3k
- memory issues?
https://mercury.postlight.com/web-parser/

indieweb/silo specific tools

XRay
granary

Retrieved from "https://indieweb.org/wiki/index.php?title=Content_extraction&oldid=37459"