Content extraction
This article is a stub. You can help the IndieWeb wiki by expanding it.
Content extraction are techniques and tools to get the main/structured content from web pages.
tools
if you have experience with any of these tools, please add your experience
- https://github.com/n1k0/readable-proxy β based on readability.js, the basis of
oh, content extraction service
- readability-lxml
- limited
- breadability
- limited
- newspaper3k
- memory issues?
- https://mercury.postlight.com/web-parser/