• LangChain
  • Core
  • Community
  • Experimental
  • Text splitters
  • ai21
  • airbyte
  • anthropic
  • astradb
  • chroma
  • cohere
  • elasticsearch
  • exa
  • fireworks
  • google-genai
  • google-vertexai
  • groq
  • ibm
  • mistralai
  • mongodb
  • nomic
  • nvidia-ai-endpoints
  • nvidia-trt
  • openai
  • pinecone
  • postgres
  • robocorp
  • together
  • upstage
  • voyageai
  • Partner libs
    ai21 airbyte anthropic astradb chroma cohere elasticsearch exa fireworks google-genai google-vertexai groq ibm mistralai mongodb nomic nvidia-ai-endpoints nvidia-trt openai pinecone postgres robocorp together upstage voyageai
  • Docs
Prev Up Next
  • langchain_core.utils.html.extract_sub_links
    • extract_sub_links()

langchain_core.utils.html.extract_sub_linksΒΆ

langchain_core.utils.html.extract_sub_links(raw_html: str, url: str, *, base_url: Optional[str] = None, pattern: Optional[Union[str, Pattern]] = None, prevent_outside: bool = True, exclude_prefixes: Sequence[str] = (), continue_on_failure: bool = False) → List[str][source]ΒΆ

Extract all links from a raw html string and convert into absolute paths.

Parameters
  • raw_html (str) – original html.

  • url (str) – the url of the html.

  • base_url (Optional[str]) – the base url to check for outside links against.

  • pattern (Optional[Union[str, Pattern]]) – Regex to use for extracting links from raw html.

  • prevent_outside (bool) – If True, ignore external links which are not children of the base url.

  • exclude_prefixes (Sequence[str]) – Exclude any URLs that start with one of these prefixes.

  • continue_on_failure (bool) – If True, continue if parsing a specific link raises an exception. Otherwise, raise the exception.

Returns

sub links

Return type

List[str]

© 2023, LangChain, Inc.. Last updated on May 04, 2024.