Document loaders
Features
The following table shows the feature support for all document loaders.
| Document Loader | Description | Lazy loading | Native async support |
|---|---|---|---|
| AZLyricsLoader | Load AZLyrics webpages. | ✅ | ✅ |
| AcreomLoader | Load acreom vault from a directory. | ✅ | ❌ |
| AirbyteCDKLoader | Load with an Airbyte source connector implemented using the CDK. | ✅ | ❌ |
| AirbyteGongLoader | Load from Gong using an Airbyte source connector. | ✅ | ❌ |
| AirbyteHubspotLoader | Load from Hubspot using an Airbyte source connector. | ✅ | ❌ |
| AirbyteJSONLoader | Load local Airbyte json files. | ❌ | ❌ |
| AirbyteSalesforceLoader | Load from Salesforce using an Airbyte source connector. | ✅ | ❌ |
| AirbyteShopifyLoader | Load from Shopify using an Airbyte source connector. | ✅ | ❌ |
| AirbyteStripeLoader | Load from Stripe using an Airbyte source connector. | ✅ | ❌ |
| AirbyteTypeformLoader | Load from Typeform using an Airbyte source connector. | ✅ | ❌ |
| AirbyteZendeskSupportLoader | Load from Zendesk Support using an Airbyte source connector. | ✅ | ❌ |
| AirtableLoader | Load the Airtable tables. | ✅ | ❌ |
| AmazonTextractPDFLoader | Load PDF files from a local file system, HTTP or S3. | ✅ | ❌ |
| ApifyDatasetLoader | Load datasets from Apify web scraping, crawling, and data extraction platform. | ❌ | ❌ |
| ArcGISLoader | Load records from an ArcGIS FeatureLayer. | ✅ | ❌ |
| ArxivLoader | Load a query result from Arxiv. | ✅ | ❌ |
| AssemblyAIAudioLoaderById | ✅ | ❌ | |
| AssemblyAIAudioTranscriptLoader | Load AssemblyAI audio transcripts. | ✅ | ❌ |
| AstraDBLoader | [Deprecated] | ✅ | ✅ |
| AsyncChromiumLoader | Scrape HTML pages from URLs using a | ✅ | ✅ |
| AsyncHtmlLoader | Load HTML asynchronously. | ✅ | ✅ |
| AthenaLoader | Load documents from AWS Athena. | ✅ | ❌ |
| AzureAIDataLoader | Load from Azure AI Data. | ✅ | ❌ |
| AzureAIDocumentIntelligenceLoader | Load a PDF with Azure Document Intelligence. | ✅ | ❌ |
| AzureBlobStorageContainerLoader | Load from Azure Blob Storage container. | ❌ | ❌ |
| AzureBlobStorageFileLoader | Load from Azure Blob Storage files. | ❌ | ❌ |
| BSHTMLLoader | Load HTML files and parse them with beautiful soup. | ✅ | ❌ |
| BibtexLoader | Load a bibtex file. | ✅ | ❌ |
| BigQueryLoader | [Deprecated] Load from the Google Cloud Platform BigQuery. | ❌ | ❌ |
| BiliBiliLoader | ❌ | ❌ | |
| BlackboardLoader | Load a Blackboard course. | ✅ | ✅ |
| BlockchainDocumentLoader | Load elements from a blockchain smart contract. | ❌ | ❌ |
| BraveSearchLoader | Load with Brave Search engine. | ✅ | ❌ |
| BrowserbaseLoader | Load pre-rendered web pages using a headless browser hosted on Browserbase. | ✅ | ❌ |
| BrowserlessLoader | Load webpages with Browserless /content endpoint. | ✅ | ❌ |
| CSVLoader | Load a CSV file into a list of Documents. | ✅ | ❌ |
| CassandraLoader | ✅ | ✅ | |
| ChatGPTLoader | Load conversations from exported ChatGPT data. | ❌ | ❌ |
| CoNLLULoader | Load CoNLL-U files. | ❌ | ❌ |
| CollegeConfidentialLoader | Load College Confidential webpages. | ✅ | ✅ |
| ConcurrentLoader | Load and pars Documents concurrently. | ✅ | ❌ |
| ConfluenceLoader | Load Confluence pages. | ✅ | ❌ |
| CouchbaseLoader | Load documents from Couchbase. | ✅ | ❌ |
| CubeSemanticLoader | Load Cube semantic layer metadata. | ✅ | ❌ |
| DataFrameLoader | Load Pandas DataFrame. | ✅ | ❌ |
| DatadogLogsLoader | Load Datadog logs. | ❌ | ❌ |
| DiffbotLoader | Load Diffbot json file. | ❌ | ❌ |
| DirectoryLoader | Load from a directory. | ✅ | ❌ |
| DiscordChatLoader | Load Discord chat logs. | ❌ | ❌ |
| DocugamiLoader | [Deprecated] Load from Docugami. | ❌ | ❌ |
| DocusaurusLoader | Load from Docusaurus Documentation. | ✅ | ✅ |
| Docx2txtLoader | Load DOCX file using docx2txt and chunks at character level. | ❌ | ❌ |
| DropboxLoader | Load files from Dropbox. | ❌ | ❌ |
| DuckDBLoader | Load from DuckDB. | ❌ | ❌ |
| EtherscanLoader | Load transactions from Ethereum mainnet. | ✅ | ❌ |
| EverNoteLoader | Load from EverNote. | ✅ | ❌ |
| FacebookChatLoader | Load Facebook Chat messages directory dump. | ✅ | ❌ |
| FaunaLoader | Load from FaunaDB. | ✅ | ❌ |
| FigmaFileLoader | Load Figma file. | ❌ | ❌ |
| FireCrawlLoader | Load web pages as Documents using FireCrawl. | ✅ | ❌ |
| GCSDirectoryLoader | [Deprecated] Load from GCS directory. | ❌ | ❌ |
| GCSFileLoader | [Deprecated] Load from GCS file. | ❌ | ❌ |
| GeoDataFrameLoader | Load geopandas Dataframe. | ✅ | ❌ |
| GitHubIssuesLoader | Load issues of a GitHub repository. | ✅ | ❌ |
| GitLoader | Load Git repository files. | ✅ | ❌ |
| GitbookLoader | Load GitBook data. | ✅ | ✅ |
| GithubFileLoader | Load GitHub File | ✅ | ❌ |
| GlueCatalogLoader | Load table schemas from AWS Glue. | ✅ | ❌ |
| GoogleApiYoutubeLoader | Load all Videos from a YouTube Channel. | ❌ | ❌ |
| GoogleDriveLoader | [Deprecated] Load Google Docs from Google Drive. | ❌ | ❌ |
| GoogleSpeechToTextLoader | [Deprecated] Loader for Google Cloud Speech-to-Text audio transcripts. | ❌ | ❌ |
| GutenbergLoader | Load from Gutenberg.org. | ❌ | ❌ |
| HNLoader | Load Hacker News data. | ✅ | ✅ |
| HuggingFaceDatasetLoader | Load from Hugging Face Hub datasets. | ✅ | ❌ |
| HuggingFaceModelLoader | ✅ | ❌ | |
| IFixitLoader | Load iFixit repair guides, device wikis and answers. | ❌ | ❌ |
| IMSDbLoader | Load IMSDb webpages. | ✅ | ✅ |
| ImageCaptionLoader | Load image captions. | ❌ | ❌ |
| IuguLoader | Load from IUGU. | ❌ | ❌ |
| JSONLoader | ✅ | ❌ | |
| JoplinLoader | Load notes from Joplin. | ✅ | ❌ |
| KineticaLoader | Load from Kinetica API. | ✅ | ❌ |
| LLMSherpaFileLoader | Load Documents using LLMSherpa. | ✅ | ❌ |
| LakeFSLoader | Load from lakeFS. | ❌ | ❌ |
| LarkSuiteDocLoader | Load from LarkSuite (FeiShu). | ✅ | ❌ |
| MHTMLLoader | Parse MHTML files with BeautifulSoup. | ✅ | ❌ |
| MWDumpLoader | Load MediaWiki dump from an XML file. | ✅ | ❌ |
| MastodonTootsLoader | Load the Mastodon 'toots'. | ✅ | ❌ |
| MathpixPDFLoader | Load PDF files using Mathpix service. | ❌ | ❌ |
| MaxComputeLoader | Load from Alibaba Cloud MaxCompute table. | ✅ | ❌ |
| MergedDataLoader | Merge documents from a list of loaders | ✅ | ✅ |
| ModernTreasuryLoader | Load from Modern Treasury. | ❌ | ❌ |
| MongodbLoader | Load MongoDB documents. | ❌ | ✅ |
| NewsURLLoader | Load news articles from URLs using Unstructured. | ✅ | ❌ |
| NotebookLoader | Load Jupyter notebook (.ipynb) files. | ❌ | ❌ |
| NotionDBLoader | Load from Notion DB. | ❌ | ❌ |
| NotionDirectoryLoader | Load Notion directory dump. | ❌ | ❌ |
| OBSDirectoryLoader | Load from Huawei OBS directory. | ❌ | ❌ |
| OBSFileLoader | Load from the Huawei OBS file. | ❌ | ❌ |
| ObsidianLoader | Load Obsidian files from directory. | ✅ | ❌ |
| OneDriveFileLoader | Load a file from Microsoft OneDrive. | ❌ | ❌ |
| OneDriveLoader | Load from Microsoft OneDrive. | ✅ | ❌ |
| OnlinePDFLoader | Load online PDF. | ❌ | ❌ |
| OpenCityDataLoader | Load from Open City. | ✅ | ❌ |
| OracleAutonomousDatabaseLoader | ❌ | ❌ | |
| OracleDocLoader | Read documents using OracleDocLoader | ❌ | ❌ |
| OutlookMessageLoader | ✅ | ❌ | |
| PDFMinerLoader | Load PDF files using PDFMiner. | ✅ | ❌ |
| PDFMinerPDFasHTMLLoader | Load PDF files as HTML content using PDFMiner. | ✅ | ❌ |
| PDFPlumberLoader | Load PDF files using pdfplumber. | ❌ | ❌ |
| PagedPDFSplitter | Load PDF using pypdf into list of documents. | ✅ | ❌ |
| PebbloSafeLoader | Pebblo Safe Loader class is a wrapper around document loaders enabling the data | ✅ | ❌ |
| PlaywrightURLLoader | Load HTML pages with Playwright and parse with Unstructured. | ✅ | ✅ |
| PolarsDataFrameLoader | Load Polars DataFrame. | ✅ | ❌ |
| PsychicLoader | Load from Psychic.dev. | ✅ | ❌ |
| PubMedLoader | Load from the PubMed biomedical library. | ✅ | ❌ |
| PyMuPDFLoader | Load PDF files using PyMuPDF. | ✅ | ❌ |
| PyPDFDirectoryLoader | Load a directory with PDF files using pypdf and chunks at character level. | ❌ | ❌ |
| PyPDFLoader | Load PDF using pypdf into list of documents. | ✅ | ❌ |
| PyPDFium2Loader | Load PDF using pypdfium2 and chunks at character level. | ✅ | ❌ |
| PySparkDataFrameLoader | Load PySpark DataFrames. | ✅ | ❌ |
| PythonLoader | Load Python files, respecting any non-default encoding if specified. | ✅ | ❌ |
| RSSFeedLoader | Load news articles from RSS feeds using Unstructured. | ✅ | ❌ |
| ReadTheDocsLoader | Load ReadTheDocs documentation directory. | ✅ | ❌ |
| RecursiveUrlLoader | Recursively load all child links from a root URL. | ✅ | ❌ |
| RedditPostsLoader | Load Reddit posts. | ❌ | ❌ |
| RoamLoader | Load Roam files from a directory. | ❌ | ❌ |
| RocksetLoader | Load from a Rockset database. | ✅ | ❌ |
| S3DirectoryLoader | Load from Amazon AWS S3 directory. | ❌ | ❌ |
| S3FileLoader | Load from Amazon AWS S3 file. | ✅ | ❌ |
| SQLDatabaseLoader | ✅ | ❌ | |
| SRTLoader | Load .srt (subtitle) files. | ❌ | ❌ |
| ScrapflyLoader | Turn a url to llm accessible markdown with Scrapfly.io. | ✅ | ❌ |
| SeleniumURLLoader | Load HTML pages with Selenium and parse with Unstructured. | ❌ | ❌ |
| SharePointLoader | Load from SharePoint. | ✅ | ❌ |
| SitemapLoader | Load a sitemap and its URLs. | ✅ | ✅ |
| SlackDirectoryLoader | Load from a Slack directory dump. | ✅ | ❌ |
| SnowflakeLoader | Load from Snowflake API. | ✅ | ❌ |
| SpiderLoader | Load web pages as Documents using Spider AI. | ✅ | ❌ |
| SpreedlyLoader | Load from Spreedly API. | ❌ | ❌ |
| StripeLoader | Load from Stripe API. | ❌ | ❌ |
| SurrealDBLoader | Load SurrealDB documents. | ❌ | ✅ |
| TelegramChatApiLoader | Load Telegram chat json directory dump. | ❌ | ❌ |
| TelegramChatFileLoader | Load from Telegram chat dump. | ❌ | ❌ |
| TelegramChatLoader | Load from Telegram chat dump. | ❌ | ❌ |
| TencentCOSDirectoryLoader | Load from Tencent Cloud COS directory. | ✅ | ❌ |
| TencentCOSFileLoader | Load from Tencent Cloud COS file. | ✅ | ❌ |
| TensorflowDatasetLoader | Load from TensorFlow Dataset. | ✅ | ❌ |
| TextLoader | Load text file. | ✅ | ❌ |
| TiDBLoader | Load documents from TiDB. | ✅ | ❌ |
| ToMarkdownLoader | Load HTML using 2markdown API. | ✅ | ❌ |
| TomlLoader | Load TOML files. | ✅ | ❌ |
| TrelloLoader | Load cards from a Trello board. | ✅ | ❌ |
| TwitterTweetLoader | Load Twitter tweets. | ❌ | ❌ |
| UnstructuredAPIFileIOLoader | Load files using Unstructured API. | ✅ | ❌ |
| UnstructuredAPIFileLoader | Load files using Unstructured API. | ✅ | ❌ |
| UnstructuredCHMLoader | Load CHM files using Unstructured. | ✅ | ❌ |
| UnstructuredCSVLoader | Load CSV files using Unstructured. | ✅ | ❌ |
| UnstructuredEPubLoader | Load EPub files using Unstructured. | ✅ | ❌ |
| UnstructuredEmailLoader | Load email files using Unstructured. | ✅ | ❌ |
| UnstructuredExcelLoader | Load Microsoft Excel files using Unstructured. | ✅ | ❌ |
| UnstructuredFileIOLoader | Load files using Unstructured. | ✅ | ❌ |
| UnstructuredFileLoader | Load files using Unstructured. | ✅ | ❌ |
| UnstructuredHTMLLoader | Load HTML files using Unstructured. | ✅ | ❌ |
| UnstructuredImageLoader | Load PNG and JPG files using Unstructured. | ✅ | ❌ |
| UnstructuredMarkdownLoader | Load Markdown files using Unstructured. | ✅ | ❌ |
| UnstructuredODTLoader | Load OpenOffice ODT files using Unstructured. | ✅ | ❌ |
| UnstructuredOrgModeLoader | Load Org-Mode files using Unstructured. | ✅ | ❌ |
| UnstructuredPDFLoader | Load PDF files using Unstructured. | ✅ | ❌ |
| UnstructuredPowerPointLoader | Load Microsoft PowerPoint files using Unstructured. | ✅ | ❌ |
| UnstructuredRSTLoader | Load RST files using Unstructured. | ✅ | ❌ |
| UnstructuredRTFLoader | Load RTF files using Unstructured. | ✅ | ❌ |
| UnstructuredTSVLoader | Load TSV files using Unstructured. | ✅ | ❌ |
| UnstructuredURLLoader | Load files from remote URLs using Unstructured. | ❌ | ❌ |
| UnstructuredWordDocumentLoader | Load Microsoft Word file using Unstructured. | ✅ | ❌ |
| UnstructuredXMLLoader | Load XML file using Unstructured. | ✅ | ❌ |
| VsdxLoader | ❌ | ❌ | |
| WeatherDataLoader | Load weather data with Open Weather Map API. | ✅ | ❌ |
| WebBaseLoader | Load HTML pages using urllib and parse them with `BeautifulSoup'. | ✅ | ✅ |
| WhatsAppChatLoader | Load WhatsApp messages text file. | ✅ | ❌ |
| WikipediaLoader | Load from Wikipedia. | ✅ | ❌ |
| XorbitsLoader | Load Xorbits DataFrame. | ✅ | ❌ |
| YoutubeLoader | Load YouTube video transcripts. | ❌ | ❌ |
| YuqueLoader | Load documents from Yuque. | ❌ | ❌ |