Skip to main content
Internet Archive's 25th Anniversary Logo

Internet Archive Research Publication Crawls

Internet Archive Web Group

A series of open web crawls targeting journal articles, technical memos, essays, datasets, and other research publications.



rss RSS

20,353
RESULTS


Show sorted alphabetically

Show sorted alphabetically

SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
UNPAYWALL-PDF-CRAWL-2018-07
UNPAYWALL-PDF-CRAWL-2018-07
collection
1,241
ITEMS
6.8M
VIEWS
by Internet Archive Web Group
collection

eye 6.8M

Web archive data from a crawl of open access PDF URLs provided by Unpaywall.
Open Access Journal Test Crawl (2018)
Open Access Journal Test Crawl (2018)
collection
794
ITEMS
5.6M
VIEWS
by Internet Archive Web Group
collection

eye 5.6M

MSAG-PDF-CRAWL-2017
collection
1,855
ITEMS
6.3M
VIEWS
by Internet Archive Web Group
collection

eye 6.3M

Microsoft Academic Graph public corpus (Feb 2016) PDF URLs, filtered to remove large sites (pubmed, citeseerx, arxiv) and already-crawled URLs.
Topics: papers, journals
UNPAYWALL-PDF-CRAWL-2019-04
UNPAYWALL-PDF-CRAWL-2019-04
collection
641
ITEMS
1.1M
VIEWS
by Internet Archive Web Group
collection

eye 1.1M

MAG-PDF-CRAWL-2020-03
MAG-PDF-CRAWL-2020-03
collection
489
ITEMS
1.2M
VIEWS
by Internet Archive Web Group
collection

eye 1.2M

OA-DOI-CRAWL-2020-02
OA-DOI-CRAWL-2020-02
collection
278
ITEMS
1M
VIEWS
by Internet Archive Web Group
collection

eye 1M

OA-JOURNAL-CRAWL-2020-07
OA-JOURNAL-CRAWL-2020-07
collection
1,923
ITEMS
4M
VIEWS
by Internet Archive Web Group
collection

eye 4M

DATACITE-DOI-CRAWL-2020-01
DATACITE-DOI-CRAWL-2020-01
collection
1,417
ITEMS
1.5M
VIEWS
by Internet Archive Web Group
collection

eye 1.5M

Wide Web Targeted PDF Crawling (2017)
Wide Web Targeted PDF Crawling (2017)
collection
922
ITEMS
1.4M
VIEWS
by Internet Archive Web Group
collection

eye 1.4M

UNPAYWALL-PDF-CRAWL-2020-05
UNPAYWALL-PDF-CRAWL-2020-05
collection
282
ITEMS
482,165
VIEWS
by Internet Archive Web Group
collection

eye 482,165

UNPAYWALL-PDF-CRAWL-2020-11
UNPAYWALL-PDF-CRAWL-2020-11
collection
199
ITEMS
470,278
VIEWS
by Internet Archive Web Group
collection

eye 470,278

UNPAYWALL-PDF-CRAWL-2020-03
UNPAYWALL-PDF-CRAWL-2020-03
collection
344
ITEMS
567,246
VIEWS
by Internet Archive Web Group
collection

eye 567,246

OAI-PMH-CRAWL-2020-06
OAI-PMH-CRAWL-2020-06
collection
2,946
ITEMS
1.4M
VIEWS
by Internet Archive Web Group
collection

eye 1.4M

MAG-PDF-CRAWL-2020-07
MAG-PDF-CRAWL-2020-07
collection
196
ITEMS
498,609
VIEWS
by Internet Archive Web Group
collection

eye 498,609

OA-DOI-CRAWL-2020-12
OA-DOI-CRAWL-2020-12
collection
191
ITEMS
455,162
VIEWS
by Internet Archive Web Group
collection

eye 455,162

DIRECT-OA-CRAWL-2019
DIRECT-OA-CRAWL-2019
collection
2,566
ITEMS
1.3M
VIEWS
by Internet Archive Web Group
collection

eye 1.3M

DOI-LANDING-CRAWL-2018-06
DOI-LANDING-CRAWL-2018-06
collection
279
ITEMS
1.9M
VIEWS
by Internet Archive Web Group
collection

eye 1.9M

SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
SEMSCHOLAR-DIRECT-PDF-CRAWL-2020-02
collection
1,011
ITEMS
411,896
VIEWS
by Internet Archive Web Group
collection

eye 411,896

DOAJ-CRAWL-2020-11
DOAJ-CRAWL-2020-11
collection
102
ITEMS
298,462
VIEWS
by Internet Archive Web Group
collection

eye 298,462

UNPAYWALL-PDF-CRAWL-2021-05
UNPAYWALL-PDF-CRAWL-2021-05
collection
123
ITEMS
178,169
VIEWS
by Internet Archive Web Group
collection

eye 178,169

CiteSeerX URL Crawl 2017
CiteSeerX URL Crawl 2017
collection
206
ITEMS
600,333
VIEWS
collection

eye 600,333

A targeted crawl to fetch research publications from the public web which have been crawled by CiteSeerX but have not previously been crawled by the Internet Archive.
Topics: scholarly, papers, journal
OA-JOURNAL-CRAWL-2019-08
OA-JOURNAL-CRAWL-2019-08
collection
201
ITEMS
1.6M
VIEWS
by Internet Archive Web Group
collection

eye 1.6M

CORE-UPSTREAM-CRAWL-2018-11
CORE-UPSTREAM-CRAWL-2018-11
collection
741
ITEMS
442,094
VIEWS
by Internet Archive Web Group
collection

eye 442,094

Crawl of "upstream" URLs from CORE (core.ac.uk) metadata dump. Only a partial seedlist of files crawled.
collection

eye 1.2M

IA crawl of PDF urls provided by Semantic Scholar.
Topic: pdf
UNPAYWALL-PDF-CRAWL-2021-07
UNPAYWALL-PDF-CRAWL-2021-07
collection
174
ITEMS
5,658
VIEWS
collection

eye 5,658

SCIELO-CRAWL-2020-07
SCIELO-CRAWL-2020-07
collection
41
ITEMS
83,604
VIEWS
by Internet Archive Web Group
collection

eye 83,604

OA-JOURNAL-CRAWL-2020-07
web

eye 88,506

favorite 0

comment 0

Internet Archive crawldata of scholarly web journal content captured by wbgrp-svc282.us.archive.org:OA-JOURNAL-CRAWL-2020-07 from Sun Aug 2 19:00:58 PDT 2020 to Sun Aug 2 13:24:24 PDT 2020.
Topic: crawldata
DOAJ-CRAWL-2020-11
web

eye 31,262

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Tue Nov 24 17:59:21 PST 2020 to Tue Nov 24 11:43:19 PST 2020.
Topic: crawldata
Internet Archive crawldata of web PDF content captured by wbgrp-svc284.us.archive.org:OA-JOURNAL-TESTCRAWL-TWO-2018 from Sat Apr 21 00:18:07 PDT 2018 to Fri Apr 20 18:04:30 PDT 2018.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,738

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 12:56:01 PDT 2020 to Thu May 7 06:36:12 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,580

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 13:34:04 PDT 2020 to Thu May 7 07:16:48 PDT 2020.
Topic: crawldata
Internet Archive crawldata of web PDF content captured by wbgrp-svc284.us.archive.org:OA-JOURNAL-TESTCRAWL-TWO-2018 from Sat Apr 21 10:29:39 PDT 2018 to Sat Apr 21 04:08:43 PDT 2018.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,406

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 11:44:01 PDT 2020 to Thu May 7 05:22:43 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,488

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 14:14:33 PDT 2020 to Thu May 7 07:51:46 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,436

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 10:32:44 PDT 2020 to Thu May 7 04:10:23 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,384

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 15:23:28 PDT 2020 to Thu May 7 08:58:05 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,194

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 09:15:41 PDT 2020 to Thu May 7 02:54:05 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,415

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 12:21:19 PDT 2020 to Thu May 7 05:58:01 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,388

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 15:56:21 PDT 2020 to Thu May 7 09:37:18 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,412

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 09:53:14 PDT 2020 to Thu May 7 03:33:58 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,236

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 11:09:02 PDT 2020 to Thu May 7 04:45:56 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 5,142

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 14:49:55 PDT 2020 to Thu May 7 08:24:42 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-11
web

eye 3,916

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-11 from Fri Nov 6 13:50:50 PST 2020 to Fri Nov 6 06:21:56 PST 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-11
web

eye 3,487

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-11 from Fri Nov 6 12:48:46 PST 2020 to Fri Nov 6 05:19:40 PST 2020.
Topic: crawldata
OA-DOI-CRAWL-2020-12
web

eye 3,107

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:OA-DOI-CRAWL-2020-12 from Mon Dec 21 21:37:22 PST 2020 to Tue Dec 22 02:37:32 PST 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,252

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 08:34:18 PDT 2020 to Thu May 7 02:16:10 PDT 2020.
Topic: crawldata
DOAJ-CRAWL-2020-11
web

eye 4,813

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Tue Nov 24 21:22:12 PST 2020 to Tue Nov 24 15:21:23 PST 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,037

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 07:51:56 PDT 2020 to Thu May 7 01:35:46 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,281

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 05:11:32 PDT 2020 to Wed May 6 22:48:13 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,109

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 03:49:43 PDT 2020 to Wed May 6 21:31:25 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,060

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 04:30:51 PDT 2020 to Wed May 6 22:13:10 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 2,870

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 01:03:22 PDT 2020 to Wed May 6 18:48:21 PDT 2020.
Topic: crawldata
Wide Web Targeted PDF Crawling (2017)
web

eye 7,752

favorite 0

comment 0

Internet Archive crawldata of uncrawled Semantic Scholar seedlist PDF URLs captured by wbgrp-svc285.us.archive.org:TARGETED-PDF-CRAWL-2017 from Fri Sep 22 23:47:42 PDT 2017 to Fri Sep 22 17:06:59 PDT 2017.
Topic: crawldata
OA-DOI-CRAWL-2020-02
web

eye 13,665

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc206.us.archive.org:OA-DOI-CRAWL-2020-02 from Fri Feb 7 19:03:43 PST 2020 to Fri Feb 7 12:06:54 PST 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,187

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 00:20:35 PDT 2020 to Wed May 6 18:05:50 PDT 2020.
Topic: crawldata
Wide Web Targeted PDF Crawling (2017)
web

eye 9,120

favorite 0

comment 0

Internet Archive crawldata of uncrawled Semantic Scholar seedlist PDF URLs captured by wbgrp-svc285.us.archive.org:TARGETED-PDF-CRAWL-2017 from Sat Sep 23 00:02:35 PDT 2017 to Fri Sep 22 17:22:58 PDT 2017.
Topic: crawldata
Internet Archive crawldata of web PDF content captured by wbgrp-svc284.us.archive.org:OA-JOURNAL-TESTCRAWL-TWO-2018 from Tue Apr 10 08:37:31 PDT 2018 to Tue Apr 10 02:31:06 PDT 2018.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,073

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 06:27:14 PDT 2020 to Thu May 7 00:09:54 PDT 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 3,029

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 05:46:20 PDT 2020 to Wed May 6 23:28:28 PDT 2020.
Topic: crawldata
Wide Web Targeted PDF Crawling (2017)
web

eye 10,662

favorite 0

comment 0

Internet Archive crawldata of web PDF content captured by wbgrp-svc284.us.archive.org:TARGETED-PDF-CRAWL-2017 from Fri Sep 22 01:03:29 PDT 2017 to Thu Sep 21 18:23:52 PDT 2017.
Topic: crawldata
Wide Web Targeted PDF Crawling (2017)
web

eye 5,187

favorite 0

comment 0

Internet Archive crawldata of uncrawled Semantic Scholar seedlist PDF URLs captured by wbgrp-svc285.us.archive.org:TARGETED-PDF-CRAWL-2017 from Sat Sep 23 00:18:26 PDT 2017 to Fri Sep 22 17:37:31 PDT 2017.
Topic: crawldata
Internet Archive crawldata of web PDF content captured by wbgrp-svc284.us.archive.org:OA-JOURNAL-TESTCRAWL-TWO-2018 from Mon Apr 9 22:02:31 PDT 2018 to Mon Apr 9 16:01:16 PDT 2018.
Topic: crawldata
Internet Archive crawldata of web PDF content captured by wbgrp-svc284.us.archive.org:OA-JOURNAL-TESTCRAWL-TWO-2018 from Tue Apr 10 17:40:25 PDT 2018 to Tue Apr 10 11:40:41 PDT 2018.
Topic: crawldata
DOAJ-CRAWL-2020-11
web

eye 4,061

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Tue Nov 24 19:39:57 PST 2020 to Tue Nov 24 13:28:27 PST 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 2,857

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 02:26:36 PDT 2020 to Wed May 6 20:10:05 PDT 2020.
Topic: crawldata
DOAJ-CRAWL-2020-11
web

eye 5,693

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Tue Nov 24 10:05:43 PST 2020 to Tue Nov 24 03:46:40 PST 2020.
Topic: crawldata
DOAJ-CRAWL-2020-11
web

eye 3,857

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Wed Nov 25 08:38:18 PST 2020 to Wed Nov 25 02:12:58 PST 2020.
Topic: crawldata
DOAJ-CRAWL-2020-11
web

eye 21,599

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Tue Nov 24 08:37:04 PST 2020 to Tue Nov 24 02:09:27 PST 2020.
Topic: crawldata
OA-DOI-CRAWL-2020-12
web

eye 2,297

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:OA-DOI-CRAWL-2020-12 from Mon Dec 21 00:53:11 PST 2020 to Mon Dec 21 13:37:20 PST 2020.
Topic: crawldata
DOAJ-CRAWL-2020-11
web

eye 6,104

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Wed Nov 25 04:05:43 PST 2020 to Tue Nov 24 21:38:53 PST 2020.
Topic: crawldata
DOAJ-CRAWL-2020-11
web

eye 4,584

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:DOAJ-CRAWL-2020-11 from Tue Nov 24 16:11:54 PST 2020 to Tue Nov 24 10:05:02 PST 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-11
web

eye 3,154

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-11 from Fri Nov 6 13:18:09 PST 2020 to Fri Nov 6 05:52:09 PST 2020.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-05
web

eye 2,927

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-05 from Thu May 7 03:07:42 PDT 2020 to Wed May 6 20:51:21 PDT 2020.
Topic: crawldata
Internet Archive crawldata of web PDF content captured by wbgrp-svc284.us.archive.org:OA-JOURNAL-TESTCRAWL-TWO-2018 from Wed Apr 11 11:59:16 PDT 2018 to Wed Apr 11 06:39:18 PDT 2018.
Topic: crawldata
UNPAYWALL-PDF-CRAWL-2020-11
web

eye 2,757

favorite 0

comment 0

Internet Archive crawldata of scholarly web landing page content captured by wbgrp-svc279.us.archive.org:UNPAYWALL-PDF-CRAWL-2020-11 from Fri Nov 6 09:59:02 PST 2020 to Fri Nov 6 02:51:09 PST 2020.
Topic: crawldata