Skip to main content

Worldwide Web Crawls

Wide crawls of the Internet conducted by Internet Archive. Please visit the Wayback Machine to explore archived web sites.



rss RSS

Show sorted alphabetically
Show sorted alphabetically
SHOW DETAILS
up-solid down-solid
eye
Title
Date Archived
Creator
collection
eye 2B
The seed for Wide00014 was: - Slash pages from every domain on the web: -- a list of domains using Survey crawl seeds -- a list of domains using Wide00012 web graph -- a list of domains using Wide00013 web graph - Top ranked pages (up to a max of 100) from every linked-to domain using the Wide00012 inter-domain navigational link graph -- a ranking of all URLs that have more than one incoming inter-domain link (rank was determined by number of incoming links using Wide00012 inter domain links)...
collection
eye 1.4B
Wide17 was seeded with the "Total Domains" list of 256,796,456 URLs provided by  Domains Index   on June 26th, and crawled with max-hops set to "3" and de-duplication set "on".   
collection
eye 1.2B
Web wide crawl.
collection
eye 1B
Web wide crawl number 16 The seed list for Wide00016 was made from the join of the top 1 million domains from CISCO and the top 1 million domains from Alexa.
Wide Crawl Number 12 - started March, 14th 2015
Wide Crawl Number 12 - started March, 14th 2015
collection
49,621
ITEMS
1.1B
VIEWS
collection
eye 1.1B
Web wide crawl with initial seedlist and crawler configuration from January 2015.
Wide Crawl started June 2014
Wide Crawl started June 2014
collection
45,341
ITEMS
1.1B
VIEWS
collection
eye 1.1B
Web wide crawl with initial seedlist and crawler configuration from June 2014.
Wide Crawl Number 13
Wide Crawl Number 13
collection
46,050
ITEMS
799.2M
VIEWS
collection
eye 799.2M
Web Wide Crawl Number 13
Wide Crawl started April 2013
Wide Crawl started April 2013
collection
25,035
ITEMS
1.2B
VIEWS
collection
eye 1.2B
Web wide crawl with initial seedlist and crawler configuration from April 2013.
Wide Crawl started August 2013
Wide Crawl started August 2013
collection
21,932
ITEMS
770.4M
VIEWS
collection
eye 770.4M
Web wide crawl with initial seedlist and crawler configuration from August 2013.
Wide Crawl started January 2012
Wide Crawl started January 2012
collection
30,373
ITEMS
669.5M
VIEWS
collection
eye 669.5M
Web wide crawl with initial seedlist and crawler configuration from January 2012 using HQ software.
Wide Crawl started October 2011
Wide Crawl started October 2011
collection
12,648
ITEMS
409.8M
VIEWS
collection
eye 409.8M
Web wide crawl with initial seedlist and crawler configuration from March 2011 using HQ software.
Wide Crawl started April 2012
Wide Crawl started April 2012
collection
39,279
ITEMS
587.7M
VIEWS
collection
eye 587.7M
Web wide crawl with initial seedlist and crawler configuration from April 2012.
Wide Crawl started February 2014
Wide Crawl started February 2014
collection
9,806
ITEMS
488.2M
VIEWS
collection
eye 488.2M
Web wide crawl with initial seedlist and crawler configuration from February 2014.
Wide Crawl started October 2010
Wide Crawl started October 2010
collection
15,839
ITEMS
452.5M
VIEWS
collection
eye 452.5M
Web wide crawl with initial seedlist and crawler configuration from October 2010
Wide Crawl started March 2011
Wide Crawl started March 2011
collection
8,528
ITEMS
381.4M
VIEWS
collection
eye 381.4M
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi. What’s in the data set: Crawl start date: 09 March, 2011 Crawl end date: 23 December, 2011 Number of captures: 2,713,676,341 Number of unique URLs: 2,273,840,159 Number of hosts: 29,032,069 The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT)...
Wide Crawl started September 2012
Wide Crawl started September 2012
collection
22,423
ITEMS
429.4M
VIEWS
collection
eye 429.4M
Web wide crawl with initial seedlist and crawler configuration from September 2012.
Wide Crawl Started January 2013
Wide Crawl Started January 2013
collection
15,157
ITEMS
436M
VIEWS
collection
eye 436M
Wide crawls of the Internet conducted by Internet Archive. Access to content is restricted. Please visit the Wayback Machine to explore archived web sites.
Host Screen Captures
Host Screen Captures
collection
17,413
ITEMS
123.9M
VIEWS
collection
eye 123.9M
Screen captures of hosts discovered during wide crawls. This data is currently not publicly accessible.
survey_00010
web
eye 7.8M
favorite 0
comment 0
"Internet Archive crawldata from feed-driven by 1.2 million top ranked domains from data.domainrank.io - captured by crawl423.us.archive.org:survey_00010 from Mon May 11 14:14:43 PDT 2020 to Mon May 11 09:09:55 PDT 2020."
Topics: survey_00010, crawldata
Wide Crawl started September 2010
Wide Crawl started September 2010
collection
332
ITEMS
13.4M
VIEWS
collection
eye 13.4M
Web wide crawl with initial seedlist and crawler configuration from September 2010
Wide Crawl started February 2014
web
eye 6.8M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Wed Feb 19 07:58:38 PST 2014 to Wed Feb 19 05:13:46 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 7M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl453.us.archive.org:wide from Wed Feb 19 01:09:37 PST 2014 to Tue Feb 18 21:33:27 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 7M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl420.us.archive.org:wide from Tue Feb 18 17:01:58 PST 2014 to Tue Feb 18 13:14:06 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 6.9M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl427.us.archive.org:wide from Wed Feb 19 09:49:01 PST 2014 to Wed Feb 19 06:07:15 PST 2014.
Topic: crawldata
Wide Crawl started February 2014
web
eye 6.9M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl454.us.archive.org:wide from Wed Feb 19 05:20:19 PST 2014 to Wed Feb 19 01:54:33 PST 2014.
Topic: crawldata
Wide Crawl started October 2011
web
eye 936,454
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Fri Oct 7 04:29:15 PDT 2011 to Fri Oct 7 00:11:17 PDT 2011.
Topic: crawldata
Wide Crawl started October 2011
web
eye 806,001
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Fri Oct 7 14:36:17 PDT 2011 to Fri Oct 7 08:44:43 PDT 2011.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl428.us.archive.org:wide from Tue Jun 13 00:55:34 PDT 2017 to Mon Jun 12 19:36:27 PDT 2017.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Mon Feb 12 21:42:38 PST 2018 to Mon Feb 12 15:20:34 PST 2018.
Topic: crawldata
Host Screen Captures
web
eye 851,624
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl431.us.archive.org:widewebcap from Tue Jan 17 22:09:27 UTC 2012 to Wed Feb 1 11:19:58 UTC 2012.
Topic: crawldata
Wide Crawl started February 2014
web
eye 1.1M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl422.us.archive.org:wide from Tue Feb 18 06:45:58 PST 2014 to Tue Feb 18 01:16:03 PST 2014.
Topic: crawldata
Wide Crawl Number 17: Started August 3rd, 2018 - Still running
web
eye 950,964
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl806.us.archive.org:wide from Sun Mar 17 18:09:04 PDT 2019 to Sun Mar 17 17:28:58 PDT 2019.
Topic: crawldata
Wide Crawl started October 2010
web
eye 60,999
favorite 0
comment 0
Internet Archive crawldata from all sites, captured by ia360905.us.archive.org:wide from Thu Dec 2 04:45:56 UTC 2010 to Thu Dec 2 09:10:27 UTC 2010.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl805.us.archive.org:wide from Mon May 13 17:55:38 PDT 2019 to Mon May 13 13:45:55 PDT 2019.
favoritefavoritefavoritefavoritefavorite ( 1 reviews )
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl803.us.archive.org:wide from Mon Aug 6 09:40:02 PDT 2018 to Wed Aug 8 08:03:57 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl803.us.archive.org:wide from Sun Aug 5 01:06:54 PDT 2018 to Mon Aug 6 15:07:05 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl424.us.archive.org:wide from Mon Aug 6 13:48:48 PDT 2018 to Wed Aug 8 11:44:48 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Mon Aug 6 16:39:31 PDT 2018 to Wed Aug 8 09:59:07 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl800.us.archive.org:wide from Sun Aug 5 03:33:48 PDT 2018 to Tue Aug 7 01:48:53 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl422.us.archive.org:wide from Mon Aug 6 06:15:56 PDT 2018 to Wed Aug 8 04:21:46 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl802.us.archive.org:wide from Sun Aug 5 00:35:40 PDT 2018 to Tue Aug 7 01:25:09 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl812.us.archive.org:wide from Mon Aug 6 13:38:33 PDT 2018 to Wed Aug 8 08:00:38 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl808.us.archive.org:wide from Sat Aug 4 22:37:55 PDT 2018 to Mon Aug 6 22:17:37 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Sun Aug 5 04:29:12 PDT 2018 to Mon Aug 6 09:39:30 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl812.us.archive.org:wide from Sun Aug 5 05:36:11 PDT 2018 to Mon Aug 6 06:38:33 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl802.us.archive.org:wide from Mon Aug 6 09:28:15 PDT 2018 to Wed Aug 8 18:02:43 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl807.us.archive.org:wide from Sun Aug 5 04:40:59 PDT 2018 to Tue Aug 7 09:40:31 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl428.us.archive.org:wide from Sun Aug 5 05:35:10 PDT 2018 to Mon Aug 6 17:04:32 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl425.us.archive.org:wide from Sun Aug 5 03:16:34 PDT 2018 to Mon Aug 6 16:55:41 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl811.us.archive.org:wide from Sun Aug 5 01:02:39 PDT 2018 to Tue Aug 7 17:04:42 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl806.us.archive.org:wide from Mon Aug 6 06:15:02 PDT 2018 to Wed Aug 8 17:46:21 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl428.us.archive.org:wide from Mon Aug 6 12:03:20 PDT 2018 to Wed Aug 8 23:51:48 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl422.us.archive.org:wide from Sat Aug 4 20:34:59 PDT 2018 to Mon Aug 6 22:41:00 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl804.us.archive.org:wide from Sun Aug 5 04:04:57 PDT 2018 to Tue Aug 7 13:12:57 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Sun Aug 5 05:22:26 PDT 2018 to Mon Aug 6 22:26:16 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl808.us.archive.org:wide from Mon Aug 6 09:25:04 PDT 2018 to Wed Aug 8 09:47:22 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl813.us.archive.org:wide from Mon Aug 6 09:17:48 PDT 2018 to Wed Aug 8 09:30:08 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Sun Aug 5 05:32:56 PDT 2018 to Tue Aug 7 09:24:53 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl800.us.archive.org:wide from Sat Aug 4 03:13:18 PDT 2018 to Sun Aug 5 11:15:19 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl424.us.archive.org:wide from Sat Aug 4 23:46:37 PDT 2018 to Tue Aug 7 01:00:56 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl806.us.archive.org:wide from Sun Aug 5 03:32:04 PDT 2018 to Mon Aug 6 22:50:58 PDT 2018.
Topic: crawldata
Wide Crawl started June 2014
web
eye 4.2M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl421.us.archive.org:wide from Thu Jul 10 06:43:41 PDT 2014 to Thu Jul 10 01:23:01 PDT 2014.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl805.us.archive.org:wide from Mon Aug 6 04:16:16 PDT 2018 to Wed Aug 8 01:01:30 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl813.us.archive.org:wide from Sun Aug 5 06:35:49 PDT 2018 to Mon Aug 6 18:20:05 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl427.us.archive.org:wide from Sat Aug 4 20:05:05 PDT 2018 to Mon Aug 6 06:35:51 PDT 2018.
Topic: crawldata
Wide Crawl started June 2014
web
eye 4.2M
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Thu Jul 10 07:24:15 PDT 2014 to Thu Jul 10 01:45:52 PDT 2014.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl425.us.archive.org:wide from Mon Aug 6 09:59:22 PDT 2018 to Wed Aug 8 07:33:30 PDT 2018.
Topic: crawldata
survey_00010
web
eye 32,680
favorite 0
comment 0
"Internet Archive crawldata from feed-driven by 1.2 million top ranked domains from data.domainrank.io - captured by crawl421.us.archive.org:survey_00010 from Sun May 24 19:26:13 PDT 2020 to Sun May 24 15:10:12 PDT 2020."
Topics: survey_00010, crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Sat Aug 4 03:13:13 PDT 2018 to Sun Aug 5 16:18:03 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl426.us.archive.org:wide from Fri Aug 3 21:18:13 PDT 2018 to Sun Aug 5 13:22:13 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl805.us.archive.org:wide from Sun Aug 5 07:05:50 PDT 2018 to Mon Aug 6 04:10:11 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl429.us.archive.org:wide from Mon Aug 6 13:08:43 PDT 2018 to Wed Aug 8 10:49:06 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl814.us.archive.org:wide from Mon Aug 6 07:50:16 PDT 2018 to Tue Aug 7 10:08:16 PDT 2018.
Topic: crawldata
Internet Archive crawldata from Webwide Crawl, captured by crawl814.us.archive.org:wide from Sun Aug 5 05:20:25 PDT 2018 to Mon Aug 6 00:50:15 PDT 2018.
Topic: crawldata
Host Screen Captures
web
eye 870,071
favorite 0
comment 0
Internet Archive crawldata from Webwide Crawl, captured by crawl431.us.archive.org:widewebcap from Wed Jan 18 15:27:56 UTC 2012 to Wed Feb 1 15:16:33 UTC 2012.
Topic: crawldata