How to find all links / pages on a website

75

42

Is it possible to find all the pages and links on ANY given website? I'd like to enter a URL and produce a directory tree of all links from that site?

I've looked at HTTrack but that downloads the whole site and I simply need the directory tree.

Jonathan Lyon

Posted 2009-09-17T14:43:10.030

Reputation: 1 586

2crawlmysite.in - site not exists – Sarah Trees – 2015-10-20T07:40:33.760

Answers

60

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.

Hank Gay

Posted 2009-09-17T14:43:10.030

Reputation: 51 460

thank you so much Hank! Perfect - exactly what I needed. Very much appreciated. – Jonathan Lyon – 2009-09-17T15:08:05.003

how do I do that myself? and what if there is no robots.txt in a web site? – Alan Coromano – 2013-07-30T17:15:50.783

1@MariusKavansky How do you manually crawl a website? Or how do you build a crawler? I'm not sure I understand your question. If there is no robots.txt file, that just means you can crawl to your heart's content. – Hank Gay – 2013-07-31T15:14:02.103

And this is available in Ubuntu's repository (actually it works with Windows/Mac/Linux) – Adi Fatol – 2013-11-26T22:28:49.610

Such a great little program! – Arash Saidi – 2014-11-29T19:24:07.880

5hi guys, linkchecker has not worked for me when I scan the site it only returns a report of broken links. Very small report. while it does they it checked thousands of links but I can't see where those are reported. Using version 9.3 can you please help? – Jawad – 2015-11-05T10:33:53.440

how to send output to file with --out or -o? – Pandya – 2018-10-09T09:25:39.827

2A nice tool. I was using "XENU link sleuth before". Linkchecker is far more verbose. – Mateng – 2011-11-14T20:42:56.590

28

Or you could use Google to display all the pages it has indexed for this domain. E.g: site:www.bbc.co.uk

John Magnolia

Posted 2009-09-17T14:43:10.030

Reputation: 9 093

5but if you use extra search features in google such as site, intitle you'll get a restriction of 700 entries. evenif on the top of the results page says a way far from 700 ex: About 87,300 results (0.73 seconds) – Mbarry – 2013-04-01T22:57:58.157

1@Mbarry, And how do you know that? – Pacerier – 2015-04-06T13:46:24.690

It is easy to get to know. Try to get 30 - 50 pages of search-results ahead and you will soon find the end, instead of thousands of results on "site:www.bbc.co.uk". – Zon – 2016-04-07T15:23:04.733

Even on normal searches google does now not return more then 400 results. – Lothar – 2017-12-05T21:01:21.793

27

If you have the developer console (JavaScript) in your browser, you can type this code in:

urls = document.querySelectorAll('a'); for (url in urls) console.log(urls[url].href);

Shortened:

n=$$('a');for(u in n)console.log(n[u].href)

ElectroBit

Posted 2009-09-17T14:43:10.030

Reputation: 664

1What about "Javascript-ed" urls? – Pacerier – 2015-02-25T00:56:13.550

Like what? What do you mean? – ElectroBit – 2015-04-03T20:53:48.133

2I mean a link done using Javascript. Your solution wouldn't show it. – Pacerier – 2015-04-06T13:45:53.350

2

@ElectroBit I really like it, but I'm not sure what I'm looking at? What is the $$ operator? Or is that just an arbitrary function name, same as n=ABC(''a'); I'm not understanding how urls gets all the 'a' tagged elements. Can you explain? I'm assuming its not jQuery. What prototype library function are we talking?

– zipzit – 2016-05-28T17:32:18.747

1

@zipzit In a handful of browsers, $$() is basically shorthand for document.querySelectorAll(). More info at this link: https://developer.mozilla.org/en-US/docs/Web/API/Document/querySelectorAll

– ElectroBit – 2016-05-28T17:54:13.940

There is no complete computable solution to traversing javascripted urls beyond some very rudimentary attempts. At least this tip is working with the DOM and not the HTML source. – Lothar – 2017-12-05T21:03:31.263

0

If this is a programming question, then I would suggest you write your own regular expression to parse all the retrieved contents. Target tags are IMG and A for standard HTML. For JAVA,

final String openingTags = "(<a [^>]*href=['\"]?|<img[^> ]* src=['\"]?)";

this along with Pattern and Matcher classes should detect the beginning of the tags. Add LINK tag if you also want CSS.

However, it is not as easy as you may have intially thought. Many web pages are not well-formed. Extracting all the links programmatically that human being can "recognize" is really difficult if you need to take into account all the irregular expressions.

Good luck!

mizubasho

Posted 2009-09-17T14:43:10.030

Reputation: 44

14

No no no no, don't parse HTML with regex, it makes Baby Jesus cry!

– dimo414 – 2013-05-29T05:47:10.587

-2

function getalllinks($url){
$links = array();
if ($fp = fopen($url, 'r')) {
$content = '';
while ($line = fread($fp, 1024)) {
$content .= $line;
}
}
$textLen = strlen($content); 
if ( $textLen > 10){
$startPos = 0;
$valid = true;
while ($valid){
$spos  = strpos($content,'<a ',$startPos);
if ($spos < $startPos) $valid = false;
$spos     = strpos($content,'href',$spos);
$spos     = strpos($content,'"',$spos)+1;
$epos     = strpos($content,'"',$spos);
$startPos = $epos;
$link = substr($content,$spos,$epos-$spos);
if (strpos($link,'http://') !== false) $links[] = $link;
}
}
return $links;
}
try this code....

user4318981

Posted 2009-09-17T14:43:10.030

Reputation: 38

8While this answer is probably correct and useful, it is preferred if you include some explanation along with it to explain how it helps to solve the problem. This becomes especially useful in the future, if there is a change (possibly unrelated) that causes it to stop working and users need to understand how it once worked. – Kevin Brown – 2015-03-06T00:12:06.320

1Eh, it's a little long. – ElectroBit – 2015-05-03T18:29:40.007

1

Completely unnecessary to parse the html in this manner in php. http://php.net/manual/en/class.domdocument.php PHP does have the ability to understand the DOM!

– JamesH – 2015-06-26T12:30:11.250

it worked for me thanks – Mohamm6d – 2016-10-04T13:59:36.057