Search

Categories

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Send mail to the author(s) E-mail

# Tuesday, 14 February 2012
( Spider | Testing )

A fun web spider project for CodeCamp.

Am using TDD so first test is:

[Test]
        public void GetHtml_GivenAWebsite_ReturnsTheRawHtml()
        {
            Spider s = new Spider();

            string result = s.GetHtml("http://www.stuff.co.nz");

            Assert.IsNotNullOrEmpty(result);
            Console.WriteLine(result);
        }

and to make it pass:

public string GetHtml(string initialWebsite)
        {
            WebRequest wr = WebRequest.Create(initialWebsite);
            HttpWebResponse response = (HttpWebResponse)wr.GetResponse();
            StreamReader sr = new StreamReader(response.GetResponseStream(),Encoding.UTF8);
            string rawHtml = sr.ReadToEnd();
            return rawHtml;
        }

Find Links

[Test]
        public void GetFirstLink_GivenAWebsite_ReturnsTheFirstLink()
        {
            Spider s = new Spider();

            string result = s.GetFirstLink("http://www.stuff.co.nz");

            Assert.IsNotNullOrEmpty(result);
            Console.WriteLine(result);
        }

and to make it pass:

public string GetFirstLink(string initialWebsite)
        {
            string rawHtml = GetHtml(initialWebsite);

            //look for <a href="
            int x = rawHtml.IndexOf("<a href=");
            
            string y = rawHtml.Substring(x+10);

            int z = y.IndexOf("\"");

            var a = y.Substring(0, z);
            return a;
        }

This does work, however finding multiple links was going to get tedious this way, and regex felt like a much better way to go, so after some googling, and not wishing to use an external library such as http://htmlagilitypack.codeplex.com/

http://stackoverflow.com/questions/122856/parse-html-links-using-c-sharp

Find Unique Links

image

Testing very useful in finding bugs.  Why is it returing holidayhomes.co.nz twice when it should only be visiting unique sites?

code so far is:

public List<String> RunSpider(string startingSite, int numberOfJumps)
        {
            string site = startingSite;
            var listOfSitesVisited = new List<String>();
            for (int i = 1; i <= numberOfJumps; i++)
            {
                Console.WriteLine("Going to: " + site);
                string html = GetHtml(site);
                listOfSitesVisited.Add(site);

                List<String> listOfLinks = GetAllLinks(html);
                List<String> listOfExternalLinks = GetExternalLinks(listOfLinks, site);
                
                string siteToGoToNext = listOfExternalLinks[0].ToString();
                bool keepGoing = true;
                int j = 0;
                while (keepGoing) {
                    if (siteToGoToNext.Contains("holidayhouses.co.nz"))
                    {
                        var c = 1;
                    }

                    if (listOfSitesVisited.Contains(siteToGoToNext))
                    {
                        keepGoing = true;
                        siteToGoToNext = listOfExternalLinks[j].ToString();
                        j++;
                    }
                    else
                    {
                        siteToGoToNext = listOfExternalLinks[j].ToString();
                        keepGoing = false;
                    }
                }

                site = siteToGoToNext;
            }

            return listOfSitesVisited;
        }

and test is:

[Test]
        public void RunSpider_GivenAStartingWebsiteAnd5Jumps_ReturnAListOfWebsitesVisitedWhichShouldBeUnique()
        {
            Spider s = new Spider();
            var listOfSitesVisited = s.RunSpider(startingSite, 5);
            CollectionAssert.Contains(listOfSitesVisited, startingSite);

            CollectionAssert.AllItemsAreUnique(listOfSitesVisited);

            Assert.AreEqual(5, listOfSitesVisited.Count);
        }

debugging to output window:

image

why is siteToGoToNext findsomeone, then it goes to holidayhouses.co.nz?

Solution – a counter was in the wrong place.

Edge Cases

image

image

interesting edge case.  Giveway.govt.nz is actually a redirect.

Handled this by:

[Test]
        public void RunSpider_GivenAStartingWebsiteAnd20Jumps_ReturnAListOfWebsitesVisitedWhichShouldBeUniqueAndHandleCasesWhereAWebsiteIsARedirectByRevertingBackToLastWebsiteAndGoingToNextLink()
        {
            Spider s = new Spider();
            //startingSite = "http://www.giveway.govt.nz";
            var listOfSitesVisited = s.RunSpider(startingSite, 20);
            CollectionAssert.Contains(listOfSitesVisited, startingSite);

            CollectionAssert.AllItemsAreUnique(listOfSitesVisited);
        }

just by reverting to previous site.

Going too deep

image

getting bogged down and in an infinite loop… try and kiss as I want to explore.

set a limit of 30 chars.

Running Out of Suitable Links

image

m.twitter problem

..took out mobile site handling.. just go for it anyway

404 Exception

try catch go to stuff.co.nz

Frames problem with mongolia site

ahh

davemateer.com problem – no links

no external links on front page.

Features

image

Websites visited in order – starting with holidayhouses.co.nz

image

bbc.co.uk start.  With a 404 in the middle, which lead back to stuff.co.nz

image

starting with cnn.com I get here after 80 hops Smile

image

cnn start.

Ideas:

display on a map where servers are?

display on map which countries visited (by domain name extension)

See WPF post for UI that I did..

| | #