Web Scraper FireFox add-in How To and other add-ins

  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

Web Scraper

I was looking at Firefox addins and came across a couple of web scraping add ins. Data scrapper and web scraper. The first one I couldn’t get working and the 2nd one I had to watch a video before I could get the settings sorted.

Recently, about 4 months ago, I added a view counter adding to my WordPress blog. Unfortunately you had to open the specific post to see the number of views.I decided that this was a good test to try using the web scraper tool on (Note. I could just go into the sites database and do a query on that particular table).

I have played with web scrapers before, on this specific post and some others playing with API’s.  I liked the idea of them but could not think where to use it. So I filed it away as a useful tool and didn’t think too much more about it. Then I came across this article: Are you buying an apartment? How to hack competition in the real estate market with data monitoring

I thought it was a good use of the tool to gain information that was pertinent to the job in hand. And then I thought, I could use it to go to job sites to look for work. I am a freelance architect/services designer so am always looking for the next project. I have tried LinkedIn but its pretty useless really. I’m not sure why it has any sort of reputation at all. I do post my posts there, just as another means of getting the information into the ether, and its only my son and daughters circle of friends that link to them when they like the articles (or be disinherited).

Another use I had thought of was for other interesting blog sites. A lot of the current posts are shown, but the archived ones are by date and its hard to know what they are about. For issues such as Revit, a lot of old posts are still relevant, so a good way of going into the older posts is by doing a web scrape.

This is why I decided to do the exercise on my site. https://cr8ive.cf/

So go into Firefox add-ins and get Web scraper and install it. I will not go into that in this post, I’m sure there are lots of videos out there for that. Once installed there is a little icon added to your ribbon, clicking it tells you that you need to open the console screen (CTRL + SHIFT + C) and then the tool is available there on the ribbon.

So 1. Go to the site you want to crawl.

2. Open Firefox Console (CTRL + SHIFT + C) .

3. Click on the Web Scraper Icon

4. Create new Sitemap tab

5. Create sitemap on pull down.

I first watched this video and followed the steps for the web scrape:

And I ran into some difficulties with the pagination part of the tutorial. I had 14 pages and it only scraped 1,2,3 & 14 and left the rest. So I had to look at other resources and found this one, the original site, which also has video tutorials:

https://www.webscraper.io/

I’ve done a bit of a jump ahead on this next image.

I have just gone to another page, in this case:

https://cr8ive.cf/page/2/

I have then given a sitemap a name and the start URL is:

https://cr8ive.cf/page/[1-14]/

(This will mean that it will scrape on all of those pages, rather than just the first. I tried the process in the first video (a number of times!!) with no success).

Then hit the Create Sitemap button.

You will see it has gone into the sitemap cr8ive and there is a blue button asking you to Add new selector. Click that

This pulls up another interface where you give it a name, then choose what type of selector you want (in this example it is a link), and after choosing this

Next, because we want to select all the posts, we click on Multiple box (1).

Then we click on Select and the line object with a button (see items 5,6) pops up.

We then select one of the headings, it initially turns green, then red, then we select another and that initially turns green then red.

Then we click in the tick box (item 5) and hit Done Selecting! button. then we hit the save selector blue button at the bottom of the page ()or if we don’t get what we want then hit cancel).

So once  we save selector we have created a new selector called Title, as you see below, and to the right you can do a number of things with it.

In the hierarchy, we have:

Sitemap : cr8ive [1]

_root [2]

Title [3]  this is the post title, the data we want to scrape is within the post. So we need to click on the ID Title [3] and this will take us to the next level down (ie, within the Post itself) , so above, in the web page, we need to open a Post to look at the web elements within that.

So, in the hierarchy we can see:

Sitemap : cr8ive [1]

_root [2]/ Title [3]

Note, if you click on -root it is in blue and an active link and will take you one step up the hierarchy to the previous screenshot above.

Next, we click into an actual post, and we want to chooser the categories (in this case BIM, Data Extraction, Productivity…..) and also Post Views (in this case, the number 10). So we use the Add new selector button at the bottom [5].

So we add a couple of selectors below.

  We will now check that the data is structured correctly using the dendogram tool:

So here, from root, go find title, and within post, get categories and view number.

The categories and views are within the title link, so all is well.

For pagination, of the 14 pages, we modified this at the top by saying go to each page:

https://cr8ive.cf/page/[1-14]/

and then look for title link on that page, of which there are a few.

 

As the structure is OK. We go to scrape:

The programme then opens a pop up window and proceeds to go through each page and open each post.

On the main page there is, at the bottom:

no data scraped yet. with a refresh button beside it.

As there are 135 odd posts it will take a while to open them up and get the data, then close and open another. So if you want to do some work in your browser, used a different browser while the programme is scraping, otherwise it stops.

I stopped it after a short while (I’d already run it previously on another PC). So when it has something ready  a little pop up box appears in the bottom right and you need to go and press the REFRESH button on the left hand side then the data shows.

You can then export the data as a CSV.

You get a download now link that you need to press and it asks whether you want to save or open in Excel.

And in excel this is how it appears.

I have put a filter on the top headers of the columns, to allow for sorting: And I have sorted by title (A to Z)

This shows that because I had multiple categories in my posts it lists each category out as a separate row, so I have a bit of repetition. As my initial exercise was to get a count of the views, I can delete the Duplicate rows in excel using the Remove Duplicates Button.

Next time, instead of having category as Multiple, I’d just use single and get one row per post.

End thoughts

So, a bit slow grinding through all the 135 posts, but pretty robust. The other nice thing is you can save the setup and export it too to use it over.

I can see a couple of uses in tracking down some relevant details for products. So setting up some nice scrapes and having them saved for later is appealing.

There is the Regex box that is interesting, also I’d be interested to see if it works on JavaScript sites where it loads the data dynamically (I had an API issue with the MetService Data for tides in Wellington that I think was because of JavaScript (not certain and that needs a bit more research))

Its great that it sits in your browser, so you do not have to go hunting for it. It makes it more accessible and so more likely to be used.

 

A couple of other addins I put into Firefox were :

Google translator for Firefox.

This is so I can look at the news from the country I’m interested in such as Russia & China and translate it to see what the people of that country are reading

Five Notes.

This pops out a sidebar on the left with 5 tabs for jotting down notes. I think this will be handy. Especially at work where I’m trying to source architectural products.

I also use Duck Duck Go as the search browser (as it doesn’t track you)

and also uBlock Origin to stop adds.

Also HTTPS everywhere but I’m not quite sure why. I read in a privacy article to use it and have.

 

 

 

Add a Comment

Your email address will not be published. Required fields are marked *