Google searching, free Web Scraping tools and free Extract Tables from PDF tool.

I have been looking at some of the free Data Science and Cognitive Computing Courses and was following the Data Journalism: First Steps, Skills and Tools course.

The Google Searching video was on improving your searches in Google using the following:

  • quotes to get specific key words eg “data science on construction”
  • using – sign to exclude certain sites eg “data science on construction” -mbie -nz   ( note no space between – and word you want excluded)
  • using wildcard “*” in search  “data on construction *  2018”
  • using a specific site to explore eg site: statistics   or more specific  site: statistics or site: *.nhs or by country site:nz “poverty statistics”
  • getting filetype     eg site:nl filetype:pdf
  • for a database, where searches could vary  eg site:nz database “search by”    ( the search by would most probably be a place for putting a query to the database)

There was a web scraping video  that I thought was great. I had previously had a couple of attempts with Python & the Beautiful Soup Extension package that I have had limited results with so far.

I particularly liked the google spreadsheets example (from 3.40 to 6.30 in the video above). This required a command in a cell with   =importhtml(“URL”, Query, index), in the example he used it was a table

Also the free Outwit Hub tool that he demonstrated that works over several pages (from 9.10 to 12.40 on the video above).

Outwit Example also using Google Search site:nz database ‘search by”

I tried it out on a database search that I found and the web site did not include   the actual search in the url so I had to run the search in the web page within Outwit Hub and then it crawled through the pages to get the first 100 lines (there was some messy lines I had to clear). I could run the search again from a later date to grab more data as it exceeded the 100 row limit of the free version. I am still impressed by the tool.

I did a search on the web for some other tools and came across this article which then referenced a later article highlighting other web scraping tools, some free.

Google Spreadsheet importhtml()

Using the Google Spreadsheets as a test, I also got this  ( although I couldn’t get the table from as I think it is in a separate tab (graph tab and table tab) )


PDF Table Extract using Tabula

And on a similar subject another tool I would like to mention is  TABULA which is able to extract tables from PDF’s and export to say CSV’s . This runs as a server on your computer and opens in your browser. FlowingData recommended this tool.

On this subject I came across an on-line version here.

