Google searching, free Web Scraping tools and free Extract Tables from PDF tool.
I have been looking at some of the free Data Science and Cognitive Computing Courses and was following the Data Journalism: First Steps, Skills and Tools course.
The Google Searching video was on improving your searches in Google using the following:
- quotes to get specific key words eg “data science on construction”
- using – sign to exclude certain sites eg “data science on construction” -mbie -nz ( note no space between – and word you want excluded)
- using wildcard “*” in search “data on construction * 2018”
- using a specific site to explore eg site: police.uk statistics or more specific site: dorset.police.uk statistics or site: *.nhs or by country site:nz “poverty statistics”
- getting filetype eg site:nl filetype:pdf
- for a database, where searches could vary eg site:nz database “search by” ( the search by would most probably be a place for putting a query to the database)
There was a web scraping video that I thought was great. I had previously had a couple of attempts with Python & the Beautiful Soup Extension package that I have had limited results with so far.
I particularly liked the google spreadsheets example (from 3.40 to 6.30 in the video above). This required a command in a cell with =importhtml(“URL”, Query, index), in the example he used it was a table
Also the free Outwit Hub tool that he demonstrated that works over several pages (from 9.10 to 12.40 on the video above).
Outwit Example also using Google Search site:nz database ‘search by”
I tried it out on a database search that I found and the web site did not include the actual search in the url so I had to run the search in the web page within Outwit Hub and then it crawled through the pages to get the first 100 lines (there was some messy lines I had to clear). I could run the search again from a later date to grab more data as it exceeded the 100 row limit of the free version. I am still impressed by the tool.
I did a search on the web for some other tools and came across this article which then referenced a later article highlighting other web scraping tools, some free.
Google Spreadsheet importhtml()
Using the Google Spreadsheets as a test, I also got this ( although I couldn’t get the table from https://www.stats.govt.nz/topics/building as I think it is in a separate tab (graph tab and table tab) )
PDF Table Extract using Tabula
And on a similar subject another tool I would like to mention is TABULA which is able to extract tables from PDF’s and export to say CSV’s . This runs as a server on your computer and opens in your browser. FlowingData recommended this tool.
On this subject I came across an on-line version here.