From the last post regarding RapidMiner I saw that it connected to the AWS (Amazon Web Services) S3 storage. On exploring the AWS Free Tier I note that you can have 5GB of Storage for free. I also then noted that they had a Free Tier for DynamoDB which is a NoSQL database. So I thought I would try and set that up. I don’t think RapidMiner can link to DynamoDB but it interested me.
I thought that setting up the DB would be simple, its not. You have to wire it up so that after creating a table in DynamoDB with a Key field, you then have to upload a Json file to S3 storage and use a Lambda programme function to import the data with IAM (Identity and Access Management) Policy and Role to allow for access to these services (it sort of makes sense, but tortuous).
I followed this video for the setup:
NOTE: This process only uploads one row of data at a time.
So, lets hop to it. I have previously setup a free (for a year) EC2 VPS (virtual private server, for a demo site for OpenMAINT) service so I will not go through the process of showing how to setup an AWS account. I did have to select an S3 service & a DynamoDB instance.
Set up IAM (Identity and Access Management) Policy and Role.
The policy combines all of the services that you are using, in this case Cloudwatch logs (see below), S3 bucket & DynamoDB. From what I can see this is the PERMISSIONS process for being able to access these services.
Setup a CloudWatch Log. This is required for when you create a TRIGGER in the Lambda function that is activated when a new JSON file is uploaded to a specific S3 bucket.
When the file is uploaded to the Bucket, the TRIGGER is activated, and the Cloudwatch Log shows what happens. So a DEBUG tool for the process to see that it works.
It is used in this example to 1/ See what the “RECORD” of the process looks like so that you can find the objects (as per the video) so that you have the right names.
On the S3 I didn’t initially give All resources permissions but I later went back and modified this so it wall (all permissions)- (it didn’t seem to make any difference)
Screenshots missing for DynamoDB but same process as per S3 setup
Then press create policy and the next screen confirms it has been created.
Then you need to setup Roles for the Policy that has just been created.
Then the actual programming bit with the Lambda Function.
I am going with the Python 3.6 version as that is what is loaded on my machine (also following the video, rather than the older 2.7 version of python)
So the environment is setup ready to go
In services , go to DynamoDB and press Create Table.
Fill in Table Name & Primary Key Name ( in this case I have ID number that is unique), make sure you have correct Data Type (number, string etc for the primary key) Then create the table (button bottom right)
Going to the Tables on the left sidebar shows you tables that you have created. Here, if you click on the “Prop” table you will see, under the Items tab, ID (pk)” field. The only one created.
Next we make the Trigger in the Lambda Function tab that we have open.
In the middle of the screen is a dotted box with ” Add triggers…..” Click on S3 in lefty column (as we want to activate a trigger when we upload a JSON file into our S3 storage)
A box appears asking for configuration in the bottom of the screen. Fill it in
The Event is an object is created (uploaded) type .json. Afteradding the conbfiguration you need to save the trigger.
Creating JSON file from CSV file
I have some CSV data and to convert it to JSON I use the online CSV to JSON converter. I took the top header and first line of data for the test file.
Amason S3 – upload .json file
Going into the S3 instance, I uploaded the json file (to check whether the TRIGGER was activated.
Going next to the cloudwatch service, click on Logs and the logs group instance
Opening it out shows the Lambda function process. In this case, we are calling the information from the uploaded file, so we can see what has been called.
This is the python code (note indent for print function):
def lambda_handler(event, context):
return ‘Hello from Lambda’
The print(str(event)) shows, in the cloudwatch log the Record of that activity.
If you cut/paste this to an ONLINE JSON VIEWER, and after pasting, Format it (with Format button), it will show you the JSON output.
If you then go into the VIEWER tab, it will show you the structure of the Data.
This can be used to find the objects that you are looking for (watch the video as it goes through it very well).
From this we can programmatically find the s3 “Bucket Name” and the s3 File “Name” by seeing what is called in the record.
I stuffed this up on a number of levels in the process.
I ended up, in my IAM policy having 2 instances of S3. So that caused a glitch and lots of errors. And took a long long time to debug. Hard to know that this was an issue, just had to go back and re-do strps until it finally fixed itself & then realised about the 2 instances in the IAM policy.
JSON format. If you look in the video, in his dataset he has DOUBLE QUOTES (“) around all the items (apart from numbers (no quotes). DynamoDB wants double quotes, single ones wont do, so I had to use search/replace in Notepad++ to change these.
A few issues with indent in Python. I had forgot about that, so a lot of uploads of the file in S3 to retest it and use the cloudwatch log.
Overall about a day debugging 12 lines of code that looked so simple and elegant in the video.
The cloudwatch log wasn’t that great at pointing to where the problem was (apart from the Double quotes in the Json file). At least it did tell you that there was a glitch though ( nothing more frustrating than code run then nothing, bloody hard to debug).
I ended up going back to printing out bucket & Key to see that these were reading correctly. Early on in the debug process I did have an issue with it not reading the Key correctly.
Another issue, when trying to upload lots of rows of data , was it timing out. You need to be in Lambda console and click on wheel to find the timeout setting (set at 3 seconds) and I altered to a longer duration.
Another thing is empty cells! If there are any in the JSON file then they will not load. This is part of NoSQL that requires non-null data fields. So this is a bit of a challenge for uploading data from a CSV file where there are empty cells. They have to be cleaned out so the file can be uploaded into DynamoDB.
I have just ascertained that this process is for uploading one Row at a time. I think this may be the python dictionary setup being used. So not a fast way to populate a database from scratch. I will need to do some more research on how to upload multiple lines of data into DynamoDB.
Not easy, quite a complex process to populate a database, I have now got to figure out how to do multiple line loads.
Good to see that the S3 storage is free
I like the Lambda Function python process, I think I’ll need to explore this more.
Overall not a particularly successful process to date, but some learning of some of the AWS services and how they connect to each other.
I think I would setup a KNIME or RapidMiner process to setup the JSON files with no blanks and ordered so that they upload well.
Where to from here? I would like to explore connecting to S3 externally (with RapidMiner) and also connecting to DynamoDB for some other purpose. A bit more research required for the latter.
End note: The Featured image is from the video by Java Home Cloud. I have used it without their permission but am acknowledging their source.