It’s been a while since I did a proper post about something techy. But I did spend quite a bit of time with Knime, and I love it. So I think it warrants a post. Knime is a complete data mining, manipulation and analysis toolkit. It’s completely open-source and really flexible. I have experience with other ETL and reporting tools, like Talend Open Studio, JasperReports and Pentaho Kettle but for me (in my use-case) Knime rises above them. Talking about my use case, what is it?
At the solar power company that I am a partner in (among other activities) we sell inverters from various different brands. As a consequence, our clients use a variety of (web-based) monitoring tools. Each with their own dashboard and output. But we need overviews comparing all our clients with each-other, and besides that, we can get extra data from those inverters internally, which allows us to do advanced debugging and monitoring. But visualising that data from those different sources is non-trivial. This is where Knime comes in. We can download exports of the webbased monitoring dashboards that our clients use. Some of them produce csv files, others excel files, and the format of all those files is wildly different. Knime allows me to read in the data from those different files, combine it into one common dataset and then analyse that data. I used it for instance to create an interactive graph that plots the average daily solar array output per week for each of our clients (as a % of their maximum output) so I could compare the performance of each solar array. Knime does it’s work this by creating a chain of “nodes” (operations) that are connected by arrows. Each node does somethings small, like calculate a number, filter rows, rename columns, etc. The data flows through the chain of operations from the start node to the end node.
With the nodes you can do calculations, filtering, joining data, grouping, pivot tables, transposing, etc. It also has graphing tools, and an enormous amount of plugins that let you go into seriously complex modelling, genetics, bioinformatics, geo-information, etc. It allows you to read a whole directory of files, databases, allows you to download data/files automatically from the internet or webservices and interface with online API’s like google analytics and twitter. Most of this stuff I haven’t tried out though.
In the left of the image above you see a bunch of nodes with no input. Those are meta nodes. These allow you to embed whole separate workflow into another workflow. For each client for instance I have a workflow that reads the respective export files, combines them and transforms the data into data that is uniform across all clients. This way you can break complex workflows up into several separate ones. And if I ever need to read the data from one of those clients in another workflow, I just copy the meta-node over to the new workflow without having to redo any work.
The below code does the matplotlib work. The code is commented and together with the pyplot docs is quite understandable.
Knime is really an end-to-end tool for people who need to work with homogenous data and load, transform and correct them before visualising and analysing the data. Although there is a bit of a learning curve to use Knime, it’s quite user friendly and anyone (without programming experience) will be able to learn to use it. The additional Python and R scripting possibilities and host of plugins take it beyond a simple ETL tool to a proper business intelligence and data analysis suite.