Biodata: a flexible research data entry application

Over the years I have wondered why researchers still rely on spreadsheet applications to record their research data. Although most people know how to use them, spreadsheets have very serious drawbacks, especially if multiple people contribute data to the same dataset/project. With spreadsheets it is usually not possible to have multiple people enter data concurrently. It is also hard to check the sanity of entered data unless you start working with complex formulas and macros, which quickly become very complex, error prone and hard to maintain. And lastly, transforming data in a spreadsheet where you combine data from different tabs/tables, transpose data, filter/sort data, group/split data, etc becomes hard and error prone. For these (and more) reasons, (relational) databases with a decent data entry application in front of it are much better suited for research data.

Although a lot of applications exist for specific research projects or fields, there is very little available for generic research data entry applications which allow a researcher to define their own datasets and let multiple people enter the data using a simple interface that validates the user input. So when I started working at Marine Conservation Philippines as science officer, I decided to build Biodata to solve this problem.

Functional

Biodata is a webapplication, centered around the concept of a dataset. Each dataset is a set of objects and variables, that together make up the dataset for a specific (research) project. Multiple datasets can be created by a researcher for different projects. In the webapplication users who enter data, just get a choice of different datasets that they want to enter (and view/edit) data for. The forms/overviews for these different datasets are automatically generated based on the dataset definition.

Datasets have four basic objects, namely Sample, Site, Observer and Observation. These link together in the basic dataset, where each sample has a site (location), one or multiple participants (observers) and one or multiple Observations (basically rows in a spreadsheet). Each observation in turn has an observer that recorded that specific observation.

On each of these extra objects fields can be added, like numeric fields, text fields, date fields, etc. Extra objects can also be added, like a Species object that is linked to an Observation, so a user can choose a species as part of the fields for the observation. Extra objects can be called anything the researcher wishes. They are just a bundle of fields with a name that makes sense in the scope of the research project. All the researcher has to do is define these objects and their respective fields and relationships, and everything from database management, form creation, etc is handled by the application.

Technical gibberish

Biodata is written in Python and uses the Flask webapplication framework. It uses SQLAlchemy to define datasets and act as a wrapper around the actual database. As such it should work with many different databases, but at Marine Conservation Philippines, we run it with a PostgreSQL backend. It contains a command-line tool for database migrations, created using Flask-migrate, which in turn uses Alembic.

It is written using a test driven development approach, and the code is hosted on GitHub.

Because of the test-driven approach, and because Flask is a framework that puts very few restrictions on the development of the application with very little “magic” going on behind the scenes, this combination of technologies allows for very quick development of the application. In practice it has often proved to be quicker to just develop a new feature than write it down in a project management application and get back to it later.