Getting started

Clean Data Sets offers 24/7 access to the best data cleansing features on the market. Whether you're a solo journalist trying to make sense of messy data or a member of a team of scientists putting data together for research, we structure your data so you can clean it up in the least amount of time.

In Clean Data Sets you work with your data in the broadly known spreadsheet grid layout (e,g, rows, columns, cells), but have access to highly specialized features designed to make data cleansing as simple as possible.

You can import your data into Clean Data Sets from a broad range of formats that include: CSV (Comma Separated Values), XML, spreadsheets formats (e.g. Excel, OpenOffice), PDF, JSON & you can even provide a website URL or paste data from your computer's clipboard. Once your data is in Clean Data Sets, you can apply clean up rules in seconds!.

Basic concepts

Although we strive to use commonly known technical concepts like rows, columns & cells, there are some key concepts that are important to understand in order to use Clean Data Sets effectively.

Project actions & history

A project is made up of a set of data you upload to Clean Data Sets on which you perform actions. At any point in time -- irrespective of the actions you undertake on a project -- you can revert back to a project's original data state. This means in Clean Data Sets you have an undo/redo history sequence, ensuring you can go back or forward to certain actions in a project.

It's important to understand that not all project actions behave the same way. While some actions go on to form part of a project's history (i.e. undo/redo sequence), other actions are temporary and need to be either explicitly made permanent or are just intended as a visual aid in the data clean up process. As a rule of thumb, if an action alters data and doesn't just present it in another manner, it's considered a permanent action that goes on to form part of a project's history. The rest of the documentation emphasizes when actions are permanent, when they require further actions to be made permanent or when actions are temporary.

Finally, keep in mind a Clean Data Sets project history (i.e. undo/redo sequence) behaves like most software applications and isn't infinite. For example, if you perform 10 permanent actions on a project and then revert back to the 3rd action and perform a new permanent action, actions 4 through 10 will be effectively lost. In other words, you can move up or down in a project's history, so long as you don't perform any new permanent action after it. The moment you do perform a new permanent action all post-events from that point on are cleared. This behavior is pretty similar in most software applications, but it's important to know so you understand a Clean Data Sets project history limitations.

Edit & Transpose

You can edit any piece of data in Clean Data Sets. But since the edition process is geared toward cleaning up data, editing is simpler & more powerful than in a standard spreadsheet. For example:

  • You can edit an individual cell and also apply the changes to other identical cells in one click.
  • You can edit all the cells in a column and apply a transformation rule (e.g. transform to uppercase, change data type) in one click.
  • You can also edit all the cells in a column to be split into more columns/rows or consolidate multiple columns into a single column/row.
  • You can edit a column's cells to be filled with a given or blank value.
  • A column can be cloned, renamed, removed, moved to the start/end/left/right, all in a single click.
  • And if you want to fetch data from the Internet based on a column's data, you can create a new column that automatically collects the data for you.

All edit tasks go on to form part of a project's history (i.e. undo/redo sequence).

You can also transpose rows and columns, a process that swaps values from a row into columns or a column into rows. Transposing is just a more elaborate form of editing that's achievable in a few clicks and which also forms part of a project's history.

Sort & View

You can sort a column's cells based on text, number, date or boolean values. In addition to sorting in ascending & descending order, it's also possible to sort the position of cells with blank and error values. Sort tasks are temporary and must be made permanent manually, otherwise all sort tasks are lost.

A view allows you to hide (a.k.a. collapse) or expand columns, a feature that's helpful for projects with a large number of columns. View tasks are temporary and only intended to aid visualization, you will need to remove a column if you want it to disappear permanently or sort a project's column order if you want a specific column to be permanently besides another column.

Row stars and flags

You can mark rows with a 'star' or 'flag' -- icons present at the start of each row -- to distinguish rows that match certain criteria. The 'star' and 'flag' marks mean nothing per se and only have the meaning you give them. For example, you can mark a row with a 'star' or 'flag' to indicate it's good, incomplete, erroneous, bad or any other characteristic, as there are no side effects to using either mark.

You can apply 'star' or 'flag' marks to individual rows or apply them to multiple rows after bulk tasks (e.g. facets or filters) to keep track of rows across multiple actions.

All tasks that involve marking rows with a 'star' or 'flag' go on to form part of a project's history (i.e. undo/redo sequence).

Facets & Clusters

A Facet is a type grouping operation available on all project columns. A Facet lets you quickly gain insight on the different values of a given column which is helpful to consolidate records and verify value ranges.

For example, a column with sloppy data capturing might have values like 'Mobile','mobile', 'cellular' and 'mobile.' which all indicate the same thing, with a Facet you can quickly identify these different values and consolidate them into a single value. With a Facet you can also quickly generate a histogram to visually identify the value ranges of a column to determine if the values fit an expected pattern.

When a Facet produces many results, you can use a cluster to determine similar values based on different algorithms. On the top right hand side of every Facet there's a cluster button, once you select a cluster and choose an algorithm you'll see a list of results you can consolidate.

Facet tasks are temporary and only intended to aid in the application of permanent tasks (e.g. consolidating values), therefore Facets are lost between sessions.

Filters

A Filter lets you group a set of records that match a given column's value. Unlike a Facet which generates groups for the various values of a column, a Filter generates a single group for the value you provide. For example, you can create a filter to get records that have a number 9 in their column 'X' or the text 'CA' in their column 'Y'. In this sense, a Filter is very similar to filters in standard spreadsheets.

Filter tasks are temporary and only intended to aid in the application of permanent tasks (e.g. consolidating values), therefore Filters are lost between sessions.

How Tos

With a basic understanding of the basic concepts in Clean Data Sets, let's explore how to do some of the most common data cleansing tasks.

Clean up duplicate values

Duplicate values are easily detected by treating all data values as text (e.g. 1 (the number) is cast to '1' (the string) with no ambiguity), so there's no need to deal with data types to detect duplicate values in Clean Data Sets.

Once you identify a column where you wish to clean up duplicate values (e.g. email, number, city), create a Text Facet on it.

If there are a small number of Facet results, you'll immediately be presented with a list you can sort by count to quickly detect which values are present more than once. Clicking on any of the Facet results generates a filtered list with the clicked value where you can take further action (i.e. edit the value, delete the entire record).

If there are a large number of Facet results, you'll be presented with the option 'Facet by choice counts' that generates a histogram with the various counts for each value. Once you generate the histogram by clicking on the 'Facet by choice counts' link, move the histogram slider to 2 and beyond to detect values that are present more than once. Once you move the histogram slider, the Facet is updated to reflect matching records in the histogram range, at which point you can click on any Facet result to generate a filtered list to take further action (i.e. edit the value, delete the entire record).

Clean up value ranges

Value ranges are useful because they help detect anomalous values in a series.

Value ranges composed of text are easily identified with a Text Facet. For example, to detect if a column's values match a strict set of values (e.g. months of the year) you can generate a Text Facet on the column and inspect the resulting Facet for unexpected values which you can then edit, consolidate or completely remove.

Value ranges composed of numbers require a Numeric Facet to be easily identifiable and sortable (e.g. treating the numbers 1,100,1000,10000 as text, produces text-based ordering which makes it difficult to detecting other series numbers like 2,3,4,5,6,7,8,9).

To create a Numeric Facet, you must first ensure a column's font appears in light green which indicates a number data type. If the desired number column font appears in black, then the column's values must be converted to a number data type. To convert a column to a number data type, click on the 'Edit cells' option, followed by 'Common transforms' and the 'To number' option, after which point a column's font will turn to light green to indicate a number data type.

Once a column's values are of a number data type (i.e. light green) create a Numeric Facet on the column to create a histogram with the various numeric values. You can quickly see the range of numeric values and move the histogram slider to detect out of range values. For example, if a column represents a person's age, negative values or values over 100 could be out of range and easily detectable by moving the histogram slider. Once you move the histogram slider, matching records are displayed for the selected range, at which point you can take further actions (i.e. edit the value, delete the entire record)

Extensions / Plugins

To provide extra functionality, Clean Data Sets supports OpenRefine extensions. Although you can technically upload any OpenRefine compliant extension designed for OpenRefine v2.6, some extensions may not function out of box 'as is'. Because OpenRefine runs locally by default, many OpenRefine extensions assume this type of deployment environment (e.g. relative paths, access to local disk) which in Clean Data Sets may not be compatible.

To make the use of OpenRefine extensions in Clean Data Sets as easy as possible, we provide some of OpenRefine's most popular extensions verified to run on Clean Data Sets. In this manner, you can download/inspect the extensions and upload them yourself to Clean Data Sets. If you install an OpenRefine extension not provided by us and it doesn't work as expected, please contact support and we may be able to help you set it up.

VIB bioinformatics extensions

VIB bionformatics offers three different extensions:

DERI extensions

DERI offers an extension to export RDF (Resource Description Framework) - Download | Documentation.