Cleansing of unstructured data – Keep, Delete, Not sure

Home / Uncategorized / Cleansing of unstructured data – Keep, Delete, Not sure

Cleansing of unstructured data – Keep, Delete, Not sure

Cleansing of unstructured data prior to migration

Data cleansing and Migration of Data

Data cleansing and migration is one of the topics to be considered whenever upgrading or installing a new system for data management.  The majority of data migration advice and white papers addresses the cleansing and migration aspects relating to structured data, or data already within databases.  There are not many articles on how to go about data cleansing of unstructured data, for example when implementing an EDRMS or DMS and migrating data from file shares.

Windows explorer and the windows file structures are great for their flexibility and ease with which folders can be created and used.  However, there are also inherent disadvantages in that the folder structure can soon proliferate creating duplicate folders and documents, depending on the personal perspective of the user.  This means that one of the first challenges for any organisation that wants to standardise documents and records management using an Electronic Document Management System is to introduce Business Classification Scheme or Taxonomy that is common and appropriate for the whole organisation.

Unstructured Data

As more and more organisations are moving towards getting their data into a dedicated system for their unstructured data, such as documents and records that are created as part of day to day work in a business, then the need for a Business Classification scheme or file plan is essential. Consistency in folder naming is the first step in establishing the file plan and once such a file plan is agreed, then begins the difficult task of considering cleansing and migration of existing or legacy data into the new system

This task is onerous and anyone who uses a computer will know the state of their own folder structure and the difficulties of maintaining consistency within it.  Within a corporate environment this is exacerbated with the number of people using the system and individual preferences for naming files and folders.  Establishing a common file plan is quite a task in itself, but once agreed, the next hurdle is to get everyone organised to start cleaning up their existing data.

This is another daunting task.  There are not many solutions which can help in identifying which of the files are redundant and which may be important for the future.  De-duplication software solutions can be used to identify duplicates.  However, only the users of the file structure areas can make this decision, or the responsible owner of the data within the organisation, who may be a senior manager for the department.  However, such accountabilities are not usually clearly defined and it tends to fall on the admin support staff who are helpers in the process of keeping things tidy.  They will need to engage with their user community to get their support for this effort.  It is better if the senior responsible person for the function takes accountability and gets their team to support this effort.

Method of Keep Delete and Not sure

So how to go about this cleaning effort?  One of the approaches we have advocated in our dealings with clients over the last few years, is the method that we call “Keep, Delete, Not sure”.  The reason this works is that it gives people a practical approach to tackle the difficult task of cleaning existing folders (hundreds?) which contains thousands of documents.

A review is required of each of the folders to assess that the content all the documents within the folder – is appropriate to the folder name.  It is amazing how many folders are called ‘Misc’ presumably for miscellaneous or ‘Joe’s Stuff’.  Often there are documents which relate to more than one subject in the folder which may need separating.  Identifying the purpose of the document and relating it to the folder name will be helpful.

Once the folders are rationalised, then a review of the documents can be carried out initially by getting a printed list of the documents and assessing these for the following actions:

  1. Keep
  2. Delete
  3. Not Sure

It is important to stress that this should be a paper exercise – DO NOT carry out this review on a LIVE system!  During this assessment, three different highlighter pens can be used to annotate the listings so that the actions can be carried out later on the electronic system.  Albeit a tedious approach, it does help in a practical way to overcome the great task of dealing with electronic files.

The files identified for deletion can be deleted immediately or renamed with a prefix of ‘zz-‘ so that in the default file listings these would fall to the bottom of the file list and out of the way.  DO NOT Delete files until you are absolutely certain that you have authority to act, and you do not need them.

The files identified to Keep can stay in the folder.

The ‘Not sure’ ones can initially be annotated with a prefix of xx- or they can moved into a separate new folder within the same area.  Remember to name this new folder appropriately, e.g. ‘NOT sure-Engineering design’ with a date – so you will know these have been reviewed. This folder is a temporary holding area until decisions are made about the content.  Do ensure that the user community is aware of this and does not add any new work here!


This approach helps to identify actions that can be carried out straightaway once the assessment has been carried out.  It helps to resolve the Keep and Delete ones straight away and also allows everyone the comfort of knowing that ‘Not sure’ is perfectly normal, and you can be unsure and keep things until you are ready to deal with them.

The ‘Not sure’ category will reduce over time when more and more data is managed proactively across the teams.  But the fact that a folder has been created, with a folder name which can have a target date for deletion, may prompt colleagues to go and investigate if they need anything from that folder.  A collaborative effort across teams is the best way to achieve data cleansing!  One superb example from a recent client is a clean-up team folder called ‘TO BE DELETED December 2016’, allowing team members to retrieve last-minute stuff if they discover a need, but when December comes they will find it hard to argue “Well how was I to know?”


Key points to NOTE:

  1. Get commitment from the data owner or the responsible person for the areas of work that is to be covered. Get them to engage with the team to allocate the effort required
  2. Do make sure that this is carried out as a paper exercise only first before you make changes to the electronic system (even if this is in shared folders).
  3. Ensure that items identified for deletion are reviewed by the team, before finally deleting them.
  4. Ensure that the ‘NOT sure’ folder is reviewed regularly, so that the folder can eventually also be cleaned out.
  5. Ensure your RM policies cover this activity, e.g. the review process and then destruction of unwanted legacy data is written into your governance policies.
Recent Posts
Contact Us

We're not around right now. But you can send us an email and we'll get back to you, asap.

Not readable? Change text. captcha txt

Start typing and press Enter to search