Microsoft’s Excel 2007 added an amazing new command not present in previous versions, Remove Duplicate Data (its just right of center on the Data ribbon).
Removing duplicate data before 2007 involved either Visual Basic programming or a manual comparison after sorting the data. Both solutions are time consuming (and error prone in the case of manual comparison).
Remove Duplicate Data can search unsorted tables with thousands of rows in less than a second, pulling any duplicates that match in every column, or in a user defined group of columns.
However, Remove Duplicate Data is scary. You click on the button, decide what columns must match, and almost instantaneously its done. Your only confirmation is the number of rows deleted.
I know Excel 2010 is about to be released to early testing, so here’s my wish list of improvements to Remove Duplicate Data in order of difficulty to implement:
- Give an option to put the deleted rows into a new worksheet after they are deleted. It can be a checkbox on the columns to check dialog. That lets you do the removal, some other processing and still recover (or review) the deleted items.
- Give an option to review the duplicated records, like Microsoft does in Access. All matching duplicated rows would be put in a new worksheet, I can decide what to delete/keep and then copy the remaining records back to the original worksheet.
- Give an option to perform arbitrary processing on the matching duplicates. For instance I might want to sum the values of one or more column, or do comparisons to decide which value of one or more columns to keep.
Remove Duplicate Data is amazingly powerful and simultaneously scary. I hope the next iteration in Excel 2010 can be made even better.