Archive for October 2011

Data Recovery Tools



Data recovery refers to the recovery of data on a computer that has been lost due any potential reason. Most operating systems in use nowadays have some kind of repair tool built into them, even if these tools are very basic in nature. For example, Microsoft Windows comes with a chkdsk facility, Apple’s Mac OS X has a Disk Utility, and Linux has an fsck utility.

While these utilities help repair minor inconsistencies, they’re quite useless in the event of large scale data loss. There are third party utilities available, some of which are far superior to these built-in utilities on these operating systems. These third party utilities can even recover data from disks that are not recognized by the operating systems? own repair utility.

Data recovery tools use two main techniques to achieve the desired results. Consistency checking is performed by scanning the logical structure of the disk and making sure that it is consistent with its specification.

The second technique is to assume very little about the state the file system that is to be analyzed and use hints and bits of the undamaged file system to rebuild the destroyed file system from scratch.

There are numerous data recovery tools available on the market. All it takes is a simple online search to pull up thousands and thousands of companies as well as descriptions of the various tools they offer.

Different data recovery tools work in different ways, though most use the same concept. The method of recovery depends on the type and extent of damage.

Most software data tools are quite ineffective when the damage is physical. Physical damage of a drive requires completely different techniques of recovery compared to logical damage.

Selection of the right data recovery tool depends on a number of factors like the type and extent of damage, effectiveness of the tool, and its cost.

Plan Your Data Analysis in 3 Steps



Back when I was doing psychology research, one of my biggest challenges was having enough data. And while this is probably still true for many experimental researchers, with the internet’s ability to make data sets accessible, it seems data is reproducing faster than rabbits.

Now I see many researchers grappling with overwhelm at managing and analyzing enormous data sets.

Even a moderate number of variables could lead to endless variations on analyses. “Hmm, I wonder if the stem length is correlated to the wind direction. What about this measure of plant size-leaf area? Is that correlated with wind direction?” This can go on forever.

According to Frank Scarpaci, owner of Project Designworks, there is a

“1:10:100 Rule:
Every dollar spent on planning and preparation saves $10 on project work or $100 on fixing problems after the project is done.”

And since, even in academia, time is money (or even more precious) planning your statistical analysis will save endless time and frustration. I mean, you’d rather spend an hour now planning the analysis than redoing it in a year after the reviewers rip it to shreds, right?

The best time to plan the analysis is before collecting data. This prevents those (all too common) situations where you realize you needed another variable or you should have measured something on a different scale. Grant applications force you to do this, but every study would benefit.

How do you plan it? You base it on the results you will report. You should already know the questions you want to answer in this study, but having it written down in a list keeps you on track. You do not need to (and should not) answer every question the data could answer in this study.

I find a great outline for a simple analysis plan comes from a brilliant article written by Darryl Bem about writing journal articles. The entire article is excellent (and I highly recommend it), but most helpful for planning is the section, “Presenting the Findings”. This section outlines 7 steps for reporting each finding. For planning purposes, I condense these into three:

1. State the conceptual hypothesis you are asking
2. Restate this hypothesis in the terms of the variables that measure the concept
3. List the statistical test or method that will answer this question

Simply repeat these three steps for all hypotheses the study is set up to answer. Start with the most general and important, and work down from there.

Doing these three steps before you sit down to analyze has three advantages:

1. It forces you to choose the variables you will use. Choosing early allows you to take unwanted variables from large data sets, making processing time much faster. It also defines the variables on which to conduct univariate analysis, and precisely defines which variables to collect.
2. It discourages you from performing irrelevant analyses, saving time, energy, and frustration as well as making your article clearer and more logical.
3. It makes writing the results section a breeze.

It is certainly true that there is a place for exploratory data analysis and some surprises always pop up-missing data, an unexpected skew, and so on. But getting back on track is always easier if you know the direction in which you’re heading.

For more information, see:

Bem, D. (2003). “Writing the Empirical Journal Article.” In The Compleat Academic:A Practical Guide for the Beginning Social Scientist, 2nd Edition. Darley, J. M., Zanna, M. P., & Roediger III, H. L. (Eds) Washington, DC: American Psychological Association.

Introduction to Dimensional Modeling for Data Warehousing Part 2, Dimensional Modeling Principles



In part 1 of this article series, we described the general structure of a dimensional model. In the present article we shall describe the basic design principles of dimensional modeling. Dimensional modeling follows the four steps defined below. A. Selection of the business process (or processes), the performance of which shall be monitored. Business processes the performance of which is considered critical, and relevant data are sufficient (e.g. operations data derived from these processes), should be selected with priority. The selected business process, may relate to a single organizational unit, or spanning more than one organizational unit.

The capture of overlapping information by different departments which can lead to many versions of truth, is avoided through the capture of a single data stream for an ‘end-to-end’ process. B. Determination of the level of detail at which the process shall be monitored (also called grain statement). The grain statement is the first step in a dimensional model design. Examples of grain statement are: