Advice on Building Your Own Dataset

Over the past month or so, I’ve been collecting various bits of data for a personal project that will be initially used as a term paper and later (possibly) for my dissertation research.  I thought this would be a great opportunity to blog about my experience compiling a “medium N” dataset (N= 364 at the moment and can’t grow much larger).

#1: Don’t Do It

If you can avoid creating your own dataset and can piggyback off of the thousands of fairly reliable, publically available data – don’t attempt this at home, kids.  Parsing bits of data that are scattered across multiple sources in various formats into one spreadsheet is incredibly time consuming, stressful, annoying, and tedious.

#2: Plan Ahead

If you must create a dataset from scratch or at least nearly from scratch using some of the work someone else has already done, be prepared for the time commitment.  Coding takes longer than you might think.  If your paper is just for a class and you don’t intend to push it forward for conferences, publication, or a dissertation chapter, refer to #1 and find something feasible that has already been compiled.  If you are ready for the long haul, make sure you prepared for it.  Which brings us to #3 –

#3: Be prepared to recode…and recode…and recode

First of all, your sources for data will not be user friendly. They are all going to give you observations in various forms – html, Excel, SPSS, STATA, etc.  Sometimes they’ll use commas to separate the thousands (which STATA hates, btw).  Other times, they’ll be massive gaps in the data and you’ll have to comb the inter-universe for alternative sources to fill in those gaps.  While your combing to fill in those gaps, you’ll probably find data on new variables you want to test, and also on observations that were missing from other unrelated variables – which ultimately leads to more coding and recoding.

#4: Stay organized

Keep your data files organized.  This means labeling your variables in a pragmatic fashion, mostly so you don’t forget what they are but also so that eventual replicators [aka freeloaders like those who do #1] can use the data easily.  Make sure you keep track of ALL and ANY sources.  Where did I get that birthrate variable again?  This is a bad position to be in when it comes time to write the codebook/ appendix for the paper.   It also means making sure that all your files are organized on your Mac (or PC? Who uses those?).  This is a major pitfall of mine. I save stuff to the desktop, to random folders, to drop box. Everywhere. It’s insane.

#5: BACK UP!!!!!

There is nothing worse than pouring a month of your life into coding data only to have your favorite spreadsheet software or computer itself freak out on you and lose everything (or at least what you coded for the last 15 minutes).  So make sure you back up any and everywhere and all the time. I’m talking about an external drive, a dropbox folder, a flash/thumb/pin (whatever you like to call it) drive, email, servers – print it out if you have to!

#6: Use your network

Don’t get discouraged when World Bank doesn’t have that perfect proxy you were looking for or when Afrobarometer didn’t ask the specific question you need for a given round.  Use your resources – especially your professional and personal networks (this means you need to hone these first).  There’s not always going to be a perfect proxy (see #7), but there might be someone in your network that has access to it if it does exist or at least has suggestions for alternatives.  Don’t be afraid to call in the Calvary.  Besides, they might end up using your data for their own project later.

#7: When the network fails, cold call

If your network fails and you still can’t get your hands on the data – start cold calling leaders in the field. Sometimes this works, sometimes it doesn’t.  I’ve found it effective when the data are present but currently not publically available.  There’s always a chance of success if you email or call the PI or professor managing the data or who is prominent in the field. If you don’t email or call, there’s no chance at all they will help you (duh!).  Besides, from the cold call exchange you might be able to add him/her to your network, thus avoiding this step next time.

#7: Be flexible

You’ve read all this great literature and know exactly what variables need to be tested and which ones have to be controlled for.  Great. Now what?  Well unfortunately, proxies are proxies.  They are intended to approximate reality, but aren’t going to capture everything.  The best proxy (which has already been used by someone somewhere, yay! It’s coded and available), is probably not the “best” in terms of completeness.  Especially if you study SSA (whoop!) you’ll find that data that is typically flawless (like literacy rates) are sparse when it comes to the continent.  This is where the creativity comes in – find something that is a close approximation.  Worse case scenario – drop the variable but explain why in your codebook and your paper.

#8: You might develop a bit of OCD

Every time you read something new and/or pursue a new variable, you are bound to want to correct stuff you’ve already coded. Remember #7?  Well don’t be too flexible that you incorporate everything and don’t be too OCD that you spend the rest of your life coding and never run a single cross-tab let alone regression.  Breathe. It won’t ever be perfect.

#9: Give Credit where Credit is Due

This goes back to #4.  At the end of the day, someone else has gone through the same terrible ordeal as you, but possible worse as they have looked through hospital records, financial papers, and/or hand-scribed census data. Make sure you cite your sources.  And absolutely make sure you acknowledge those in your network or new network (through cold calling) who provided you with help and/or data itself.


We have to draw the line somewhere, stop coding variables, and start running models.  Knowing where to draw the line is tough and of course the line should be flexible (see #7) if you find that your existing data are inadequate.  But at the end of the day, the dataset is not a paper.  While it might seem like a work of art, there is much exploration left to do.  It’s time for some t’s, p’s, R’s, and of course z’s.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s