Data Quality: The Heart of Big Data

After last week’s post on promise and perils of big data, I wanted to pursue the discussion further around data quality. This is usually covered by “veracity and validity” as additional “Vs” of big data. In my experience, these two really go hand-in-hand and speak to the issue at the heart of driving business value leveraging big data. If users are not confident in the data quality, then it doesn’t matter what insights the system delivers as no adoption will occur.

4vs_big_dataMerriam-Webster defines something as valid when it is “well-grounded or justifiable: being at once relevant and meaningful; logically correct.” Veracity is defined as “something true.” In the big data conversation (as found on insideBIGDATA, veracity refers to the biases, noise and abnormalities found in data while validity refers to the accuracy and correctness. At it’s core, the question of data quality comes down to whether they data you have is reliable and trustworthy for making decisions.

In a Jan 2015 article on big data veracity, IBM speculated that uncertain data will account for 80% of your data. The author Jean Francois Puget said that 80% of an analytics project is comprised of data cleansing. I whole-heartedly agree that 80% of data analytics project time should be spent on data cleansing. Unfortunately, in my experience, the project urgency and promise reduce this time significantly.

While the reality might be slightly alarming, I think there are steps in the process that can minimize data quality issues.

  1. Make sure the team understands the business problem – How can you know what data you need, if you don’t understand the business problem? Further, how can you know if your data is accurate or true for solving the business problem? All data analysis and quality checks should become obvious once the project team gets grounded in the business problem.
  2. Map out the data needed to solve the business problem – Once you understand the business problem, you can start to map out the data you need to solve it. You’ll want to consider how the data will be sourced. Data obtained from outside your sphere of influence (department, organization, etc) may require additional manipulation and cleansing to get to a usable state.
  3. Analyze your source data – Even if you are comfortable in the data source, you will still want to do some analysis once you start receiving data. Just because documentation says something is true, does not make it so. Values may be different than expected, which could have significant impact on your model or visualization.
  4. Make decisions about the business rules – It is very rare for data to be usable without any manipulation or cleansing. In response to steps 1-3, decide how the data needs to be manipulated.
    1. How should the specific business rule be applied (i.e. fill in gaps with average of last 10 records)?
    2. When will the business rule run (i.e. at what step of the process)?
    3. Who, specifically which system, is responsible for applying the business rule (i.e. Is it the source system? Is it an intermediary system (like Mulesoft) that feeds the data to you? Is it the data load process? Or is it a post load process?)
  5. Clearly document & get sign off on everything captured above –  Leveraging big data to deliver business value is hard. Documentation acts as a reminder of what was done and why. It should clearly define the problem, the data, the sources, the business rules as well as the confirmation that the project team and audience agreed to the path taken.

During my projects, I spend a lot of time around data analysis. As the project manager I want to fully understand the data, and I want confidence that it is correct and reliable. It’s this same effort that needs to be taken with the project stakeholders and end users to given them a comfort level. Each step of the process of validating assumptions are proof points for building trust. It is trust that will yield adoption. 


Big Data: The Promise and the Peril

Last night, Women in Technology hosted an amazing panel on Big Data for their monthly WIT.Connect. Carol Hayes (Navy Federal Credit Union), Carrie Cordero (Georgetown University), Kim Garner (Neuster Advisory Services Group), Rashmi Mathur (IBM Global Business Services), and Stacey Halota (Graham Holdings Company) joined moderator, Susan Gibson (ODNI) for a discussion on the promise and perils of big data. I’ve compiled my notes to share with you.


How Do we Define “Big Data?”

The first stop in the conversation was for the panel to define “big data.” Carol provided us with a brief history of the term, starting with the 1997 use by a group of NASA scientists who primarily were referring to memory & disk space issues. The term was further legitimized in 2014 when it was added to the Merriam-Webster and Oxford dictionaries. This usually requires at least 10 years of use, which alludes to even earlier usage. The panel definition refered to the 4 Vs – volume, variety, velocity & veracity. UC Berkeley researchers are now pushing to include elements of analytics & security.

Does Big Data Live Up to the Hype/Promise?

Much has changed about data over the years. Not only have we seen significant increases in the volumes of data, about 80% of big data is unstructured. Historically, data has been very silo-ed. The struggle for most businesses is figuring out how to use the data to make business decisions. The goal for many is to form a 360 degree view of the customer.  Technology is constantly improving to help tackle this challenge.  Some of the specific use cases discussed included fraud detection and marketing.

This week’s news brought us word of IBM’s $2.6 billion acquisition of Truven Health.  Susan wondered if investments of this size are worth it. Rashmi challenged us to look beyond a single investment and focus on the broader IBM initiative to use data to improve health, and view this acquisition as part of a broader healthcare ecosystem.

On the topic of whether small businesses get left out of the big data conversation, we were reminded that all businesses have very valuable data. Leveraging the data you have with 3rd party data can significantly increase the data value. For small businesses, the bigger challenge tends to be having the internal resources available to drive the business questions and analysis.

What About the Perils? Privacy? Security?

In thinking about the perils, privacy & security of big data, we need to consider it from the point of housing, aggregating & sharing it. Stacey challenged us to answer the question “how much data are we deleting” on a regular basis. She recommends at least having an annual discussion. Before that, you should have already done your data inventory to document what you have & why you have it. Stacey cautions us to be straightforward in thinking about the data you should (or shouldn’t) have. For those working with their client’s data, it all begins with the business question. Once you know what the business goal is, you can decide what data you need. You will need to consider the balance of risk, opportunity and cost.

The collection of data outpaces laws & compliance. This has resulted in a decade of breaches. Protecting information is a governance issues not a technical issues. Governance should drive protection. The enhancements in big data technology has resulted in newer technology including the security measures from the beginning (versus adding it as an afterthought).

It’s agreed that privacy & security issues impact businesses of all sizes. Unfortunately, the smaller organizations take on higher risk as a result of limited structure & longevity. It is much harder for a small business to survive a hit to their reputation breaches can cause. We need to ensure that employees get the education needed to handle data. There should be no distinction between programming and secure programming. Ideally, security becomes so engrained in our business process that it just happens without the need for separate functions.

The panel recommended these 5 actions for getting ahead of compliance issues:

  1. Review the California recommendations on breach of personal information
  2. Review the ISO 27001 information security standards
  3. Establish an Incident Response Plan that outlines point of contacts, forensic partner(s), lawyer, etc
  4. Have a plan & test it
    • there are incident response simulation consultants to assist you
    • the general process is to answer a list of questions & receive checklist of legal & custom actions to take
  5. Share incident information with other companies within the industry

Inherent in housing, aggregating, analyzing and sharing data is risk. How much risk is too much? That will depend on the nature of the organization and the data. IBM has the business group respond to a simple questionnaire that helps drive that assessment during the initial phases of new projects.

While the discussion touched on global compliance for companies, this is currently in a bit of flux at the moment. The Safe Harbor framework that allowed US companies to self-certify that they handle data in a way consistent with the EU requirements was recently challenged. Privacy Shield is the new framework being developed to outline the new requirements. Global companies should be watching these decisions closely.

Where Will the Future Take Us?

  • Tell me before it happens – Companies are leveraging historical data to predict the future. More often, companies want to be told what will happen, before it happens. Insights will get crisper as we begin to hone in on relevance.
  • Data journalism – the ability to tell stories with data is the wave of the future.
  • Natural language and machine learning will make people smarter. It’s all about enablement.
  • Threat intelligence – the sharing of information becomes more critical.
  • Regulatory compliance – inequality and accountability of government versus private sector will converge.

Any Parting Words of Advice?

The consensus is that big data yields plenty of opportunity. It’s one of the few industries where there are plenty of educational opportunities, and negative unemployment. “Any career with ‘data’ is good.” Be sure to look at degrees and certifications, but those aren’t required. Natural curiosity can nicely lend itself to the human side to data analytics. Deliver “the art of the science.”

One final Afterthought

No good conversation about big data occurs without having a mention of veracity of data. Long before modeling or analysis begins, significant time is spent on ensuring good data exists. Thought and care needs to go into cleaning the data, filling in missing data, ensuring the data makes sense.

4 Considerations When Choosing Project Management Tools

Project Managers tend to be very opinionated on the “right” PM tools to use. This discussion, and interview question has always frustrated me. I strongly believe that it doesn’t matter what tool you as the Project Manager prefers, you should be able to adapt to whatever tools work best for the organization and your customer. If you Google “recommended project management tools” or “the best project management tools” you can between 34 million and 125 million results. In my experience, there are four key considerations I use for the basis of deciding which tools will work best.


Keep it Simple! The least common denominator is usually the best option. Tons of fancy features aren’t generally very helpful. Often times you pay for the luxury of having those available. Apply the 80/20 rule to your evaluation process – 80% of of your needs will be fulfilled by 20% of the features & functionality in most project management tools.


Who is your audience? The tools you leverage with your project team or for your own planning will probably be different than those you use with broader stakeholders. I have found that simple tools are often best. While I can develop a complex gantt chart, and will sometimes do so for my own planning, I find basic word-processing & spreadsheet applications are easier for less technical (or PM focused) audiences.


Is your audience within or external to your organization or team? Often times you need to deliver project status and meeting notes to an audience with internal and external participants. These are more static update tools, rather than interactive. These tend to capture status, action items, changes, decisions, etc. This is the right tool for conveying priority or tasks to another team within your organization where you may share resources. Additionally, you will want to figure out how to make project artifacts easily accessible. This may be through an intranet for internal access only, or a collaborative/extranet solution.

If your audience is located in your same office, I would highly recommend using tools like white boarding and sticky notes to brainstorm and organize. Recent studies (here’s Inc’s write-up) have shown that leveraging less technical tools improves creativity and brain power.


This consideration encompasses cost and availability. Sometimes your decision is just made for you. If you are running your project budget or organization very lean, you’ll want to leverage open source or free software. Alternatively, if your organization has negotiated a software licensing deal that gives you access to specific tools, it may make the most sense just to use what’s available.

At the end of the day, don’t overthink your tool selection. Decide what your critical goals are, and find the tools that meet them. There are significant advances in project management tools that provide plenty of options. Not all of them are right for all projects, teams or organizations. Sometimes it best to get everyone in the room, using the most basic tools of all.

My Take on the Data & Women DC Inaugural Event

It’s been a while since I wrote about women in tech, but I attended the Data + Women DC Inaugural Event last night, hosted by CHIEF (check out their blog for their monthly events) and was really inspired by what I saw and heard. In some ways the format was like all other meetups, networking followed by a program, but this group did something a little bit different by splitting into smaller groups for more intimate discussions. It was definitely easier to get to know people, and as one person in my group said “maybe all meetups need to treat each event like it’s an inaugural one, and give everyone a smaller forum to be heard.” I tend to agree.

Unfortunately, or given the aforementioned feedback, fortunately, I was coming from another appointment so missed the networking. I caught part of the panel discussion and then all of the small group discussion. We hit on quite a few pieces of advice or considerations that I wanted to share.


One of the most critical points made in response to the question about what you and/or your company can do to help advance women was about bragging. Often we are uncomfortable with other people bragging about our work, especially if it’s unexpected. It’s important to promote the work you do, and if you’re not comfortable doing that, then maybe having your friends and colleagues do it for you, will help make it more comfortable. One participant said she was going to take that recommendation back to her corporate lean in circle.

Emotion & Passion

We definitely touched on not allowing your emotions get in the way of your passion (or lack their of). Several participants shared their experience creating a goal to accomplish X to prove you could, then realizing part of the way that you didn’t want/like this. In the same vane, if something isn’t working for you in your current role or with your current company, it’s within your right to fix it. And if your company isn’t willing it work with you, then it’s time to fulfill that somewhere else.

Confidence & Competence

We had a fairly extensive conversation about women’s confidence & perceived competence. There have been many studies that show men interview for potential (what they believe they can do) and women interview for performance (what they know they can do). While the overall consensus was that we wanted to be true to who were are, and what are capabilities are, but still acknowledge what you can do. Some discussion occurred around how frustrating it is to work with people who say “they can do everything”, but in reality can only do some portion of it. This conversation brought to mind the differences I see in male and female developers. Many male developers I know will say they have experience in language a, b, c and therefore have learned the programming methodologies and frameworks and feel they can do languages d, e and f. Female developers that I know tend to put more weight on what they have done (i.e. language a, b, c). I hope female developers will become comfortable enough to take the same stance as men, extrapolating from their experiences to speculate what they can do.

Inherent Bias in Open Source and Software Language Naming

Our group shared some interesting experiences with the open source apps and software language naming conventions. One participant was recently using an app and came across very male gendered language in the examples and documentation. In pushing the issue on social media, she was able to get some changes made, but no clear alternatives to the problem. Another participant introduced the topic of software languages named for females tend to use very comfortable, personal first names (Ada, Ruby, Julia). That’s rather interesting when you apply the aggressive “wrangling”, “manipulating” verbs towards it.  I can’t say that I had observed either of these first hand. I wonder if I just don’t notice.

I had a great time connecting with my small group. I hope I represented our conversation well. And I hope to see everyone again.