How IBM builds an effective data science team
摘要：Data science is a team sport. This sentiment rings true not only with our experiences within IBM, but with our enterprise customers, who often ask us for advice on how to structure data science teams within their own organizations.
Data science is a team sport. This sentiment rings true not only with our experiences within IBM, but with our enterprise customers, who often ask us for advice on how to structure data science teams within their own organizations.
Before that can be done, however, it’s important to remember that the various skills required to execute a data science project are both rare and distinct. That means we need to make sure that each team member can focus on what he or she does best.
Consider this breakdown of a data science project, along with the skills required for each role:
While each role is certainly distinct, each team member does need to have T-shaped skills — meaning they’ll need to have depth in their own role but also a cursory understanding of the adjacent roles.
Let’s explore each role from the chart in a little more depth.
Product owners are the subject matter experts, with a deep understanding of the particular business sector and its concerns. In some instances, the primary role of the product owner will be on the business side, while they work periodically with the data science team to address a specific data science problem or set of problems before cycling back into the broader role.
In fact, cycling back to the normal role is a benefit to the data science team. It means the product owner acts as the ultimate end user of the models and can offer concrete feedback and requests. It also means the product owner can advocate for data science from within the business units themselves.
Product owners are most often responsible for:
- Defining the business problem and working with data scientists to define the working hypothesis
- Helping to locate data and data stewards as necessary
- Brokering and resolving data quality issues
Data engineers are the wizards who move all the data to the center of gravity and connect that data via services and message queues. They also build APIs to make the data generally available to the enterprise, and they’re responsible for engineering the data onto the platform that best fits the needs of the team. With data engineers, we look for these top three skills:
- Proficient in at least three of the following: Python, Scala, Java, Ruby, SQL
- Proficient at consuming and building REST APIs
- Proficient at integrating predictive and prescriptive models into applications and processes
Data scientists tend to fill one of two distinct roles: machine learning engineers and decision optimization engineers. Because market conditions have caused “data scientist” to be such a hot role, making this distinction can remove some confusing wiggle room. (For our detailed thoughts on this, see our recent article on VentureBeat.)
Machine learning engineers
Machine learning engineers build the machine learning models, which means identifying the important data elements and features to use in each model. They determine which types of models to use, and they test the accuracy and precision of those models. They’re also responsible for the long-term monitoring and maintenance of the models. They need these top three skills:
- Training and experience applying probability and statistics
- Experience in data modeling and evaluation and a deep understanding of supervised and unsupervised machine learning
- Experience programming in at least two of the following: Python, R, Scala, Julia, or Java, with a preference for Python expertise
Decision optimization engineers
Decision optimization engineering skills and experiences overlap with machine learning engineers, but the differences are important. Decision optimization engineers need these top three skills:
- Experience applying mathematical modeling and/or constraint programming to a range of industry problems
- Proficient programming skills in Python and the ability to apply predictive models as input into decision optimization problems
- Experience building Monte Carlo simulation/optimization for what-if scenario analysis
That brings us to data journalists, the team members who help represent the output of the model in the context of the data that drove it and who can clearly articulate the business problem at hand. With data journalists, we look for these top three skills:
- Coding skills in either Python, Java, or Scala
- Experience integrating data and the output of predictive and prescriptive models within the context of a business problem
- Proficiency with data parsing, scraping, and wrangling
If you can gather together a team with these essential skills — and if you can ensure they collaborate well and maintain a meaningful understanding of one another’s work — you’ll be well on your way to uncovering the insights and understanding that can supercharge whatever organization you’re leading.
Without them, you could be flying blind.
Seth Dobrin is vice president and chief data officer at IBM Analytics.