Introduction to Data Science

What data scientists actually do

Data science is about answering questions with data, but "answering questions" covers a lot of ground. Here are the main roles you'll see:

Data Scientist (general). The broadest title. Frames business questions, explores data, chooses analytical approaches, builds models, validates results, and communicates findings. Works across description ("what happened?"), inference ("is this difference real?"), prediction ("what will happen?"), and sometimes causal analysis ("what caused this?").

Research Scientist / Applied Scientist. Heavier on methodology. Runs experiments, builds statistical models, publishes findings. More rigorous about assumptions and validation. Common in tech companies, healthcare, and policy research.

Decision Scientist. Focuses on the business side. Less modeling, more framing the right question, designing the right analysis, and translating findings into recommendations that decision-makers can act on.

Quantitative Analyst. Specialized in finance, risk, or pricing. Heavy on statistical modeling and forecasting. Often works with time series and financial data.

These roles overlap significantly. At a small company, one data scientist does all of it. The underlying workflow is the same regardless of title.

The professional loop

Every data science project, whether it's a simple summary or a complex causal analysis, moves through the same cycle:

1. Problem framing. What decision does the client face? What question does the data need to answer? What would a useful answer look like? Getting this wrong means doing excellent analysis on the wrong question.

2. Data audit and exploratory analysis. What data exists? What shape is it in? What's missing? What patterns are visible before any formal analysis? You can't answer a question with data you don't understand.

3. Question typology. What kind of question is this? Description, inference, causal, prediction, or forecasting? The answer determines the method. A causal question answered with a descriptive method gives a wrong answer that looks right.

4. Data preparation and assumption design. Clean the data, handle missing values, check assumptions. Every analytical method has assumptions, and violating them produces results that look valid but aren't.

5. Analysis and modeling. Run the analysis or build the model. This is where AI does the heaviest lifting, and where its mistakes are most consequential.

6. Validation and sensitivity analysis. Does the result hold up? What happens if you change the assumptions? Where does it break? A result that only works under one specific set of conditions isn't a finding. It's a coincidence.

7. Communication and recommendation. Translate the analysis into something the client can use. Not a technical report, but a recommendation with evidence, uncertainty honestly communicated, and limitations clearly stated.

8. Handoff or monitoring. If the analysis produces something that runs in production (a prediction model, a scoring system), hand it off to engineering or set up monitoring. Not every project reaches this step.

You'll run this loop in every project. What changes is the complexity: early projects give you a clean dataset and a clear question. Later projects give you messy data, vague problems, and expect you to figure out what kind of question you're even answering.

What you'll work on

Each project is built for a client with a specific problem. You'll direct AI to analyze data, build models, and produce recommendations, then verify whether the analysis is sound and the conclusions are warranted. Here's a sample:

An exploratory analysis that reveals patterns in a client's business data
A statistical inference project that tests whether a change had a real effect
A prediction model deployed as a scoring system
A causal analysis that disentangles correlation from causation
A time series forecast with honest uncertainty bounds
A communication deliverable that translates complex findings for a non-technical audience

The projects get harder in specific ways. The questions get vaguer. The data gets messier. The methods get more demanding. The client stops telling you what kind of analysis they need. And throughout, AI is your primary tool, capable and fast, but prone to specific statistical mistakes that you'll learn to catch.

Core tools

These are the tools data scientists use daily. You'll set up the core ones in the track setup; the rest are introduced as projects need them.

Terminal. Your command line. Everything runs through it.

Claude Code. Your AI coding agent. You'll direct it to explore data, run analyses, build models, and produce visualizations. It's strong at writing analytical code, and it makes specific, predictable mistakes with statistics that you'll learn to catch.

Git and GitHub. Version control. Every project lives in a repository.

Python. The language of data science. Nearly every analytical library is Python-first. You don't need to be an expert programmer (you're directing AI), but you need to read Python and understand what it does.

Jupyter notebooks. Interactive documents where you run code and see results immediately. The standard environment for data exploration and analysis. Most data science work lives in notebooks.

pandas. Data manipulation. Loading, cleaning, transforming, and summarizing data. The workhorse library.

matplotlib and seaborn. Visualization. Charts, plots, and statistical graphics. How you see what the data looks like.

scipy and statsmodels. Statistical analysis. Hypothesis tests, confidence intervals, regression models. Where the inferential and causal work happens.

scikit-learn. Machine learning for prediction tasks. Preprocessing, model training, evaluation. Used when the question is "what will happen?" rather than "what caused this?"

You'll install additional tools as the track progresses: causal inference libraries, forecasting tools, and others. Each project tells you what's needed.