{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Цель лабораторной работы\n",
"Изучить библиотеки обработки данных Pandas и PandaSQL."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Задание\n",
"Задание состоит из двух частей.\n",
"\n",
"## Часть 1\n",
"Требуется выполнить первое демонстрационное задание под названием «Exploratory data analysis with Pandas» со страницы курса [mlcourse.ai](https://mlcourse.ai/assignments).\n",
"\n",
"## Часть 2\n",
"Требуется выполнить следующие запросы с использованием двух различных библиотек — Pandas и PandaSQL:\n",
"\n",
"- один произвольный запрос на соединение двух наборов данных,\n",
"- один произвольный запрос на группировку набора данных с использованием функций агрегирования.\n",
"\n",
"Также требуется сравнить время выполнения каждого запроса в Pandas и PandaSQL."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Часть 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Unique values of all features (for more information, please see the links above):\n",
"\n",
"- `age`: continuous.\n",
"- `workclass`: `Private`, `Self-emp-not-inc`, `Self-emp-inc`, `Federal-gov`, `Local-gov`, `State-gov`, `Without-pay`, `Never-worked`.\n",
"- `fnlwgt`: continuous.\n",
"- `education`: `Bachelors`, `Some-college`, `11th`, `HS-grad`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `9th`, `7th-8th`, `12th`, `Masters`, `1st-4th`, `10th`, `Doctorate`, `5th-6th`, `Preschool`.\n",
"- `education-num`: continuous.\n",
"- `marital-status`: `Married-civ-spouse`, `Divorced`, `Never-married`, `Separated`, `Widowed`, `Married-spouse-absent`, `Married-AF-spouse`.\n",
"- `occupation`: `Tech-support`, `Craft-repair`, `Other-service`, `Sales`, `Exec-managerial`, `Prof-specialty`, `Handlers-cleaners`, `Machine-op-inspct`, `Adm-clerical`, `Farming-fishing`, `Transport-moving`, `Priv-house-serv`, `Protective-serv`, `Armed-Forces`.\n",
"- `relationship`: `Wife`, `Own-child`, `Husband`, `Not-in-family`, `Other-relative`, `Unmarried`.\n",
"- `race`: `White`, `Asian-Pac-Islander`, `Amer-Indian-Eskimo`, `Other`, `Black`.\n",
"- `sex`: `Female`, `Male`.\n",
"- `capital-gain`: continuous.\n",
"- `capital-loss`: continuous.\n",
"- `hours-per-week`: continuous.\n",
"- `native-country`: `United-States`, `Cambodia`, `England`, `Puerto-Rico`, `Canada`, `Germany`, `Outlying-US(Guam-USVI-etc)`, `India`, `Japan`, `Greece`, `South`, `China`, `Cuba`, `Iran`, `Honduras`, `Philippines`, `Italy`, `Poland`, `Jamaica`, `Vietnam`, `Mexico`, `Portugal`, `Ireland`, `France`, `Dominican-Republic`, `Laos`, `Ecuador`, `Taiwan`, `Haiti`, `Columbia`, `Hungary`, `Guatemala`, `Nicaragua`, `Scotland`, `Thailand`, `Yugoslavia`, `El-Salvador`, `Trinadad&Tobago`, `Peru`, `Hong`, `Holand-Netherlands`. \n",
"- `salary`: `>50K`, `<=50K`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pd.set_option(\"display.width\", 70)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" age | \n",
" workclass | \n",
" fnlwgt | \n",
" education | \n",
" education-num | \n",
" marital-status | \n",
" occupation | \n",
" relationship | \n",
" race | \n",
" sex | \n",
" capital-gain | \n",
" capital-loss | \n",
" hours-per-week | \n",
" native-country | \n",
" salary | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 39 | \n",
" State-gov | \n",
" 77516 | \n",
" Bachelors | \n",
" 13 | \n",
" Never-married | \n",
" Adm-clerical | \n",
" Not-in-family | \n",
" White | \n",
" Male | \n",
" 2174 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" 1 | \n",
" 50 | \n",
" Self-emp-not-inc | \n",
" 83311 | \n",
" Bachelors | \n",
" 13 | \n",
" Married-civ-spouse | \n",
" Exec-managerial | \n",
" Husband | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 13 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" 2 | \n",
" 38 | \n",
" Private | \n",
" 215646 | \n",
" HS-grad | \n",
" 9 | \n",
" Divorced | \n",
" Handlers-cleaners | \n",
" Not-in-family | \n",
" White | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" 3 | \n",
" 53 | \n",
" Private | \n",
" 234721 | \n",
" 11th | \n",
" 7 | \n",
" Married-civ-spouse | \n",
" Handlers-cleaners | \n",
" Husband | \n",
" Black | \n",
" Male | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" United-States | \n",
" <=50K | \n",
"
\n",
" \n",
" 4 | \n",
" 28 | \n",
" Private | \n",
" 338409 | \n",
" Bachelors | \n",
" 13 | \n",
" Married-civ-spouse | \n",
" Prof-specialty | \n",
" Wife | \n",
" Black | \n",
" Female | \n",
" 0 | \n",
" 0 | \n",
" 40 | \n",
" Cuba | \n",
" <=50K | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" age workclass fnlwgt education education-num \\\n",
"0 39 State-gov 77516 Bachelors 13 \n",
"1 50 Self-emp-not-inc 83311 Bachelors 13 \n",
"2 38 Private 215646 HS-grad 9 \n",
"3 53 Private 234721 11th 7 \n",
"4 28 Private 338409 Bachelors 13 \n",
"\n",
" marital-status occupation relationship race \\\n",
"0 Never-married Adm-clerical Not-in-family White \n",
"1 Married-civ-spouse Exec-managerial Husband White \n",
"2 Divorced Handlers-cleaners Not-in-family White \n",
"3 Married-civ-spouse Handlers-cleaners Husband Black \n",
"4 Married-civ-spouse Prof-specialty Wife Black \n",
"\n",
" sex capital-gain capital-loss hours-per-week \\\n",
"0 Male 2174 0 40 \n",
"1 Male 0 0 13 \n",
"2 Male 0 0 40 \n",
"3 Male 0 0 40 \n",
"4 Female 0 0 40 \n",
"\n",
" native-country salary \n",
"0 United-States <=50K \n",
"1 United-States <=50K \n",
"2 United-States <=50K \n",
"3 United-States <=50K \n",
"4 Cuba <=50K "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.read_csv('adult.data.csv')\n",
"data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**1. How many men and women (`sex` feature) are represented in this dataset?** "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Male 21790\n",
"Female 10771\n",
"Name: sex, dtype: int64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[\"sex\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**2. What is the average age (`age` feature) of women?**"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"36.85823043357163"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[data[\"sex\"] == \"Female\"][\"age\"].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**3. What is the percentage of German citizens (`native-country` feature)?**"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.420749%\n"
]
}
],
"source": [
"print(\"{0:%}\".format(data[data[\"native-country\"] == \"Germany\"]\n",
" .shape[0] / data.shape[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (`salary` feature) and those who earn less than 50K per year?**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<=50K: = 36.78373786407767 ± 14.020088490824813 years\n",
" >50K: = 44.24984058155847 ± 10.51902771985177 years\n"
]
}
],
"source": [
"ages1 = data[data[\"salary\"] == \"<=50K\"][\"age\"]\n",
"ages2 = data[data[\"salary\"] == \">50K\"][\"age\"]\n",
"print(\"<=50K: = {0} ± {1} years\".format(ages1.mean(), ages1.std()))\n",
"print(\" >50K: = {0} ± {1} years\".format(ages2.mean(), ages2.std()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**6. Is it true that people who earn more than 50K have at least high school education? (`education` – `Bachelors`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `Masters` or `Doctorate` feature)**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"high_educations = set([\"Bachelors\", \"Prof-school\", \"Assoc-acdm\",\n",
" \"Assoc-voc\", \"Masters\", \"Doctorate\"])\n",
"def high_educated(e):\n",
" return e in high_educations\n",
"\n",
"data[data[\"salary\"] == \">50K\"][\"education\"].map(high_educated).all()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**7. Display age statistics for each race (`race` feature) and each gender (`sex` feature). Use `groupby()` and `describe()`. Find the maximum age of men of `Amer-Indian-Eskimo` race.**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" count | \n",
" mean | \n",
" std | \n",
" min | \n",
" 25% | \n",
" 50% | \n",
" 75% | \n",
" max | \n",
"
\n",
" \n",
" race | \n",
" sex | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Amer-Indian-Eskimo | \n",
" Female | \n",
" 119.0 | \n",
" 37.117647 | \n",
" 13.114991 | \n",
" 17.0 | \n",
" 27.0 | \n",
" 36.0 | \n",
" 46.00 | \n",
" 80.0 | \n",
"
\n",
" \n",
" Male | \n",
" 192.0 | \n",
" 37.208333 | \n",
" 12.049563 | \n",
" 17.0 | \n",
" 28.0 | \n",
" 35.0 | \n",
" 45.00 | \n",
" 82.0 | \n",
"
\n",
" \n",
" Asian-Pac-Islander | \n",
" Female | \n",
" 346.0 | \n",
" 35.089595 | \n",
" 12.300845 | \n",
" 17.0 | \n",
" 25.0 | \n",
" 33.0 | \n",
" 43.75 | \n",
" 75.0 | \n",
"
\n",
" \n",
" Male | \n",
" 693.0 | \n",
" 39.073593 | \n",
" 12.883944 | \n",
" 18.0 | \n",
" 29.0 | \n",
" 37.0 | \n",
" 46.00 | \n",
" 90.0 | \n",
"
\n",
" \n",
" Black | \n",
" Female | \n",
" 1555.0 | \n",
" 37.854019 | \n",
" 12.637197 | \n",
" 17.0 | \n",
" 28.0 | \n",
" 37.0 | \n",
" 46.00 | \n",
" 90.0 | \n",
"
\n",
" \n",
" Male | \n",
" 1569.0 | \n",
" 37.682600 | \n",
" 12.882612 | \n",
" 17.0 | \n",
" 27.0 | \n",
" 36.0 | \n",
" 46.00 | \n",
" 90.0 | \n",
"
\n",
" \n",
" Other | \n",
" Female | \n",
" 109.0 | \n",
" 31.678899 | \n",
" 11.631599 | \n",
" 17.0 | \n",
" 23.0 | \n",
" 29.0 | \n",
" 39.00 | \n",
" 74.0 | \n",
"
\n",
" \n",
" Male | \n",
" 162.0 | \n",
" 34.654321 | \n",
" 11.355531 | \n",
" 17.0 | \n",
" 26.0 | \n",
" 32.0 | \n",
" 42.00 | \n",
" 77.0 | \n",
"
\n",
" \n",
" White | \n",
" Female | \n",
" 8642.0 | \n",
" 36.811618 | \n",
" 14.329093 | \n",
" 17.0 | \n",
" 25.0 | \n",
" 35.0 | \n",
" 46.00 | \n",
" 90.0 | \n",
"
\n",
" \n",
" Male | \n",
" 19174.0 | \n",
" 39.652498 | \n",
" 13.436029 | \n",
" 17.0 | \n",
" 29.0 | \n",
" 38.0 | \n",
" 49.00 | \n",
" 90.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" count mean std min \\\n",
"race sex \n",
"Amer-Indian-Eskimo Female 119.0 37.117647 13.114991 17.0 \n",
" Male 192.0 37.208333 12.049563 17.0 \n",
"Asian-Pac-Islander Female 346.0 35.089595 12.300845 17.0 \n",
" Male 693.0 39.073593 12.883944 18.0 \n",
"Black Female 1555.0 37.854019 12.637197 17.0 \n",
" Male 1569.0 37.682600 12.882612 17.0 \n",
"Other Female 109.0 31.678899 11.631599 17.0 \n",
" Male 162.0 34.654321 11.355531 17.0 \n",
"White Female 8642.0 36.811618 14.329093 17.0 \n",
" Male 19174.0 39.652498 13.436029 17.0 \n",
"\n",
" 25% 50% 75% max \n",
"race sex \n",
"Amer-Indian-Eskimo Female 27.0 36.0 46.00 80.0 \n",
" Male 28.0 35.0 45.00 82.0 \n",
"Asian-Pac-Islander Female 25.0 33.0 43.75 75.0 \n",
" Male 29.0 37.0 46.00 90.0 \n",
"Black Female 28.0 37.0 46.00 90.0 \n",
" Male 27.0 36.0 46.00 90.0 \n",
"Other Female 23.0 29.0 39.00 74.0 \n",
" Male 26.0 32.0 42.00 77.0 \n",
"White Female 25.0 35.0 46.00 90.0 \n",
" Male 29.0 38.0 49.00 90.0 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.groupby([\"race\", \"sex\"])[\"age\"].describe()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"82"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data[(data[\"race\"] == \"Amer-Indian-Eskimo\")\n",
" & (data[\"sex\"] == \"Male\")][\"age\"].max()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (`marital-status` feature)? Consider as married those who have a `marital-status` starting with `Married` (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.**"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True 5965\n",
"False 697\n",
"Name: married, dtype: int64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def is_married(m):\n",
" return m.startswith(\"Married\")\n",
"\n",
"data[\"married\"] = data[\"marital-status\"].map(is_married)\n",
"(data[(data[\"sex\"] == \"Male\") & (data[\"salary\"] == \">50K\")]\n",
" [\"married\"].value_counts())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**9. What is the maximum number of hours a person works per week (`hours-per-week` feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?**"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Maximum is 99 hours/week.\n",
"85 people work this time at week.\n",
"29.411765% get >50K salary.\n"
]
}
],
"source": [
"m = data[\"hours-per-week\"].max()\n",
"print(\"Maximum is {} hours/week.\".format(m))\n",
"\n",
"people = data[data[\"hours-per-week\"] == m]\n",
"c = people.shape[0]\n",
"print(\"{} people work this time at week.\".format(c))\n",
"\n",
"s = people[people[\"salary\"] == \">50K\"].shape[0]\n",
"print(\"{0:%} get >50K salary.\".format(s / c))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**10. Count the average time of work (`hours-per-week`) for those who earn a little and a lot (`salary`) for each country (`native-country`). What will these be for Japan?**"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" salary | \n",
" <=50K | \n",
" >50K | \n",
"
\n",
" \n",
" native-country | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ? | \n",
" 40.164760 | \n",
" 45.547945 | \n",
"
\n",
" \n",
" Cambodia | \n",
" 41.416667 | \n",
" 40.000000 | \n",
"
\n",
" \n",
" Canada | \n",
" 37.914634 | \n",
" 45.641026 | \n",
"
\n",
" \n",
" China | \n",
" 37.381818 | \n",
" 38.900000 | \n",
"
\n",
" \n",
" Columbia | \n",
" 38.684211 | \n",
" 50.000000 | \n",
"
\n",
" \n",
" Cuba | \n",
" 37.985714 | \n",
" 42.440000 | \n",
"
\n",
" \n",
" Dominican-Republic | \n",
" 42.338235 | \n",
" 47.000000 | \n",
"
\n",
" \n",
" Ecuador | \n",
" 38.041667 | \n",
" 48.750000 | \n",
"
\n",
" \n",
" El-Salvador | \n",
" 36.030928 | \n",
" 45.000000 | \n",
"
\n",
" \n",
" England | \n",
" 40.483333 | \n",
" 44.533333 | \n",
"
\n",
" \n",
" France | \n",
" 41.058824 | \n",
" 50.750000 | \n",
"
\n",
" \n",
" Germany | \n",
" 39.139785 | \n",
" 44.977273 | \n",
"
\n",
" \n",
" Greece | \n",
" 41.809524 | \n",
" 50.625000 | \n",
"
\n",
" \n",
" Guatemala | \n",
" 39.360656 | \n",
" 36.666667 | \n",
"
\n",
" \n",
" Haiti | \n",
" 36.325000 | \n",
" 42.750000 | \n",
"
\n",
" \n",
" Holand-Netherlands | \n",
" 40.000000 | \n",
" NaN | \n",
"
\n",
" \n",
" Honduras | \n",
" 34.333333 | \n",
" 60.000000 | \n",
"
\n",
" \n",
" Hong | \n",
" 39.142857 | \n",
" 45.000000 | \n",
"
\n",
" \n",
" Hungary | \n",
" 31.300000 | \n",
" 50.000000 | \n",
"
\n",
" \n",
" India | \n",
" 38.233333 | \n",
" 46.475000 | \n",
"
\n",
" \n",
" Iran | \n",
" 41.440000 | \n",
" 47.500000 | \n",
"
\n",
" \n",
" Ireland | \n",
" 40.947368 | \n",
" 48.000000 | \n",
"
\n",
" \n",
" Italy | \n",
" 39.625000 | \n",
" 45.400000 | \n",
"
\n",
" \n",
" Jamaica | \n",
" 38.239437 | \n",
" 41.100000 | \n",
"
\n",
" \n",
" Japan | \n",
" 41.000000 | \n",
" 47.958333 | \n",
"
\n",
" \n",
" Laos | \n",
" 40.375000 | \n",
" 40.000000 | \n",
"
\n",
" \n",
" Mexico | \n",
" 40.003279 | \n",
" 46.575758 | \n",
"
\n",
" \n",
" Nicaragua | \n",
" 36.093750 | \n",
" 37.500000 | \n",
"
\n",
" \n",
" Outlying-US(Guam-USVI-etc) | \n",
" 41.857143 | \n",
" NaN | \n",
"
\n",
" \n",
" Peru | \n",
" 35.068966 | \n",
" 40.000000 | \n",
"
\n",
" \n",
" Philippines | \n",
" 38.065693 | \n",
" 43.032787 | \n",
"
\n",
" \n",
" Poland | \n",
" 38.166667 | \n",
" 39.000000 | \n",
"
\n",
" \n",
" Portugal | \n",
" 41.939394 | \n",
" 41.500000 | \n",
"
\n",
" \n",
" Puerto-Rico | \n",
" 38.470588 | \n",
" 39.416667 | \n",
"
\n",
" \n",
" Scotland | \n",
" 39.444444 | \n",
" 46.666667 | \n",
"
\n",
" \n",
" South | \n",
" 40.156250 | \n",
" 51.437500 | \n",
"
\n",
" \n",
" Taiwan | \n",
" 33.774194 | \n",
" 46.800000 | \n",
"
\n",
" \n",
" Thailand | \n",
" 42.866667 | \n",
" 58.333333 | \n",
"
\n",
" \n",
" Trinadad&Tobago | \n",
" 37.058824 | \n",
" 40.000000 | \n",
"
\n",
" \n",
" United-States | \n",
" 38.799127 | \n",
" 45.505369 | \n",
"
\n",
" \n",
" Vietnam | \n",
" 37.193548 | \n",
" 39.200000 | \n",
"
\n",
" \n",
" Yugoslavia | \n",
" 41.600000 | \n",
" 49.500000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"salary <=50K >50K\n",
"native-country \n",
"? 40.164760 45.547945\n",
"Cambodia 41.416667 40.000000\n",
"Canada 37.914634 45.641026\n",
"China 37.381818 38.900000\n",
"Columbia 38.684211 50.000000\n",
"Cuba 37.985714 42.440000\n",
"Dominican-Republic 42.338235 47.000000\n",
"Ecuador 38.041667 48.750000\n",
"El-Salvador 36.030928 45.000000\n",
"England 40.483333 44.533333\n",
"France 41.058824 50.750000\n",
"Germany 39.139785 44.977273\n",
"Greece 41.809524 50.625000\n",
"Guatemala 39.360656 36.666667\n",
"Haiti 36.325000 42.750000\n",
"Holand-Netherlands 40.000000 NaN\n",
"Honduras 34.333333 60.000000\n",
"Hong 39.142857 45.000000\n",
"Hungary 31.300000 50.000000\n",
"India 38.233333 46.475000\n",
"Iran 41.440000 47.500000\n",
"Ireland 40.947368 48.000000\n",
"Italy 39.625000 45.400000\n",
"Jamaica 38.239437 41.100000\n",
"Japan 41.000000 47.958333\n",
"Laos 40.375000 40.000000\n",
"Mexico 40.003279 46.575758\n",
"Nicaragua 36.093750 37.500000\n",
"Outlying-US(Guam-USVI-etc) 41.857143 NaN\n",
"Peru 35.068966 40.000000\n",
"Philippines 38.065693 43.032787\n",
"Poland 38.166667 39.000000\n",
"Portugal 41.939394 41.500000\n",
"Puerto-Rico 38.470588 39.416667\n",
"Scotland 39.444444 46.666667\n",
"South 40.156250 51.437500\n",
"Taiwan 33.774194 46.800000\n",
"Thailand 42.866667 58.333333\n",
"Trinadad&Tobago 37.058824 40.000000\n",
"United-States 38.799127 45.505369\n",
"Vietnam 37.193548 39.200000\n",
"Yugoslavia 41.600000 49.500000"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p = pd.crosstab(data[\"native-country\"], data[\"salary\"],\n",
" values=data['hours-per-week'], aggfunc=\"mean\")\n",
"p"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"salary\n",
"<=50K 41.000000\n",
">50K 47.958333\n",
"Name: Japan, dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p.loc[\"Japan\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Часть 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Импортируем `pandasql`:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: pandasql in /home/wrapper228/anaconda/lib/python3.6/site-packages\n",
"Requirement already satisfied: pandas in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandasql)\n",
"Requirement already satisfied: sqlalchemy in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandasql)\n",
"Requirement already satisfied: numpy in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandasql)\n",
"Requirement already satisfied: python-dateutil>=2 in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandas->pandasql)\n",
"Requirement already satisfied: pytz>=2011k in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandas->pandasql)\n",
"Requirement already satisfied: six>=1.5 in /home/wrapper228/anaconda/lib/python3.6/site-packages (from python-dateutil>=2->pandas->pandasql)\n",
"\u001b[33mYou are using pip version 9.0.1, however version 19.0.3 is available.\n",
"You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
]
}
],
"source": [
"!pip install pandasql"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"from pandasql import sqldf\n",
"pysqldf = lambda q: sqldf(q, globals())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Для выполнения данного задания возьмём два набора данных из исходных данных, представленных NASA для своего хакатона по предсказанию мощности солнечного излучения:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"wind = (pd.read_csv('wind speed.csv', header=None,\n",
" names=[\"row\", \"UNIX\", \"date\",\n",
" \"time\", \"speed\", \"text\"])\n",
" .drop(\"text\", axis=1))\n",
"temp = (pd.read_csv('temperature.csv', header=None,\n",
" names=[\"row\", \"UNIX\", \"date\",\n",
" \"time\", \"temperature\", \"text\"])\n",
" .drop(\"text\", axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Посмотрим на эти наборы данных:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" row | \n",
" UNIX | \n",
" date | \n",
" time | \n",
" speed | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1475315718 | \n",
" 2016-09-30 | \n",
" 23:55:18 | \n",
" 7.87 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1475315423 | \n",
" 2016-09-30 | \n",
" 23:50:23 | \n",
" 7.87 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1475315124 | \n",
" 2016-09-30 | \n",
" 23:45:24 | \n",
" 9.00 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1475314821 | \n",
" 2016-09-30 | \n",
" 23:40:21 | \n",
" 13.50 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 1475314522 | \n",
" 2016-09-30 | \n",
" 23:35:22 | \n",
" 15.75 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" row UNIX date time speed\n",
"0 1 1475315718 2016-09-30 23:55:18 7.87\n",
"1 2 1475315423 2016-09-30 23:50:23 7.87\n",
"2 3 1475315124 2016-09-30 23:45:24 9.00\n",
"3 4 1475314821 2016-09-30 23:40:21 13.50\n",
"4 5 1475314522 2016-09-30 23:35:22 15.75"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wind.head()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"row int64\n",
"UNIX int64\n",
"date object\n",
"time object\n",
"speed float64\n",
"dtype: object"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wind.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" row | \n",
" UNIX | \n",
" date | \n",
" time | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1475315718 | \n",
" 2016-09-30 | \n",
" 23:55:18 | \n",
" 48 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1475315423 | \n",
" 2016-09-30 | \n",
" 23:50:23 | \n",
" 48 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1475315124 | \n",
" 2016-09-30 | \n",
" 23:45:24 | \n",
" 48 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1475314821 | \n",
" 2016-09-30 | \n",
" 23:40:21 | \n",
" 48 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 1475314522 | \n",
" 2016-09-30 | \n",
" 23:35:22 | \n",
" 48 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" row UNIX date time temperature\n",
"0 1 1475315718 2016-09-30 23:55:18 48\n",
"1 2 1475315423 2016-09-30 23:50:23 48\n",
"2 3 1475315124 2016-09-30 23:45:24 48\n",
"3 4 1475314821 2016-09-30 23:40:21 48\n",
"4 5 1475314522 2016-09-30 23:35:22 48"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp.head()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"row int64\n",
"UNIX int64\n",
"date object\n",
"time object\n",
"temperature int64\n",
"dtype: object"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"temp.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Объединим эти наборы данных различными способами, проверяя время их выполнения:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" row | \n",
" UNIX | \n",
" date | \n",
" time | \n",
" speed | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1475315718 | \n",
" 2016-09-30 | \n",
" 23:55:18 | \n",
" 7.87 | \n",
" 48 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1475315423 | \n",
" 2016-09-30 | \n",
" 23:50:23 | \n",
" 7.87 | \n",
" 48 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1475315124 | \n",
" 2016-09-30 | \n",
" 23:45:24 | \n",
" 9.00 | \n",
" 48 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1475314821 | \n",
" 2016-09-30 | \n",
" 23:40:21 | \n",
" 13.50 | \n",
" 48 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 1475314522 | \n",
" 2016-09-30 | \n",
" 23:35:22 | \n",
" 15.75 | \n",
" 48 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" row UNIX date time speed temperature\n",
"0 1 1475315718 2016-09-30 23:55:18 7.87 48\n",
"1 2 1475315423 2016-09-30 23:50:23 7.87 48\n",
"2 3 1475315124 2016-09-30 23:45:24 9.00 48\n",
"3 4 1475314821 2016-09-30 23:40:21 13.50 48\n",
"4 5 1475314522 2016-09-30 23:35:22 15.75 48"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wind.merge(temp[[\"UNIX\", \"temperature\"]], on=\"UNIX\").head()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.29 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"wind.merge(temp[[\"UNIX\", \"temperature\"]], on=\"UNIX\")"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" row | \n",
" UNIX | \n",
" date | \n",
" time | \n",
" speed | \n",
" temperature | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1475315718 | \n",
" 2016-09-30 | \n",
" 23:55:18 | \n",
" 7.87 | \n",
" 48 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 1475315423 | \n",
" 2016-09-30 | \n",
" 23:50:23 | \n",
" 7.87 | \n",
" 48 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 1475315124 | \n",
" 2016-09-30 | \n",
" 23:45:24 | \n",
" 9.00 | \n",
" 48 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" 1475314821 | \n",
" 2016-09-30 | \n",
" 23:40:21 | \n",
" 13.50 | \n",
" 48 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" 1475314522 | \n",
" 2016-09-30 | \n",
" 23:35:22 | \n",
" 15.75 | \n",
" 48 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" row UNIX date time speed temperature\n",
"0 1 1475315718 2016-09-30 23:55:18 7.87 48\n",
"1 2 1475315423 2016-09-30 23:50:23 7.87 48\n",
"2 3 1475315124 2016-09-30 23:45:24 9.00 48\n",
"3 4 1475314821 2016-09-30 23:40:21 13.50 48\n",
"4 5 1475314522 2016-09-30 23:35:22 15.75 48"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pysqldf(\"\"\"SELECT w.row, w.UNIX, w.date, w.time,\n",
" w.speed, t.temperature\n",
" FROM wind AS w JOIN temp AS t\n",
" ON w.UNIX = t.UNIX\n",
" \"\"\").head()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"395 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"pysqldf(\"\"\"SELECT w.row, w.UNIX, w.date, w.time,\n",
" w.speed, t.temperature\n",
" FROM wind AS w JOIN temp AS t\n",
" ON w.UNIX = t.UNIX\n",
" \"\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Видно, что `pandasql` в 50 раз медленнее, чем `pandas`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Сгруппируем набор данных с использованием функций агрегирования различными способами:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"date\n",
"2016-09-01 6.396560\n",
"2016-09-02 5.804086\n",
"2016-09-03 4.960248\n",
"2016-09-04 5.184571\n",
"2016-09-05 5.830676\n",
"Name: speed, dtype: float64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wind.groupby(\"date\")[\"speed\"].mean().head()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.18 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"wind.groupby(\"date\")[\"speed\"].mean()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" AVG(speed) | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2016-09-01 | \n",
" 6.396560 | \n",
"
\n",
" \n",
" 1 | \n",
" 2016-09-02 | \n",
" 5.804086 | \n",
"
\n",
" \n",
" 2 | \n",
" 2016-09-03 | \n",
" 4.960248 | \n",
"
\n",
" \n",
" 3 | \n",
" 2016-09-04 | \n",
" 5.184571 | \n",
"
\n",
" \n",
" 4 | \n",
" 2016-09-05 | \n",
" 5.830676 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date AVG(speed)\n",
"0 2016-09-01 6.396560\n",
"1 2016-09-02 5.804086\n",
"2 2016-09-03 4.960248\n",
"3 2016-09-04 5.184571\n",
"4 2016-09-05 5.830676"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pysqldf(\"\"\"SELECT date, AVG(speed)\n",
" FROM wind\n",
" GROUP BY date\n",
" \"\"\").head()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"163 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"pysqldf(\"\"\"SELECT date, AVG(speed)\n",
" FROM wind\n",
" GROUP BY date\n",
" \"\"\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Здесь разница уже более чем в 100 раз. Таким образом для таких простых запросов проще использовать Pandas."
]
}
],
"metadata": {
"authors": [
{
"name": "Лещев Артем Олегович"
}
],
"celltoolbar": "Raw Cell Format",
"group": "ИУ5-24М",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"lab_number": 2,
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
},
"title": "Изучение библиотек обработки данных"
},
"nbformat": 4,
"nbformat_minor": 1
}