{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Цель лабораторной работы\n",
    "Изучить библиотеки обработки данных Pandas и PandaSQL<cite data-cite=\"ue:lab2\"></cite>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Задание\n",
    "Задание состоит из&nbsp;двух частей<cite data-cite=\"ue:lab2\"></cite>.\n",
    "\n",
    "## Часть 1\n",
    "Требуется выполнить первое демонстрационное задание под&nbsp;названием «Exploratory data analysis with Pandas» со&nbsp;страницы курса [mlcourse.ai](https://mlcourse.ai/assignments).\n",
    "\n",
    "## Часть 2\n",
    "Требуется выполнить следующие запросы с&nbsp;использованием двух различных библиотек&nbsp;— Pandas и PandaSQL:\n",
    "\n",
    "- один произвольный запрос на соединение двух наборов данных,\n",
    "- один произвольный запрос на группировку набора данных с использованием функций агрегирования.\n",
    "\n",
    "Также требуется сравнить время выполнения каждого запроса в&nbsp;Pandas и PandaSQL."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Часть 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unique values of all features (for more information, please see the links above):\n",
    "\n",
    "- `age`: continuous.\n",
    "- `workclass`: `Private`, `Self-emp-not-inc`, `Self-emp-inc`, `Federal-gov`, `Local-gov`, `State-gov`, `Without-pay`, `Never-worked`.\n",
    "- `fnlwgt`: continuous.\n",
    "- `education`: `Bachelors`, `Some-college`, `11th`, `HS-grad`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `9th`, `7th-8th`, `12th`, `Masters`, `1st-4th`, `10th`, `Doctorate`, `5th-6th`, `Preschool`.\n",
    "- `education-num`: continuous.\n",
    "- `marital-status`: `Married-civ-spouse`, `Divorced`, `Never-married`, `Separated`, `Widowed`, `Married-spouse-absent`, `Married-AF-spouse`.\n",
    "- `occupation`: `Tech-support`, `Craft-repair`, `Other-service`, `Sales`, `Exec-managerial`, `Prof-specialty`, `Handlers-cleaners`, `Machine-op-inspct`, `Adm-clerical`, `Farming-fishing`, `Transport-moving`, `Priv-house-serv`, `Protective-serv`, `Armed-Forces`.\n",
    "- `relationship`: `Wife`, `Own-child`, `Husband`, `Not-in-family`, `Other-relative`, `Unmarried`.\n",
    "- `race`: `White`, `Asian-Pac-Islander`, `Amer-Indian-Eskimo`, `Other`, `Black`.\n",
    "- `sex`: `Female`, `Male`.\n",
    "- `capital-gain`: continuous.\n",
    "- `capital-loss`: continuous.\n",
    "- `hours-per-week`: continuous.\n",
    "- `native-country`: `United-States`, `Cambodia`, `England`, `Puerto-Rico`, `Canada`, `Germany`, `Outlying-US(Guam-USVI-etc)`, `India`, `Japan`, `Greece`, `South`, `China`, `Cuba`, `Iran`, `Honduras`, `Philippines`, `Italy`, `Poland`, `Jamaica`, `Vietnam`, `Mexico`, `Portugal`, `Ireland`, `France`, `Dominican-Republic`, `Laos`, `Ecuador`, `Taiwan`, `Haiti`, `Columbia`, `Hungary`, `Guatemala`, `Nicaragua`, `Scotland`, `Thailand`, `Yugoslavia`, `El-Salvador`, `Trinadad&Tobago`, `Peru`, `Hong`, `Holand-Netherlands`.   \n",
    "- `salary`: `>50K`, `<=50K`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pd.set_option(\"display.width\", 70)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>age</th>\n",
       "      <th>workclass</th>\n",
       "      <th>fnlwgt</th>\n",
       "      <th>education</th>\n",
       "      <th>education-num</th>\n",
       "      <th>marital-status</th>\n",
       "      <th>occupation</th>\n",
       "      <th>relationship</th>\n",
       "      <th>race</th>\n",
       "      <th>sex</th>\n",
       "      <th>capital-gain</th>\n",
       "      <th>capital-loss</th>\n",
       "      <th>hours-per-week</th>\n",
       "      <th>native-country</th>\n",
       "      <th>salary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>39</td>\n",
       "      <td>State-gov</td>\n",
       "      <td>77516</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Never-married</td>\n",
       "      <td>Adm-clerical</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>2174</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>50</td>\n",
       "      <td>Self-emp-not-inc</td>\n",
       "      <td>83311</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Exec-managerial</td>\n",
       "      <td>Husband</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>38</td>\n",
       "      <td>Private</td>\n",
       "      <td>215646</td>\n",
       "      <td>HS-grad</td>\n",
       "      <td>9</td>\n",
       "      <td>Divorced</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Not-in-family</td>\n",
       "      <td>White</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>53</td>\n",
       "      <td>Private</td>\n",
       "      <td>234721</td>\n",
       "      <td>11th</td>\n",
       "      <td>7</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Handlers-cleaners</td>\n",
       "      <td>Husband</td>\n",
       "      <td>Black</td>\n",
       "      <td>Male</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>United-States</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>28</td>\n",
       "      <td>Private</td>\n",
       "      <td>338409</td>\n",
       "      <td>Bachelors</td>\n",
       "      <td>13</td>\n",
       "      <td>Married-civ-spouse</td>\n",
       "      <td>Prof-specialty</td>\n",
       "      <td>Wife</td>\n",
       "      <td>Black</td>\n",
       "      <td>Female</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>40</td>\n",
       "      <td>Cuba</td>\n",
       "      <td>&lt;=50K</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   age         workclass  fnlwgt  education  education-num  \\\n",
       "0   39         State-gov   77516  Bachelors             13   \n",
       "1   50  Self-emp-not-inc   83311  Bachelors             13   \n",
       "2   38           Private  215646    HS-grad              9   \n",
       "3   53           Private  234721       11th              7   \n",
       "4   28           Private  338409  Bachelors             13   \n",
       "\n",
       "       marital-status         occupation   relationship   race  \\\n",
       "0       Never-married       Adm-clerical  Not-in-family  White   \n",
       "1  Married-civ-spouse    Exec-managerial        Husband  White   \n",
       "2            Divorced  Handlers-cleaners  Not-in-family  White   \n",
       "3  Married-civ-spouse  Handlers-cleaners        Husband  Black   \n",
       "4  Married-civ-spouse     Prof-specialty           Wife  Black   \n",
       "\n",
       "      sex  capital-gain  capital-loss  hours-per-week  \\\n",
       "0    Male          2174             0              40   \n",
       "1    Male             0             0              13   \n",
       "2    Male             0             0              40   \n",
       "3    Male             0             0              40   \n",
       "4  Female             0             0              40   \n",
       "\n",
       "  native-country salary  \n",
       "0  United-States  <=50K  \n",
       "1  United-States  <=50K  \n",
       "2  United-States  <=50K  \n",
       "3  United-States  <=50K  \n",
       "4           Cuba  <=50K  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.read_csv('adult.data.csv')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1. How many men and women (`sex` feature) are represented in this dataset?** "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Male      21790\n",
       "Female    10771\n",
       "Name: sex, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[\"sex\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2. What is the average age (`age` feature) of women?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "36.85823043357163"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[data[\"sex\"] == \"Female\"][\"age\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3. What is the percentage of German citizens (`native-country` feature)?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.420749%\n"
     ]
    }
   ],
   "source": [
    "print(\"{0:%}\".format(data[data[\"native-country\"] == \"Germany\"]\n",
    "                     .shape[0] / data.shape[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (`salary` feature) and those who earn less than 50K per year?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<=50K: = 36.78373786407767 ± 14.020088490824813 years\n",
      " >50K: = 44.24984058155847 ± 10.51902771985177 years\n"
     ]
    }
   ],
   "source": [
    "ages1 = data[data[\"salary\"] == \"<=50K\"][\"age\"]\n",
    "ages2 = data[data[\"salary\"] ==  \">50K\"][\"age\"]\n",
    "print(\"<=50K: = {0} ± {1} years\".format(ages1.mean(), ages1.std()))\n",
    "print(\" >50K: = {0} ± {1} years\".format(ages2.mean(), ages2.std()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**6. Is it true that people who earn more than 50K have at least high school education? (`education` – `Bachelors`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `Masters` or `Doctorate` feature)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "high_educations = set([\"Bachelors\", \"Prof-school\", \"Assoc-acdm\",\n",
    "                       \"Assoc-voc\", \"Masters\", \"Doctorate\"])\n",
    "def high_educated(e):\n",
    "    return e in high_educations\n",
    "\n",
    "data[data[\"salary\"] == \">50K\"][\"education\"].map(high_educated).all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**7. Display age statistics for each race (`race` feature) and each gender (`sex` feature). Use `groupby()` and `describe()`. Find the maximum age of men of `Amer-Indian-Eskimo` race.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>race</th>\n",
       "      <th>sex</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Amer-Indian-Eskimo</th>\n",
       "      <th>Female</th>\n",
       "      <td>119.0</td>\n",
       "      <td>37.117647</td>\n",
       "      <td>13.114991</td>\n",
       "      <td>17.0</td>\n",
       "      <td>27.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>46.00</td>\n",
       "      <td>80.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Male</th>\n",
       "      <td>192.0</td>\n",
       "      <td>37.208333</td>\n",
       "      <td>12.049563</td>\n",
       "      <td>17.0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>45.00</td>\n",
       "      <td>82.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Asian-Pac-Islander</th>\n",
       "      <th>Female</th>\n",
       "      <td>346.0</td>\n",
       "      <td>35.089595</td>\n",
       "      <td>12.300845</td>\n",
       "      <td>17.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>33.0</td>\n",
       "      <td>43.75</td>\n",
       "      <td>75.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Male</th>\n",
       "      <td>693.0</td>\n",
       "      <td>39.073593</td>\n",
       "      <td>12.883944</td>\n",
       "      <td>18.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>46.00</td>\n",
       "      <td>90.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Black</th>\n",
       "      <th>Female</th>\n",
       "      <td>1555.0</td>\n",
       "      <td>37.854019</td>\n",
       "      <td>12.637197</td>\n",
       "      <td>17.0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>37.0</td>\n",
       "      <td>46.00</td>\n",
       "      <td>90.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Male</th>\n",
       "      <td>1569.0</td>\n",
       "      <td>37.682600</td>\n",
       "      <td>12.882612</td>\n",
       "      <td>17.0</td>\n",
       "      <td>27.0</td>\n",
       "      <td>36.0</td>\n",
       "      <td>46.00</td>\n",
       "      <td>90.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">Other</th>\n",
       "      <th>Female</th>\n",
       "      <td>109.0</td>\n",
       "      <td>31.678899</td>\n",
       "      <td>11.631599</td>\n",
       "      <td>17.0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>39.00</td>\n",
       "      <td>74.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Male</th>\n",
       "      <td>162.0</td>\n",
       "      <td>34.654321</td>\n",
       "      <td>11.355531</td>\n",
       "      <td>17.0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>32.0</td>\n",
       "      <td>42.00</td>\n",
       "      <td>77.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th rowspan=\"2\" valign=\"top\">White</th>\n",
       "      <th>Female</th>\n",
       "      <td>8642.0</td>\n",
       "      <td>36.811618</td>\n",
       "      <td>14.329093</td>\n",
       "      <td>17.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>46.00</td>\n",
       "      <td>90.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Male</th>\n",
       "      <td>19174.0</td>\n",
       "      <td>39.652498</td>\n",
       "      <td>13.436029</td>\n",
       "      <td>17.0</td>\n",
       "      <td>29.0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>49.00</td>\n",
       "      <td>90.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             count       mean        std   min  \\\n",
       "race               sex                                           \n",
       "Amer-Indian-Eskimo Female    119.0  37.117647  13.114991  17.0   \n",
       "                   Male      192.0  37.208333  12.049563  17.0   \n",
       "Asian-Pac-Islander Female    346.0  35.089595  12.300845  17.0   \n",
       "                   Male      693.0  39.073593  12.883944  18.0   \n",
       "Black              Female   1555.0  37.854019  12.637197  17.0   \n",
       "                   Male     1569.0  37.682600  12.882612  17.0   \n",
       "Other              Female    109.0  31.678899  11.631599  17.0   \n",
       "                   Male      162.0  34.654321  11.355531  17.0   \n",
       "White              Female   8642.0  36.811618  14.329093  17.0   \n",
       "                   Male    19174.0  39.652498  13.436029  17.0   \n",
       "\n",
       "                            25%   50%    75%   max  \n",
       "race               sex                              \n",
       "Amer-Indian-Eskimo Female  27.0  36.0  46.00  80.0  \n",
       "                   Male    28.0  35.0  45.00  82.0  \n",
       "Asian-Pac-Islander Female  25.0  33.0  43.75  75.0  \n",
       "                   Male    29.0  37.0  46.00  90.0  \n",
       "Black              Female  28.0  37.0  46.00  90.0  \n",
       "                   Male    27.0  36.0  46.00  90.0  \n",
       "Other              Female  23.0  29.0  39.00  74.0  \n",
       "                   Male    26.0  32.0  42.00  77.0  \n",
       "White              Female  25.0  35.0  46.00  90.0  \n",
       "                   Male    29.0  38.0  49.00  90.0  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.groupby([\"race\", \"sex\"])[\"age\"].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "82"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[(data[\"race\"] == \"Amer-Indian-Eskimo\")\n",
    "     & (data[\"sex\"] == \"Male\")][\"age\"].max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (`marital-status` feature)? Consider as married those who have a `marital-status` starting with `Married` (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True     5965\n",
       "False     697\n",
       "Name: married, dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def is_married(m):\n",
    "    return m.startswith(\"Married\")\n",
    "\n",
    "data[\"married\"] = data[\"marital-status\"].map(is_married)\n",
    "(data[(data[\"sex\"] == \"Male\") & (data[\"salary\"] == \">50K\")]\n",
    "     [\"married\"].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**9. What is the maximum number of hours a person works per week (`hours-per-week` feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum is 99 hours/week.\n",
      "85 people work this time at week.\n",
      "29.411765% get >50K salary.\n"
     ]
    }
   ],
   "source": [
    "m = data[\"hours-per-week\"].max()\n",
    "print(\"Maximum is {} hours/week.\".format(m))\n",
    "\n",
    "people = data[data[\"hours-per-week\"] == m]\n",
    "c = people.shape[0]\n",
    "print(\"{} people work this time at week.\".format(c))\n",
    "\n",
    "s = people[people[\"salary\"] == \">50K\"].shape[0]\n",
    "print(\"{0:%} get >50K salary.\".format(s / c))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10. Count the average time of work (`hours-per-week`) for those who earn a little and a lot (`salary`) for each country (`native-country`). What will these be for Japan?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>salary</th>\n",
       "      <th>&lt;=50K</th>\n",
       "      <th>&gt;50K</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>native-country</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>?</th>\n",
       "      <td>40.164760</td>\n",
       "      <td>45.547945</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cambodia</th>\n",
       "      <td>41.416667</td>\n",
       "      <td>40.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Canada</th>\n",
       "      <td>37.914634</td>\n",
       "      <td>45.641026</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>China</th>\n",
       "      <td>37.381818</td>\n",
       "      <td>38.900000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Columbia</th>\n",
       "      <td>38.684211</td>\n",
       "      <td>50.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cuba</th>\n",
       "      <td>37.985714</td>\n",
       "      <td>42.440000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Dominican-Republic</th>\n",
       "      <td>42.338235</td>\n",
       "      <td>47.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ecuador</th>\n",
       "      <td>38.041667</td>\n",
       "      <td>48.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>El-Salvador</th>\n",
       "      <td>36.030928</td>\n",
       "      <td>45.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>England</th>\n",
       "      <td>40.483333</td>\n",
       "      <td>44.533333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>France</th>\n",
       "      <td>41.058824</td>\n",
       "      <td>50.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Germany</th>\n",
       "      <td>39.139785</td>\n",
       "      <td>44.977273</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Greece</th>\n",
       "      <td>41.809524</td>\n",
       "      <td>50.625000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Guatemala</th>\n",
       "      <td>39.360656</td>\n",
       "      <td>36.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Haiti</th>\n",
       "      <td>36.325000</td>\n",
       "      <td>42.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Holand-Netherlands</th>\n",
       "      <td>40.000000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Honduras</th>\n",
       "      <td>34.333333</td>\n",
       "      <td>60.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hong</th>\n",
       "      <td>39.142857</td>\n",
       "      <td>45.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Hungary</th>\n",
       "      <td>31.300000</td>\n",
       "      <td>50.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>India</th>\n",
       "      <td>38.233333</td>\n",
       "      <td>46.475000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Iran</th>\n",
       "      <td>41.440000</td>\n",
       "      <td>47.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ireland</th>\n",
       "      <td>40.947368</td>\n",
       "      <td>48.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Italy</th>\n",
       "      <td>39.625000</td>\n",
       "      <td>45.400000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Jamaica</th>\n",
       "      <td>38.239437</td>\n",
       "      <td>41.100000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Japan</th>\n",
       "      <td>41.000000</td>\n",
       "      <td>47.958333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Laos</th>\n",
       "      <td>40.375000</td>\n",
       "      <td>40.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mexico</th>\n",
       "      <td>40.003279</td>\n",
       "      <td>46.575758</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nicaragua</th>\n",
       "      <td>36.093750</td>\n",
       "      <td>37.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Outlying-US(Guam-USVI-etc)</th>\n",
       "      <td>41.857143</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Peru</th>\n",
       "      <td>35.068966</td>\n",
       "      <td>40.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Philippines</th>\n",
       "      <td>38.065693</td>\n",
       "      <td>43.032787</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Poland</th>\n",
       "      <td>38.166667</td>\n",
       "      <td>39.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Portugal</th>\n",
       "      <td>41.939394</td>\n",
       "      <td>41.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Puerto-Rico</th>\n",
       "      <td>38.470588</td>\n",
       "      <td>39.416667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Scotland</th>\n",
       "      <td>39.444444</td>\n",
       "      <td>46.666667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>South</th>\n",
       "      <td>40.156250</td>\n",
       "      <td>51.437500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Taiwan</th>\n",
       "      <td>33.774194</td>\n",
       "      <td>46.800000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Thailand</th>\n",
       "      <td>42.866667</td>\n",
       "      <td>58.333333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Trinadad&amp;Tobago</th>\n",
       "      <td>37.058824</td>\n",
       "      <td>40.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>United-States</th>\n",
       "      <td>38.799127</td>\n",
       "      <td>45.505369</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Vietnam</th>\n",
       "      <td>37.193548</td>\n",
       "      <td>39.200000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Yugoslavia</th>\n",
       "      <td>41.600000</td>\n",
       "      <td>49.500000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "salary                          <=50K       >50K\n",
       "native-country                                  \n",
       "?                           40.164760  45.547945\n",
       "Cambodia                    41.416667  40.000000\n",
       "Canada                      37.914634  45.641026\n",
       "China                       37.381818  38.900000\n",
       "Columbia                    38.684211  50.000000\n",
       "Cuba                        37.985714  42.440000\n",
       "Dominican-Republic          42.338235  47.000000\n",
       "Ecuador                     38.041667  48.750000\n",
       "El-Salvador                 36.030928  45.000000\n",
       "England                     40.483333  44.533333\n",
       "France                      41.058824  50.750000\n",
       "Germany                     39.139785  44.977273\n",
       "Greece                      41.809524  50.625000\n",
       "Guatemala                   39.360656  36.666667\n",
       "Haiti                       36.325000  42.750000\n",
       "Holand-Netherlands          40.000000        NaN\n",
       "Honduras                    34.333333  60.000000\n",
       "Hong                        39.142857  45.000000\n",
       "Hungary                     31.300000  50.000000\n",
       "India                       38.233333  46.475000\n",
       "Iran                        41.440000  47.500000\n",
       "Ireland                     40.947368  48.000000\n",
       "Italy                       39.625000  45.400000\n",
       "Jamaica                     38.239437  41.100000\n",
       "Japan                       41.000000  47.958333\n",
       "Laos                        40.375000  40.000000\n",
       "Mexico                      40.003279  46.575758\n",
       "Nicaragua                   36.093750  37.500000\n",
       "Outlying-US(Guam-USVI-etc)  41.857143        NaN\n",
       "Peru                        35.068966  40.000000\n",
       "Philippines                 38.065693  43.032787\n",
       "Poland                      38.166667  39.000000\n",
       "Portugal                    41.939394  41.500000\n",
       "Puerto-Rico                 38.470588  39.416667\n",
       "Scotland                    39.444444  46.666667\n",
       "South                       40.156250  51.437500\n",
       "Taiwan                      33.774194  46.800000\n",
       "Thailand                    42.866667  58.333333\n",
       "Trinadad&Tobago             37.058824  40.000000\n",
       "United-States               38.799127  45.505369\n",
       "Vietnam                     37.193548  39.200000\n",
       "Yugoslavia                  41.600000  49.500000"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p = pd.crosstab(data[\"native-country\"], data[\"salary\"],\n",
    "                values=data['hours-per-week'], aggfunc=\"mean\")\n",
    "p"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "salary\n",
       "<=50K    41.000000\n",
       ">50K     47.958333\n",
       "Name: Japan, dtype: float64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "p.loc[\"Japan\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Часть 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Импортируем `pandasql`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: pandasql in /home/wrapper228/anaconda/lib/python3.6/site-packages\n",
      "Requirement already satisfied: pandas in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandasql)\n",
      "Requirement already satisfied: sqlalchemy in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandasql)\n",
      "Requirement already satisfied: numpy in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandasql)\n",
      "Requirement already satisfied: python-dateutil>=2 in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandas->pandasql)\n",
      "Requirement already satisfied: pytz>=2011k in /home/wrapper228/anaconda/lib/python3.6/site-packages (from pandas->pandasql)\n",
      "Requirement already satisfied: six>=1.5 in /home/wrapper228/anaconda/lib/python3.6/site-packages (from python-dateutil>=2->pandas->pandasql)\n",
      "\u001b[33mYou are using pip version 9.0.1, however version 19.0.3 is available.\n",
      "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!pip install pandasql"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandasql import sqldf\n",
    "pysqldf = lambda q: sqldf(q, globals())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Для&nbsp;выполнения данного задания возьмём два набора данных из&nbsp;исходных данных, представленных NASA для&nbsp;своего хакатона по&nbsp;предсказанию мощности солнечного излучения<cite data-cite=\"data:sunshine\"></cite>:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "wind = (pd.read_csv('wind speed.csv', header=None,\n",
    "                    names=[\"row\", \"UNIX\", \"date\",\n",
    "                           \"time\", \"speed\", \"text\"])\n",
    "                    .drop(\"text\", axis=1))\n",
    "temp = (pd.read_csv('temperature.csv', header=None,\n",
    "                    names=[\"row\", \"UNIX\", \"date\",\n",
    "                           \"time\", \"temperature\", \"text\"])\n",
    "                    .drop(\"text\", axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Посмотрим на&nbsp;эти наборы данных:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>row</th>\n",
       "      <th>UNIX</th>\n",
       "      <th>date</th>\n",
       "      <th>time</th>\n",
       "      <th>speed</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1475315718</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:55:18</td>\n",
       "      <td>7.87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1475315423</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:50:23</td>\n",
       "      <td>7.87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1475315124</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:45:24</td>\n",
       "      <td>9.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1475314821</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:40:21</td>\n",
       "      <td>13.50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>1475314522</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:35:22</td>\n",
       "      <td>15.75</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   row        UNIX        date      time  speed\n",
       "0    1  1475315718  2016-09-30  23:55:18   7.87\n",
       "1    2  1475315423  2016-09-30  23:50:23   7.87\n",
       "2    3  1475315124  2016-09-30  23:45:24   9.00\n",
       "3    4  1475314821  2016-09-30  23:40:21  13.50\n",
       "4    5  1475314522  2016-09-30  23:35:22  15.75"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wind.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "row        int64\n",
       "UNIX       int64\n",
       "date      object\n",
       "time      object\n",
       "speed    float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wind.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>row</th>\n",
       "      <th>UNIX</th>\n",
       "      <th>date</th>\n",
       "      <th>time</th>\n",
       "      <th>temperature</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1475315718</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:55:18</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1475315423</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:50:23</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1475315124</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:45:24</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1475314821</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:40:21</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>1475314522</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:35:22</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   row        UNIX        date      time  temperature\n",
       "0    1  1475315718  2016-09-30  23:55:18           48\n",
       "1    2  1475315423  2016-09-30  23:50:23           48\n",
       "2    3  1475315124  2016-09-30  23:45:24           48\n",
       "3    4  1475314821  2016-09-30  23:40:21           48\n",
       "4    5  1475314522  2016-09-30  23:35:22           48"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "row             int64\n",
       "UNIX            int64\n",
       "date           object\n",
       "time           object\n",
       "temperature     int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "temp.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Объединим эти наборы данных различными способами, проверяя время их выполнения<cite data-cite=\"doc:pandas,gh:pandasql,doc:ipython\"></cite>:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>row</th>\n",
       "      <th>UNIX</th>\n",
       "      <th>date</th>\n",
       "      <th>time</th>\n",
       "      <th>speed</th>\n",
       "      <th>temperature</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1475315718</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:55:18</td>\n",
       "      <td>7.87</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1475315423</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:50:23</td>\n",
       "      <td>7.87</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1475315124</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:45:24</td>\n",
       "      <td>9.00</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1475314821</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:40:21</td>\n",
       "      <td>13.50</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>1475314522</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:35:22</td>\n",
       "      <td>15.75</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   row        UNIX        date      time  speed  temperature\n",
       "0    1  1475315718  2016-09-30  23:55:18   7.87           48\n",
       "1    2  1475315423  2016-09-30  23:50:23   7.87           48\n",
       "2    3  1475315124  2016-09-30  23:45:24   9.00           48\n",
       "3    4  1475314821  2016-09-30  23:40:21  13.50           48\n",
       "4    5  1475314522  2016-09-30  23:35:22  15.75           48"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wind.merge(temp[[\"UNIX\", \"temperature\"]], on=\"UNIX\").head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5.29 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "wind.merge(temp[[\"UNIX\", \"temperature\"]], on=\"UNIX\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>row</th>\n",
       "      <th>UNIX</th>\n",
       "      <th>date</th>\n",
       "      <th>time</th>\n",
       "      <th>speed</th>\n",
       "      <th>temperature</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1475315718</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:55:18</td>\n",
       "      <td>7.87</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1475315423</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:50:23</td>\n",
       "      <td>7.87</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1475315124</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:45:24</td>\n",
       "      <td>9.00</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1475314821</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:40:21</td>\n",
       "      <td>13.50</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>1475314522</td>\n",
       "      <td>2016-09-30</td>\n",
       "      <td>23:35:22</td>\n",
       "      <td>15.75</td>\n",
       "      <td>48</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   row        UNIX        date      time  speed  temperature\n",
       "0    1  1475315718  2016-09-30  23:55:18   7.87           48\n",
       "1    2  1475315423  2016-09-30  23:50:23   7.87           48\n",
       "2    3  1475315124  2016-09-30  23:45:24   9.00           48\n",
       "3    4  1475314821  2016-09-30  23:40:21  13.50           48\n",
       "4    5  1475314522  2016-09-30  23:35:22  15.75           48"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pysqldf(\"\"\"SELECT w.row, w.UNIX, w.date, w.time,\n",
    "                  w.speed, t.temperature\n",
    "           FROM wind AS w JOIN temp AS t\n",
    "           ON w.UNIX = t.UNIX\n",
    "        \"\"\").head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "395 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "pysqldf(\"\"\"SELECT w.row, w.UNIX, w.date, w.time,\n",
    "                  w.speed, t.temperature\n",
    "           FROM wind AS w JOIN temp AS t\n",
    "           ON w.UNIX = t.UNIX\n",
    "        \"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Видно, что `pandasql` в&nbsp;50&nbsp;раз медленнее, чем `pandas`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Сгруппируем набор данных с&nbsp;использованием функций агрегирования различными способами:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "date\n",
       "2016-09-01    6.396560\n",
       "2016-09-02    5.804086\n",
       "2016-09-03    4.960248\n",
       "2016-09-04    5.184571\n",
       "2016-09-05    5.830676\n",
       "Name: speed, dtype: float64"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wind.groupby(\"date\")[\"speed\"].mean().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.18 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "wind.groupby(\"date\")[\"speed\"].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>date</th>\n",
       "      <th>AVG(speed)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2016-09-01</td>\n",
       "      <td>6.396560</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-09-02</td>\n",
       "      <td>5.804086</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-09-03</td>\n",
       "      <td>4.960248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-09-04</td>\n",
       "      <td>5.184571</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-09-05</td>\n",
       "      <td>5.830676</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         date  AVG(speed)\n",
       "0  2016-09-01    6.396560\n",
       "1  2016-09-02    5.804086\n",
       "2  2016-09-03    4.960248\n",
       "3  2016-09-04    5.184571\n",
       "4  2016-09-05    5.830676"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pysqldf(\"\"\"SELECT date, AVG(speed)\n",
    "           FROM wind\n",
    "           GROUP BY date\n",
    "        \"\"\").head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "163 ms ± 867 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "pysqldf(\"\"\"SELECT date, AVG(speed)\n",
    "           FROM wind\n",
    "           GROUP BY date\n",
    "        \"\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Здесь разница уже более чем в&nbsp;100&nbsp;раз. Таким образом для&nbsp;таких простых запросов проще использовать Pandas."
   ]
  }
 ],
 "metadata": {
  "authors": [
   {
    "name": "Лещев Артем Олегович"
   }
  ],
  "celltoolbar": "Raw Cell Format",
  "group": "ИУ5-24М",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "lab_number": 2,
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  },
  "title": "Изучение библиотек обработки данных"
 },
 "nbformat": 4,
 "nbformat_minor": 1
}