Data Munging (?)

G├╝nl├╝k hayatta veriler istedi─čimiz kadar basit olmaz, bunlar ├╝zerinde i┼člemler yaparak uygun hale getiririz

Verilerin Sa─člamas─▒ Gereken ├ľzellikler

  • Tek tablodan olu┼čan basit veya ba─člant─▒l─▒ bir ka├ž tablodan olu┼čan

    • Farkl─▒ veriler i├žin mapping ile veri tipleri birbirine benzetilir

  • Kolay analiz edilebilir formatta olan

  • Makine ├Â─črenimine sokulabilecek veriler

  • D├╝┼č├╝k karma┼č─▒kl─▒─ča sahip

  • Y├╝ksek boyutlu veriler i├žin optimizasyon

Veri ├çekme ─░┼člemleri

Web siteleri ├╝zerindeki tablolar─▒ ├žekmek i├žin pd.read_html kullan.

Veri ├çekme Sorunlar─▒ Engellemek i├žin UserAgent Ayarlama

Baz─▒ websiteleri, isteklerin nereden geldi─čini bilmeden hareket edemezler. Bu sebeple iste─či detayland─▒rmam─▒z gerekmektedir.

HTML alan─▒na ba─člant─▒y─▒ yaz─▒n, pd.read_html(html) ┼čeklinde kullan─▒n

from urllib.request import urlopen, Request
ÔÇő
HTML = "" # ├ľrn: https://en.wikipedia.org/
ÔÇő
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
reg_url = HTML
req = Request(url=reg_url, headers=headers)
html = urlopen(req).read() # Pandas i├žinullan─▒lacak html objesi

Wikipedia'dan tablo ├žekme T├╝m tablo verileri aras─▒nda `0`, `1` ... de─čerleri ile gezinebiliriz. ```py import pandas as pd import json df = pd.read_html('https://en.wikipedia.org/w/index.php?title=Fortune_Global_500&oldid=855890446', header=0)[1] fortune_500 = json.loads(df.to_json(orient="records")) df ``` ![](../res/ex_wikipedia_tablo.png) ```py df_list = pd.read_html("https://en.wikipedia.org/w/index.php?title=Automotive_industry&oldid=875776152", header=0) car_totals = json.loads(df_list[1].to_json(orient="records")) car_by_man = json.loads(df_list[3].to_json(orient='records')) ``` ![](../res/ex2_wiki_tablo.png)

Harici verileri tablomuza aktarma Harici verilerimizde Inc, AG gibi ┼čirket k─▒saltmalar─▒ mevcut, bunlar─▒ kald─▒rmak i├žin *mapping* i┼člemine ba┼čvururuz ```py other_data = [ {"name": "Walmart", "employees": 2300000, "year founded": 1962 }, {"name": "State Grid Corporation of China", "employees": 927839, "year founded": 2002}, {"name": "China Petrochemical Corporation", "employees":358571, "year founded": 1998 }, {"name": "China National Petroleum Corporation", "employees": 1636532, "year founded": 1988}, {"name": "Toyota Motor Corporation", "employees": 364445, "year founded": 1937}, {"name": "Volkswagen AG", "employees": 642292, "year founded": 1937}, {"name": "Royal Dutch Shell", "employees": 92000, "year founded": 1907}, {"name": "Berkshire Hathaway Inc.", "employees":377000, "year founded": 1839}, {"name": "Apple Inc.", "employees": 123000, "year founded": 1976}, {"name": "Exxon Mobile Corporation", "employees": 69600, "year founded": 1999}, {"name": "BP plc", "employees": 74000, "year founded": 1908} ] mapping = { 'Apple': 'Apple Inc.', 'BP': 'BP plc', 'Berkshire Hathaway': 'Berkshire Hathaway Inc.', 'China National Petroleum': 'China National Petroleum Corporation', 'Exxon Mobil': 'Exxon Mobile Corporation', 'Sinopec Group': 'China Petrochemical Corporation', 'State Grid': 'State Grid Corporation of China', 'Toyota Motor': 'Toyota Motor Corporation', 'Volkswagen': 'Volkswagen AG' } ```

Veri yap─▒s─▒n─▒ de─či┼čtirme ├Ârne─či `500$ billion` ┼čeklindeki verileri bilimsem `500e9` (500 x 10^9) verisine ├ževirme ```py def convert_revenue(x): return float(x.lstrip('$').rstrip('billion')) * 1e9 assert convert_revenue('$500 billion') == 500e9 # Test i┼člemi ```

Yeni i┼členmi┼č veri ortaya ├ž─▒karma ─░┼členmi┼č verileri her daim ana veriyi bozmadan, ek objelerde tutmal─▒y─▒z. ```py def rev_per_emp(company): name = company[u'Company'] n_employees = dict_data[mapping.get(name, name)].get('employees') company['rev per emp'] = convert_revenue(company[u'Revenue in USD'])/n_employees return company def compute_copy(d, func): return func({k:v for k,v in d.items()}) data = list(map(lambda x : compute_copy(x, rev_per_emp), fortune_500)) ``` ![](../res/ex_copied_processed_data.png)

Verileri s─▒ralama i┼člemleri S─▒ralama i┼člemleri karar verme i┼člemleri i├žin ├žok ├Ânemlidir. ```py rev_per_emp = sorted( [i[u'Company'], i['rev per emp'] for i in data], key=lambda x : x[1], reverse=True ) rev_per_emp ``` ![](../res/ex_data_sorting.png)

Verileri sayd─▒rma i┼člemleri ```py from collection import Counter Counter(i[u'Industry'] for i in data) ``` ![](../res/ex_counter.png)

Verileri kategorilere ay─▒rma Belli de─čerlere ├Âzg├╝ analiz yapmak i├žin etkili bir ├ž├Âz├╝md├╝r. ```py sub_data = [i for i in data if i[u'Industry'] in [u'Automobiles', u'Petroleum']] sub_data ``` ![](../res/ex_categorized_data.png)

Ek s─▒ralama y├Ântemi ─░lk ├Ânce sorting yap─▒l─▒r ard─▒ndan son eleman al─▒n─▒r. ```py total_prod = sorted((i[u"Year"], i[u'Production']) for i in car_totals)[-1][1] ```

H─▒zl─▒ Notlar

# Index'e g├Âre de─čil, kay─▒tlara g├Âre dict olu┼čturma ­čîč
soil_data = json.loads(rev.to_json(orient="records"))
ÔÇő
# *'lar soldan kald─▒r─▒p toplama
total = sum([float(i[u'Revenue'].rstrip('*')) for i in oil_data])
ÔÇő
# Max i┼člemi (2. itema g├Âre max alma)
max(overlaps, key=lambda x : x[1])[0]
ÔÇő
# S─▒ralama ve en y├╝ksek de─čeri alma
total_prod = df_totals.sort_values(by='Year').iloc[-1]['Production']
# orient varken (sat─▒ra g├Âre yapar)
[{'Company': 'Saudi Aramco', 'Revenue': '465.49'},
{'Company': 'Sinopec Group', 'Revenue': '448.00'}]
ÔÇő
# orient olmazsa (s├╝tuna g├Âre yapar)
{'Company': {'0': 'Saudi Aramco',
'1': 'Sinopec Group',
'2': 'China National Petroleum Corporation',
'3': 'Exxon Mobil',
'4': 'Royal Dutch Shell',

Fuzzy Match

Kelimelerin birbirine ├žok yak─▒n olanlar─▒n─▒ bulur.

def fuzzy_match(word, s):
words = set(word.split(' '))
overlaps = [(k, len(v.intersection(words))) for k, v in s.items()]
return max(overlaps, key=lambda x : x[1])[0]
split_names = {i: set(i.split(' ')) for i in shares.keys()}
for i in petro_companies:
match = fuzzy_match(i, split_names)
print("matched {} to {}".format(i, match))
market_share[i] = shares[match]