When it started to get harder to deal with the Data with the conventional ways and DataBank Systems, IBM Data Scientists realiesed some features of this undealable big Data. What makes hard to work with the Data in old easy way was described in 4 words, and to call a data as Big Data we need to check these 4 characters of the Data:
Volume: The Size of the Data should be in the range, which is beyond the way that we can handle it in custom ways. So, lets say, 20 years ago GB Datas were Big Data, but during these 20 years "the traditional technologies to deal with data", CPU, RAM, Harddisk Technologies improved also a little more and now 1 computer is enough to deal with GB Data, then unfortunately "The Big Data" which was big 20 years ago, is not any more "Big" enough. In short, the Size of the Data, should demand more than One computer with the best Hardware of its time.
Variety: The Diversity of the Data should demand other ways of handling than the traditional ways. Traditionally we are used to see the Database in the shape of "Tables" which are consist of "Rows" and "Tables", in a structered way. When we need to save the photos, texts, audios and Videos in the Database, then we need a little flexible ones. So here Variety stands for the different types of Data to keep in DataBase.
Velocity: Velocity is a synonym for Speed actually. Here we should understand it so, that the Data increase periodically, in a way that even if the Data is not big now, it can be very big soon. This kind of speed comes from a Streaming Data, which are mostly origining from Maschines. Z.B. Sensors(GPS - Location of a Ship, or Plane or an Application in a Cellphone), Cameras etc.
Veracity: This V stands for the Trustworthiness of the Data. Trustworthiness here means, the quality of represantation Power of the Data.
If all of these 4 conditions exist, then we can call the Data as "Big Data"
a.) Erkläre anhand eines selbstgewählten Beispiels was fehlende Werte sind.
Lets consider that we have a Data of a Poll, which brings the values of the people who have children or expecting one, and their level of expectancies in life and relatively their happiness in life.
Lets have the following columns: sex, number_of_children, pregnancy, spouse_pregnancy, house/apartment, demand_to_improve_accomodation, car, demanding_car
Permanent Missing Values: First of all we need to be ready for any kind of data lost during the filling in the poll, or copying them into computer or piling them up in a file, or transforming them in an application.
Structural Missing Values: We certainly should expect missing values in the pregnancy and spouse_pregnancy columns, depending on the sex of the Poll Participator. We can understand that instead of "No", the participator skipped the area, just because logically it is understood that, the area should have no information.
The data type of missing values can be:
b.) Warum muss sich bei der Datenanalyse damit beschäftigt werden?
As we see in the datatypes of missing values, when we want to apply some mathematical process on the columns, these can cause problems, or affect the statistical measures(mean, median, mode... etc.) of the whole column. Then, the result from Data Analysis will not be trustworthy.
c.) Wie können fehlende Werte in der Datenanalyse identifiziert werden?
We can use certain codes to check the values in the columns and get familiar with the data.
df.head() : to see the first 5 rows
df.tail(3) : to see the last 3 rows
df.info() : To check if the supposed type and the real type fits.
(Supposed type: Integer, Real Type: Float => There can be NaNs)
df.isnull.sum() : To check the columns without values and some them up.
df.nunique() : to check the number of unique items before we check the unique items.
df.unique() : to see the unique values
some libraries and their plots (z.B. missingno => msno.bar, msno.matrix, msno.heatmap... etc.)
Du arbeitest als Data Scientist in einem Start Up. Ihr habt vor einem Jahr euer Geschäft eröffnet und nun wollt ihr den nächsten Schritt gehen und eure Dienste erweitern. Euer Geschäftsmodell ist das Betreiben einer Plattform bei der sich Personen die eine Geschäftsidee haben, aber nicht das benötigte Geld, anmelden und für ihr Projekt innerhalb einer vorgegebenen Zeit Geld sammeln können. Auf der anderen Seite habt ihr Geldgeber, die gern ihr Geld in Projekte anlegen möchten und nach Investitionen suchen. Als Vermittler bringt eure Plattform also Geldnehmer und Geldgeber zusammen. Ihr verdient euer Geld mit einer Provision für jedes Projekt was auf eurer Plattform landet.
Deine Datenbasis ist die Historie eurer Plattform. Alle Projekte sind abgeschlossene Projekte, d.h. die Zeit, um für sein Projekt Geld zu sammeln ist abgelaufen. Euer Geschäftsmodell sieht es vor, dass die gesammelten Gelder ausgezahlt werden, auch wenn der Zielbetrag nicht erreicht wurde.
In dem Datensatz gibt es KEINE Duplikate.
Der gesplittete Datensatz enthält folgende Spalten (inkl. Bedeutung):
repayment interval ... Rückzahlungsmodalitäten/-rhythmus
Data Preprocessing
- Datensatz einlesen
- fehlende Werte
- Ausreißer
- Pairplot - erste visuelle Inspektion
- Features(kommentar)
- Speicherplatz
Visualisierung
import pandas as pd
df_part1 = pd.read_csv("part1.csv", sep="/n", engine="python", nrows=2)
df_part1 # We see in this output that the seperator is ",", so we import our file without sep. And the file has an index column.
,funded_amount,loan_amount,activity,sector,use,country_code,country,region,currency,term_in_months,lender_count,borrower_genders,repayment_interval | |
---|---|
0 | 0,300.0,300.0,Fruits & Vegetables,Food,"To buy... |
1 | 1,575.0,575.0,Rickshaw,Transportation,to repai... |
df_part1 = pd.read_csv("part1.csv", engine="python", index_col=0)
df_part1.head(2)
funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
df_part2 = pd.read_csv("part2.csv", sep="/n", engine="python", nrows=2)
df_part2 # We see in this output that the seperator is "#", so we import our file that way. And the file has an index column.
#funded_amount#loan_amount#activity#sector#use#country_code#country#region#currency#term_in_months#lender_count#borrower_genders#repayment_interval | |
---|---|
0 | 0#175.0#175.0#Liquor Store / Off-License#Food#... |
1 | 1#325.0#325.0#Livestock#Agriculture#to buy 3 z... |
df_part2 = pd.read_csv("part2.csv", sep="#", engine="python", index_col=0)
df_part2.head(2)
funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 175.0 | 175.0 | Liquor Store / Off-License | Food | to purchase additional stock of coconut wine t... | PH | Philippines | Palo, Leyte | PHP | 8.0 | 6 | female | irregular |
1 | 325.0 | 325.0 | Livestock | Agriculture | to buy 3 zebus and food to fatten them up. | MG | Madagascar | Antsirabe | MGA | 12.0 | 13 | female | monthly |
df_part1.columns == df_part2.columns
array([ True, True, True, True, True, True, True, True, True, True, True, True, True])
len(df_part1.index)+len(df_part2.index)
671205
df_funding = pd.concat([df_part1,df_part2])
len(df_funding.index)
671205
df_funding.reset_index(inplace=True)
df_funding.drop(columns="index", inplace=True)
df_funding.head(51)
funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 300.0 | 300.0 | Fruits & Vegetables | Food | To buy seasonal, fresh fruits to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 12 | female | irregular |
1 | 575.0 | 575.0 | Rickshaw | Transportation | to repair and maintain the auto rickshaw used ... | PK | Pakistan | Lahore | PKR | 11.0 | 14 | female, female | irregular |
2 | 150.0 | 150.0 | Transportation | Transportation | To repair their old cycle-van and buy another ... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
3 | 200.0 | 200.0 | Embroidery | Arts | to purchase an embroidery machine and a variet... | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
4 | 400.0 | 400.0 | Milk Sales | Food | to purchase one buffalo. | PK | Pakistan | Abdul Hakeem | PKR | 14.0 | 16 | female | monthly |
5 | 250.0 | 250.0 | Services | Services | purchase leather for my business using ksh 20000. | KE | Kenya | NaN | KES | 4.0 | 6 | female | irregular |
6 | 200.0 | 200.0 | Dairy | Agriculture | To purchase a dairy cow and start a milk produ... | IN | India | Maynaguri | INR | 43.0 | 8 | female | bullet |
7 | 400.0 | 400.0 | Beauty Salon | Services | to buy more hair and skin care products. | PK | Pakistan | Ellahabad | PKR | 14.0 | 8 | female | monthly |
8 | 475.0 | 475.0 | Manufacturing | Manufacturing | to purchase leather, plastic soles and heels i... | PK | Pakistan | Lahore | PKR | 14.0 | 19 | female | monthly |
9 | 625.0 | 625.0 | Food Production/Sales | Food | to buy a stall, gram flour, ketchup, and coal ... | PK | Pakistan | Lahore | PKR | 11.0 | 24 | female | irregular |
10 | 200.0 | 200.0 | Rickshaw | Transportation | to cover the cost of repairing rickshaw | PK | Pakistan | Lahore | PKR | 11.0 | 3 | female | irregular |
11 | 400.0 | 400.0 | Wholesale | Wholesale | to purchase biscuits, sweets and juices in bulk. | PK | Pakistan | Faisalabad | PKR | 14.0 | 16 | female | monthly |
12 | 400.0 | 400.0 | General Store | Retail | to buy stock of rice, sugar and flour . | PK | Pakistan | Faisalabad | PKR | 14.0 | 16 | female | monthly |
13 | 400.0 | 400.0 | Clothing Sales | Clothing | to purchase variety of winter clothes to sell. | PK | Pakistan | Lahore | PKR | 12.0 | 10 | female | irregular |
14 | 225.0 | 225.0 | Poultry | Agriculture | to expand her existing poultry farm business. | IN | India | Dhupguri | INR | 43.0 | 7 | female | bullet |
15 | 300.0 | 300.0 | Rickshaw | Transportation | to buy a three-wheeled rickshaw. | PK | Pakistan | Lahore | PKR | 11.0 | 9 | female | irregular |
16 | 400.0 | 400.0 | General Store | Retail | to buy packs of salts, biscuits and beverages. | PK | Pakistan | Faisalabad | PKR | 14.0 | 11 | female | monthly |
17 | 875.0 | 875.0 | Tailoring | Services | To buy a sewing machine, lace, zippers and but... | PK | Pakistan | Lahore | PKR | 11.0 | 25 | female, female, female | irregular |
18 | 250.0 | 250.0 | Sewing | Services | to purchase a sewing machine. | IN | India | Maynaguri | INR | 43.0 | 4 | female | bullet |
19 | 475.0 | 475.0 | Beauty Salon | Services | to buy more cosmetics products for her beauty ... | PK | Pakistan | Lahore | PKR | 14.0 | 18 | female | monthly |
20 | 250.0 | 250.0 | Bakery | Food | to buy ingredients to make bakery products. | PK | Pakistan | Lahore | PKR | 11.0 | 10 | female | irregular |
21 | 350.0 | 350.0 | Restaurant | Food | to purchase vegetables, chicken, and oil to co... | PK | Pakistan | Lahore | PKR | 12.0 | 13 | female | irregular |
22 | 575.0 | 575.0 | Clothing Sales | Clothing | To buy winter clothing to sell | PK | Pakistan | Lahore | PKR | 12.0 | 20 | female, female | irregular |
23 | 350.0 | 350.0 | Embroidery | Arts | to buy reels of threads in different colors an... | PK | Pakistan | Lahore | PKR | 14.0 | 6 | female | monthly |
24 | 250.0 | 250.0 | Food Stall | Food | to purchase a variety of needed food items to ... | PK | Pakistan | Lahore | PKR | 11.0 | 7 | female | irregular |
25 | 250.0 | 250.0 | Farming | Agriculture | to purchase potato seeds and fertilizers for g... | IN | India | Dhupguri | INR | 43.0 | 9 | female | bullet |
26 | 250.0 | 250.0 | Construction Supplies | Construction | to purchase stones for starting a business sup... | IN | India | Maynaguri | INR | 43.0 | 10 | female | bullet |
27 | 475.0 | 475.0 | Rickshaw | Transportation | to cover the cost of repairing rickshaw | PK | Pakistan | Lahore | PKR | 12.0 | 17 | female | irregular |
28 | 125.0 | 125.0 | Farming | Agriculture | to purchase potato seeds and fertilizers for g... | IN | India | Maynaguri | INR | 43.0 | 5 | female | bullet |
29 | 250.0 | 250.0 | Farming | Agriculture | to purchase potato seeds and fertilizer for fa... | IN | India | Dhupguri | INR | 43.0 | 10 | female | bullet |
30 | 250.0 | 250.0 | Clothing Sales | Clothing | to buy clothing to sell | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
31 | 400.0 | 400.0 | Manufacturing | Manufacturing | to buy bundles of good quality paper along wit... | PK | Pakistan | Jaranwala | PKR | 14.0 | 7 | female | monthly |
32 | 450.0 | 450.0 | General Store | Retail | to buy packs of salt, biscuits, and beverages. | PK | Pakistan | Lahore | PKR | 14.0 | 18 | female | monthly |
33 | 250.0 | 250.0 | Farming | Agriculture | to cultivate broad beans on her own land. | IN | India | Dhupguri | INR | 43.0 | 10 | female | bullet |
34 | 400.0 | 400.0 | Milk Sales | Food | to purchase one cow. | PK | Pakistan | Sialkot | PKR | 14.0 | 16 | female | monthly |
35 | 2225.0 | 2225.0 | Personal Products Sales | Retail | to buy hair oils to sell. | PK | Pakistan | Lahore | PKR | 11.0 | 58 | female, female, female, female, female, female... | irregular |
36 | 400.0 | 400.0 | Services | Services | to purchase bamboo, nails and iron sheets | PK | Pakistan | Lahore | PKR | 11.0 | 16 | female | irregular |
37 | 475.0 | 475.0 | Sewing | Services | to buy more spools of threads in different col... | PK | Pakistan | Lahore | PKR | 14.0 | 12 | female | monthly |
38 | 350.0 | 350.0 | Home Products Sales | Retail | to buy different kinds of knives to sell | PK | Pakistan | Lahore | PKR | 11.0 | 4 | female | irregular |
39 | 475.0 | 475.0 | General Store | Retail | to buy rice, sugar and flour in bulk. | PK | Pakistan | Jaranwala | PKR | 14.0 | 14 | female | monthly |
40 | 250.0 | 250.0 | Natural Medicines | Health | to buy herbal medicines to sell in their herba... | PK | Pakistan | Lahore | PKR | 11.0 | 6 | female | irregular |
41 | 475.0 | 475.0 | Fish Selling | Food | to purchase a new, bigger-size cart. | PK | Pakistan | Jhang | PKR | 14.0 | 16 | female | monthly |
42 | 200.0 | 200.0 | Education provider | Education | to purchase furniture for husband's academy. | PK | Pakistan | Lahore | PKR | 11.0 | 8 | female | irregular |
43 | 250.0 | 250.0 | Shoe Sales | Retail | To buy women's shoes to sell | PK | Pakistan | Lahore | PKR | 12.0 | 8 | female | irregular |
44 | 450.0 | 450.0 | Rickshaw | Transportation | to buy new tires for her son's rickshaw. | PK | Pakistan | Faisalabad | PKR | 14.0 | 15 | female | monthly |
45 | 150.0 | 150.0 | Farming | Agriculture | to purchase potato seeds and fertilizers for c... | IN | India | Maynaguri | INR | 43.0 | 6 | female | bullet |
46 | 250.0 | 250.0 | Fruits & Vegetables | Food | to purchase sacks of tomatoes, potatoes, fruit... | KE | Kenya | Voi | KES | 13.0 | 7 | female | irregular |
47 | 250.0 | 250.0 | Farming | Agriculture | to purchase potato seeds and fertilizers for g... | IN | India | Dhupguri | INR | 43.0 | 10 | female | bullet |
48 | 600.0 | 600.0 | Machinery Rental | Services | to invest in working capital and to maintain g... | NI | Nicaragua | Leon | NIO | 14.0 | 16 | female | monthly |
49 | 450.0 | 450.0 | General Store | Retail | to stock his store. | SV | El Salvador | NaN | USD | 14.0 | 18 | male | monthly |
50 | 3175.0 | 3175.0 | Butcher Shop | Food | to buy meat and also to start selling fish in ... | TZ | Tanzania | Dar es Salaam | TZS | 10.0 | 93 | male, male, male, male, male | monthly |
df_funding.tail(51)
funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
671154 | 0.0 | 725.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | female | monthly |
671155 | 50.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 2 | female, female | monthly |
671156 | 75.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 3 | female | monthly |
671157 | 75.0 | 75.0 | Livestock | Agriculture | Pretend the issue with spanish loan was addres... | MX | Mexico | Iztacalco | MXN | 13.0 | 3 | female | monthly |
671158 | 0.0 | 75.0 | Livestock | Agriculture | Pretend the issue with spanish loan was addres... | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | female | monthly |
671159 | 50.0 | 725.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 2 | female | monthly |
671160 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671161 | 0.0 | 725.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | female | monthly |
671162 | 50.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 2 | female, female | monthly |
671163 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 1 | female | monthly |
671164 | 25.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 1 | female | monthly |
671165 | 50.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 2 | female, female | monthly |
671166 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
671167 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
671168 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671169 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
671170 | 25.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 1 | female | monthly |
671171 | 0.0 | 25.0 | Games | Entertainment | Kiva Coordinator replaced loan use. Should see... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
671172 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671173 | 75.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 3 | female, female | monthly |
671174 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
671175 | 75.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 3 | female, female | monthly |
671176 | 0.0 | 50.0 | Livestock | Agriculture | Edited loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | female, female | monthly |
671177 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
671178 | 0.0 | 25.0 | Livestock | Agriculture | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
671179 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
671180 | 0.0 | 725.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | female | monthly |
671181 | 0.0 | 25.0 | Livestock | Agriculture | Reviewed loan use in english. | PK | Pakistan | Attock | PKR | 13.0 | 0 | female | monthly |
671182 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | female, female | monthly |
671183 | 0.0 | 75.0 | Livestock | Agriculture | Pretend the issue with spanish loan was addres... | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | female | monthly |
671184 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | female, female | monthly |
671185 | 0.0 | 25.0 | Livestock | Agriculture | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
671186 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 1 | female | monthly |
671187 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671188 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | female | monthly |
671189 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671190 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671191 | 0.0 | 725.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | female | monthly |
671192 | 0.0 | 875.0 | Livestock | Agriculture | Translated loan use to english. | BO | Bolivia | La Paz | BOB | 13.0 | 0 | female, female | monthly |
671193 | 0.0 | 125.0 | Livestock | Agriculture | Pretend the flagged issue was addressed by KC. | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | female, female | monthly |
671194 | 0.0 | 25.0 | Livestock | Agriculture | Kiva Coordinator fixed issue loan (no longer v... | KE | Kenya | NaN | KES | 13.0 | 0 | female, female | monthly |
671195 | 0.0 | 50.0 | Livestock | Agriculture | Edited loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | female, female | monthly |
671196 | 0.0 | 250.0 | Livestock | Agriculture | Reviewed loan use in english. | GH | Ghana | Dansoman | GHS | 13.0 | 0 | female | monthly |
671197 | 0.0 | 25.0 | Livestock | Agriculture | Pretend the issue with loan got addressed by K... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671198 | 0.0 | 75.0 | Livestock | Agriculture | Pretend the issue with spanish loan was addres... | MX | Mexico | Iztacalco | MXN | 13.0 | 0 | female | monthly |
671199 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
671200 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'para compara: cemento, arenya y ladri... | PY | Paraguay | Concepción | USD | 13.0 | 0 | female | monthly |
671201 | 25.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 1 | female | monthly |
671202 | 0.0 | 25.0 | Games | Entertainment | NaN | KE | Kenya | NaN | KES | 13.0 | 0 | NaN | monthly |
671203 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
671204 | 0.0 | 25.0 | Livestock | Agriculture | [True, u'to start a turducken farm.'] - this l... | KE | Kenya | NaN | KES | 13.0 | 0 | female | monthly |
df_funding.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 671205 entries, 0 to 671204 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 funded_amount 671205 non-null float64 1 loan_amount 671205 non-null float64 2 activity 671205 non-null object 3 sector 671205 non-null object 4 use 666973 non-null object 5 country_code 671197 non-null object 6 country 671205 non-null object 7 region 614405 non-null object 8 currency 671205 non-null object 9 term_in_months 671205 non-null float64 10 lender_count 671205 non-null int64 11 borrower_genders 666984 non-null object 12 repayment_interval 671205 non-null object dtypes: float64(3), int64(1), object(9) memory usage: 66.6+ MB
df_funding.nunique()
funded_amount 610 loan_amount 479 activity 163 sector 15 use 424912 country_code 86 country 87 region 12695 currency 67 term_in_months 148 lender_count 503 borrower_genders 11298 repayment_interval 4 dtype: int64
df_funding.sector.unique()
array(['Food', 'Transportation', 'Arts', 'Services', 'Agriculture', 'Manufacturing', 'Wholesale', 'Retail', 'Clothing', 'Construction', 'Health', 'Education', 'Personal Use', 'Housing', 'Entertainment'], dtype=object)
df_funding.country.unique()
array(['Pakistan', 'India', 'Kenya', 'Nicaragua', 'El Salvador', 'Tanzania', 'Philippines', 'Peru', 'Senegal', 'Cambodia', 'Liberia', 'Vietnam', 'Iraq', 'Honduras', 'Palestine', 'Mongolia', 'United States', 'Mali', 'Colombia', 'Tajikistan', 'Guatemala', 'Ecuador', 'Bolivia', 'Yemen', 'Ghana', 'Sierra Leone', 'Haiti', 'Chile', 'Jordan', 'Uganda', 'Burundi', 'Burkina Faso', 'Timor-Leste', 'Indonesia', 'Georgia', 'Ukraine', 'Kosovo', 'Albania', 'The Democratic Republic of the Congo', 'Costa Rica', 'Somalia', 'Zimbabwe', 'Cameroon', 'Turkey', 'Azerbaijan', 'Dominican Republic', 'Brazil', 'Mexico', 'Kyrgyzstan', 'Armenia', 'Paraguay', 'Lebanon', 'Samoa', 'Israel', 'Rwanda', 'Zambia', 'Nepal', 'Congo', 'Mozambique', 'South Africa', 'Togo', 'Benin', 'Belize', 'Suriname', 'Thailand', 'Nigeria', 'Mauritania', 'Vanuatu', 'Panama', 'Virgin Islands', 'Saint Vincent and the Grenadines', "Lao People's Democratic Republic", 'Malawi', 'Myanmar (Burma)', 'Moldova', 'South Sudan', 'Solomon Islands', 'China', 'Egypt', 'Guam', 'Afghanistan', 'Madagascar', 'Namibia', 'Puerto Rico', 'Lesotho', "Cote D'Ivoire", 'Bhutan'], dtype=object)
df_funding.country_code.unique()
array(['PK', 'IN', 'KE', 'NI', 'SV', 'TZ', 'PH', 'PE', 'SN', 'KH', 'LR', 'VN', 'IQ', 'HN', 'PS', 'MN', 'US', 'ML', 'CO', 'TJ', 'GT', 'EC', 'BO', 'YE', 'GH', 'SL', 'HT', 'CL', 'JO', 'UG', 'BI', 'BF', 'TL', 'ID', 'GE', 'UA', 'XK', 'AL', 'CD', 'CR', 'SO', 'ZW', 'CM', 'TR', 'AZ', 'DO', 'BR', 'MX', 'KG', 'AM', 'PY', 'LB', 'WS', 'IL', 'RW', 'ZM', 'NP', 'CG', 'MZ', 'ZA', 'TG', 'BJ', 'BZ', 'SR', 'TH', 'NG', 'MR', 'VU', 'PA', 'VI', 'VC', 'LA', 'MW', 'MM', 'MD', 'SS', 'SB', 'CN', 'EG', 'GU', 'AF', 'MG', nan, 'PR', 'LS', 'CI', 'BT'], dtype=object)
df_funding.loc[df_funding.country_code.isna()==True, :]
funded_amount | loan_amount | activity | sector | use | country_code | country | region | currency | term_in_months | lender_count | borrower_genders | repayment_interval | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
202537 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 6.0 | 162 | female | bullet |
202823 | 4150.0 | 4150.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | Rundu | NAD | 6.0 | 159 | male | bullet |
344929 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 7.0 | 120 | female | bullet |
351177 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | Rundu | NAD | 7.0 | 126 | male | bullet |
420953 | 3325.0 | 3325.0 | Wholesale | Wholesale | To purchase lighting products for sale to loca... | NaN | Namibia | EEnhana | NAD | 7.0 | 118 | female | bullet |
421218 | 4000.0 | 4000.0 | Wholesale | Wholesale | purchase solar lighting products for sale to l... | NaN | Namibia | Rundu | NAD | 7.0 | 150 | male | bullet |
487207 | 5100.0 | 5100.0 | Renewable Energy Products | Retail | to pay for stock of solar lights and cell phon... | NaN | Namibia | Katima Mulilo | NAD | 7.0 | 183 | male | bullet |
487653 | 5000.0 | 5000.0 | Wholesale | Wholesale | to maintain a stock of solar lights and cell p... | NaN | Namibia | Oshakati | NAD | 7.0 | 183 | female | bullet |
df_funding.loc[df_funding.country_code.isna()==True, "country_code"] = "NA"
# We check it again if everything is alright
df_funding.nunique()
funded_amount 610 loan_amount 479 activity 163 sector 15 use 424912 country_code 87 country 87 region 12695 currency 67 term_in_months 148 lender_count 503 borrower_genders 11298 repayment_interval 4 dtype: int64
### Now lets focus on the columns "Term in Months" and "Borrower Genders"
df_funding.term_in_months.unique()
### it seems everything is ok, the more quantity was origined because of a big range of payment flexibility
array([ 12., 11., 43., 14., 4., 13., 10., 8., 5., 20., 7., 3., 17., 15., 9., 39., 23., 26., 6., 30., 22., 25., 27., 16., 52., 18., 36., 19., 28., 62., 32., 24., 21., 50., 38., 2., 35., 72., 137., 1., 49., 33., 42., 29., 37., 51., 113., 79., 31., 44., 74., 34., 48., 114., 40., 124., 104., 63., 85., 78., 70., 86., 61., 60., 67., 55., 53., 41., 68., 143., 77., 130., 45., 111., 134., 107., 142., 148., 56., 122., 133., 141., 110., 81., 106., 54., 147., 112., 59., 145., 121., 109., 80., 47., 97., 75., 101., 128., 98., 87., 71., 66., 46., 125., 76., 73., 120., 144., 118., 131., 65., 108., 58., 123., 84., 99., 82., 92., 69., 91., 57., 90., 93., 129., 89., 88., 64., 126., 138., 158., 83., 100., 105., 132., 96., 127., 135., 95., 154., 156., 94., 115., 102., 116., 136., 103., 139., 146.])
df_funding.borrower_genders.unique()
# This columns seem to have lots of different unique values
array(['female', 'female, female', 'female, female, female', ..., 'female, female, male, female, female, female, female, female, female, female, male, male, female, female, male, female, female, female, female, female, female, female', 'male, female, female, female, female, female, female, female, male, male, female, male, female, male, male, male', 'female, female, female, male, female, female, female, male, female, female, female, male, female, male, female, female, female, female, female, female, female, female, female, female, female, female, female, female, male'], dtype=object)
df_funding.loc[df_funding["borrower_genders"].isna(), "borrower_genders"]
140 NaN 145 NaN 170 NaN 412 NaN 414 NaN ... 671151 NaN 671174 NaN 671178 NaN 671185 NaN 671202 NaN Name: borrower_genders, Length: 4221, dtype: object
# There are two options in front of us,
# Delete the column of borrower_genders, or somehow make use of it and maybe get some valuable decision out of it.
# I say, lets parse the column into two other columns, which we can hold the info of the columns as integers
b_g_procent_female = []
borrowers_total = []
female_list = []
male_list = []
for i in df_funding["borrower_genders"]:
female = 0
male = 0
total = 0
try:
if len(i) > 6:
for j in i.split(", "):
if j == "female":
female += 1
if j == "male":
male += 1
else:
if len(i) == 6:
female += 1
else:
male +=1
b_g_procent_female.append(round(100*female / (female + male), 2) )
borrowers_total.append(female + male)
female_list.append(female)
male_list.append(male)
except:
b_g_procent_female.append(np.nan)
borrowers_total.append(np.nan)
female_list.append(np.nan)
male_list.append(np.nan)
df_funding["borrowers_total"]= borrowers_total
df_funding["b_g_percent_female"]= b_g_procent_female
df_funding["female"]= female_list
df_funding["male"]= male_list
df_funding.head(51)
df_funding.repayment_interval.unique()
array(['irregular', 'bullet', 'monthly', 'weekly'], dtype=object)
# This line proves each row with value "0" in lender_count column, corresponds to the row with the value "0" in "funded_amount" row
(df_funding.loc[(df_funding.loc[:,"lender_count"]== 0), "lender_count"] != df_funding.loc[(df_funding.loc[:,"funded_amount"]== 0), "lender_count"]).sum()
0
# There are no 0 value for "loan_amount" column
df_funding.loc[(df_funding.loc[:,"loan_amount"]== 0), "loan_amount"]
Series([], Name: loan_amount, dtype: float64)
pip install requests import requests response = requests.get("https://api.exchangerate-api.com/v4/latest/EUR")
to see the documentation
import json def jprint(obj):
# create a formatted string of the Python JSON object
text = json.dumps(obj, sort_keys=True, indent=4)
print(text)
jprint(response.json())
how we reach the currencies
response.json()['rates']['TRY']
assigning the new column with the local currencies
df_funding['euro_converted'] = [response.json()['rates'][i] if i !='ZWD' else "" for i in df_funding['currency']] df_funding['euro_converted'] = df_funding['funded_amount'] / [float(x) if x !="" else 0 for x in df_funding['euro_converted']] df_funding.head()
df_dropped = df_funding.drop(columns=["use", "currency", "borrower_genders", "b_g_percent_female"])
df_dropped_with_outliers = df_funding.drop(columns=["use", "currency", "borrower_genders"])
df_dropped.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
funded_amount | 671205.0 | 785.995061 | 1130.398941 | 0.0 | 250.0 | 450.0 | 900.0 | 100000.0 |
loan_amount | 671205.0 | 842.397107 | 1198.660073 | 25.0 | 275.0 | 500.0 | 1000.0 | 100000.0 |
term_in_months | 671205.0 | 13.739022 | 8.598919 | 1.0 | 8.0 | 13.0 | 14.0 | 158.0 |
lender_count | 671205.0 | 20.590922 | 28.459551 | 0.0 | 7.0 | 13.0 | 24.0 | 2986.0 |
borrowers_total | 666984.0 | 2.018357 | 3.413631 | 1.0 | 1.0 | 1.0 | 1.0 | 50.0 |
female | 666984.0 | 1.606197 | 3.041054 | 0.0 | 1.0 | 1.0 | 1.0 | 50.0 |
male | 666984.0 | 0.412160 | 1.119501 | 0.0 | 0.0 | 0.0 | 1.0 | 44.0 |
import seaborn as sns
sns.boxplot(x=df_dropped.funded_amount)
<AxesSubplot:xlabel='funded_amount'>
# Calculate the Interquartile portion
Q1 = df_dropped.funded_amount.quantile(0.25)
Q3 = df_dropped.funded_amount.quantile(0.75)
IQR = Q3 - Q1
Q1, Q3
(250.0, 900.0)
# Then the limits of the boxplot
lower_limit = Q1 - 1.5*IQR
upper_limit = Q3 + 1.5*IQR
lower_limit, upper_limit
(-725.0, 1875.0)
#Checking the upper outliers in the column "funded_amount"
df_dropped.loc[df_dropped.loc[:,"funded_amount"] >upper_limit, "funded_amount"]
35 2225.0 50 3175.0 59 3175.0 62 4275.0 70 2000.0 ... 671105 2600.0 671108 4025.0 671110 2700.0 671115 2650.0 671134 3525.0 Name: funded_amount, Length: 55112, dtype: float64
# 26000 rows is better than 55000 to go, so we increase the upper outlier limit to 3000
df_dropped.loc[df_dropped.loc[:,"funded_amount"] > 3000, "funded_amount"]
50 3175.0 59 3175.0 62 4275.0 85 3175.0 86 3050.0 ... 670843 4300.0 670997 6025.0 671011 4925.0 671108 4025.0 671134 3525.0 Name: funded_amount, Length: 26025, dtype: float64
# because our lower outlier limit is minus, there are no lower outliers in the column "funded amount"
df_dropped.loc[df_dropped.loc[:,"funded_amount"] < lower_limit, "funded_amount"]
Series([], Name: funded_amount, dtype: float64)
# Now we can permanentyly change the column by pressing down the values
upper_limit = 3000
df_dropped['funded_amount'] = df_dropped['funded_amount'].apply(lambda x: upper_limit if x > upper_limit else x)
#Checking the upper outliers in the column "funded_amount", this time there are no outliers
df_dropped.loc[df_dropped.loc[:,"funded_amount"] >upper_limit, "funded_amount"]
Series([], Name: funded_amount, dtype: float64)
#we are checking the outliers in funded_amount again, now all our Outliers are between the value of 1875 and 3000.
sns.boxplot(x=df_dropped.funded_amount)
<AxesSubplot:xlabel='funded_amount'>
# We check the "loan_amount" column
sns.boxplot(x=df_dropped.loan_amount)
<AxesSubplot:xlabel='loan_amount'>
# Same situation we see in here, so i will push down the values more than 3000 to 3000.
df_dropped['loan_amount'] = df_dropped['loan_amount'].apply(lambda x: upper_limit if x > upper_limit else x)
sns.boxplot(x=df_dropped.funded_amount)
<AxesSubplot:xlabel='funded_amount'>
# Lets check other two columns
sns.boxplot(x=df_dropped.term_in_months)
<AxesSubplot:xlabel='term_in_months'>
# It seems 30 is a good value to go with. It is a little higher than the 3. quartile, we got the quantity of almost 25000
df_dropped.loc[df_dropped.loc[:,"term_in_months"] >30, "term_in_months"]
2 43.0 6 43.0 14 43.0 18 43.0 25 43.0 ... 671086 38.0 671089 38.0 671092 38.0 671100 38.0 671101 62.0 Name: term_in_months, Length: 24527, dtype: float64
# We press down the term_in_months columns values to lower than 30
df_dropped['term_in_months'] = df_dropped['term_in_months'].apply(lambda x: 30 if x > 30 else x)
sns.boxplot(x=df_dropped.term_in_months)
<AxesSubplot:xlabel='term_in_months'>
# checking another column for outliers
sns.boxplot(x=df_dropped.lender_count)
<AxesSubplot:xlabel='lender_count'>
# lets try to catch the count of 25000 again
df_dropped.loc[df_dropped.loc[:,"lender_count"] >70, "lender_count"]
50 93 59 114 62 144 70 71 73 101 ... 668164 110 668582 278 670283 123 670503 94 670734 158 Name: lender_count, Length: 27708, dtype: int64
# Presses the values of lender_count column's values under 70
df_dropped['lender_count'] = df_dropped['lender_count'].apply(lambda x: 70 if x > 70 else x)
sns.boxplot(x=df_dropped.lender_count)
<AxesSubplot:xlabel='lender_count'>
df_dropped.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
funded_amount | 671205.0 | 709.756125 | 705.929139 | 0.0 | 250.0 | 450.0 | 900.0 | 3000.0 |
loan_amount | 671205.0 | 754.061762 | 728.286964 | 25.0 | 275.0 | 500.0 | 1000.0 | 3000.0 |
term_in_months | 671205.0 | 13.161863 | 6.118905 | 1.0 | 8.0 | 13.0 | 14.0 | 30.0 |
lender_count | 671205.0 | 18.727781 | 17.287630 | 0.0 | 7.0 | 13.0 | 24.0 | 70.0 |
borrowers_total | 666984.0 | 2.018357 | 3.413631 | 1.0 | 1.0 | 1.0 | 1.0 | 50.0 |
female | 666984.0 | 1.606197 | 3.041054 | 0.0 | 1.0 | 1.0 | 1.0 | 50.0 |
male | 666984.0 | 0.412160 | 1.119501 | 0.0 | 0.0 | 0.0 | 1.0 | 44.0 |
df_dropped.to_csv('dropped.csv', index = False)
df_dropped_with_outliers.to_csv('dropped_w_o.csv', index = False)