1.1 1. Teil - Theoriefragen

1.1.1 1.1) 4 V´s

Erkläre die 4 V´s mit deinen eigenen Worten anhand eines Beispiels. Vergiss dabei nicht die 4 V´s thematisch einzuordnen.

When it started to get harder to deal with the Data with the conventional ways and DataBank Systems, IBM Data Scientists realiesed some features of this undealable big Data. What makes hard to work with the Data in old easy way was described in 4 words, and to call a data as Big Data we need to check these 4 characters of the Data:

If all of these 4 conditions exist, then we can call the Data as "Big Data"

Sample Case 1: A company which has an online shop to commerce their products, has a GB of Data depending on the diversity of the Products. But in this online Shop, company dont want to recieve comments or input from the customers because they control it in another system. Here, we have certainly Volume, and maybe Variety too, but if there is barely new input, which is only by the company themselves, and no often new Products to multiply the Volume, we cannot consider that there is a Velocity here. So because not all of the 4 Vs are met in here, we can not say that this company need a Big Data analyst, scientist etc. And they don't need to change their system into a new way.

Sample Case 2: In this case, lets consider a Security company, which hold the Servers too for their installed Cameras. Lets say these cameras also have microphone. And this company should keep the data at least 1 year for the past reasearches. For :

1.1.2 1.2) fehlende Werte

a.) Erkläre anhand eines selbstgewählten Beispiels was fehlende Werte sind.

Lets consider that we have a Data of a Poll, which brings the values of the people who have children or expecting one, and their level of expectancies in life and relatively their happiness in life.

Lets have the following columns: sex, number_of_children, pregnancy, spouse_pregnancy, house/apartment, demand_to_improve_accomodation, car, demanding_car

Permanent Missing Values: First of all we need to be ready for any kind of data lost during the filling in the poll, or copying them into computer or piling them up in a file, or transforming them in an application.

Structural Missing Values: We certainly should expect missing values in the pregnancy and spouse_pregnancy columns, depending on the sex of the Poll Participator. We can understand that instead of "No", the participator skipped the area, just because logically it is understood that, the area should have no information.

The data type of missing values can be:

b.) Warum muss sich bei der Datenanalyse damit beschäftigt werden?

As we see in the datatypes of missing values, when we want to apply some mathematical process on the columns, these can cause problems, or affect the statistical measures(mean, median, mode... etc.) of the whole column. Then, the result from Data Analysis will not be trustworthy.

c.) Wie können fehlende Werte in der Datenanalyse identifiziert werden?

We can use certain codes to check the values in the columns and get familiar with the data.

df.head()       : to see the first 5 rows
df.tail(3)      : to see the last 3 rows
df.info()       : To check if the supposed type and the real type fits. 
            (Supposed type: Integer, Real Type: Float => There can be NaNs)
df.isnull.sum() : To check the columns without values and some them up.
df.nunique()    : to check the number of unique items before we check the unique items.
df.unique()     : to see the unique values

some libraries and their plots (z.B. missingno => msno.bar, msno.matrix, msno.heatmap... etc.)




1.2 2. Teil - Praxis

Du arbeitest als Data Scientist in einem Start Up. Ihr habt vor einem Jahr euer Geschäft eröffnet und nun wollt ihr den nächsten Schritt gehen und eure Dienste erweitern. Euer Geschäftsmodell ist das Betreiben einer Plattform bei der sich Personen die eine Geschäftsidee haben, aber nicht das benötigte Geld, anmelden und für ihr Projekt innerhalb einer vorgegebenen Zeit Geld sammeln können. Auf der anderen Seite habt ihr Geldgeber, die gern ihr Geld in Projekte anlegen möchten und nach Investitionen suchen. Als Vermittler bringt eure Plattform also Geldnehmer und Geldgeber zusammen. Ihr verdient euer Geld mit einer Provision für jedes Projekt was auf eurer Plattform landet.

Deine Datenbasis ist die Historie eurer Plattform. Alle Projekte sind abgeschlossene Projekte, d.h. die Zeit, um für sein Projekt Geld zu sammeln ist abgelaufen. Euer Geschäftsmodell sieht es vor, dass die gesammelten Gelder ausgezahlt werden, auch wenn der Zielbetrag nicht erreicht wurde.

In dem Datensatz gibt es KEINE Duplikate.

Der gesplittete Datensatz enthält folgende Spalten (inkl. Bedeutung):

Our 2 files seem to have same columns, lets check it

Since we know that there are no duplicates, in this case, we consider to concat them one after another

There are Null values in the columns of "use", "country_code", "region" and "borrower genders"

We check the number of unique counts of the columns

Evaluation of the Columns unique values

1. We suppose that "country code" quantity and "country" quantity should be same. But here we have a 1 difference.

2. "Term in Months" and "Borrower Genders" values seem to be higher than expected

Lets check the unique values of them

We see the values are ordered and, "nan" value belongs to Namibia

So we try to check if nan belongs to Namibia, if yes, we replace it with NA

The Relation between the missing Values:

It seems there is relation between the "funded_amount" and "lender_count" column.

What we see on both columns is, they don't have missing values...

But what about the noisy Data??

Ideas:

- Our income depends on how big the loan money is.

        So it'd be good to see where we earn more.

- We can also compare the sectors, which one is helping us more in which country maybe.

- How good is the terms in countries and sectors

Lets do some changes on the DataFrame to be able to have first comparison of our datas with pairplot.

In case the amount values are in the local currencies

To use API for our exchange rate convertion, we need requests library

pip install requests import requests response = requests.get("https://api.exchangerate-api.com/v4/latest/EUR")


            to see the documentation

import json def jprint(obj):

# create a formatted string of the Python JSON object
text = json.dumps(obj, sort_keys=True, indent=4)
print(text)

jprint(response.json())


            how we reach the currencies

response.json()['rates']['TRY']


            assigning the new column with the local currencies

df_funding['euro_converted'] = [response.json()['rates'][i] if i !='ZWD' else "" for i in df_funding['currency']] df_funding['euro_converted'] = df_funding['funded_amount'] / [float(x) if x !="" else 0 for x in df_funding['euro_converted']] df_funding.head()

Lets drop some columns we would not need

lets check our outliers

i say 55000 is very much to supress to the max level, so i will take the upper limit a little higher.

Now we are at the end of Outliers part,

Before we have our first visuals we save our data frame into a csv and go on on another file.