Does H2O Driverless AI have inbuilt support for merging multiple dataset and using the merged dataset for training?

Issue

Suppose we have three datasets containing data from a company.

  1. employee.csv : This dataset contains the details of the employees working in the company, like employee ID, employee name, dept id of the dept he works in, country code of the country where he is from and his annual salary.
  2. dept.csv : This dataset has information about the department of the company, like the dept id, dept name, dept specialization.
  3. country.csv : This dataset contains some country names with its country code and the capital city of the country.

Is there a feature in H2O Driverless AI where we can upload these datasets (without merging using python) and merge it in H2O Driverless AI platform and use it for training using overlapping columns ?

Solution

Yes, you can use a data recipe for processing datasets (including joining them). See the docs for more about data recipes. You can create a recipe that joins datasets.

# Let's join a `employee.csv` (X) to `dept.csv` (Y1) and `country.csv` (Y2)
# Define and read locations of datasets for Y1/Y2
Y_file_name1 = "./tmp/user/location_of_dept.csv.bin"
Y_file_name2 = "./tmp/user/location_of_country.csv.bin"
Y1 = dt.fread(Y_file_name1)
Y2 = dt.fread(Y_file_name2)

# Set key and join Y1
key1 = ["dept_id"]
Y1.key = key1
X = X[:, :, dt.join(Y1)]

# Set key and join Y2
key2 = ["country_code"]
Y2.key = key2
X = X[:, :, dt.join(Y2)]

return X

See this recipe as an example for joining one dataset to another.

Answered By – Neema Mashayekhi

Answer Checked By – Dawn Plyler (AngularFixing Volunteer)

Leave a Reply

Your email address will not be published.