How to correctly generate training data based on percentages?


I have a question.

I am currently generating training data for my bayesian network as follows:
(also as code down below)

picture of data

-> infected stands for people who are infected (0= not infected, 1= infected)
-> p_tests is the result of the test (0= test negative, 1= test positive)

Wether there is a 0 or 1 in TP, FN, FP and TN is decided like this:

data = np.random.randint(2, size=(10, 2))
columns = ['infected', 'p_tests']
df = pd.DataFrame(data=data, columns=columns)

df["TP"] = ((df['infected'] == 1) & (df['p_tests'] == 1)).astype(int)
df["FN"] = ((df['infected'] == 1) & (df['p_tests'] == 0)).astype(int)
df["FP"] = ((df['infected'] == 0) & (df['p_tests'] == 1)).astype(int)
df["TN"] = ((df['infected'] == 0) & (df['p_tests'] == 0)).astype(int)


So that is running fine.

My question now would be, how I can decide eg. the 1 and 0 of the infected group based on my probabilities.

The chance to be infected is 10%. How can I program the data, so that 10% of my set have 1s (show that 10% are infected) ?

The same is with the probability of TP (80%), TN(98%), FP(2%), FN(20%)

Does anyone have an Ideo on how to solve this?


To assign values at random with a set probability, e.g. P(infected) = 0.9, you can proceed like this:

  1. Choose random floating-point values r in the range (0, 1.0)

     df['r'] = np.random.uniform(0, 1.0, size=df.shape[0])
  2. Set the value of infected based on the probability threshold:

     df['infected'] = (df['r'] >= 0.9).astype(int)

Answered By – alexis

Answer Checked By – Marie Seifert (AngularFixing Admin)

Leave a Reply

Your email address will not be published.