Skip to content

Config

This is the reference to the functions contained in config. For now, they are all accesible directly through machine-learning-datasets and you don't need to use the config namespace.

Initialize Dataset Configuration Functions

init(new_dssave_path=None, new_dsconfig_path=None)

Initializes the paths and configuration for the dataset.

Parameters:

Name Type Description Default
new_dssave_path Optional[str]

The new path to save the dataset. Defaults to None.

None
new_dsconfig_path Optional[str]

The new path to the dataset configuration file. Defaults to None.

None

Returns:

Type Description
Tuple[str, int, Dict]

Tuple[str, int, Dict]: A tuple containing the dataset configuration path, the dataset count, and the dataset configuration.

Source code in machine_learning_datasets/config.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def init(
        new_dssave_path:Optional[str] = None,
        new_dsconfig_path:Optional[str] = None
    ) -> Tuple[str, int, Dict]:
    """Initializes the paths and configuration for the dataset.

    Args:
        new_dssave_path (Optional[str]): The new path to save the dataset. Defaults to None.
        new_dsconfig_path (Optional[str]): The new path to the dataset configuration file.
                                           Defaults to None.

    Returns:
        Tuple[str, int, Dict]: A tuple containing the dataset configuration path, the dataset
                               count, and the dataset configuration.
    """
    global dssave_path, dsconfig_path, dsconfig, dsconfig_dscount

    if new_dssave_path is not None:
        dssave_path = new_dssave_path

    if new_dsconfig_path is not None:
        dsconfig_path = new_dsconfig_path

    if os.path.exists(dsconfig_path):
        with open(dsconfig_path, encoding="utf8") as dsconfig_file:
            dsconfig = json.load(dsconfig_file)
        dsconfig_dscount = len(dsconfig['datasets'])
    else:
        dsconfig = False

    return dsconfig_path, dsconfig_dscount, dsconfig

load(name=None, **kwargs)

Load a dataset from a source.

Parameters:

Name Type Description Default
name Optional[str]

The name of the dataset to load. If not provided, all keyword arguments will be used as parameters.

None
**kwargs Any

Additional keyword arguments that will be used as parameters if name is not provided.

{}

Returns:

Name Type Description
Any Any

The loaded dataset.

Raises:

Type Description
Warning

If the dataset specified by name is not found in the dsconfig JSON.

Warning

If the source parameter is not defined.

Note

This function assumes that the dsconfig variable is defined and contains a JSON object with a datasets key, which is a list of dataset configurations.

Example

dataset = load(name='my_dataset', source='Kaggle', file_path='data.csv')

Source code in machine_learning_datasets/config.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def load(
        name:Optional[str] = None,
        **kwargs:Any
    ) -> Any:
    """Load a dataset from a source.

    Args:
        name (Optional[str]): The name of the dataset to load. If not provided, all keyword
                              arguments will be used as parameters.
        **kwargs (Any): Additional keyword arguments that will be used as parameters if
                        `name` is not provided.

    Returns:
        Any: The loaded dataset.

    Raises:
        Warning: If the dataset specified by `name` is not found in the `dsconfig` JSON.
        Warning: If the `source` parameter is not defined.

    Note:
        This function assumes that the `dsconfig` variable is defined and contains a JSON
        object with a `datasets` key, which is a list of dataset configurations.

    Example:
        dataset = load(name='my_dataset', source='Kaggle', file_path='data.csv')
    """
    kparams = locals()['kwargs']
    params = {}
    dataset = None
    if name is not None:
        for i in range(len(dsconfig['datasets'])):
            if dsconfig['datasets'][i]["name"] == name:
                params = dsconfig['datasets'][i].copy()
                del params['name']
                break
        if not bool(params):
            warnings.warn("Dataset not found in dsconfig JSON")
    params.update(kparams)
    if 'source' not in params:
        warnings.warn("Source not defined")
    else:
        if params['source'] == 'Kaggle':
            source = Kaggle()
        else:
            source = URL()

        params1 = source.fetch(**params)
        params2 = source.extract(**params1)
        params3 = source.parse(**params2)

        #TODO add code that differenciates the x, y, train, test
        dataset = source.prepare(**params3)

    return dataset