Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ticker name "NA" makes the exists_qlib_data function report errors. #1720

Open
OzzyXu opened this issue Jan 1, 2024 · 3 comments
Open

Ticker name "NA" makes the exists_qlib_data function report errors. #1720

OzzyXu opened this issue Jan 1, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@OzzyXu
Copy link
Contributor

OzzyXu commented Jan 1, 2024

🐛 Bug Description

The ticker name "NA" in the "all.txt" under /instruments makes the exists_qlib_data function fail due to the string "NA" being wrongly converted to the float "nan" but not a string.

To Reproduce

Steps to reproduce the behavior:

  1. Save the attached all.txt under the ~/.qlib/qlib_data/us_data/instruments.
  2. Run the following code:
provider_uri = "~/.qlib/qlib_data/us_data_new"  # target_dir
if not exists_qlib_data(provider_uri):
    print(f"Qlib data is not found in {provider_uri}")
    sys.path.append(str(scripts_dir))
    from get_data import GetData

    GetData().qlib_data(target_dir=provider_uri, region=REG_US)

Expected Behavior

The code should run without errors.

Screenshot

image

Environment

Note: User could run cd scripts && python collect_info.py all under project directory to get system information
and paste them here directly.

  • Qlib version: 0.93
  • Python version: 3.8.10
  • OS (Windows, Linux, MacOS): Windows
  • Commit number (optional, please provide it if you are using the dev version):

Additional Notes

  1. The bug is caused by the wrong usage of pandas.read_csv in the following line of exists_qlib_data under qlib\utils\__init__.py. Refer to the page for more details.
 miss_code = set(pd.read_csv(_instrument, sep="\t", header=None).loc[:, 0].apply(str.lower)) - set(code_names)
  1. The cause of the bug can be further verified by the following code:
temp = pd.read_csv("all.txt", sep="\t", header=None).loc[:, 0]
non_string_values = [i for i in temp if not isinstance(i, str)]
print(non_string_values)
[nan]
  1. The bug can be easily fixed by adding keep_default_na=False
temp = pd.read_csv("all.txt", sep="\t", header=None, keep_default_na=False).loc[:, 0]
non_string_values = [i for i in temp if not isinstance(i, str)]
print(non_string_values)
[]
  1. I can help with the fix, just want to ask what tests I need to run to make sure whether the fix would cause any other issues.
@OzzyXu OzzyXu added the bug Something isn't working label Jan 1, 2024
@OzzyXu OzzyXu changed the title Ticker name "NaN" sometimes makes the exists_qlib_data function report errors. Ticker name "NA" makes the exists_qlib_data function report errors. Jan 4, 2024
@SunsetWolf
Copy link
Collaborator

Would you like to create a PR to fix this and be one of the contributors to qlib.

@OzzyXu
Copy link
Contributor Author

OzzyXu commented Jan 16, 2024

@SunsetWolf Sure. Then I will double-check whether my fix will cause any issues, if not, then I will create a PR to fix it. And I am happy to be a contributor to Qlib and try to help with other issues.

@SarthakNikhal
Copy link

@OzzyXu Let me know about it. I'd like to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants