Summary
The BigQueryReadClient returns multiple 'empty' streams when instantiating a read session via create_read_session. Attempting to invoke to_pandas() on the stream reader yeilds an AttributeError.
Firstly, is this behaviour abnormal? If not, I will just wrap the method in a try/catch and plow on.
Environment details
- OS type and version: Linux - WSL - Ubuntu 22.04.03 LTS
- Python version: 3.9.18
- pip version: 23.3.1
- package manager: poetry@1.7.1
google-cloud-bigquery-storage version: 2.24.0
Steps to reproduce
from google.cloud.bigquery_storage_v1 import BigQueryReadClient, types
client = BigQueryReadClient()
requested_session = types.ReadSession()
requested_session.table = "projects/<project>/datasets/<dataset>/tables/<table>"
requested_session.data_format = types.DataFormat.AVRO
requested_session.read_options.selected_fields = <some_fields>
requested_session.read_options.row_restriction = <some_row_restriction>
parent = "projects/<project_id>"
session = client.create_read_session(
parent=parent,
read_session=requested_session,
)
dfs = []
for stream in session.streams:
reader = client.read_rows(stream.name)
sub_df = reader.to_dataframe() # < error raised here, for all but 1 of the streams: 'NoneType' object has no attribute '_parse_avro_schema'
dfs.append(sub_df)
...
Stack trace
Exception has occurred: AttributeError
'NoneType' object has no attribute '_parse_avro_schema'
File "reader.py", line 424, in to_dataframe
self._stream_parser._parse_avro_schema()
File "reader.py", line 299, in to_dataframe
return self.rows(read_session=read_session).to_dataframe(dtypes=dtypes)
AttributeError: 'NoneType' object has no attribute '_parse_avro_schema'
The relevent line:
|
self._stream_parser._parse_avro_schema() |
So clearly, the object is not being populated as expected. After inspecting the data from the one stream that does yeild data, it seems that the remaining streams are empty.
Detail
The emergence of this problem is something specific to the table that I am accessing, and the combination of filtering and type of the requested field. The minimal case where this occurs is when quering a single BYTES type field. The approximate size of this field is 0.1Mb.
The issue persists when querying one row. I can query 1 row, of just this BYTES field from the BigQuery table and I will get some 13 empty streams and 1 populated stream.
If I try catch over the streams, I am able to successfully grab the data from the one stream.
Am I doing something wrong here, or is this normal?
Summary
The
BigQueryReadClientreturns multiple 'empty' streams when instantiating a read session viacreate_read_session. Attempting to invoketo_pandas()on the stream reader yeilds anAttributeError.Firstly, is this behaviour abnormal? If not, I will just wrap the method in a try/catch and plow on.
Environment details
google-cloud-bigquery-storageversion: 2.24.0Steps to reproduce
Stack trace
The relevent line:
python-bigquery-storage/google/cloud/bigquery_storage_v1/reader.py
Line 422 in fe09e3b
So clearly, the object is not being populated as expected. After inspecting the data from the one stream that does yeild data, it seems that the remaining streams are empty.
Detail
The emergence of this problem is something specific to the table that I am accessing, and the combination of filtering and type of the requested field. The minimal case where this occurs is when quering a single
BYTEStype field. The approximate size of this field is 0.1Mb.The issue persists when querying one row. I can query 1 row, of just this
BYTESfield from the BigQuery table and I will get some 13 empty streams and 1 populated stream.If I try catch over the streams, I am able to successfully grab the data from the one stream.
Am I doing something wrong here, or is this normal?