Skip to main content
JPMorganChase logo
Join our team
    • About us

    • How we do business

    • Leadership

    • Awards and recognition

    • Technology

    • Governance

    • Suppliers

    • Diversity, opportunity & inclusion

    Read more

    Chairman and CEO Letter to Shareholders

    Annual Report 2024

    Learn more
    • Impact

    • Business growth and entrepreneurship
    • Careers and skills
    • Community development
    • Environmental sustainability
    • Financial health and wealth creation

    Latest news

    An Ohio-based company is protecting first responders around the world

    With support from JPMorganChase, Fire-Dex is providing protective equipment to firefighters in 100 countries and all 50 states. 

    Learn more
  • Communities
    • Institute

    • Explore all topics

    Latest news

    The pandemic’s bankable moment

    Learn more
    • Careers

    • Work with us

    • Grow with us

    • How we hire

    • Explore opportunities

    • Students and graduates

    • Apply now

    Latest news

    Veteran’s Unconventional Path to Landing her Dream Job in Tech 

    U.S. Army Veteran Ashley Wigfall transitioned to a civilian role and charted her path to technologist through mentorship and skills training at the JPMorgan Chase tech hub in Plano, Texas.

    Learn more
    • Investor Relations

    • CEO Letters
    • Annual Report
    • Quarterly Earnings
    • Press releases
    • Events and presentations
    • Investor Day

    Learn more

    Chairman and CEO Letter to Shareholders

    Annual Report 2024

    Learn more
  • Newsroom

Latest news

New JPMorganChase HQ Drives Billions in Economic Growth for New York

With about 8,000 jobs created and $2.6 billion added to New York City’s economy, JPMorganChase is proud to help fuel NYC and sends gratitude to the construction workers who made this possible.

Learn more
  1. About us
  2. Technology
  3. Research
  4. Artificial Intelligence Research
  5. Synthetic Data

Synthetic Data

Researching and developing algorithms to generate realistic Synthetic Datasets applicable to financial services. Access our public datasets below.

The infographic illustrates a process flow for generating and comparing synthetic data with real data. The flow is depicted using a series of connected elements, each labeled with a number to indicate the sequence of steps:

  1. Real Data Storage: A cylindrical container labeled "Real Data" represents the storage of real data.
  2. Data Generator Input: An arrow points from a smiling face icon to a rectangular box labeled "Data Generator," indicating user input or initiation of the data generation process.
  3. Real Data Input to Data Generator: An arrow connects the "Real Data" container to the "Data Generator" box, signifying the use of real data in the generation process.
  4. Synthetic Data Output: An arrow extends from the "Data Generator" box to another cylindrical container labeled "Synthetic Data," representing the output of synthetic data.
  5. Synthetic Data Metrics: An arrow leads from the "Synthetic Data" container to a stack of documents labeled "Synthetic Data Metrics," indicating the analysis and metrics generation for synthetic data.
  6. Real Data Metrics: A stack of documents labeled "Real Data Metrics" is connected to a box labeled "Metrics Comparator," showing the analysis and metrics generation for real data.
  7. Metrics Comparison: The "Metrics Comparator" box receives inputs from both "Real Data Metrics" and "Synthetic Data Metrics," indicating the comparison of metrics between real and synthetic data.

The entire process is enclosed within a dashed line, suggesting a continuous or iterative cycle.

While real data can be very valuable, it may not be easily available. At J.P. Morgan AI Research, we conduct research and develop algorithms to generate realistic Synthetic Datasets, with the aim of advancing AI research and development in financial services. Feel free to explore our available datasets on the lefthand panel.

Process

Below is a sample process we devised at J.P. Morgan AI Research to generate synthetic financial datasets.  To learn more about the challenges and opportunities in generating data in finance, please read Generating Synthetic Data in Finance: Opportunities, challenges and pitfalls.

Step 1: Compute metrics for the real data

Step 2: Develop a Generator (may be statistical methods or an agent-based simulation)

Step 3: (Optional) Calibrate the Generator using the real data

Step 4: Run the Generator to generate synthetic data

Step 5: Compute metrics for the synthetic data

Step 6: Compare the metrics of the real data and synthetic data

Step 7: (Optional) Refine the Generator to improve against comparison metrics

For upcoming workshops and updates, visit:

ICAIF

NeurIPS

Sample AML Trace Data

Time step

Label

Action

Arg1

Arg2

Arg3

1/26/21 14:33

GOOD

CREATE-ACCOUNT

COMPANY-57251

CHECKING-57245

JAPAN

1/26/21 14:37

GOOD

CREATE-ACCOUNT

COMPANY-57342

CHECKING-57337

ICELAND

1/26/21 14:37

BAD

CREATE-ACCOUNT

COMPANY-57590

CHECKING-57585

MAURITANIA

1/26/21 14:39

BAD

CREATE-ACCOUNT

COMPANY-57661

CHECKING-57619

GUERNSEY

1/26/21 14:46

GOOD

CREATE-ACCOUNT

COMPANY-57364

DIGITAL-MONEY-57358

EQUATORIAL-GUINEA

1/26/21 14:46

GOOD

CREATE-ACCOUNT

COMPANY-57368

CHECKING-57542

AFGHANISTAN

1/26/21 14:50

GOOD

CREATE-ACCOUNT

COMPANY-57387

DIGITAL-MONEY-57381

TURKS-AND-CAICOS-IS

1/26/21 14:50

BAD

CREATE-ACCOUNT

COMPANY-57338

CHECKING-57358

BOSNIA-AND-HERZEGOVINA

1/26/21 14:50

BAD

CREATE-ACCOUNT

COMPANY-57661

CHECKING-57654

CONGO-DEM-REP

1/26/21 14:50

GOOD

CREATE-ACCOUNT

COMPANY-57364

DIGITAL-MONEY-57358

KYRGYZSTAN

1/26/21 14:54

BAD

CREATE-ACCOUNT

COMPANY-57285

DIGITAL-MONEY-57285

COCOS-KEELING-IS

1/26/21 14:54

BAD

QUICK-DEPOSIT

CLIENT-57574

T-CASH-IN-57548

HEARD-ISLAND-AND-MCDONALD-ISLANDS

1/26/21 15:03

BAD

CREATE-ACCOUNT

COMPANY-57821

DIGITAL-MONEY-57811

LIBYA

1/26/21 15:03

BAD

CREATE-ACCOUNT

COMPANY-57746

CHECKING-57714

VIETNAM

1/26/21 15:04

BAD

CREATE-ACCOUNT

COMPANY-57335

DIGITAL-MONEY-57335

SUDAN

2/1/21 9:09

GOOD

CREATE-ACCOUNT

COMPANY-57288

CHECKING-57288

THAILAND

2/1/21 9:14

BAD

CREATE-ACCOUNT

COMPANY-57268

CHECKING-57542

COMPANY-57539

2/2/21 0:52

BAD

SET-OWNERSHIP-ACCOUNT

CLIENT-57574

CHECKING-57542

VIETNAM

2/2/21 4:16

BAD

CREATE-ACCOUNT

COMPANY-57747

DIGITAL-MONEY-57737

SVALBARD

Money laundering is the process of introducing money coming from illegal activities into the financial system in order to use it for legal or illegal purposes. This data represents sequence of high level interactions, with a financial institution, of legitimate clients and clients that are engaged in money laundering activities. The current data contains state and action pairs of bank customer related activities. Examples are opening an account, making transactions, payments, withdrawals, purchases etc. The data was generated by running an AI planning-execution simulator.

References

1. Generating Synthetic Data in Finance: Opportunities, challenges and pitfalls. S Assefa, D Dervovic, M Mahfouz, R Tillman, P Reddy, T Balch and M Veloso. Proceedings of the 1st International Conference on AI in Finance (ICAIF), 2020. Also in NeurIPS 2019 Workshop on AI in Financial Services

2. Simulating and classifying behavior in adversarial environments based on action-state traces: An application to money laundering, D Borrajo, M Veloso, S Shah. Proceedings of the 1st International Conference on AI in Finance (ICAIF), 2020. Also in arXiv preprint arXiv:2011.01826, 2020

Sample Customer Journey Data

Time step

Label

Event

Customer id

1/28/21 15:48

STANDARD-FAIL-RARELY-DIGITAL

mobile : logon

ID-22522

1/29/21 2:39

STANDARD-FAIL-RARELY-DIGITAL

web : logon

ID-22425

1/29/21 18:30

STANDARD-FAIL-RARELY-DIGITAL

mobile : logon

ID-22791

1/29/21 20:50

STANDARD-FAIL-RARELY-DIGITAL

atm : authentication

ID-22710

1/30/21 1:28

STANDARD-FAIL-RARELY-DIGITAL

web : logon

ID-22658

1/30/21 2:28

STANDARD-FAIL-RARELY-DIGITAL

mobile : logon

ID-22425

1/30/21 4:24

STANDARD-FAIL-RARELY-DIGITAL

web : logon

ID-22483

1/30/21 4:24

STANDARD-FAIL-RARELY-DIGITAL

mobile : funds transfer activity

ID-22454

1/30/21 4:26

STANDARD-FAIL-RARELY-DIGITAL

mobile : logon

ID-22454

1/30/21 4:28

STANDARD-FAIL-RARELY-DIGITAL

mobile : investment portfolio

ID-22454

1/30/21 4:30

STANDARD-FAIL-RARELY-DIGITAL

mobile : transaction summary multiple products

ID-22454

1/30/21 4:53

STANDARD-FAIL-RARELY-DIGITAL

mobile : logoff

ID-22454

1/30/21 5:55

STANDARD-FAIL-RARELY-DIGITAL

mobile : profile maintenance

ID-22437

1/30/21 5:55

STANDARD-FAIL-RARELY-DIGITAL

mobile : logon

ID-22437

1/30/21 6:01

STANDARD-FAIL-RARELY-DIGITAL

mobile : quickpay receipient view

ID-22437

1/30/21 21:35

STANDARD-FAIL-RARELY-DIGITAL

mobile : travel notification

ID-22437

1/30/21 22:50

STANDARD-FAIL-RARELY-DIGITAL

mobile : logoff

ID-22437

1/30/21 22:50

STANDARD-FAIL-RARELY-DIGITAL

web : logon

ID-22563

1/30/21 22:50

STANDARD-FAIL-RARELY-DIGITAL

mobile : logon

ID-22600

Customer journey events represent sequences of lower level retail banking clients’ interactions with the bank. Example types of events include login to a web application, making payments, withdrawing money from ATM machines.  The data was generated by running an AI planning-execution simulator and translating the output planning traces into tabular format.

References

1. Generating Synthetic Data in Finance: Opportunities, challenges and pitfalls. S Assefa, D Dervovic, M Mahfouz, R Tillman, P Reddy, T Balch and M Veloso. Proceedings of the 1st International Conference on AI in Finance (ICAIF), 2020. Also in NeurIPS 2019 Workshop on AI in Financial Services

2. Domain-independent generation and classification of behavior traces. D Borrajo and M Veloso. arXiv preprint arXiv:2011.02918.

Sample Order Book Data

100024

1E+10

1E+10

99989

200

1E+10

1E+10

-1E+10

100024

100

99989

200

1E+10

1E+10

1E+10

1E+10

100024

100

100009

100

1E+10

1E+10

1E+10

1E+10

100024

92

100009

100

100045

100

99989

100

100024

92

100009

100

100039

100

99989

100

100024

92

100009

100

100039

100

99979

100

100039

92

100009

100

100045

100

99979

100

100039

92

100009

100

100045

100

99979

100

100028

200

100009

24

100039

92

99979

100

100028

200

100009

24

100039

92

99987

29

Sample Order Stream Data

34200

1

8

200

99989

1_2

 

1042

34200

1

2

100

100024

-1

 

1029

34200

4

3

100

99985

-1

 

1077

34200

4

9

100

99989

1

 

1042

34200

1

3

100

99985

-1

 

1077

34200

1

4

100

100009

1

 

1010

34200

1

5

8

100042

1

 

1083

34200

4

5

8

100042

1

 

1083

34200

4

2

8

100024

-1

 

1029

34200

1

6

100

100045

1

 

1006

34200

1

7

100

100039

-1

 

1040

34200

1

10

100

99979

1

 

1016

34200

1

11

100

100009

1

 

1096

34200

1

12

100

100040

1

 

1003

34200

4

12

92

100040

1

 

1003

Synthetic limit order book data describing a series of buy and sell orders of financial instruments (stocks) by various market participants at a public stock exchange. Specifically, this data will contain messages and snapshots of orders over time. The data represents N trading days of simulated data for high liquidity stocks in different market regimes (e.g., trending up/down, high/low volatility).

References

1. Generating Synthetic Data in Finance: Opportunities, challenges and pitfalls. S Assefa, D Dervovic, M Mahfouz, R Tillman, P Reddy, T Balch and M Veloso. Proceedings of the 1st International Conference on AI in Finance (ICAIF), 2020. Also in NeurIPS 2019 Workshop on AI in Financial Services

2. Get Real: Realism Metrics for Robust Limit Order Book Market Simulations. S. Vyetrenko et al. Proceedings of the 1st International Conference on AI in Finance (ICAIF), 2020.

Sample Data

Transaction_Id

Sender_Id

Sender_Account

Sender_Country

Sender_Sector

Sender_Job

Bene_Id

Bene_Account

Bene_Country

USD_Amount

label

Transaction_Type

PAY-BILL-3589

CLIENT-3566

ACCOUNT-3578

USA

21264

CCB

COMPANY-3574

ACCOUNT-3587

GERMANY

492.67

0

MAKE-PAYMENT

WITHDRAWAL-3591

CLIENT-3566

ACCOUNT-3579

USA

18885

CCB

COMPANY-3516

ACCOUNT-3527

GERMANY

388.32

0

WITHDRAWAL

MOVE-FUNDS-3528

CLIENT-3508

ACCOUNT-3510

USA

4809

CCB

COMPANY-3516

ACCOUNT-3527

GERMANY

280.7

0

MOVE-FUNDS

WITHDRAWAL-3529

CLIENT-3508

ACCOUNT-3510

USA

7455

CCB

CLIENT-3442

ACCOUNT-3461

USA

118.14

0

WITHDRAWAL

QUICK-DEPOSIT-3471

CLIENT-3442

ACCOUNT-3460

USA

10516

CCB

CLIENT-3442

ACCOUNT-3460

USA

164.97

0

DEPOSIT-CASH

QUICK-DEPOSIT-3406

CLIENT-3384

ACCOUNT-3395

USA

36316

CCB

CLIENT-3384

ACCOUNT-3396

USA

413.11

0

DEPOSIT-CASH

PAY-BILL-3347

CLIENT-3330

ACCOUNT-3341

USA

20626

CCB

CLIENT-3333

ACCOUNT-3333

CANADA

377.65

0

PAY-CHECK

PAY-CHECK-3438

CLIENT-3330

ACCOUNT-3340

USA

20264

CCB

CLIENT-3333

ACCOUNT-3333

CANADA

338.03

0

PAY-CHECK

MOVE-FUNDS-3294

CLIENT-3272

ACCOUNT-3274

USA

21568

CCB

CLIENT-3275

ACCOUNT-3275

CANADA

100.85

0

MOVE-FUNDS

MOVE-FUNDS-2929

CLIENT-3222

ACCOUNT-3224

USA

29004

CCB

CLIENT-3225

ACCOUNT-3225

CANADA

200.03

0

MOVE-FUNDS

PAY-BILL-3222

CLIENT-3203

ACCOUNT-3205

USA

27393

CCB

CLIENT-3203

ACCOUNT-3218

GERMANY

234.86

0

PAY-BILL

QUICK-DEPOSIT-3243

CLIENT-3203

ACCOUNT-3220

USA

9452

CCB

CLIENT-3203

ACCOUNT-3222

USA

95.22

0

DEPOSIT-CASH

DEPOSIT-CASH-3163

CLIENT-3147

ACCOUNT-3153

USA

25066

CCB

COMPANY-3147

ACCOUNT-3160

GERMANY

675.37

0

DEPOSIT-CASH

WITHDRAWAL-3100

CLIENT-3075

ACCOUNT-3090

USA

22778

CCB

CLIENT-3075

ACCOUNT-3090

EXCHANGE

319.95

0

EXCHANGE

QUICK-PAYMENT-3099

CLIENT-3075

ACCOUNT-3091

USA

39013

CCB

CLIENT-3078

ACCOUNT-3087

TAIWAN

771.54

0

QUICK-PAYMENT

PAY-BILL-3036

CLIENT-3016

ACCOUNT-3028

USA

43951

CCB

CLIENT-3022

ACCOUNT-3033

GERMANY

730.69

0

MAKE-PAYMENT

Data representing transactions from a subject-centric view with the goal of identifying fraudulent transaction. This data contains a large variety of transaction types representing normal activities as well as abnormal/fraudulent activities that are introduced with predefined probabilities. The data was generated by running an AI planning-execution simulator and translating the output planning traces into tabular format. Parameters of the data generation model include the number of clients, time duration and probabilities of fraud.

References

1. Generating Synthetic Data in Finance: Opportunities, challenges and pitfalls. S Assefa, D Dervovic, M Mahfouz, R Tillman, P Reddy, T Balch and M Veloso. Proceedings of the 1st International Conference on AI in Finance (ICAIF), 2020. Also in NeurIPS 2019 Workshop on AI in Financial Services

2. Domain-independent generation and classification of behavior traces. D Borrajo and M Veloso. arXiv preprint arXiv:2011.02918.

Sample Synthetic Document With Annotation

 

A sample image of a legal document, structured with various colored bounding boxes that organize and highlight different sections of the document. Each color serves a specific purpose in categorizing the information visually.

  1. Title Box:
    • Color: Brown
    • Content: Contains the title "Justice FY 2014 Budget - Water Violations in West."
  2. Main Content Boxes:
    • Color: Purple
    • Content: Contain the main text and information about legal actions, arrests, and federal court actions.
  3. Header Boxes:
    • Color: Green
    • Content: Contains case reference information.
  4. Statistical Data Boxes:
    • Color: Blue
    • Content: Numerical data related to the case.
  5. Subsection Headers:
    • Color: Red

Content: Header titles within sections.

Dataset of synthetic document images along with labels and bounding boxes of the layout elements. The documents correspond to three different domains namely articles, resumes and forms. We focus mainly on the document structure and produce visually unique samples capturing complex and diverse layouts.  The layout categories include generic elements such as titles, sections, headers/footers, tables, figures etc. and domain specific elements such as equations, skills, profiles, questions, answers etc.

References

1.  Synthetic Document Generator for Annotation-free Layout Recognition.
N Raman, S Shah, and M Veloso.
Pattern Recognition, 2022.

Synthetic Equity Market Data

spot

call 80% 20

call 90% 20

call 100% 20

call 110% 20

call 120% 20

call 80% 40

call 90% 40

call 100% 40

call 110% 40

call 120% 40

put 80% 120

put 90% 120

put 100% 120

put 110% 120

put 120% 120

1

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

2

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

3

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

4

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

5

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

2

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Synthetic equity market data contains simulated time series of spot and option prices for a given asset. Spot is one-dimensional while options are defined on a high-dimensional grid of relative strikes (e.g. [80%, 90%, 100%, 110%, 120%]) and floating maturities (e.g [20, 40, 60, 120]). The time series is on daily interval.

Simulated data is generated by a machine learning model which is trained on data derived from historical spot and option prices. Historical prices are sourced from Bloomberg via RMDS. For spot, we adjust raw prices by removing dividend, borrow and rates impact. For options, an internal vol fitting process is used to convert raw prices to implied volatilities which are then transformed to discrete local volatilities (DLVs). The transformation is mainly to remove possible static arbitrage from the implied vol surface.

The machine learning model is then developed using adjusted spot and DLVs data. In the pipeline, preprocessing is first done to compress high-dimensional data to some low-dimensional representations via an auto encoder. Neural network based generative model is trained on the low-dimensional data. The generative model takes inputs from random noise plus some initial state up to time t, and generates next state at t+1. The objective function is to minimize the distance between the generated (fake) and historical (real) conditional distributions. Once the model is trained, it can generate synthetic low-dimensional data, which is then reconstructed to high-dimensional data via the decoder in auto encoder. The generated high-dimensional data contains synthetic spot and DLVs. DLVs are then converted back to option prices.

The shape of the generated data set is (num_paths, num_days, num_variables). For example, if we want to simulate 10000 paths of an asset’s spot and call option prices for the next 252 days. Using the aforementioned option grid, the shape will be (10000, 252, 21) where 21 is for spot and 20 call options. By default we include put options too, so the shape will be (10000, 252, 41).

References

1.  Deep Hedging: Learning to Simulate Equity Option Markets.
M Wiese, L Bai, B Wood, H Buehler.

2.  Conditional Sig-Wasserstein GANs for Time Series Generation..
H Ni, L Szpruch, M Wiese, S Liao, B Xiao.

The following datasets developed by AI Research are available for public use for non-commercial purposes subject to the terms of each dataset’s original license.

Contact us to request access to a dataset.

DocLLM: Instruction tuning dataset

Motivation: Visually-rich Document Understanding (VrDU) models require deeply annotated datasets for training and validation. A diverse collection of such datasets exists for tasks such as Key Information Extraction, Document Classification, and Question Answering. With the recent popularity of Large Multimodal Language Models, researchers have shifted to using instruction tuning datasets that are more suitable for Generative Models. This has led to a bottleneck, as new datasets need to be created, or prior datasets need to be converted into a new format that is suitable for instruction tuning.

Outcome: To address this issue, AI Research has converted 16 previously published collections into instruction tuning datasets, covering the four tasks of Key Information Extraction, Document Classification, Visual Question Answering, and Natural Language Inference/Tabular Reasoning. The dataset enables researchers to seamlessly train and test their models against a wide variety of tasks and collections. It also creates a unified benchmark against which future models can be evaluated.

Citation:

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. 2024. DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding (ACL’24), August 11-16, 2024, Bangkok, Thailand, 9 pages.

Datasets cont’d

This collection includes a set of 16 previously released VrDU datasets, and is meant to be used for research purposes only. Users are subject to the terms of each dataset’s original license.

REFinD: Relation extraction financial dataset

Motivation: Relation extraction (RE) from text is a core problem in Natural Language Processing (NLP).

Datasets for Relation Extraction (RE) are created to aid downstream tasks such as building knowledge graphs, information retrieval, semantic search, question/answering and textual entailment. However, most available large-scale RE datasets are compiled using general knowledge sources such as Wikipedia, web texts and news. These datasets often fail to capture domain-specific challenges. Therefore, various state-of-the-art models that perform competitively on such datasets fail to perform well in the financial domain.

Outcome: To address this limitation, AIR has created REFinD, Relation Extraction Financial Dataset. It is the largest-scale annotated dataset of relations, with ∼29K instances and 22 relations among 8 types of entity pairs, generated entirely over financial documents. It is the first RE dataset to use Security and Exchange (SEC) filings a rich and complex data source. The team introduced diversity in the dataset by including all context surrounding the entities, capturing longer contexts than seen in financial texts.

Citation:

Simerjot Kaur, Charese Smiley, Akshat Gupta, Joy Sain, Dongsheng Wang, Suchetha Siddagangappa, Toyin Aguda, and Sameena Shah. 2023. REFinD: Relation Extraction Financial Dataset. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23), July 23–27, 2023, Taipei, Taiwan, 9 pages. https://doi.org/10.1145/3539618.3591911

This dataset is licensed under the Creative Commons Attribution-Noncommercial 4.0 International License. Therefore, the dataset provided may only be used for non-commercial purposes and may not be used for any commercial use.

BizGraphQA:  A Dataset for Image-based Inference over Graph-structured Diagrams from Business Domains

Motivation: Graph-structured diagrams, such as enterprise ownership charts or management hierarchies, are a challenging medium for deep learning models as they not only require the capacity to model language and spatial relations but also the topology of links between entities and the varying semantics of what those links represent. Devising Question Answering models that automatically process and understand such diagrams have vast applications to many enterprise domains, and can move the state-of-the-art on multimodal document understanding to a new frontier.

Curating real-world datasets to train these models can be difficult, due to scarcity and confidentiality of the documents where such diagrams are included. Recently released synthetic datasets are often prone to repetitive structures that can be memorized or tackled using heuristics.

Outcome: In this paper, we present a collection of 10,000 synthetic graphs that faithfully reflect properties of real graphs in four business domains, and are realistically rendered within a PDF document with varying styles and layouts. In addition, we have generated over 130,000 question instances that target complex graphical relationships specific to each domain. We hope this challenge will encourage the development of models capable of robust reasoning about graph structured images, which are ubiquitous in numerous sectors in business and across scientific disciplines.

Citation:

Petr Babkin, William Watson, Zhiqiang Ma, Lucas Cecchi, Natraj Raman, Armineh Nourbakhsh, and Sameena Shah. 2023. BizGraphQA: A Dataset for Image-based Inference over Graph-structured Diagrams from Business Domains. In Proceedings of the 46th International ACM SIGIR Conference on

Research and Development in Information Retrieval (SIGIR ’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3539618.3591875 dataset provided may only be used for non-commercial purposes and may not be used for any commercial use.

Request Synthetic Data

Please include type of synthetic data requested, purpose of the request (research questions), name, affiliation and email address.

Contact us
JPMorganChase logo
  • About us

  • How we do business
  • Leadership
  • Awards and recognition
  • Technology
  • Suppliers
  • Governance
  • History
  • Art collection
  • Human rights
  • Diversity, opportunity & inclusion
  • Communities

  • Community relief
  • Volunteerism
  • Impact

  • Business growth and entrepreneurship
  • Careers and skills
  • Community development
  • Environmental sustainability
  • Financial health and wealth creation
  • Stay informed
  • Institute

  • Center for Geopolitics

  • Careers

  • Work with us
  • Grow with us
  • How we hire
  • Explore opportunities
  • Equal opportunities
  • Recruitment scams warning
  • Investor Relations

  • Newsroom

  • Media contacts
  • Related sites

  • Chase
  • J.P. Morgan
  • J.P. Morgan Research
  • Morgan Health
  • Alumni Network
  • Privacy and security
  • Terms and conditions
  • Cookies
  • Accessibility
  • Global Financial Crimes Compliance
©2025 JPMorgan Chase & Co. All rights reserved. JPMorgan Chase & Co. is an Equal Opportunity Employer, including Disability/Veterans.

You are now leaving JPMorganChase

JPMorganChase's website terms, privacy and security policies don't apply to the site or app you're about to visit. Please review its website terms, privacy and security policies to see how they apply to you. JPMorganChase isn't responsible for (and doesn't provide) any products, services or content at this third-party site or app, except for products and services that explicitly carry the JPMorganChase name.

Proceed