Synthetic Data

Search JPMorganChase

About us

How we do business

Leadership

Awards and recognition

Technology

Governance

Suppliers

Diversity, opportunity & inclusion

Read more

Chairman and CEO Letter to Shareholders
Learn more
Impact

Global impact

U.S. impact

Creating thriving communities together

See how our clients and partners—from small business owners to workforce training leaders—work with JPMorganChase to drive meaningful impact and economic growth where they live and work.
Learn more
Communities
Institute

Explore all topics

Latest news
Careers

Work with us

Grow with us

How we hire

Explore opportunities

Students and graduates

Apply now

Latest news

Veteran’s Unconventional Path to Landing her Dream Job in Tech

U.S. Army Veteran Ashley Wigfall transitioned to a civilian role and charted her path to technologist through mentorship and skills training at the JPMorgan Chase tech hub in Plano, Texas.
Learn more
Investor Relations

Senior Executive Letters

Annual Report

Quarterly Earnings

Press releases

Events and presentations

Company Update

Investor Day

Learn more

Chairman and CEO Letter to Shareholders
Learn more
Newsroom

Process

Below is a sample process we devised at J.P. Morgan AI Research to generate synthetic financial datasets. To learn more about the challenges and opportunities in generating data in finance, please read Generating Synthetic Data in Finance: Opportunities, challenges and pitfalls.

Step 1: Compute metrics for the real data

Step 2: Develop a Generator (may be statistical methods or an agent-based simulation)

Step 3: (Optional) Calibrate the Generator using the real data

Step 4: Run the Generator to generate synthetic data

Step 5: Compute metrics for the synthetic data

Step 6: Compare the metrics of the real data and synthetic data

Step 7: (Optional) Refine the Generator to improve against comparison metrics

For upcoming workshops and updates, visit:

ICAIF

NeurIPS

Money laundering is the process of introducing money coming from illegal activities into the financial system in order to use it for legal or illegal purposes. This data represents sequence of high level interactions, with a financial institution, of legitimate clients and clients that are engaged in money laundering activities. The current data contains state and action pairs of bank customer related activities. Examples are opening an account, making transactions, payments, withdrawals, purchases etc. The data was generated by running an AI planning-execution simulator.

References

1. Generating Synthetic Data in Finance: Opportunities, challenges and pitfalls. S Assefa, D Dervovic, M Mahfouz, R Tillman, P Reddy, T Balch and M Veloso. Proceedings of the 1st International Conference on AI in Finance (ICAIF), 2020. Also in NeurIPS 2019 Workshop on AI in Financial Services

2. Simulating and classifying behavior in adversarial environments based on action-state traces: An application to money laundering, D Borrajo, M Veloso, S Shah. Proceedings of the 1st International Conference on AI in Finance (ICAIF), 2020. Also in arXiv preprint arXiv:2011.01826, 2020

Customer journey events represent sequences of lower level retail banking clients’ interactions with the bank. Example types of events include login to a web application, making payments, withdrawing money from ATM machines. The data was generated by running an AI planning-execution simulator and translating the output planning traces into tabular format.

References

2. Domain-independent generation and classification of behavior traces. D Borrajo and M Veloso. arXiv preprint arXiv:2011.02918.

Synthetic limit order book data describing a series of buy and sell orders of financial instruments (stocks) by various market participants at a public stock exchange. Specifically, this data will contain messages and snapshots of orders over time. The data represents N trading days of simulated data for high liquidity stocks in different market regimes (e.g., trending up/down, high/low volatility).

References

2. Get Real: Realism Metrics for Robust Limit Order Book Market Simulations. S. Vyetrenko et al. Proceedings of the 1st International Conference on AI in Finance (ICAIF), 2020.

Data representing transactions from a subject-centric view with the goal of identifying fraudulent transaction. This data contains a large variety of transaction types representing normal activities as well as abnormal/fraudulent activities that are introduced with predefined probabilities. The data was generated by running an AI planning-execution simulator and translating the output planning traces into tabular format. Parameters of the data generation model include the number of clients, time duration and probabilities of fraud.

References

2. Domain-independent generation and classification of behavior traces. D Borrajo and M Veloso. arXiv preprint arXiv:2011.02918.

Dataset of synthetic document images along with labels and bounding boxes of the layout elements. The documents correspond to three different domains namely articles, resumes and forms. We focus mainly on the document structure and produce visually unique samples capturing complex and diverse layouts. The layout categories include generic elements such as titles, sections, headers/footers, tables, figures etc. and domain specific elements such as equations, skills, profiles, questions, answers etc.

References

1. Synthetic Document Generator for Annotation-free Layout Recognition.
N Raman, S Shah, and M Veloso.
Pattern Recognition, 2022.

Synthetic equity market data contains simulated time series of spot and option prices for a given asset. Spot is one-dimensional while options are defined on a high-dimensional grid of relative strikes (e.g. [80%, 90%, 100%, 110%, 120%]) and floating maturities (e.g [20, 40, 60, 120]). The time series is on daily interval.

Simulated data is generated by a machine learning model which is trained on data derived from historical spot and option prices. Historical prices are sourced from Bloomberg via RMDS. For spot, we adjust raw prices by removing dividend, borrow and rates impact. For options, an internal vol fitting process is used to convert raw prices to implied volatilities which are then transformed to discrete local volatilities (DLVs). The transformation is mainly to remove possible static arbitrage from the implied vol surface.

The machine learning model is then developed using adjusted spot and DLVs data. In the pipeline, preprocessing is first done to compress high-dimensional data to some low-dimensional representations via an auto encoder. Neural network based generative model is trained on the low-dimensional data. The generative model takes inputs from random noise plus some initial state up to time t, and generates next state at t+1. The objective function is to minimize the distance between the generated (fake) and historical (real) conditional distributions. Once the model is trained, it can generate synthetic low-dimensional data, which is then reconstructed to high-dimensional data via the decoder in auto encoder. The generated high-dimensional data contains synthetic spot and DLVs. DLVs are then converted back to option prices.

The shape of the generated data set is (num_paths, num_days, num_variables). For example, if we want to simulate 10000 paths of an asset’s spot and call option prices for the next 252 days. Using the aforementioned option grid, the shape will be (10000, 252, 21) where 21 is for spot and 20 call options. By default we include put options too, so the shape will be (10000, 252, 41).

References

1. Deep Hedging: Learning to Simulate Equity Option Markets.
M Wiese, L Bai, B Wood, H Buehler.

2. Conditional Sig-Wasserstein GANs for Time Series Generation..
H Ni, L Szpruch, M Wiese, S Liao, B Xiao.

The following datasets developed by AI Research are available for public use for non-commercial purposes subject to the terms of each dataset’s original license.

DocLLM: Instruction tuning dataset

Motivation: Visually-rich Document Understanding (VrDU) models require deeply annotated datasets for training and validation. A diverse collection of such datasets exists for tasks such as Key Information Extraction, Document Classification, and Question Answering. With the recent popularity of Large Multimodal Language Models, researchers have shifted to using instruction tuning datasets that are more suitable for Generative Models. This has led to a bottleneck, as new datasets need to be created, or prior datasets need to be converted into a new format that is suitable for instruction tuning.

Outcome: To address this issue, AI Research has converted 16 previously published collections into instruction tuning datasets, covering the four tasks of Key Information Extraction, Document Classification, Visual Question Answering, and Natural Language Inference/Tabular Reasoning. The dataset enables researchers to seamlessly train and test their models against a wide variety of tasks and collections. It also creates a unified benchmark against which future models can be evaluated.

Citation:

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. 2024. DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding (ACL’24), August 11-16, 2024, Bangkok, Thailand, 9 pages.

Datasets cont’d

This collection includes a set of 16 previously released VrDU datasets, and is meant to be used for research purposes only. Users are subject to the terms of each dataset’s original license.

REFinD: Relation extraction financial dataset

Motivation: Relation extraction (RE) from text is a core problem in Natural Language Processing (NLP).

Datasets for Relation Extraction (RE) are created to aid downstream tasks such as building knowledge graphs, information retrieval, semantic search, question/answering and textual entailment. However, most available large-scale RE datasets are compiled using general knowledge sources such as Wikipedia, web texts and news. These datasets often fail to capture domain-specific challenges. Therefore, various state-of-the-art models that perform competitively on such datasets fail to perform well in the financial domain.

Outcome: To address this limitation, AIR has created REFinD, Relation Extraction Financial Dataset. It is the largest-scale annotated dataset of relations, with ∼29K instances and 22 relations among 8 types of entity pairs, generated entirely over financial documents. It is the first RE dataset to use Security and Exchange (SEC) filings a rich and complex data source. The team introduced diversity in the dataset by including all context surrounding the entities, capturing longer contexts than seen in financial texts.

Citation:

Simerjot Kaur, Charese Smiley, Akshat Gupta, Joy Sain, Dongsheng Wang, Suchetha Siddagangappa, Toyin Aguda, and Sameena Shah. 2023. REFinD: Relation Extraction Financial Dataset. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23), July 23–27, 2023, Taipei, Taiwan, 9 pages. https://doi.org/10.1145/3539618.3591911

This dataset is licensed under the Creative Commons Attribution-Noncommercial 4.0 International License. Therefore, the dataset provided may only be used for non-commercial purposes and may not be used for any commercial use.

BizGraphQA: A Dataset for Image-based Inference over Graph-structured Diagrams from Business Domains

Motivation: Graph-structured diagrams, such as enterprise ownership charts or management hierarchies, are a challenging medium for deep learning models as they not only require the capacity to model language and spatial relations but also the topology of links between entities and the varying semantics of what those links represent. Devising Question Answering models that automatically process and understand such diagrams have vast applications to many enterprise domains, and can move the state-of-the-art on multimodal document understanding to a new frontier.

Curating real-world datasets to train these models can be difficult, due to scarcity and confidentiality of the documents where such diagrams are included. Recently released synthetic datasets are often prone to repetitive structures that can be memorized or tackled using heuristics.

Outcome: In this paper, we present a collection of 10,000 synthetic graphs that faithfully reflect properties of real graphs in four business domains, and are realistically rendered within a PDF document with varying styles and layouts. In addition, we have generated over 130,000 question instances that target complex graphical relationships specific to each domain. We hope this challenge will encourage the development of models capable of robust reasoning about graph structured images, which are ubiquitous in numerous sectors in business and across scientific disciplines.

Citation:

Petr Babkin, William Watson, Zhiqiang Ma, Lucas Cecchi, Natraj Raman, Armineh Nourbakhsh, and Sameena Shah. 2023. BizGraphQA: A Dataset for Image-based Inference over Graph-structured Diagrams from Business Domains. In Proceedings of the 46th International ACM SIGIR Conference on

Research and Development in Information Retrieval (SIGIR ’23), July 23–27, 2023, Taipei, Taiwan. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3539618.3591875 dataset provided may only be used for non-commercial purposes and may not be used for any commercial use.

Request Synthetic Data

Please include type of synthetic data requested, purpose of the request (research questions), name, affiliation and email address.

You are now leaving JPMorganChase

JPMorganChase's website terms, privacy and security policies don't apply to the site or app you're about to visit. Please review its website terms, privacy and security policies to see how they apply to you. JPMorganChase isn't responsible for (and doesn't provide) any products, services or content at this third-party site or app, except for products and services that explicitly carry the JPMorganChase name.

Proceed

Time step	Label	Action	Arg1	Arg2	Arg3
1/26/21 14:33	GOOD	CREATE-ACCOUNT	COMPANY-57251	CHECKING-57245	JAPAN
1/26/21 14:37	GOOD	CREATE-ACCOUNT	COMPANY-57342	CHECKING-57337	ICELAND
1/26/21 14:37	BAD	CREATE-ACCOUNT	COMPANY-57590	CHECKING-57585	MAURITANIA
1/26/21 14:39	BAD	CREATE-ACCOUNT	COMPANY-57661	CHECKING-57619	GUERNSEY
1/26/21 14:46	GOOD	CREATE-ACCOUNT	COMPANY-57364	DIGITAL-MONEY-57358	EQUATORIAL-GUINEA
1/26/21 14:46	GOOD	CREATE-ACCOUNT	COMPANY-57368	CHECKING-57542	AFGHANISTAN
1/26/21 14:50	GOOD	CREATE-ACCOUNT	COMPANY-57387	DIGITAL-MONEY-57381	TURKS-AND-CAICOS-IS
1/26/21 14:50	BAD	CREATE-ACCOUNT	COMPANY-57338	CHECKING-57358	BOSNIA-AND-HERZEGOVINA
1/26/21 14:50	BAD	CREATE-ACCOUNT	COMPANY-57661	CHECKING-57654	CONGO-DEM-REP
1/26/21 14:50	GOOD	CREATE-ACCOUNT	COMPANY-57364	DIGITAL-MONEY-57358	KYRGYZSTAN
1/26/21 14:54	BAD	CREATE-ACCOUNT	COMPANY-57285	DIGITAL-MONEY-57285	COCOS-KEELING-IS
1/26/21 14:54	BAD	QUICK-DEPOSIT	CLIENT-57574	T-CASH-IN-57548	HEARD-ISLAND-AND-MCDONALD-ISLANDS
1/26/21 15:03	BAD	CREATE-ACCOUNT	COMPANY-57821	DIGITAL-MONEY-57811	LIBYA
1/26/21 15:03	BAD	CREATE-ACCOUNT	COMPANY-57746	CHECKING-57714	VIETNAM
1/26/21 15:04	BAD	CREATE-ACCOUNT	COMPANY-57335	DIGITAL-MONEY-57335	SUDAN
2/1/21 9:09	GOOD	CREATE-ACCOUNT	COMPANY-57288	CHECKING-57288	THAILAND
2/1/21 9:14	BAD	CREATE-ACCOUNT	COMPANY-57268	CHECKING-57542	COMPANY-57539
2/2/21 0:52	BAD	SET-OWNERSHIP-ACCOUNT	CLIENT-57574	CHECKING-57542	VIETNAM
2/2/21 4:16	BAD	CREATE-ACCOUNT	COMPANY-57747	DIGITAL-MONEY-57737	SVALBARD

Time step	Label	Event	Customer id
1/28/21 15:48	STANDARD-FAIL-RARELY-DIGITAL	mobile : logon	ID-22522
1/29/21 2:39	STANDARD-FAIL-RARELY-DIGITAL	web : logon	ID-22425
1/29/21 18:30	STANDARD-FAIL-RARELY-DIGITAL	mobile : logon	ID-22791
1/29/21 20:50	STANDARD-FAIL-RARELY-DIGITAL	atm : authentication	ID-22710
1/30/21 1:28	STANDARD-FAIL-RARELY-DIGITAL	web : logon	ID-22658
1/30/21 2:28	STANDARD-FAIL-RARELY-DIGITAL	mobile : logon	ID-22425
1/30/21 4:24	STANDARD-FAIL-RARELY-DIGITAL	web : logon	ID-22483
1/30/21 4:24	STANDARD-FAIL-RARELY-DIGITAL	mobile : funds transfer activity	ID-22454
1/30/21 4:26	STANDARD-FAIL-RARELY-DIGITAL	mobile : logon	ID-22454
1/30/21 4:28	STANDARD-FAIL-RARELY-DIGITAL	mobile : investment portfolio	ID-22454
1/30/21 4:30	STANDARD-FAIL-RARELY-DIGITAL	mobile : transaction summary multiple products	ID-22454
1/30/21 4:53	STANDARD-FAIL-RARELY-DIGITAL	mobile : logoff	ID-22454
1/30/21 5:55	STANDARD-FAIL-RARELY-DIGITAL	mobile : profile maintenance	ID-22437
1/30/21 5:55	STANDARD-FAIL-RARELY-DIGITAL	mobile : logon	ID-22437
1/30/21 6:01	STANDARD-FAIL-RARELY-DIGITAL	mobile : quickpay receipient view	ID-22437
1/30/21 21:35	STANDARD-FAIL-RARELY-DIGITAL	mobile : travel notification	ID-22437
1/30/21 22:50	STANDARD-FAIL-RARELY-DIGITAL	mobile : logoff	ID-22437
1/30/21 22:50	STANDARD-FAIL-RARELY-DIGITAL	web : logon	ID-22563
1/30/21 22:50	STANDARD-FAIL-RARELY-DIGITAL	mobile : logon	ID-22600

100024	1E+10	1E+10	99989	200	1E+10	1E+10	-1E+10
100024	100	99989	200	1E+10	1E+10	1E+10	1E+10
100024	100	100009	100	1E+10	1E+10	1E+10	1E+10
100024	92	100009	100	100045	100	99989	100
100024	92	100009	100	100039	100	99989	100
100024	92	100009	100	100039	100	99979	100
100039	92	100009	100	100045	100	99979	100
100039	92	100009	100	100045	100	99979	100
100028	200	100009	24	100039	92	99979	100
100028	200	100009	24	100039	92	99987	29

34200	1	8	200	99989	1_2	1042
34200	1	2	100	100024	-1	1029
34200	4	3	100	99985	-1	1077
34200	4	9	100	99989	1	1042
34200	1	3	100	99985	-1	1077
34200	1	4	100	100009	1	1010
34200	1	5	8	100042	1	1083
34200	4	5	8	100042	1	1083
34200	4	2	8	100024	-1	1029
34200	1	6	100	100045	1	1006
34200	1	7	100	100039	-1	1040
34200	1	10	100	99979	1	1016
34200	1	11	100	100009	1	1096
34200	1	12	100	100040	1	1003
34200	4	12	92	100040	1	1003

Transaction_Id	Sender_Id	Sender_Account	Sender_Country	Sender_Sector	Sender_Job	Bene_Id	Bene_Account	Bene_Country	USD_Amount	label	Transaction_Type
PAY-BILL-3589	CLIENT-3566	ACCOUNT-3578	USA	21264	CCB	COMPANY-3574	ACCOUNT-3587	GERMANY	492.67	0	MAKE-PAYMENT
WITHDRAWAL-3591	CLIENT-3566	ACCOUNT-3579	USA	18885	CCB	COMPANY-3516	ACCOUNT-3527	GERMANY	388.32	0	WITHDRAWAL
MOVE-FUNDS-3528	CLIENT-3508	ACCOUNT-3510	USA	4809	CCB	COMPANY-3516	ACCOUNT-3527	GERMANY	280.7	0	MOVE-FUNDS
WITHDRAWAL-3529	CLIENT-3508	ACCOUNT-3510	USA	7455	CCB	CLIENT-3442	ACCOUNT-3461	USA	118.14	0	WITHDRAWAL
QUICK-DEPOSIT-3471	CLIENT-3442	ACCOUNT-3460	USA	10516	CCB	CLIENT-3442	ACCOUNT-3460	USA	164.97	0	DEPOSIT-CASH
QUICK-DEPOSIT-3406	CLIENT-3384	ACCOUNT-3395	USA	36316	CCB	CLIENT-3384	ACCOUNT-3396	USA	413.11	0	DEPOSIT-CASH
PAY-BILL-3347	CLIENT-3330	ACCOUNT-3341	USA	20626	CCB	CLIENT-3333	ACCOUNT-3333	CANADA	377.65	0	PAY-CHECK
PAY-CHECK-3438	CLIENT-3330	ACCOUNT-3340	USA	20264	CCB	CLIENT-3333	ACCOUNT-3333	CANADA	338.03	0	PAY-CHECK
MOVE-FUNDS-3294	CLIENT-3272	ACCOUNT-3274	USA	21568	CCB	CLIENT-3275	ACCOUNT-3275	CANADA	100.85	0	MOVE-FUNDS
MOVE-FUNDS-2929	CLIENT-3222	ACCOUNT-3224	USA	29004	CCB	CLIENT-3225	ACCOUNT-3225	CANADA	200.03	0	MOVE-FUNDS
PAY-BILL-3222	CLIENT-3203	ACCOUNT-3205	USA	27393	CCB	CLIENT-3203	ACCOUNT-3218	GERMANY	234.86	0	PAY-BILL
QUICK-DEPOSIT-3243	CLIENT-3203	ACCOUNT-3220	USA	9452	CCB	CLIENT-3203	ACCOUNT-3222	USA	95.22	0	DEPOSIT-CASH
DEPOSIT-CASH-3163	CLIENT-3147	ACCOUNT-3153	USA	25066	CCB	COMPANY-3147	ACCOUNT-3160	GERMANY	675.37	0	DEPOSIT-CASH
WITHDRAWAL-3100	CLIENT-3075	ACCOUNT-3090	USA	22778	CCB	CLIENT-3075	ACCOUNT-3090	EXCHANGE	319.95	0	EXCHANGE
QUICK-PAYMENT-3099	CLIENT-3075	ACCOUNT-3091	USA	39013	CCB	CLIENT-3078	ACCOUNT-3087	TAIWAN	771.54	0	QUICK-PAYMENT
PAY-BILL-3036	CLIENT-3016	ACCOUNT-3028	USA	43951	CCB	CLIENT-3022	ACCOUNT-3033	GERMANY	730.69	0	MAKE-PAYMENT

About us

How we do business

Leadership

Awards and recognition

Technology

Governance

Suppliers

Diversity, opportunity & inclusion

Impact

Institute

Explore all topics

Careers

Work with us

Grow with us

How we hire

Explore opportunities

Students and graduates

Apply now

Investor Relations

Process

Datasets cont’d

Request Synthetic Data

About us

Communities

Impact

Institute

Center for Geopolitics

Careers

Investor Relations

Newsroom

Related sites

You are now leaving JPMorganChase